



Why is this? There is no one answer. Some organizations are simply unaware of the security recommendations, tools, and techniques available to them. Others lack the necessary skill and experience to implement the guidance and maintain secured configurations. It is not uncommon for these organizations to feel overwhelmed by the sheer number of recommendations, settings and options. Still others may feel that security is not an issue in their environment. The list goes on and on, yet the need for security and integrity has never been more important.
Interestingly, the evolution and convergence of technology is cultivating new ideas and solutions to help organizations better protect their services and data. One such idea is being demonstrated by the Immutable Service Container (ISC) project. Immutable Service Containers are an architectural deployment pattern used to describe a platform for highly secure service delivery. Building upon concepts and functionality enabled by operating systems, hypervisors, virtualization, and networking, ISCs provide a secured container into which a service or set of services is deployed. Each ISC embodies at its core the key principles inherent in the Sun Systemic Security framework including: self-preservation, defense in depth, least privilege, compartmentalization and proportionality. Further, ISC design borrows from Cloud Computing principles such as service abstraction, micro-virtualization, automation, and "fail in place".
By designing service delivery platforms using the Immutable Service Containers mode, a number of significant security benefits:
It is expected that Immutable Service Containers will form the most basic architectural building block for more complex, highly dynamic and autonomic architectures. The goal of the ISC project is to more fully describe the architecture and attributes of ISCs, their inherent benefits, their construction as well as to document practical examples using various software applications.
While the notion of ISCs is not based upon any one product or technology, an instantiation has been recently developed using OpenSolaris 2009.06. This instantiation offers a pre-integrated configuration leveraging OpenSolaris security recommended practices and settings. With ISCs, you are not starting from a blank slate, but rather you can now build upon the security expertise of others. Let's look at the OpenSolaris-based ISC more closely.
In an ISC configuration, the global zone is treated as a system controller and exposed services are deployed (only) into their own non-global zones. From a networking perspective, however, the entire environment is viewed as a single entity (one IP address) where the global zone acts as a security monitoring and arbitration point for all of the services running in non-global zones.
As a foundation, this highly optimized environment is pre-configured with:
Further, the default OpenSolaris ISC uses:
Further in the ISC model, each non-global zone has its own encrypted scratch space (w/its own ephemeral key), its own persistent storage location, as well as a pre-configured auditing and networking configuration that matches that of the global zone. You do not need to use the encrypted scratch space or persistent storage, but it is there if you want to take advantage of it. Obviously, additional resource controls (CPU, memory, etc.) can be added as necessary. These are not pre-configured due to the variability of service payloads.
So what does all of this really mean? Using the ISC model, you can deploy your services in a micro-virtualized environment that offers protection against kernel-based root kits (and some forms of user-land root kits), offers flexible file system immutability (based upon read-only file systems mounted into the non-global zone), can take advantage of process least privilege and resource controls, and is operated in a hardened environment where there is a packet filtering, NAT and auditing policy that is effectively out of the reach of the deployed service. This means that should a service be compromised in a non-global zone, it will not be able to impact the integrity or validity of the auditing, packet filtering, and NAT configuration or logs. While you may not be able to stop every form of attack, having reliable audit trails can significantly help to determine the extent of the breach and facilitate recovery.
The following diagram puts all of the pieces together:

Additional private virtual networking models are also being considered. All in all, the ISC model offers a very compelling deployment model. The accessiblity and attractiveness of this model is further enhanced by the availability of an ISC construction kit that allows you to take an OpenSolaris 2009.06 system and convert it to the ISC model with a single command. Sound interesting? Give it a try, come join the project and be sure to send along your feedback !
Desktop Summit which is made up of GUADEC and AKademy will be held this year in the Gran Canaria from 3rd-11th July. This is by far the biggest events on its nature, FOSS and totally Desktop oriented.
I will be arriving on the 2nd July evening with the whole bunch from the Desktop group in Sun. Now that some of the people, Alberto, Luis are native Canarians. I am looking forwards to their local hospitality
Also we meet up hackers old and new.
Many of the exciting talks including GNOME Shell, GNOME 3.0, Mobile Development are so exciting topics that I look forward to hear and see! I will be there until 9th July.
The installation of a zone in OpenSolaris is a bit different than in Solaris 10 (or SXCE) and it's due to IPS, which is unique to OpenSolaris. When you create a zone in Solaris 10, you get a native zone, which is very lightweight because it shares much of its system software with the base Solaris 10 installation. However, native zones presume you are using the SVR4 packing system (as opposed to IPS). Therefore, OpenSolaris uses a branded zone called ipkg.
The ipkg branded zone doesn't share any of its system information with base OpenSolaris installation. As a matter of fact, when installed, it's not even copied from the base installation, but rather downloaded from an IPS repository. Obviously this makes working with zones in OpenSolaris a bit more restrictive (it took about 10 minutes to download and install on my machine). Supposedly, work is underway to add IPS support to native zones. But until that happens, here's my guide to working with zones in OpenSolaris.
Setting up a zone involves 4 steps: create, install, boot and configure.
If you're not interested in zones, you should at least be aware that you're already running in one - the global zone:
bleonard@opensolaris:~$ zoneadm list -v ID NAME STATUS PATH BRAND IP 0 global running / native shared
Zones must be installed within a ZFS file system, otherwise the zone install command will generate the error "no zonepath dataset" (see defect 8468 for details). You can either use an existing ZFS file system, such as /export/home or create a new one, as I chose to do here:
pfexec zfs create -o mountpoint=/zones rpool/zones
Before we actually create the zone, let's pre-determine some information that will be required. I'm going to set the zone name to myzone. The zone needs a network interface, which can match that of the global zone. This is easiest to figure out by hovering over the connection properties icon in the top panel and noting the Network Connection:
In my case it's e1000g0.
Since we'll be using a shared IP stack with the global zone, the non-global zone is not at liberty to select its own IP address (or use DHCP). I may talk about exclusive IP stacks in another entry, but for now we need to choose a free IP address on the subnet (I'm running OpenSolaris in VirtualBox, which provides it's own subnet). I'll be using 10.0.2.25.
Once you have that information collected, you can begin to create the zone:
bleonard@opensolaris:~$ pfexec zonecfg -z myzone myzone: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:myzone> create zonecfg:myzone> set zonepath=/zones/myzone zonecfg:myzone> add net zonecfg:myzone:net> set physical=e1000g0 zonecfg:myzone:net> set address=10.0.2.25 zonecfg:myzone:net> end zonecfg:myzone> exit
To see your zone's current configuration, run:
bleonard@opensolaris:~$ zonecfg -z myzone info zonename: myzone zonepath: /zones/myzone brand: ipkg autoboot: false bootargs: pool: limitpriv: scheduling-class: ip-type: shared net: address: 10.0.2.25 physical: e1000g0 defrouter not specified
List the zones again, using the -c option to show all zones (not just those installed):
bleonard@opensolaris:~$ zoneadm list -cv ID NAME STATUS PATH BRAND IP 0 global running / native shared - myzone configured /zones/myzone ipkg shared
Notice the brand is ipkg.
Now that the zone's configured, let's install it. Zone installation on OpenSolaris is a much different experience than on Solaris 10, as the zone must be downloaded from the package repository rather then simply copied from the global zone:
bleonard@opensolaris:~$ pfexec zoneadm -z myzone install
A ZFS file system has been created for this zone.
Authority: Using http://pkg.opensolaris.org/release/.
Image: Preparing at /zones/myzone/root ... done.
Installing: (output follows)
DOWNLOAD PKGS FILES XFER (MB)
Completed 52/52 7862/7862 72.41/72.41
PHASE ACTIONS
Install Phase 12939/12939
PHASE ITEMS
Reading Existing Index 9/9
Indexing Packages 52/52
Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=741
Done: Installation completed in 595.162 seconds.
Next Steps: Boot the zone, then log into the zone console
(zlogin -C) to complete the configuration process
We can verify the installation via it's status:
bleonard@opensolaris:~$ zoneadm list -cv ID NAME STATUS PATH BRAND IP 0 global running / native shared - myzone installed /zones/myzone ipkg shared
Log into the zone and wait for it to boot:
bleonard@opensolaris:~$ pfexec zlogin -C myzone [Connected to zone 'myzone' console]
Open a 2nd terminal window and boot the zone. If you see the warning like I did, don't worry about it, I address this at the end of the entry.
bleonard@opensolaris:~$ pfexec zoneadm -z myzone boot zone 'myzone': WARNING: e1000g0:1: no matching subnet found in netmasks(4) for 10.0.2.25; using default of 255.0.0.0.
Then back in the 1st terminal, proceed with system configuration:
[NOTICE: Zone booting up] SunOS Release 5.11 Version snv_101b 32-bit Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: myzone Loading smf(5) service descriptions: 68/68 Reading ZFS config: done. Mounting ZFS filesystems: (5/5)
What type of terminal are you using? 1) ANSI Standard CRT 2) DEC VT100 3) PC Console 4) Sun Command Tool 5) Sun Workstation 6) X Terminal Emulator (xterms) 7) Other Type the number of your choice and press Return: 6 Creating new rsa public/private host key pair Creating new dsa public/private host key pair Configuring network interface addresses: e1000g0.
Give the zone a host name (or select the default):
─ Host Name for e1000g0:1 ─────────────────────────────────────────────────────
Enter the host name which identifies this system on the network. The name
must be unique within your domain; creating a duplicate host name will cause
problems on the network after you install Solaris.
A host name must have at least one character; it can contain letters,
digits, and minus signs (-).
Host name for e1000g0:1 myzone
────────────────────────────────────────────────────────────────────────────────
F2_Continue F6_Help
Note, for some reason the "Continue" command switches from F2, as in the screen shot above, to Esc+2, as seen in the following screens.
Confirm the host name:
─ Confirm Information for e1000g0:1 ────────────────────────────────────────────
> Confirm the following information. If it is correct, press F2;
to change any information, press F4.
Host name: myzone
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-4_Change Esc-6_Help
Configure the security policy:
─ Configure Security Policy: ───────────────────────────────────────────────────
Specify Yes if the system will use the Kerberos security mechanism.
Specify No if this system will use standard UNIX security.
Configure Kerberos Security
───────────────────────────
[ ] Yes
[X] No
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Confirm the security policy:
─ Confirm Information ──────────────────────────────────────────────────────────
> Confirm the following information. If it is correct, press F2;
to change any information, press F4.
Configure Kerberos Security: No
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-4_Change Esc-6_Help
Set the name service. I will be using DNS:
─ Name Service ─────────────────────────────────────────────────────────────────
On this screen you must provide name service information. Select the name
service that will be used by this system, or None if your system will either
not use a name service at all, or if it will use a name service not listed
here.
> To make a selection, use the arrow keys to highlight the option
and press Return to mark it [X].
Name service
────────────
[ ] NIS+
[ ] NIS
[X] DNS
[ ] LDAP
[ ] None
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
If you also selected DNS, set the domain name, DNS severs and search domains. I'm using the same settings as my global zone, which you can find in /etc/resolve.conf:
Set the domain name:bleonard@opensolaris:~$ cat /etc/resolv.conf domain hsd1.ct.comcast.net. nameserver 10.0.2.3
─ Domain Name ──────────────────────────────────────────────────────────────
On this screen you must specify the domain where this system resides. Make
sure you enter the name correctly including capitalization and punctuation.
Domain name: hsd1.ct.comcast.net
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Add the DNS Server Addresses:
─ DNS Server Addresses ─────────────────────────────────────────────────────────
On this screen you must enter the IP address of your DNS server(s). You
must enter at least one address. IP addresses must contain four sets of
numbers separated by periods (for example 129.200.9.1).
Server's IP address: 10.0.2.3
Server's IP address:
Server's IP address:
───────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
And any search domains:
─ DNS Search List ──────────────────────────────────────────────────────────────
On this screen you can enter a list of domains that will be searched when a
DNS query is made. If you do not enter any domains, DNS will only search
the DNS domain chosen for this system. The domains entered, when
concatenated, may not be longer than 250 characters.
Search domain:
Search domain:
Search domain:
Search domain:
Search domain:
Search domain:
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Confirm the network information:
─ Confirm Information ──────────────────────────────────────────────────────────
> Confirm the following information. If it is correct, press F2;
to change any information, press F4.
Name service: DNS
Domain name: hsd1.ct.comcast.net
Server address(es): 10.0.2.3
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-4_Change Esc-6_Help
Ignore the Name Service Error (i.e., do not enter new name service information):
─ Name Service Error ───────────────────────────────────────────────────────────
Unable to find an address entry for myzone with the specified DNS
configuration.
Enter new name service information?
───────────────────────────────────
[ ] Yes
[X] No
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
NFSv4 Domain Name:
─ NFSv4 Domain Name ────────────────────────────────────────────────────────────
NFS version 4 uses a domain name that is automatically derived from the
system's naming services. The derived domain name is sufficient for most
configurations. In a few cases, mounts that cross domain boundaries might
cause files to appear to be owned by "nobody" due to the lack of a common
domain name.
The current NFSv4 default domain is: "hsd1.ct.comcast.net"
NFSv4 Domain Configuration
──────────────────────────────────────────────
[X] Use the NFSv4 domain derived by the system
[ ] Specify a different NFSv4 domain
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Confirm:
─ Confirm Information for NFSv4 Domain ─────────────────────────────────────────
> Confirm the following information. If it is correct, press F2;
to change any information, press F4.
NFSv4 Domain Name: << Value to be derived dynamically >>
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-4_Change Esc-6_Help
Select your time zone:
─ Time Zone ────────────────────────────────────────────────────────────────────
On this screen you must specify your default time zone. You can specify a
time zone in three ways: select one of the continents or oceans from the
list, select other - offset from GMT, or other - specify time zone file.
> To make a selection, use the arrow keys to highlight the option and
press Return to mark it [X].
Continents and Oceans
──────────────────────────────────
- [ ] Africa
│ [X] Americas
│ [ ] Antarctica
│ [ ] Arctic Ocean
│ [ ] Asia
│ [ ] Atlantic Ocean
│ [ ] Australia
│ [ ] Europe
v [ ] Indian Ocean
──────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Country:
─ Country or Region ────────────────────────────────────────────────────────────
> To make a selection, use the arrow keys to highlight the option and
press Return to mark it [X].
Countries and Regions
───────────────────────────
- [X] United States
│ [ ] Anguilla
│ [ ] Antigua & Barbuda
│ [ ] Argentina
│ [ ] Aruba
│ [ ] Bahamas
│ [ ] Barbados
│ [ ] Belize
│ [ ] Bolivia
│ [ ] Brazil
│ [ ] Canada
│ [ ] Cayman Islands
v [ ] Chile
──────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Time Zone:
─ Time Zone ───────────────────────────────────────────────────────────────────
> To make a selection, use the arrow keys to highlight the option and
press Return to mark it [X].
Time zones
──────────────────────────────────────────────────────────────────────────
- [X] Eastern Time
│ [ ] Eastern Time - Michigan - most locations
│ [ ] Eastern Time - Kentucky - Louisville area
│ [ ] Eastern Time - Kentucky - Wayne County
│ [ ] Eastern Time - Indiana - most locations
│ [ ] Eastern Time - Indiana - Daviess, Dubois, Knox & Martin Counties
│ [ ] Eastern Time - Indiana - Starke County
│ [ ] Eastern Time - Indiana - Pulaski County
│ [ ] Eastern Time - Indiana - Crawford County
│ [ ] Eastern Time - Indiana - Switzerland County
│ [ ] Central Time
│ [ ] Central Time - Indiana - Perry County
v [ ] Central Time - Indiana - Pike County
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Confirm Time Zone:
─ Confirm Information ─────────────────────────────────────────────────────────
> Confirm the following information. If it is correct, press F2;
to change any information, press F4.
Time zone: Eastern Time
(US/Eastern)
──────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-4_Change Esc-6_Help
And finally, set the root password:
─ Root Password ────────────────────────────────────────────────────────────────
Please enter the root password for this system.
The root password may contain alphanumeric and special characters. For
security, the password will not be displayed on the screen as you type it.
> If you do not want a root password, leave both entries blank.
Root password: *********
Root password: *********
────────────────────────────────────────────────────────────────────────────────
Esc-2_Continue Esc-6_Help
Zone configuration is complete. You can now log into the zone:
System identification is completed. myzone console login: root Password: Apr 1 21:48:04 myzone login: ROOT LOGIN /dev/console Sun Microsystems Inc. SunOS 5.11 snv_101b November 2008 root@myzone:~#
From the other terminal, the zone's status now shows as running:
bleonard@opensolaris:~$ zoneadm list -v ID NAME STATUS PATH BRAND IP 0 global running / native shared 1 myzone running /zones/myzone ipkg shared
To drop off the zone console, exit the shell prompt and then type ~. at the console login prompt:
root@myzone:~# exit logout myzone console login: ~. [Connection to zone 'myzone' console closed] bleonard@opensolaris:~$
The zone is still running. Log in again:
bleonard@opensolaris:~$ pfexec zlogin -C myzone [Connected to zone 'myzone' console]
Hit return to get the login prompt:
myzone console login: root Password: Last login: Wed Apr 1 22:00:12 on console Sun Microsystems Inc. SunOS 5.11 snv_101b November 2008 root@myzone:~#
The zone can be shutdown, halted or rebooted from within the zone (here's a reboot example):
root@myzone:~# reboot Apr 2 01:18:50 myzone reboot: initiated by root on /dev/console [NOTICE: Zone rebooting] SunOS Release 5.11 Version snv_101b 32-bit Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms. Hostname: myzone Reading ZFS config: done. Mounting ZFS filesystems: (5/5) myzone console login:
Or from the global zone:
pfexec zoneadm -z myzone reboot
Now that we have a zone, there's plenty of opportunity to experiment...
pfexec zoneadm -z myzone uninstallpfexec zonecfg -z myzone delete -F
If you're getting the netmask warning as I did when the zone boots:
You can eliminate it by adding the zone's IP subnet into /etc/inet/netmasks. However, before we can edit the netmasks file, we need to make it writable:zone 'myzone': WARNING: e1000g0:1: no matching subnet found in netmasks(4) for 10.0.2.25;⁞ using default of 255.0.0.0.
pfexec chmod u+w /etc/inet/netmasks
Then add the proper subnet for you network. For example:
10.0.2.0 255.255.255.0
Now the zone will boot cleanly. For more information see netmasks Warning Displayed When Booting Zone.
Codeina's a nifty little feature of OpenSolaris 2009.06. It's an application from Fluendo that points applications such as Totem or Rhythmbox to Fluendo's web shop if you are missing the codec for the file you are trying to play. In the case of MP3s, the codec is free. For other audio and video formats, you'll have to pay a small license fee.
For example, when I attempt to play an MP3 in Totem, I'm presented with the following:
Clicking Install launches the Codeina Web Shop application where I can register to "buy" the free MP3 decoder:

And then Install the decoder:

Which will quickly complete:

After which my song will automatically start playing.
Note, if Codeina isn't working for you, you may be running into issue Codeina fails to start. The quick fix for this is to remove the Fluendo configuration file and try again.
rm ~/.local/share/codeina/providers/fluendo.xml
I was poking around with the Just enough Operation System (JeOS) delivery form of OpenSolaris, when I found the Web Space Server project built on top of that platform. The Web Space Server project is based on the Liferay Portal and it has been packaged up nicely into a variety of virtual machines for both VirtualBox and VMware.
For this example I grabbed the OVF (Open Virtualization Format) which is designed for packaging up such appliances and was a snap to import into VirtualBox (File > Import Appliance):
When the import is complete, go ahead and start the WebSpaceServer. As this is a "just enough" distribution of OpenSolaris, there is no GUI. When the machine boots up, it will give you the HTTP address of the server, which in my case is 10.0.1.14:
Note, the machine takes anywhere from 5-10 minutes to complete service startup. To monitor its progress, you can log in using template / template.
One thing you probably want to do right off bat is enable ssh so you can log into the server remotely:
svcadm enable ssh
Then, from another host:
bleonard@opensolaris:~$ ssh template@10.0.1.14 The authenticity of host '10.0.1.14 (10.0.1.14)' can't be established. RSA key fingerprint is 7c:6d:df:62:e2:ec:1f:a5:d7:2a:5f:3f:72:9a:ac:45. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.0.1.14' (RSA) to the list of known hosts. Password: Last login: Tue Jun 30 03:55:15 2009 Welcome to a Sun GlassFish Web Space Server and OpenSolaris virtual machine image. Use of this virtual machine image is subject to the license terms found in /etc/notices. template@webspace:~$
You need to wait for the GlassFish server to finish starting up. You can check the domain2 SMF service:
template@webspace:~$ svcs -l domain2 fmri svc:/application/SUNWappserver/domain2:default name Appserver Domain Administration Server enabled true state offline next_state online state_time Tue Jun 30 04:35:59 2009 logfile /var/svc/log/application-SUNWappserver-domain2:default.log restarter svc:/system/svc/restarter:default contract_id 46 dependency require_all/none svc:/milestone/network:default (online) dependency require_all/none svc:/system/filesystem/local:default (online)
Here you can see its current state is offline, but it's transitioning to online. You can tail the server log file to track its progress:
tail -f /opt/webspace/glassfish2/domains/domain2/logs/server
Ultimately, you'll see a line like the following:
[#|2009-06-30T11:44:37.899+0000|INFO|sun-appserver2.1|javax.enterprise.system.core|_ThreadID=10;_ThreadName=main;|Application server startup complete.|#
Once that is complete, you can then browse to the server. It has a professional home page with links to try the Web Space Server and a Quick Start Tour. Webmin is also included for managing MySQL and OpenSolaris.
When you first try the Web Space Server, be patient as it configures itself to run for the first time. Eventually, the Welcome page will appear:
The Crossbow project is probably the most exciting new feature in OpenSolaris 2009.06. It a nutshell, project Crossbow brings virtualization to the networking layer. In this quick example I'm going to create a virtual network interface card (VNIC) and dynamically alter it's maximum bandwidth as traffic is flowing over it.
The VNIC must be linked to an actual network interface device. To see the existing devices on the system:
bleonard@opensolaris:~$ dladm show-phys LINK MEDIA STATE SPEED DUPLEX DEVICE e1000g0 Ethernet up 1000 full e1000g0 iwh0 WiFi down 0 unknown iwh0 vboxnet0 Ethernet unknown 0 unknown vboxnet0
The current device in use is e1000g0, so I'll create my VNIC over that device:
pfexec dladm create-vnic -l e1000g0 vnic0
View the new VNIC:
bleonard@opensolaris:~$ dladm show-vnic LINK OVER SPEED MACADDRESS MACADDRTYPE VID vnic0 e1000g0 1000 2:8:20:e2:77:62 random 0
Note its speed matches that of the physical link. Let's reduce this from 1000 megabits/second to 2 megabits/second:
pfexec dladm set-linkprop -p maxbw=2 vnic0
Before the VNIC can be used, it must be plumbed:
pfexec ifconfig vnic0 plumb
And assigned an IP address. If you're using DHCP, the following will work:
pfexec ifconfig vnic0 dhcp start
Now you can see vnic0 using ifconfig:
bleonard@opensolaris:~$ ifconfig vnic0 vnic0: flags=1104843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,ROUTER,IPv4> mtu 1500 index 7 inet 10.0.1.16 netmask ffffff00 broadcast 10.0.1.255
The VNIC has been assigned IP address 10.0.1.16 and is now ready for use. To test it out, we'll copy a large file from one host to another, over the VNIC. To test this on a single machine, use a virtual machine configured with bridged networking or a zone. From the other host:
bleonard@os200906:~$ mkfile 100M big-file bleonard@os200906:~$ scp big-file bleonard@10.0.1.16:bile-file The authenticity of host '10.0.1.16 (10.0.1.16)' can't be established. RSA key fingerprint is f7:1d:2c:d7:24:e3:1c:57:53:0f:59:75:31:4a:0f:7d. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '10.0.1.16' (RSA) to the list of known hosts. Password: big-file 0% | | 128 KB 13:22 ETA
Notice it's estimated to take over 13 minutes to copy this file. While the file is being copied, back on the host machine, change the maxbw property to 1000 megabits / second:
pfexec dladm set-linkprop -p maxbw=1000 vnic0
Immedially you'll notice the copy operation speed up and quickly complete:
big-file 100% |*****************************| 100 MB 00:37
Very cool!
The Japan OpenSolaris Community together on Saturday. Nice day (and night). About 60 people came by for the three sessions, two of which were in Japanese and the third in English. Then all three groups came together for a nomikai. I think the model works well to start integrating the Japanese and international OpenSolaris communities.
I used a new lens
for this event. My f/1.4 lens is getting fixed, so I borrowed Jon`s
50mm f/1.2, which is one scary smart lens. It`s a tad expensive, too,
so I was more than a little nervous shooting with it. Anyway, at f/1.2 the
focus is just razor thin. Focus on someone`s glasses and their entire face
is out. I messed up a few images that way, but by the end of the night
I was getting used to it. Amazing piece of glass. By the way, you can see Jon`s stuff
here. He`s one of the best photographers around.
I've been checking out the Tokyo Hackerspace gmail list for a few weeks. Looks very interesting. The project grew out of some discussions at BarCamp Tokyo
a couple of months ago, and I spoke to Karamoon about it at the OpenSolaris community event this weekend. In a world of ever expanding global digital communities, it
seems like a nice idea to have a very local a very physical space to
hang out in and hack on things that need hacking. Global and digital are
fine, but local and physical are needed too. For info, check it out on the wiki. I had the privilege to be a guest on FLOSS Weekly with Leo Laporte and Jono Bacon this week, thanks guys! Of course Aaron and David had done awesome groundwork with an interview on ZFS a few weeks earlier. It was a fun hour, and I enjoyed it though can think of many thing I’d answer differently now! Looking forward to catching up with Jono and others at the Community Leadership Summit next month in San Jose, the weekend before OSCON.
And yes, OpenSolaris is officially ‘not bollocks’. Check it out!
Four members of the Japan OpenSolaris Community wrote a book on ZFS recently. It`s coming out in July, and it`s specifically for the Japanese market. The cover has not been selected yet, but here are the early details: ZFS 仮想化されたファイルシステムの徹底活用 (大型本) by Hisayoshi Kato, Michitoshi Sato, Nagahara Niroharu, and Imai Satoshi. This is quite a significant contribution to the community in Japan because it`s important to have technical content written by Japanese engineers for Japanese engineers. Translating English content from the west good, of course, but the generation of original content in Japanese also needs to be part of this community`s growth plans.
Here are some more books on OpenSolaris: http://blogs.sun.com/jimgris/tags/book
I previously posted screenshots of the L2ARC: the ZFS second level cache which uses read optimized SSDs ("Readzillas") to cache random read workloads. That's the read side of the Hybrid Storage Pool. On the write side, the ZFS separate intent log (SLOG) can use write optimized SSDs (which we call "Logzillas") to accelerate performance of synchronous write workloads.
In the screenshots that follow I'll show how Logzillas have delivered 12x more IOPS and over 20x reduced latency for a synchronous write workload over NFS. These screenshots are from Analytics on the Sun Storage 7000 series. In particular, the following heat map shows the dramatic reduction in NFS latency when turning on the Logzillas:
Before the 17:54 line, latency was at the mercy of spinning 7200 RPM disks, and reached as high as 9 ms. After 17:54, these NFS operations were served from Logzillas, which consistently delivered latency much lower than 1 ms. Click for a larger screenshot.
While the use of Logzilla devices can dramatically improve the performance of synchronous write workloads, what is a synchronous write?
When an application writes data, if it waits for that data to complete writing to stable storage (such as a hard disk), it is a synchronous write. If the application requests the write but continues on without waiting (the data may is buffered in the filesystem's DRAM cache, but not yet written to disk), it is an asynchronous write. Asynchronous writes are faster for the application, and is the default behavior.
There is a down side to asynchronous writes - the application doesn't know if the write completed successfully. If there is a power outage before the write could be flushed to disk, the write will be lost. For some applications such as database log writers, this risk is unacceptable - and so they perform synchronous writes instead.
There are two forms of synchronous writes: individual I/O which is written synchronously, and groups of previous writes which are synchronously committed.
Write I/O will become synchronous when:
Rather than synchronously writing each individual I/O, an application may synchronously commit previous writes at logical checkpoints in the code. This can improve performance by grouping the synchronous writes. On ZFS these may also be handled by the SLOG and Logzilla devices. These are performed by:
Applications that are expected to perform synchronous writes include:
The second example is easy to demonstrate. Here I've taken firefox-1.5.0.6-source.tar, a tar file containing 43,000 small files, and unpacked it to an NFS share (tar xf.) This will create thousands of small files, which becomes a synchronous write workload as these files are created. Writing of the file contents will be asynchronous, it is just the act of creating the file entries in their parent directories which is synchronous. I've unpacked this tar file twice, the first time without Logzilla devices and the second time with. The difference is clear:
Without logzillas, it took around 20 minutes to unpack the tar file. With logzillas, it took around 2 minutes - 10x faster. This improvement is also visible as the higher number of NFS ops/sec, and the lower NFS latency in the heat map.
For this example I used a Sun Storage 7410 with 46 disks and 2 Logzillas, stored in 2 JBODs. The disks and Logzillas can serve many more IOPS than I've demonstrated here, as I'm only running a single threaded application from a single modest client as a workload.
How do you know if your particular write workload is synchronous or asynchronous? There are many ways to check, here I'll describe a scenario where a client side application is performing writes to an NFS file server.
To determine if your application is performing synchronous writes, one way is to debug the open system calls. The following example uses truss on Solaris (try strace on Linux) on two different programs:
client# truss -ftopen ./write
4830: open64("outfile", O_WRONLY|O_CREAT, 0644) = 3
client# truss -ftopen ./odsync-write
4832: open64("outfile", O_WRONLY|O_DSYNC|O_CREAT, 0644) = 3
The second program (odsync-write) opened its file with the O_DSYNC flag, so subsequent writes will be synchronous. In these examples, the program was executed by the debugger, truss. If the application is already running, on Solaris you can use pfiles to examine the file flags for processes (on Linux, try lsof.)
Apart from tracing the open() syscall, also look for frequent fsync() or fdsync() calls, which synchronously commit the previously written data.
If you'd like to determine if you have a synchronous write workload from the NFS server itself, you can't run debuggers like truss on the target process (since it's running on another host), so instead we'll need to examine the NFS protocol. You can do this either with packet sniffers (snoop, tcpdump), or with DTrace if it and its NFS provider are available:
topknot# dtrace -n 'nfsv3:::op-write-start { @[args[2]->stable] = count(); }'
dtrace: description 'nfsv3:::op-write-start ' matched 1 probe
^C
2 1398
Here I've frequency counted the "stable" member of the NFSv3 write protocol, which was "2" some 1,398 times. 2 is a synchronous write workload (FILE_SYNC), and 0 is for asynchronous:
enum stable_how {
UNSTABLE = 0,
DATA_SYNC = 1,
FILE_SYNC = 2
};
If you don't have DTrace available, you'll need to dig this out of the NFS protocol headers. For example, using snoop:
# snoop -d nxge4 -v | grep NFS [...] NFS: ----- Sun NFS ----- NFS: NFS: Proc = 7 (Write to file) NFS: File handle = [A2FC] NFS: 7A6344B308A689BA0A00060000000000073000000A000300000000001F000000 NFS: Offset = 96010240 NFS: Size = 8192 NFS: Stable = FSYNC NFS:
That's an example of a synchronous write.
The NFS client may be performing a synchronous-write-like workload by frequently calling the NFS commit operation. To identify this, use whichever tool is available to show NFS operations broken down by type (Analytics on the Sun Storage 7000; nfsstat -s on Solaris)
The following diagram shows the difference when adding a seperate intent log (SLOG) to ZFS:
Major components:
For the most detailed information of these and the other components of ZFS, you can browse their source code at the ZFS Source Tour.
When data is asynchronously written to ZFS, it is buffered in memory and gathered periodically by the DMU as transaction groups, and then written to disk. Transaction groups either succeed or fail as a whole, and are part of the ZFS design to always deliver on disk data consistency.
This periodical writing of transaction groups can improve asynchronous writes by aggregating data and streaming it to disk. However the interval of these writes is in the order of seconds, which makes it unsuitable to serve synchronous writes directly - as the application would stall until the next transaction group was synced. Enter the ZIL.
The ZIL handles synchronous writes by immediately writing their data and information to stable storage, an "intent log", so that ZFS can claim that the write completed. The written data hasn't reached its final destination on the ZFS filesystem yet, that will happen sometime later when the transaction group is written. In the meantime, if there is a system failure, the data will not be lost as ZFS can replay that intent log and write the data to its final destination. This is a similar principle as Oracle redo logs, for example.
In the above diagram on the left, the ZIL stores the log data on the same disks as the ZFS pool. While this works, the performance of synchronous writes can suffer as they compete for disk access along with the regular pool requests (reads, and transaction group writes.) Having two different workloads compete for the same disk can negatively effect both workloads, as the heads seek between the hot data for one and the other. The solution is to give the ZIL its own dedicated "log devices" to write to, as pictured on the right. These dedicated log devices form the separate intent log: the SLOG.
By writing to dedicated log devices, we can improve performance further by choosing a device which is best suited for fast writes. Enter Logzilla. Logzilla was the name we gave to write-optimized flash-memory-based solid state disks (SSDs.) Flash memory is known for slow write performance, so to improve this the Logzilla device buffers the write in DRAM and uses a super-capacitor to power the device long enough to write the DRAM buffer to flash, should it lose power. These devices can write an I/O as fast as 0.1 ms (depends on I/O size), and do so consistently. By using them as our SLOG, we can serve synchronous write workloads consistently fast.
It's worth mentioning that when ZFS synchronously writes to disk, it uses new ioctl()s to ensure that the data is flushed properly to the disk platter, and isn't just buffered on the disk's write cache (which is a small amount of DRAM inside the disk itself.) Which is exactly what we want to happen - that's why these writes are synchronous. Other filesystems didn't bother to do this (eg, UFS), and believed that data had reached stable storage when it may have just been cached. If there was a power outage, and the disk's write cache is not battery backed, then those writes would be lost - which means data corruption for the filesystem. Since ZFS waits for the disk to properly write out the data, synchronous writes on ZFS are slower than other filesystems - but they are also correct. There is a way to turn off this behaviour in ZFS (zfs_nocacheflush), and ZFS will perform as fast or faster than other filesystems for synchronous writes - but you've also sacrificed data integrity, so this is highly unrecommended. By using fast SLOG devices on ZFS, we get both speed and data integrity.
I won't go into too much more detail of the inner workings of the SLOG, which is best described by the author Neil Perrin.
To create a screenshot to best show the effect of Logzillas, I applied a synchronous write workload over NFS from a single threaded process on a single client. The target was the same 7410 used for the tar test, which has 46 disks and 2 Logzillas. At first I let the workload run with the Logzillas disabled, then enbled them:
This is the effect of adding 2 Logzillas to a pool of 46 disks, on both latency and IOPS. My last post discussed the odd latency pattern that these 7200 RPM disks causes for synchronous writes, which looks a bit like an icy lake.
The latency heat map shows the improvement very well, but it's also shown something unintended: These heat maps use a false color palette which draws the faintest details much darker than they (linearly) should be, so that they are visible. This has made visible a minor and unrelated performance issue: those faint vertical trails on the right side of the plot. These were every 30 seconds, and was when ZFS flushed a transaction group to disk - which stalled a fraction of NFS I/O while this happened. The fraction is so small it's almost lost in the rounding, but appears to be about 0.17% when examining the left panel numbers. While minor, we are working to fix it. This perf issue is known internally as the "picket fence". Logzilla is still delivering a fantastic improvement over 99% of the time.
To quantify the Logzilla improvement, I'll now zoom to the before and after periods to see the range averages from Analytics:
With just the 46 disks:
Reading the values from the left panels: with just the 46 disks, we've averaged 143 NFS synchronous writes/sec. Latency has reached as high as 8.93ms - most of the 8 ms would be worst case rotational latency for a 7200 RPM disk.
46 disks + 2 Logzillas:
Nice - a tight latency cloud from 229 to 305 microseconds, showing very fast and consistent responses from the Logzillas.
We've now averaged 1,708 NFS synchronous writes/sec - 12x faster than without the Logzillas. Beyond 381 us, latency has rounded down to zero ops/sec, and was mostly in the 267 to 304 microsecond range. Previously latency stretched from a few to over 8 ms, making the delivered NFS latency improvement reach over 20x.
This 7410 and these Logzillas can handle much more than 1,708 synchronous writes/sec, if I apply more clients and threads (I'll show a higher load demo in a moment.)
Another way to see the effect of Logzillas is through disk I/O. The following shows before and after in seperate graphs:
I've used the Hierarchial breakdown feature to highlight the JBODs in green. In the top plot (before), the I/O to each JBOD was equal - about 160 IOPS. After enabling the Logzillas, the bottom plot shows one of the JBODs is now serving over 1800 IOPS - which is the JBOD with the Logzillas in.
There are spikes of IOPS every 30 seconds in both these graphs. That is when ZFS flushes a transaction group to disk - commiting the writes to their final location. Zooming into the before graph:
Instead of highlighting JBODs, I've now highlighted individual disks (yes, the pie chart is pretty now.) The non-highlighted slice is for the system disks, which are recording the numerous Analytics statistics I've enabled.
We can see that all the disks have data flushed to them every 30 seconds, but they are also being written to constantly. This constant writing is the synchronous writes being written to the ZFS intent log, which is being served from the pool of 46 disks (since it isn't a separate intent log, yet.)
After enabling the Logzillas:
Now those constant writes are being served from 2 disks: HDD 20 and HDD 16, which are both in the /2029...012 JBOD. Those are the Logzillas, and are our separate intent log (SLOG):
I've highlighted one in this screenshot, HDD 20.
The 2 Logzillas I've been demonstrating have been acting as individual log devices. ZFS lists them like this:
topknot# zpool status
pool: pool-0
state: ONLINE
[...]
NAME STATE READ WRITE CKSUM
pool-0 ONLINE 0 0 0
[...many lines snipped...]
mirror ONLINE 0 0 0
c2t5000CCA216CCB913d0 ONLINE 0 0 0
c2t5000CCA216CE289Cd0 ONLINE 0 0 0
logs
c2tATASTECZEUSIOPS018GBYTES00000AD6d0 ONLINE 0 0 0
c2tATASTECZEUSIOPS018GBYTES00000CD8d0 ONLINE 0 0 0
spares
c2t5000CCA216CCB958d0 AVAIL
c2t5000CCA216CCB969d0 AVAIL
Log devices can be mirrored instead (see zpool(1M)), and zpool will list them as part of a "mirror" under "logs". On the Sun Storage 7000 series, it's configurable from the BUI (and CLI) when creating the pool:
With a single Logzilla, if there was a system failure when a client was performing synchronous writes, and if that data was written to the Logzilla but hadn't been flushed to disk, and if that Logzilla also fails, then data will be lost. So it is system failure + drive failure + unlucky timing. ZFS should still be intact.
It sounds a little paranoid to use mirroring, however from experience system failures and drive failures sometimes go hand in hand, so it is understandable that this is often chosen for high availability environments. The default for the Sun Storage 7000 config when you have multiple Logzillas is to mirror them.
While the Logzillas can greatly improve performance, it's important to understand which workloads this is for and how well they work, to help set realistic expectations. Here's a summary:
So far I've only used a single threaded process to apply the workload. To push the 7410 and Logzillas to their limit, I'll multiply this workload by 200. To do this I'll use 10 clients, and on each client I'll add another workload thread every minute until it reaches 20 per client. Each workload thread performs 512 byte synchronous write ops continually. I'll also upgrade the 7410: I'll now use 6 JBODs containing 8 Logzillas and 132 data disks, to give the 7410 a fighting chance against this bombardment.
We've reached 114,000 synchronous write ops/sec over NFS, and the delivered NFS latency is still mostly less than 1 ms - awesome stuff! We can study the numbers in the left panel of the latency heat map to quantify this: there were 132,453 ops/sec total, and (adding up the numbers) 112,121 of them completed faster than 1.21 ms - which is 85%.
The bottom plot suggests we have driven this 7410 close to its CPU limit (for what is measured as CPU utilization, which includes bus wait cycles.) The Logzillas are only around 30% utilized. The CPUs are 4 sockets of 2.3 GHz quad core Opteron - and for them to be the limit means a lot of other system components are performing extremely well. There are now faster CPUs available, which we need to make available in the 7410 to push that ops/sec limit much higher than 114k.
Ouch! We are barely reaching 7,000 sync write ops/sec, and our latency is pretty ordinary - often 15 ms and higher. These 7,200 RPM disks are not great to serve synchronous writes from alone, which is why Adam created the Hybrid Storage Pool model.
When comparing the above two screenshots, note the change in the vertical scale. With Logzillas, the NFS latency heat map is plotted to 10 ms; without, it's plotted to 100 ms to span the higher latency from the disks.
The full description of the server and clients used in the above demo is:
ServerThe clients are a little old. The server is new, and is a medium sized 7410.
I've been comparing the performance of write optimized SSDs vs cheap 7,200 RPM SATA disks as the primary storage. What if I used high quality 15K RPM disks instead?
Rotational latency will be half. What about seek latency? Perhaps if the disks are of higher quality the heads move faster, but they also have to move further to span the same data (15K RPM disks are usually lower density than 7,200 RPM disks), so that could cancel out. They could have a larger write cache, which would help. Lets be generous and say that 15K RPM disks are 3x faster than 7,200 RPM disks.
So instead of reaching 7,000 sync write ops/sec, perhaps we could reach 21,000 if we used 15K RPM disks. That's a long way short of 114,000 using Logzilla SSDs. Comparing the latency improvement also shows this is no contest. In fact, to comfortably match the speed of current SSDs, disks would need to spin at 100K RPM and faster. Which may never happen. SSDs, on the other hand, are getting faster and faster.
For the SLOG devices, I've been using Logzillas - write optimized SSDs. It is possible to use NVRAM cards as SLOG devices, which should deliver much faster writes than Logzilla can. While these may be an interesting option, there are some down sides to consider:
Switching to NVRAM cards may also make little relative difference for NFS clients over the network, when other latencies are factored. It may be an interesting option, but it's not necessarily better than Logzillas.
For synchronous write workloads, the ZFS intent log can use dedicated SSD based log devices (Logzillas) to greatly improve performance. The screenshots above from Analytics on the Sun Storage 7000 series showed 12x more IOPS and reaching over 20x reduced latency.
Thanks to Neil Perrin for creating the ZFS SLOG, Adam Leventhal for designing it into the Hybrid Storage Pool, and Sun's Michael Cornwell and STEC for making the Logzillas a reality.
Logzillas as SLOG devices work great, making the demos for this blog entry easy to create. It's not every day you find a performance technology that can deliver 10x and higher.
Apart from KDE4, Gnome 2.26 will also be available in BeleniX 0.8. I pulled the Desktop Consolidation trunk (JDS) and built it with a bunch of changes. The packages are already available in the package repository trunk but not yet recommended to upgrade to trunk as it is seeing a lot of churn at present. One of the things pending is to replace the default OpenSolaris branding with BeleniX branding. Below is a screenshot showing Gnome 2.26 + Compiz + Avant + Google Gadgets + Webkit on my box. The Gnome developer help documentation browser is built with Webkit support.

In perhaps a new trend, I’m blogging from 39011 feet (or so says the seatback in front of me). I’m traveling back home to the east coast from San Jose, CA where I attended (and spoke) at this year’s O’Reilly Velocity Conference.
I participated (and blogged) about the Velocity Summit in which I’ve participated for the past two years. The summit is the unconference preceding the real conference that help the organizers digest current hot topics and better define the conference track for the actual conference. The summit itself is filled with enough brain power to warp space-time, so I drop everything to go to that.
Ironically, despite being a well respected authority in web site (and general internet) scalability and performance, my talk proposals for Velocity 2008 were not accepted — I clearly need to write better proposals. This year, I managed to work my way into the workshop track on Monday. Despite having a bad headache and feeling "off" the day before, I managed to get my act together and put on an A-game for my workshop. For those of you interested, here is my scalable09 slide stack.
I thought I’d take a moment to talk about what I liked about the conference and what I think could use some improvement. I realize this is a down economy and that might be a legitimate justification for some the actions that resulted in some of my disappointment.
First, the negative. I usually start with positive and end with negative because I’m a pessimist. However, all in all the conference was awesome, so I thought I’d get my short list of gripes out of the way early.
O’Reilly is infamous for throwing good conferences for geeks. In my opinion, the field of web