Register Discussions Communities Projects Download Source Browser

July 02, 2009

OpenSolaris Power User Tutorial at OSCON

I really should have posted this quite some time ago, but between getting the OpenSolaris 2009.06 release out, speaking at CommunityOne, speaking at the OpenSolaris user group in New York, and trying to sleep once in a while, it's been a little tough to keep up.  Anyway, Nick and I are giving a three-hour OpenSolaris tutorial at OSCON 2009 on July 21. Looking at the content draft, we've probably got more like five hours of material, but we'll figure out how to cram most of it in.  Even if you've read OpenSolaris Bible you're likely to learn a lot, as a fair amount of the material is on technology that's not covered in the book, such as Crossbow and the Automated Installer.  I'm also expecting to spend some time wandering around at the conference, so hope to see you there!


OSCON 2009


July 01, 2009

NEW: OpenSolaris Immutable Service Containers

While the need for security and integrity is well-recognized, it is less often well-implemented. Security assessments and industry reports regularly show how sporadic and inconsistent security configurations become for organizations both large and small. Published recommended security practices and settings remain unused in many environments and existing, once secured, deployments suffer from atrophy due to neglect.

Why is this? There is no one answer. Some organizations are simply unaware of the security recommendations, tools, and techniques available to them. Others lack the necessary skill and experience to implement the guidance and maintain secured configurations. It is not uncommon for these organizations to feel overwhelmed by the sheer number of recommendations, settings and options. Still others may feel that security is not an issue in their environment. The list goes on and on, yet the need for security and integrity has never been more important.

Interestingly, the evolution and convergence of technology is cultivating new ideas and solutions to help organizations better protect their services and data. One such idea is being demonstrated by the Immutable Service Container (ISC) project. Immutable Service Containers are an architectural deployment pattern used to describe a platform for highly secure service delivery. Building upon concepts and functionality enabled by operating systems, hypervisors, virtualization, and networking, ISCs provide a secured container into which a service or set of services is deployed. Each ISC embodies at its core the key principles inherent in the Sun Systemic Security framework including: self-preservation, defense in depth, least privilege, compartmentalization and proportionality. Further, ISC design borrows from Cloud Computing principles such as service abstraction, micro-virtualization, automation, and "fail in place".

By designing service delivery platforms using the Immutable Service Containers mode, a number of significant security benefits:

  • For application owners:
    • ISCs help to protect applications and services from tampering
    • ISCs provide a consistent set of security interfaces and resources for applications and services to use
  • For system administrators:
    • ISCs isolate services from one another to avoid contamination
    • ISCs separate service delivery from security enforcement/monitoring
    • ISCs can be (mostly) pre-configured by security experts
  • For IT managers:
    • ISCs creation can be automated, pre-integrating security functionality making them faster and easier to build and deploy
    • ISCs leverage industry accepted security practices making them easier to audit and support

It is expected that Immutable Service Containers will form the most basic architectural building block for more complex, highly dynamic and autonomic architectures. The goal of the ISC project is to more fully describe the architecture and attributes of ISCs, their inherent benefits, their construction as well as to document practical examples using various software applications.

While the notion of ISCs is not based upon any one product or technology, an instantiation has been recently developed using OpenSolaris 2009.06. This instantiation offers a pre-integrated configuration leveraging OpenSolaris security recommended practices and settings. With ISCs, you are not starting from a blank slate, but rather you can now build upon the security expertise of others. Let's look at the OpenSolaris-based ISC more closely.

In an ISC configuration, the global zone is treated as a system controller and exposed services are deployed (only) into their own non-global zones. From a networking perspective, however, the entire environment is viewed as a single entity (one IP address) where the global zone acts as a security monitoring and arbitration point for all of the services running in non-global zones.

As a foundation, this highly optimized environment is pre-configured with:

Further, the default OpenSolaris ISC uses:

  • Non-Global Zone. Exposed services are deployed in a non-global zone. There they can take advantage of the core security benefits enabled by OpenSolaris non-global zones such as restricted access to the kernel, memory, devices, etc. For more information on non-global zone security capabilities, see the Sun BluePrint titled "Understanding the Security Capabilities of Solaris Zones Software". Using a fresh ISC, you can simply install your service into the provided non -global zone as you normally would.

    Further in the ISC model, each non-global zone has its own encrypted scratch space (w/its own ephemeral key), its own persistent storage location, as well as a pre-configured auditing and networking configuration that matches that of the global zone. You do not need to use the encrypted scratch space or persistent storage, but it is there if you want to take advantage of it. Obviously, additional resource controls (CPU, memory, etc.) can be added as necessary. These are not pre-configured due to the variability of service payloads.

  • Solaris Auditing. A default audit policy is implemented in the global zone and all non-global zones that tracks login and logout events, administrative events as well as all commands (and command line arguments) executed on the system. The audit configuration and audit trail are kept in the global zone where they cannot be accessed by any of the non-global zones. The audit trail is also pre-configure d to be delivered by SYSLOG (by default this information is captured in /var/log/auditlog).
  • Private Virtual Network. A private virtual network is configured by default for all of the non-global zones. This network isolates each non-global zone to its own virtual NIC. By default, the global and non-global zones can freely initiate external communications, although this can be restricted if needed. A non-global zone is not permitted to accept connections, by default. Non-global zone service s can be exposed through the global zone IP address by adjusting the IP Filter and IP NAT policies (below).
  • Solaris IP NAT. Each non-global zone is pre-configured to have a private address assigned to its virtual NIC. To allow the non -global zone to communicate with external systems and networks, an IP NAT policy is implemented. Outgoing connections are masked using the IP address of the global zone. Incoming connections are redirected based upon the port used to communicate. Beyond simple hardening of the non-global zone (a state which can be altered from within the non-global zone itself), this mechanism ensures that the global zone can control which services are exposed by the non-global zone and on which ports.
  • Solaris IP Filter. A default packet filtering policy is implemented in the global zone allowing only DHCP (for the exposed network interface) and SSH (to the global zone). Additional rules are available (but disabled) to allow access to non-global zones on an as-needed basis. Further, rules are implemented to deny external access to any non-global zone that has changed its pre-assigned (private) IP address. Packet filtering is pre-configured to log packets to SYSLOG (by default this information is captured in /var/log/ipflog).

So what does all of this really mean? Using the ISC model, you can deploy your services in a micro-virtualized environment that offers protection against kernel-based root kits (and some forms of user-land root kits), offers flexible file system immutability (based upon read-only file systems mounted into the non-global zone), can take advantage of process least privilege and resource controls, and is operated in a hardened environment where there is a packet filtering, NAT and auditing policy that is effectively out of the reach of the deployed service. This means that should a service be compromised in a non-global zone, it will not be able to impact the integrity or validity of the auditing, packet filtering, and NAT configuration or logs. While you may not be able to stop every form of attack, having reliable audit trails can significantly help to determine the extent of the breach and facilitate recovery.

The following diagram puts all of the pieces together:

Additional private virtual networking models are also being considered. All in all, the ISC model offers a very compelling deployment model. The accessiblity and attractiveness of this model is further enhanced by the availability of an ISC construction kit that allows you to take an OpenSolaris 2009.06 system and convert it to the ISC model with a single command. Sound interesting? Give it a try, come join the project and be sure to send along your feedback !

Desktop Summit - Here I come!

Desktop Summit which is made up of GUADEC and AKademy will be held this year in the Gran Canaria from 3rd-11th July. This is by far the biggest events on its nature, FOSS and totally Desktop oriented.

I will be arriving on the 2nd July evening with the whole bunch from the Desktop group in Sun. Now that some of the people, Alberto, Luis are native Canarians. I am looking forwards to their local hospitality :) Also we meet up hackers old and new.

Many of the exciting talks including GNOME Shell, GNOME 3.0, Mobile Development are so exciting topics that I look forward to hear and see! I will be there until 9th July.

Zones

The installation of a zone in OpenSolaris is a bit different than in Solaris 10 (or SXCE) and it's due to IPS, which is unique to OpenSolaris. When you create a zone in Solaris 10, you get a native zone, which is very lightweight because it shares much of its system software with the base Solaris 10 installation. However, native zones presume you are using the SVR4 packing system (as opposed to IPS). Therefore, OpenSolaris uses a branded zone called ipkg.

The ipkg branded zone doesn't share any of its system information with base OpenSolaris installation. As a matter of fact, when installed, it's not even copied from the base installation, but rather downloaded from an IPS repository. Obviously this makes working with zones in OpenSolaris a bit more restrictive (it took about 10 minutes to download and install on my machine). Supposedly, work is underway to add IPS support to native zones. But until that happens, here's my guide to working with zones in OpenSolaris.

Setting up a zone involves 4 steps: create, install, boot and configure.

Step 1: Create the Zone

If you're not interested in zones, you should at least be aware that you're already running in one - the global zone:

bleonard@opensolaris:~$ zoneadm list -v
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared

Zones must be installed within a ZFS file system, otherwise the zone install command will generate the error "no zonepath dataset" (see defect 8468 for details). You can either use an existing ZFS file system, such as /export/home or create a new one, as I chose to do here:

pfexec zfs create -o mountpoint=/zones rpool/zones

Before we actually create the zone, let's pre-determine some information that will be required. I'm going to set the zone name to myzone. The zone needs a network interface, which can match that of the global zone. This is easiest to figure out by hovering over the connection properties icon in the top panel and noting the Network Connection:

In my case it's e1000g0.

Since we'll be using a shared IP stack with the global zone, the non-global zone is not at liberty to select its own IP address (or use DHCP). I may talk about exclusive IP stacks in another entry, but for now we need to choose a free IP address on the subnet (I'm running OpenSolaris in VirtualBox, which provides it's own subnet). I'll be using 10.0.2.25.

Once you have that information collected, you can begin to create the zone:

bleonard@opensolaris:~$ pfexec zonecfg -z myzone
myzone: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:myzone> create
zonecfg:myzone> set zonepath=/zones/myzone
zonecfg:myzone> add net 
zonecfg:myzone:net> set physical=e1000g0
zonecfg:myzone:net> set address=10.0.2.25
zonecfg:myzone:net> end
zonecfg:myzone> exit

To see your zone's current configuration, run:

bleonard@opensolaris:~$ zonecfg -z myzone info
zonename: myzone
zonepath: /zones/myzone
brand: ipkg
autoboot: false
bootargs: 
pool: 
limitpriv: 
scheduling-class: 
ip-type: shared
net:
	address: 10.0.2.25
	physical: e1000g0
	defrouter not specified  

List the zones again, using the -c option to show all zones (not just those installed):

bleonard@opensolaris:~$ zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   - myzone           configured /zones/myzone                  ipkg     shared

Notice the brand is ipkg.

Step 2: Install the Zone

Now that the zone's configured, let's install it. Zone installation on OpenSolaris is a much different experience than on Solaris 10, as the zone must be downloaded from the package repository rather then simply copied from the global zone:

bleonard@opensolaris:~$ pfexec zoneadm -z myzone install
A ZFS file system has been created for this zone.
  Authority: Using http://pkg.opensolaris.org/release/.
      Image: Preparing at /zones/myzone/root ... done.
 Installing: (output follows)
DOWNLOAD                                    PKGS       FILES     XFER (MB)
Completed                                  52/52   7862/7862   72.41/72.41 

PHASE                                        ACTIONS
Install Phase                            12939/12939 
PHASE                                          ITEMS
Reading Existing Index                           9/9 
Indexing Packages                              52/52 

       Note: Man pages can be obtained by installing SUNWman
Postinstall: Copying SMF seed repository ... done.
Postinstall: Working around http://defect.opensolaris.org/bz/show_bug.cgi?id=741
       Done: Installation completed in 595.162 seconds.

 Next Steps: Boot the zone, then log into the zone console
             (zlogin -C) to complete the configuration process

We can verify the installation via it's status:

 bleonard@opensolaris:~$ zoneadm list -cv
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   - myzone           installed  /zones/myzone                  ipkg     shared

Steps 3 & 4: Boot and Configure

The next steps are to boot and configure the zone. When the zone boots for the first time, sysidtool is going to run to configure the system. We will boot the zone using two terminal windows: one to boot the system and the other to configure it. Note, it is possible to automate these system configuration steps, which I'll cover in a future blog.

Log into the zone and wait for it to boot:

bleonard@opensolaris:~$ pfexec zlogin -C myzone
[Connected to zone 'myzone' console]  

Open a 2nd terminal window and boot the zone. If you see the warning like I did, don't worry about it, I address this at the end of the entry.

bleonard@opensolaris:~$ pfexec zoneadm -z myzone boot
zone 'myzone': WARNING: e1000g0:1: no matching subnet found in netmasks(4) for 10.0.2.25; using default of 255.0.0.0.

Then back in the 1st terminal, proceed with system configuration:

[NOTICE: Zone booting up]


SunOS Release 5.11 Version snv_101b 32-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: myzone
Loading smf(5) service descriptions: 68/68
Reading ZFS config: done.
Mounting ZFS filesystems: (5/5)

 

What type of terminal are you using? 1) ANSI Standard CRT 2) DEC VT100 3) PC Console 4) Sun Command Tool 5) Sun Workstation 6) X Terminal Emulator (xterms) 7) Other Type the number of your choice and press Return: 6 Creating new rsa public/private host key pair Creating new dsa public/private host key pair Configuring network interface addresses: e1000g0.

 
  

Give the zone a host name (or select the default):

─ Host Name for e1000g0:1 ─────────────────────────────────────────────────────

  Enter the host name which identifies this system on the network.  The name
  must be unique within your domain; creating a duplicate host name will cause
  problems on the network after you install Solaris.

  A host name must have at least one character; it can contain letters,
  digits, and minus signs (-).


    Host name for e1000g0:1 myzone                          











────────────────────────────────────────────────────────────────────────────────
    F2_Continue    F6_Help

Note, for some reason the "Continue" command switches from F2, as in the screen shot above, to Esc+2, as seen in the following screens.

Confirm the host name:

─ Confirm Information for e1000g0:1 ────────────────────────────────────────────

  > Confirm the following information.  If it is correct, press F2;
    to change any information, press F4.


    Host name: myzone















────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-4_Change    Esc-6_Help

Configure the security policy:

─ Configure Security Policy: ───────────────────────────────────────────────────

  Specify Yes if the system will use the Kerberos security mechanism.

  Specify No if this system will use standard UNIX security.

      Configure Kerberos Security
      ───────────────────────────
      [ ] Yes
      [X] No












────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Confirm the security policy:

 ─ Confirm Information ──────────────────────────────────────────────────────────

  > Confirm the following information.  If it is correct, press F2;
    to change any information, press F4.


    Configure Kerberos Security: No















────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-4_Change    Esc-6_Help

Set the name service. I will be using DNS:

─ Name Service ─────────────────────────────────────────────────────────────────

  On this screen you must provide name service information.  Select the name
  service that will be used by this system, or None if your system will either
  not use a name service at all, or if it will use a name service not listed
  here.

  > To make a selection, use the arrow keys to highlight the option
    and press Return to mark it [X].


      Name service
      ────────────
      [ ] NIS+
      [ ] NIS
      [X] DNS
      [ ] LDAP
      [ ] None




────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

If you also selected DNS, set the domain name, DNS severs and search domains. I'm using the same settings as my global zone, which you can find in /etc/resolve.conf:

bleonard@opensolaris:~$ cat /etc/resolv.conf 
domain hsd1.ct.comcast.net.
nameserver 10.0.2.3
Set the domain name:
─ Domain Name ──────────────────────────────────────────────────────────────

  On this screen you must specify the domain where this system resides.  Make
  sure you enter the name correctly including capitalization and punctuation.


    Domain name: hsd1.ct.comcast.net             















────────────────────────────────────────────────────────────────────────────────

    Esc-2_Continue    Esc-6_Help

Add the DNS Server Addresses:

─ DNS Server Addresses ─────────────────────────────────────────────────────────

  On this screen you must enter the IP address of your DNS server(s).  You
  must enter at least one address.  IP addresses must contain four sets of
  numbers separated by periods (for example 129.200.9.1).



    Server's IP address: 10.0.2.3        
    Server's IP address:
    Server's IP address:











───────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

And any search domains:

─ DNS Search List ──────────────────────────────────────────────────────────────

  On this screen you can enter a list of domains that will be searched when a
  DNS query is made.  If you do not enter any domains, DNS will only search
  the DNS domain chosen for this system.  The domains entered, when
  concatenated, may not be longer than 250 characters.



    Search domain:                                 
    Search domain:
    Search domain:
    Search domain:
    Search domain:
    Search domain:







────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Confirm the network information:

─ Confirm Information ──────────────────────────────────────────────────────────

  > Confirm the following information.  If it is correct, press F2;
    to change any information, press F4.


          Name service: DNS
           Domain name: hsd1.ct.comcast.net
    Server address(es): 10.0.2.3













────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-4_Change    Esc-6_Help

Ignore the Name Service Error (i.e., do not enter new name service information):

─ Name Service Error ───────────────────────────────────────────────────────────

  Unable to find an address entry for myzone with the specified DNS
  configuration.


      Enter new name service information?
      ───────────────────────────────────
      [ ] Yes
      [X] No












────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

NFSv4 Domain Name:

─ NFSv4 Domain Name ────────────────────────────────────────────────────────────

  NFS version 4 uses a domain name that is automatically derived from the
  system's naming services. The derived domain name is sufficient for most
  configurations. In a few cases, mounts that cross domain boundaries might
  cause files to appear to be owned by "nobody" due to the lack of a common
  domain name.

  The current NFSv4 default domain is: "hsd1.ct.comcast.net"


      NFSv4 Domain Configuration
      ──────────────────────────────────────────────
      [X] Use the NFSv4 domain derived by the system
      [ ] Specify a different NFSv4 domain







────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Confirm:

─ Confirm Information for NFSv4 Domain ─────────────────────────────────────────

  > Confirm the following information.  If it is correct, press F2;
    to change any information, press F4.



    NFSv4 Domain Name:  << Value to be derived dynamically >>















────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-4_Change    Esc-6_Help

Select your time zone:

─ Time Zone ────────────────────────────────────────────────────────────────────

  On this screen you must specify your default time zone.  You can specify a
  time zone in three ways:  select one of the continents or oceans from the
  list, select other - offset from GMT, or other - specify time zone file.

  > To make a selection, use the arrow keys to highlight the option and
    press Return to mark it [X].


      Continents and Oceans
      ──────────────────────────────────
  -   [ ] Africa
  │   [X] Americas
  │   [ ] Antarctica
  │   [ ] Arctic Ocean
  │   [ ] Asia
  │   [ ] Atlantic Ocean
  │   [ ] Australia
  │   [ ] Europe
  v   [ ] Indian Ocean

──────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Country:

─ Country or Region ────────────────────────────────────────────────────────────

  > To make a selection, use the arrow keys to highlight the option and
    press Return to mark it [X].


      Countries and Regions
      ───────────────────────────
  -   [X] United States
  │   [ ] Anguilla
  │   [ ] Antigua & Barbuda
  │   [ ] Argentina
  │   [ ] Aruba
  │   [ ] Bahamas
  │   [ ] Barbados
  │   [ ] Belize
  │   [ ] Bolivia
  │   [ ] Brazil
  │   [ ] Canada
  │   [ ] Cayman Islands
  v   [ ] Chile

──────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Time Zone:

─ Time Zone ───────────────────────────────────────────────────────────────────

  > To make a selection, use the arrow keys to highlight the option and
    press Return to mark it [X].


      Time zones
      ──────────────────────────────────────────────────────────────────────────
  -   [X] Eastern Time
  │   [ ] Eastern Time - Michigan - most locations
  │   [ ] Eastern Time - Kentucky - Louisville area
  │   [ ] Eastern Time - Kentucky - Wayne County
  │   [ ] Eastern Time - Indiana - most locations
  │   [ ] Eastern Time - Indiana - Daviess, Dubois, Knox & Martin Counties
  │   [ ] Eastern Time - Indiana - Starke County
  │   [ ] Eastern Time - Indiana - Pulaski County
  │   [ ] Eastern Time - Indiana - Crawford County
  │   [ ] Eastern Time - Indiana - Switzerland County
  │   [ ] Central Time
  │   [ ] Central Time - Indiana - Perry County
  v   [ ] Central Time - Indiana - Pike County

────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Confirm Time Zone:

─ Confirm Information ─────────────────────────────────────────────────────────

  > Confirm the following information.  If it is correct, press F2;
    to change any information, press F4.


    Time zone: Eastern Time
               (US/Eastern)














──────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-4_Change    Esc-6_Help

And finally, set the root password:

─ Root Password ────────────────────────────────────────────────────────────────

  Please enter the root password for this system.

  The root password may contain alphanumeric and special characters.  For
  security, the password will not be displayed on the screen as you type it.

  > If you do not want a root password, leave both entries blank.


    Root password:  *********
    Root password:  *********       










────────────────────────────────────────────────────────────────────────────────
    Esc-2_Continue    Esc-6_Help

Zone configuration is complete. You can now log into the zone:

System identification is completed.

myzone console login: root
Password: 
Apr  1 21:48:04 myzone login: ROOT LOGIN /dev/console
Sun Microsystems Inc.   SunOS 5.11      snv_101b        November 2008
root@myzone:~# 

From the other terminal, the zone's status now shows as running:

bleonard@opensolaris:~$ zoneadm list -v
  ID NAME             STATUS     PATH                           BRAND    IP    
   0 global           running    /                              native   shared
   1 myzone           running    /zones/myzone                  ipkg     shared

Working with the Zone

To drop off the zone console, exit the shell prompt and then type ~. at the console login prompt:

root@myzone:~# exit
logout

myzone console login: ~.
[Connection to zone 'myzone' console closed]
bleonard@opensolaris:~$ 

The zone is still running. Log in again:

bleonard@opensolaris:~$ pfexec zlogin -C myzone
[Connected to zone 'myzone' console]

Hit return to get the login prompt:

myzone console login: root
Password: 
Last login: Wed Apr  1 22:00:12 on console
Sun Microsystems Inc.   SunOS 5.11      snv_101b        November 2008
root@myzone:~# 

The zone can be shutdown, halted or rebooted from within the zone (here's a reboot example):

root@myzone:~# reboot
Apr  2 01:18:50 myzone reboot: initiated by root on /dev/console

[NOTICE: Zone rebooting]


SunOS Release 5.11 Version snv_101b 32-bit
Copyright 1983-2008 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Hostname: myzone
Reading ZFS config: done.
Mounting ZFS filesystems: (5/5)

myzone console login: 

Or from the global zone:

pfexec zoneadm -z myzone reboot

Now that we have a zone, there's plenty of opportunity to experiment...

Deleteing the Zone

pfexec zoneadm -z myzone uninstall 
pfexec zonecfg -z myzone delete -F

Fixing the netmask Warnings

If you're getting the netmask warning as I did when the zone boots:

zone 'myzone': WARNING: e1000g0:1: no matching subnet found in netmasks(4) for 10.0.2.25;⁞ using default of 255.0.0.0.
You can eliminate it by adding the zone's IP subnet into /etc/inet/netmasks. However, before we can edit the netmasks file, we need to make it writable:
pfexec chmod u+w /etc/inet/netmasks

Then add the proper subnet for you network. For example:

10.0.2.0 255.255.255.0

Now the zone will boot cleanly. For more information see netmasks Warning Displayed When Booting Zone.

Codeina

Codeina's a nifty little feature of OpenSolaris 2009.06. It's an application from Fluendo that points applications such as Totem or Rhythmbox to Fluendo's web shop if you are missing the codec for the file you are trying to play. In the case of MP3s, the codec is free. For other audio and video formats, you'll have to pay a small license fee.

For example, when I attempt to play an MP3 in Totem, I'm presented with the following:

Clicking Install launches the Codeina Web Shop application where I can register to "buy" the free MP3 decoder:


And then Install the decoder:


Which will quickly complete:

After which my song will automatically start playing.

Note, if Codeina isn't working for you, you may be running into issue Codeina fails to start. The quick fix for this is to remove the Fluendo configuration file and try again.

rm ~/.local/share/codeina/providers/fluendo.xml

Just enough Web Space Server

I was poking around with the Just enough Operation System (JeOS) delivery form of OpenSolaris, when I found the Web Space Server project built on top of that platform. The Web Space Server project is based on the Liferay Portal and it has been packaged up nicely into a variety of virtual machines for both VirtualBox and VMware.

For this example I grabbed the OVF (Open Virtualization Format) which is designed for packaging up such appliances and was a snap to import into VirtualBox (File > Import Appliance):

When the import is complete, go ahead and start the WebSpaceServer. As this is a "just enough" distribution of OpenSolaris, there is no GUI. When the machine boots up, it will give you the HTTP address of the server, which in my case is 10.0.1.14:


Note, the machine takes anywhere from 5-10 minutes to complete service startup. To monitor its progress, you can log in using template / template.

One thing you probably want to do right off bat is enable ssh so you can log into the server remotely:

svcadm enable ssh

Then, from another host:

bleonard@opensolaris:~$ ssh template@10.0.1.14
The authenticity of host '10.0.1.14 (10.0.1.14)' can't be established.
RSA key fingerprint is 7c:6d:df:62:e2:ec:1f:a5:d7:2a:5f:3f:72:9a:ac:45.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.1.14' (RSA) to the list of known hosts.
Password: 
Last login: Tue Jun 30 03:55:15 2009

Welcome to a Sun GlassFish Web Space Server and OpenSolaris virtual 
machine image.

Use of this virtual machine image is subject to the license terms found 
in /etc/notices.

template@webspace:~$ 

You need to wait for the GlassFish server to finish starting up. You can check the domain2 SMF service:

template@webspace:~$ svcs -l domain2
fmri         svc:/application/SUNWappserver/domain2:default
name         Appserver Domain Administration Server
enabled      true
state        offline
next_state   online
state_time   Tue Jun 30 04:35:59 2009
logfile      /var/svc/log/application-SUNWappserver-domain2:default.log
restarter    svc:/system/svc/restarter:default
contract_id  46 
dependency   require_all/none svc:/milestone/network:default (online)
dependency   require_all/none svc:/system/filesystem/local:default (online)

Here you can see its current state is offline, but it's transitioning to online. You can tail the server log file to track its progress:

tail -f /opt/webspace/glassfish2/domains/domain2/logs/server

Ultimately, you'll see a line like the following:

[#|2009-06-30T11:44:37.899+0000|INFO|sun-appserver2.1|javax.enterprise.system.core|_ThreadID=10;_ThreadName=main;|Application server startup complete.|#

Once that is complete, you can then browse to the server. It has a professional home page with links to try the Web Space Server and a Quick Start Tour. Webmin is also included for managing MySQL and OpenSolaris.


When you first try the Web Space Server, be patient as it configures itself to run for the first time. Eventually, the Welcome page will appear:


Resources


June 30, 2009

FMA Messages WIKI way…

Almost perfect!! Wow, that was a really, really nice surprise! I do like all the new stuff from Solaris 10/OpenSolaris, and i say that in every opportunity, like now. I did write some posts about FMA, like this one… but one problem about the fmdump messages IMHO, was the fact that the messages on console were [...]


1 Minute at CommunityOne

Ashwin Bhat and Angad Singh asked me for one minute of my time outside the keynote hall at CommunityOne a few weeks ago. Hey, what's a minute, right? Happy to. But this minute was to be digitally recorded. Uha. Video. I generally shy away from such things because I`m shy about being interviewed. But these are really good guys and they`ve done great work as Campus Ambassadors in India, so I felt safe in front of their camera (though I`d clearly much rather be behind the camera). It wasn't too bad, though. But the 1 minute ran for 2 minutes and 21 seconds! Anyway. Thanks, guys. Great fun. Hope to get back to India sometime soon.

Fun with Crossbow

The Crossbow project is probably the most exciting new feature in OpenSolaris 2009.06. It a nutshell, project Crossbow brings virtualization to the networking layer. In this quick example I'm going to create a virtual network interface card (VNIC) and dynamically alter it's maximum bandwidth as traffic is flowing over it.

The VNIC must be linked to an actual network interface device. To see the existing devices on the system:

bleonard@opensolaris:~$ dladm show-phys
LINK         MEDIA                STATE      SPEED  DUPLEX    DEVICE
e1000g0      Ethernet             up         1000   full      e1000g0
iwh0         WiFi                 down       0      unknown   iwh0
vboxnet0     Ethernet             unknown    0      unknown   vboxnet0

The current device in use is e1000g0, so I'll create my VNIC over that device:

pfexec dladm create-vnic -l e1000g0 vnic0

View the new VNIC:

bleonard@opensolaris:~$ dladm show-vnic
LINK         OVER         SPEED  MACADDRESS           MACADDRTYPE         VID
vnic0        e1000g0      1000   2:8:20:e2:77:62      random              0

Note its speed matches that of the physical link. Let's reduce this from 1000 megabits/second to 2 megabits/second:

pfexec dladm set-linkprop -p maxbw=2 vnic0

Before the VNIC can be used, it must be plumbed:

pfexec ifconfig vnic0 plumb

And assigned an IP address. If you're using DHCP, the following will work:

pfexec ifconfig vnic0 dhcp start

Now you can see vnic0 using ifconfig:

bleonard@opensolaris:~$ ifconfig vnic0
vnic0: flags=1104843<UP,BROADCAST,RUNNING,MULTICAST,DHCP,ROUTER,IPv4> mtu 1500 index 7
	inet 10.0.1.16 netmask ffffff00 broadcast 10.0.1.255

The VNIC has been assigned IP address 10.0.1.16 and is now ready for use. To test it out, we'll copy a large file from one host to another, over the VNIC. To test this on a single machine, use a virtual machine configured with bridged networking or a zone. From the other host:

bleonard@os200906:~$ mkfile 100M big-file
bleonard@os200906:~$ scp big-file bleonard@10.0.1.16:bile-file
The authenticity of host '10.0.1.16 (10.0.1.16)' can't be established.
RSA key fingerprint is f7:1d:2c:d7:24:e3:1c:57:53:0f:59:75:31:4a:0f:7d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '10.0.1.16' (RSA) to the list of known hosts.
Password: 
big-file               0% |                             |   128 KB    13:22 ETA

Notice it's estimated to take over 13 minutes to copy this file. While the file is being copied, back on the host machine, change the maxbw property to 1000 megabits / second:

pfexec dladm set-linkprop -p maxbw=1000 vnic0

Immedially you'll notice the copy operation speed up and quickly complete:

big-file             100% |*****************************|   100 MB    00:37 

Very cool!

June 29, 2009

Japan OpenSolaris Community Meeting 062709

The Japan OpenSolaris Community together on Saturday. Nice day (and night). About 60 people came by for the three sessions, two of which were in Japanese and the third in English. Then all three groups came together for a nomikai. I think the model works well to start integrating the Japanese and international OpenSolaris communities.

I used a new lens for this event. My f/1.4 lens is getting fixed, so I borrowed Jon`s 50mm f/1.2, which is one scary smart lens. It`s a tad expensive, too, so I was more than a little nervous shooting with it. Anyway, at f/1.2 the focus is just razor thin. Focus on someone`s glasses and their entire face is out. I messed up a few images that way, but by the end of the night I was getting used to it. Amazing piece of glass. By the way, you can see Jon`s stuff here. He`s one of the best photographers around.

links for 2009-06-29

Tokyo Hackerspace

I've been checking out the Tokyo Hackerspace gmail list for a few weeks. Looks very interesting. The project grew out of some discussions at BarCamp Tokyo a couple of months ago, and I spoke to Karamoon about it at the OpenSolaris community event this weekend. In a world of ever expanding global digital communities, it seems like a nice idea to have a very local a very physical space to hang out in and hack on things that need hacking. Global and digital are fine, but local and physical are needed too. For info, check it out on the wiki.

Website Transition: Two Phase 1 Documents Posted

This morning Bonnie posted two documents supporting the Phase 1 website transition plans:
  1. The plan to implement the governance and website roles and collectives, and
  2. The data migration strategy outlining how data will be migrated from existing databases into the new Auth database.
Bonnie and Alan drafted the documents and all three of us iterated for a couple of weeks as they went through multiple drafts. It`s amazing experiencing the distinction between writing a document that articulates some issue in theory and writing a document that articulates a specific implementation that has to actually work. It`s the distinction between night and day. Ideas are fine, but if you don`t build them they are not real. That lesson is learned.

Also, I appreciate more than ever the process I went through in the recent past attempting to re-write the OpenSolaris Constitution. Building and describing the new site would have been so much easier had that Constitution been approved in March. But it wasn`t. That`s life I suppose. So, now we have to implement the old Constitution while also accounting for things that document doesn`t even mention because it came about after the original site was designed. Not to mention all the odd stuff that evolved (and broke) on the current site -- all of which has to be migrated to the new site. As a result, in August we will have some things covered under Governance and some things covered under generally accepted practice -- and that last bit was really the basis of the concept we were trying to move toward with the proposed Constitution. Hopefully, the OGB will at some point this year take up that proposed Constitution again, get it updated, and get it approved so our Governance documents reflect the reality of how the community operates in real life.

Anyway, until that happens we will continue building what we have to build, and it will be good to finally break with the past of the old site. So, it`s important for anyone with an account on opensolaris.org to review these new documents and the other information we have posted in the Website community recently to be prepared for the changes coming in July and August. All users on the site will be affected by this multi-phase transition (hopefully in a good way, of course). More documents will be posted in the coming weeks on website feature mappings, Auth transition instructions, and content migration plans. And that`s just Phase 1 and Phase 2. There will be a Phase 3 that will take us well into the fall.

Website Transition Documentation | Auth System Beta | XWiki Website Beta | Program Roadmap

OpenSolaris PDAs?

"I hope that after some time we'll see OpenSolaris powered PDAs." -- Alexander Eremin

OpenSolaris on FLOSS Weekly

I had the privilege to be a guest on FLOSS Weekly with Leo Laporte and Jono Bacon this week, thanks guys! Of course Aaron and David had done awesome groundwork with an interview on ZFS a few weeks earlier. It was a fun hour, and I enjoyed it though can think of many thing I’d answer differently now! Looking forward to catching up with Jono and others at the Community Leadership Summit next month in San Jose, the weekend before OSCON.

And yes, OpenSolaris is officially ‘not bollocks’. Check it out!

June 28, 2009

New ZFS Book Coming to Japan

Four members of the Japan OpenSolaris Community wrote a book on ZFS recently. It`s coming out in July, and it`s specifically for the Japanese market. The cover has not been selected yet, but here are the early details: ZFS 仮想化されたファイルシステムの徹底活用 (大型本) by Hisayoshi Kato, Michitoshi Sato, Nagahara Niroharu, and Imai Satoshi. This is quite a significant contribution to the community in Japan because it`s important to have technical content written by Japanese engineers for Japanese engineers. Translating English content from the west good, of course, but the generation of original content in Japanese also needs to be part of this community`s growth plans.

Here are some more books on OpenSolaris: http://blogs.sun.com/jimgris/tags/book

Photos of AGC Family Party



见到了好多老朋友,真是很高兴!希望不必”再过二十年,我们来相聚” ...

June 27, 2009

Japan Linux Symposium: Tokyo, October 2009

It's excellent to see the 9th Annual Linux Kernel Summit and the 1st Annual Japan Linux Symposium coming to Tokyo in October. Check out this language from the LF website: "The Japan Linux Symposium will be the showcase Japan and Asia Pacific conference for The Linux Foundation." Showcase. This is significant. The Japanese may not shout about it much, but developers in this country are contributing to FOSS and their contributions are growing. The potential in Japan for open source is huge. I've been saying it since I got here. So cool that the LF clearly recognizes this potential by bringing their conferences here. Also interesting: the LF website appears in two languages -- English and Japanese.



I already hang out with the Tokyo Linux User Group (here, here), so I hope to attend this gig in October.

OpenSolaris in Brazil: Right at the Top!

Nothing like going right to the very top, eh? My goodness. Here's the OpenSolaris community in Brazil at FISL hanging out with Brazil's President Luiz Inácio Lula da Silva. I think this sets a new standard in government relations for the entire OpenSolaris community, don't you think? So, each one of us around the world now has to go out and shoot some images (video is fine, too) of our country's leader standing with our respective communities all dressed up in OpenSolaris stuff. Ok. Should be easy enough. Just send your images to osug-leaders or advocacy-discuss, and we'll collect them there.

Absolutely. Outrageous.

June 26, 2009

FISL 10

Congratulations to Vitório Sassi and all OSOL users from OpenSolaris-BR (special to PoaOSUG ;-) that are doing a great work at FISL 10!


SLOG Screenshots

I previously posted screenshots of the L2ARC: the ZFS second level cache which uses read optimized SSDs ("Readzillas") to cache random read workloads. That's the read side of the Hybrid Storage Pool. On the write side, the ZFS separate intent log (SLOG) can use write optimized SSDs (which we call "Logzillas") to accelerate performance of synchronous write workloads.

In the screenshots that follow I'll show how Logzillas have delivered 12x more IOPS and over 20x reduced latency for a synchronous write workload over NFS. These screenshots are from Analytics on the Sun Storage 7000 series. In particular, the following heat map shows the dramatic reduction in NFS latency when turning on the Logzillas:

Before the 17:54 line, latency was at the mercy of spinning 7200 RPM disks, and reached as high as 9 ms. After 17:54, these NFS operations were served from Logzillas, which consistently delivered latency much lower than 1 ms. Click for a larger screenshot.

What is a Synchronous Write?

While the use of Logzilla devices can dramatically improve the performance of synchronous write workloads, what is a synchronous write?

When an application writes data, if it waits for that data to complete writing to stable storage (such as a hard disk), it is a synchronous write. If the application requests the write but continues on without waiting (the data may is buffered in the filesystem's DRAM cache, but not yet written to disk), it is an asynchronous write. Asynchronous writes are faster for the application, and is the default behavior.

There is a down side to asynchronous writes - the application doesn't know if the write completed successfully. If there is a power outage before the write could be flushed to disk, the write will be lost. For some applications such as database log writers, this risk is unacceptable - and so they perform synchronous writes instead.

There are two forms of synchronous writes: individual I/O which is written synchronously, and groups of previous writes which are synchronously committed.

Individual synchronous writes

Write I/O will become synchronous when:

  • A file is opened using a flag such as O_SYNC or O_DSYNC
  • NFS clients use the mount option (if available):
    • "sync": to force all write I/O to be synchronous
    • "forcedirectio": which avoids caching, and the client may decide to make write I/O synchronous as part of avoiding its cache

Synchronously committing previous writes

Rather than synchronously writing each individual I/O, an application may synchronously commit previous writes at logical checkpoints in the code. This can improve performance by grouping the synchronous writes. On ZFS these may also be handled by the SLOG and Logzilla devices. These are performed by:

  • An application calling fsync()
  • An NFS client calling commit. It can do this because:
    • It closed a file handle that it wrote to (Solaris clients can avoid this using the "nocto" mount option)
    • Directory operations (file creation and deletion)
    • Too many uncommitted buffers on a file (eg, FreeBSD/MacOS X)

Examples of Synchronous Writes

Applications that are expected to perform synchronous writes include:

  • Database log writers: eg, Oracle's LGWR.
  • Expanding archives over NFS: eg, tar creating thousands of small files.
  • iSCSI writes on the Sun Storage 7000 series (unless lun write cache is enabled.)

The second example is easy to demonstrate. Here I've taken firefox-1.5.0.6-source.tar, a tar file containing 43,000 small files, and unpacked it to an NFS share (tar xf.) This will create thousands of small files, which becomes a synchronous write workload as these files are created. Writing of the file contents will be asynchronous, it is just the act of creating the file entries in their parent directories which is synchronous. I've unpacked this tar file twice, the first time without Logzilla devices and the second time with. The difference is clear:

Without logzillas, it took around 20 minutes to unpack the tar file. With logzillas, it took around 2 minutes - 10x faster. This improvement is also visible as the higher number of NFS ops/sec, and the lower NFS latency in the heat map.

For this example I used a Sun Storage 7410 with 46 disks and 2 Logzillas, stored in 2 JBODs. The disks and Logzillas can serve many more IOPS than I've demonstrated here, as I'm only running a single threaded application from a single modest client as a workload.

Identifying Synchronous Writes

How do you know if your particular write workload is synchronous or asynchronous? There are many ways to check, here I'll describe a scenario where a client side application is performing writes to an NFS file server.

Client side

To determine if your application is performing synchronous writes, one way is to debug the open system calls. The following example uses truss on Solaris (try strace on Linux) on two different programs:

    client# truss -ftopen ./write
    4830:   open64("outfile", O_WRONLY|O_CREAT, 0644)            = 3
    
    client# truss -ftopen ./odsync-write
    4832:   open64("outfile", O_WRONLY|O_DSYNC|O_CREAT, 0644)    = 3   
    

The second program (odsync-write) opened its file with the O_DSYNC flag, so subsequent writes will be synchronous. In these examples, the program was executed by the debugger, truss. If the application is already running, on Solaris you can use pfiles to examine the file flags for processes (on Linux, try lsof.)

Apart from tracing the open() syscall, also look for frequent fsync() or fdsync() calls, which synchronously commit the previously written data.

Server side

If you'd like to determine if you have a synchronous write workload from the NFS server itself, you can't run debuggers like truss on the target process (since it's running on another host), so instead we'll need to examine the NFS protocol. You can do this either with packet sniffers (snoop, tcpdump), or with DTrace if it and its NFS provider are available:

    topknot# dtrace -n 'nfsv3:::op-write-start { @[args[2]->stable] = count(); }'
    dtrace: description 'nfsv3:::op-write-start ' matched 1 probe
    ^C
            2             1398
    

Here I've frequency counted the "stable" member of the NFSv3 write protocol, which was "2" some 1,398 times. 2 is a synchronous write workload (FILE_SYNC), and 0 is for asynchronous:

    enum stable_how {
            UNSTABLE = 0,
            DATA_SYNC = 1,
            FILE_SYNC = 2
    };
    

If you don't have DTrace available, you'll need to dig this out of the NFS protocol headers. For example, using snoop:

    # snoop -d nxge4 -v | grep NFS
    [...]
    NFS:  ----- Sun NFS -----
    NFS:  
    NFS:  Proc = 7 (Write to file)
    NFS:  File handle = [A2FC]
    NFS:   7A6344B308A689BA0A00060000000000073000000A000300000000001F000000
    NFS:  Offset = 96010240
    NFS:  Size   = 8192
    NFS:  Stable = FSYNC
    NFS:  
    

That's an example of a synchronous write.

The NFS client may be performing a synchronous-write-like workload by frequently calling the NFS commit operation. To identify this, use whichever tool is available to show NFS operations broken down by type (Analytics on the Sun Storage 7000; nfsstat -s on Solaris)

SLOG and Synchronous Writes

The following diagram shows the difference when adding a seperate intent log (SLOG) to ZFS:

Major components:

  • ZPL: ZFS POSIX Layer. Primary interface to ZFS as a filesystem.
  • ZIL: ZFS Intent Log. Synchronous write data for replay in the event of a crash.
  • DMU: Data Management Unit. Transactional object management.
  • ARC: Adaptive Replacement Cache. Main memory filesystem cache.
  • ZIO: ZFS I/O Pipeline. Processing of disk I/O.

For the most detailed information of these and the other components of ZFS, you can browse their source code at the ZFS Source Tour.

When data is asynchronously written to ZFS, it is buffered in memory and gathered periodically by the DMU as transaction groups, and then written to disk. Transaction groups either succeed or fail as a whole, and are part of the ZFS design to always deliver on disk data consistency.

This periodical writing of transaction groups can improve asynchronous writes by aggregating data and streaming it to disk. However the interval of these writes is in the order of seconds, which makes it unsuitable to serve synchronous writes directly - as the application would stall until the next transaction group was synced. Enter the ZIL.

ZFS Intent Log

The ZIL handles synchronous writes by immediately writing their data and information to stable storage, an "intent log", so that ZFS can claim that the write completed. The written data hasn't reached its final destination on the ZFS filesystem yet, that will happen sometime later when the transaction group is written. In the meantime, if there is a system failure, the data will not be lost as ZFS can replay that intent log and write the data to its final destination. This is a similar principle as Oracle redo logs, for example.

Separate Intent Log

In the above diagram on the left, the ZIL stores the log data on the same disks as the ZFS pool. While this works, the performance of synchronous writes can suffer as they compete for disk access along with the regular pool requests (reads, and transaction group writes.) Having two different workloads compete for the same disk can negatively effect both workloads, as the heads seek between the hot data for one and the other. The solution is to give the ZIL its own dedicated "log devices" to write to, as pictured on the right. These dedicated log devices form the separate intent log: the SLOG.

Logzilla

By writing to dedicated log devices, we can improve performance further by choosing a device which is best suited for fast writes. Enter Logzilla. Logzilla was the name we gave to write-optimized flash-memory-based solid state disks (SSDs.) Flash memory is known for slow write performance, so to improve this the Logzilla device buffers the write in DRAM and uses a super-capacitor to power the device long enough to write the DRAM buffer to flash, should it lose power. These devices can write an I/O as fast as 0.1 ms (depends on I/O size), and do so consistently. By using them as our SLOG, we can serve synchronous write workloads consistently fast.

Speed + Data Integrity

It's worth mentioning that when ZFS synchronously writes to disk, it uses new ioctl()s to ensure that the data is flushed properly to the disk platter, and isn't just buffered on the disk's write cache (which is a small amount of DRAM inside the disk itself.) Which is exactly what we want to happen - that's why these writes are synchronous. Other filesystems didn't bother to do this (eg, UFS), and believed that data had reached stable storage when it may have just been cached. If there was a power outage, and the disk's write cache is not battery backed, then those writes would be lost - which means data corruption for the filesystem. Since ZFS waits for the disk to properly write out the data, synchronous writes on ZFS are slower than other filesystems - but they are also correct. There is a way to turn off this behaviour in ZFS (zfs_nocacheflush), and ZFS will perform as fast or faster than other filesystems for synchronous writes - but you've also sacrificed data integrity, so this is highly unrecommended. By using fast SLOG devices on ZFS, we get both speed and data integrity.

I won't go into too much more detail of the inner workings of the SLOG, which is best described by the author Neil Perrin.

Screenshots

To create a screenshot to best show the effect of Logzillas, I applied a synchronous write workload over NFS from a single threaded process on a single client. The target was the same 7410 used for the tar test, which has 46 disks and 2 Logzillas. At first I let the workload run with the Logzillas disabled, then enbled them:

This is the effect of adding 2 Logzillas to a pool of 46 disks, on both latency and IOPS. My last post discussed the odd latency pattern that these 7200 RPM disks causes for synchronous writes, which looks a bit like an icy lake.

The latency heat map shows the improvement very well, but it's also shown something unintended: These heat maps use a false color palette which draws the faintest details much darker than they (linearly) should be, so that they are visible. This has made visible a minor and unrelated performance issue: those faint vertical trails on the right side of the plot. These were every 30 seconds, and was when ZFS flushed a transaction group to disk - which stalled a fraction of NFS I/O while this happened. The fraction is so small it's almost lost in the rounding, but appears to be about 0.17% when examining the left panel numbers. While minor, we are working to fix it. This perf issue is known internally as the "picket fence". Logzilla is still delivering a fantastic improvement over 99% of the time.

To quantify the Logzilla improvement, I'll now zoom to the before and after periods to see the range averages from Analytics:

Before

With just the 46 disks:

Reading the values from the left panels: with just the 46 disks, we've averaged 143 NFS synchronous writes/sec. Latency has reached as high as 8.93ms - most of the 8 ms would be worst case rotational latency for a 7200 RPM disk.

After

46 disks + 2 Logzillas:

Nice - a tight latency cloud from 229 to 305 microseconds, showing very fast and consistent responses from the Logzillas.

We've now averaged 1,708 NFS synchronous writes/sec - 12x faster than without the Logzillas. Beyond 381 us, latency has rounded down to zero ops/sec, and was mostly in the 267 to 304 microsecond range. Previously latency stretched from a few to over 8 ms, making the delivered NFS latency improvement reach over 20x.

This 7410 and these Logzillas can handle much more than 1,708 synchronous writes/sec, if I apply more clients and threads (I'll show a higher load demo in a moment.)

Disk I/O

Another way to see the effect of Logzillas is through disk I/O. The following shows before and after in seperate graphs:

I've used the Hierarchial breakdown feature to highlight the JBODs in green. In the top plot (before), the I/O to each JBOD was equal - about 160 IOPS. After enabling the Logzillas, the bottom plot shows one of the JBODs is now serving over 1800 IOPS - which is the JBOD with the Logzillas in.

There are spikes of IOPS every 30 seconds in both these graphs. That is when ZFS flushes a transaction group to disk - commiting the writes to their final location. Zooming into the before graph:

Instead of highlighting JBODs, I've now highlighted individual disks (yes, the pie chart is pretty now.) The non-highlighted slice is for the system disks, which are recording the numerous Analytics statistics I've enabled.

We can see that all the disks have data flushed to them every 30 seconds, but they are also being written to constantly. This constant writing is the synchronous writes being written to the ZFS intent log, which is being served from the pool of 46 disks (since it isn't a separate intent log, yet.)

After enabling the Logzillas:

Now those constant writes are being served from 2 disks: HDD 20 and HDD 16, which are both in the /2029...012 JBOD. Those are the Logzillas, and are our separate intent log (SLOG):

I've highlighted one in this screenshot, HDD 20.

Mirrored Logzillas

The 2 Logzillas I've been demonstrating have been acting as individual log devices. ZFS lists them like this:

topknot# zpool status
  pool: pool-0
 state: ONLINE
[...]
        NAME                                     STATE     READ WRITE CKSUM
        pool-0                                   ONLINE       0     0     0
[...many lines snipped...]
          mirror                                 ONLINE       0     0     0
            c2t5000CCA216CCB913d0                ONLINE       0     0     0
            c2t5000CCA216CE289Cd0                ONLINE       0     0     0
        logs
          c2tATASTECZEUSIOPS018GBYTES00000AD6d0  ONLINE       0     0     0
          c2tATASTECZEUSIOPS018GBYTES00000CD8d0  ONLINE       0     0     0
        spares
          c2t5000CCA216CCB958d0                  AVAIL
          c2t5000CCA216CCB969d0                  AVAIL

Log devices can be mirrored instead (see zpool(1M)), and zpool will list them as part of a "mirror" under "logs". On the Sun Storage 7000 series, it's configurable from the BUI (and CLI) when creating the pool:

With a single Logzilla, if there was a system failure when a client was performing synchronous writes, and if that data was written to the Logzilla but hadn't been flushed to disk, and if that Logzilla also fails, then data will be lost. So it is system failure + drive failure + unlucky timing. ZFS should still be intact.

It sounds a little paranoid to use mirroring, however from experience system failures and drive failures sometimes go hand in hand, so it is understandable that this is often chosen for high availability environments. The default for the Sun Storage 7000 config when you have multiple Logzillas is to mirror them.

Expectations

While the Logzillas can greatly improve performance, it's important to understand which workloads this is for and how well they work, to help set realistic expectations. Here's a summary:

  • Logzillas do not help every write workload; they are for synchronous write workloads, as described earlier.
  • Our current Logzilla devices for the Sun Storage 7000 family deliver as high as 100 Mbytes/sec each (less with small I/O sizes), and as high as 10,000 IOPS (less with big I/O sizes).
  • A heavy multi threaded workload on a single Logzilla device can queue I/O, increasing latency. Use multiple Logzilla devices to serve I/O concurrently.

Stepping up the workload

So far I've only used a single threaded process to apply the workload. To push the 7410 and Logzillas to their limit, I'll multiply this workload by 200. To do this I'll use 10 clients, and on each client I'll add another workload thread every minute until it reaches 20 per client. Each workload thread performs 512 byte synchronous write ops continually. I'll also upgrade the 7410: I'll now use 6 JBODs containing 8 Logzillas and 132 data disks, to give the 7410 a fighting chance against this bombardment.

8 Logzillas

We've reached 114,000 synchronous write ops/sec over NFS, and the delivered NFS latency is still mostly less than 1 ms - awesome stuff! We can study the numbers in the left panel of the latency heat map to quantify this: there were 132,453 ops/sec total, and (adding up the numbers) 112,121 of them completed faster than 1.21 ms - which is 85%.

The bottom plot suggests we have driven this 7410 close to its CPU limit (for what is measured as CPU utilization, which includes bus wait cycles.) The Logzillas are only around 30% utilized. The CPUs are 4 sockets of 2.3 GHz quad core Opteron - and for them to be the limit means a lot of other system components are performing extremely well. There are now faster CPUs available, which we need to make available in the 7410 to push that ops/sec limit much higher than 114k.

No Logzillas

Ouch! We are barely reaching 7,000 sync write ops/sec, and our latency is pretty ordinary - often 15 ms and higher. These 7,200 RPM disks are not great to serve synchronous writes from alone, which is why Adam created the Hybrid Storage Pool model.

When comparing the above two screenshots, note the change in the vertical scale. With Logzillas, the NFS latency heat map is plotted to 10 ms; without, it's plotted to 100 ms to span the higher latency from the disks.

Config

The full description of the server and clients used in the above demo is:

Server
  • Sun Storage 7410
  • Latest software version: 2009.04.10.2.1,1-1.15
  • 4 x quad core AMD Opteron 2.3 GHz CPUs
  • 128 Gbytes DRAM
  • 2 x SAS HBAs
  • 1 x 2x10 GbE network card (1 port connected)
  • 6 JBODs (136 disks total, each 1 Tbyte 7,200 RPM)
  • Storage configured with mirroring
  • 8 Logzillas (installed in the JBODs)
Clients
  • 10 x blade servers, each:
  • 2 x quad core Intel Xeon 1.6 GHz CPUs
  • 3 Gbytes DRAM
  • 2 x 1 GbE network ports (1 connected)
  • mounted using NFSv3, wsize=512

The clients are a little old. The server is new, and is a medium sized 7410.

What about 15K RPM disks?

I've been comparing the performance of write optimized SSDs vs cheap 7,200 RPM SATA disks as the primary storage. What if I used high quality 15K RPM disks instead?

Rotational latency will be half. What about seek latency? Perhaps if the disks are of higher quality the heads move faster, but they also have to move further to span the same data (15K RPM disks are usually lower density than 7,200 RPM disks), so that could cancel out. They could have a larger write cache, which would help. Lets be generous and say that 15K RPM disks are 3x faster than 7,200 RPM disks.

So instead of reaching 7,000 sync write ops/sec, perhaps we could reach 21,000 if we used 15K RPM disks. That's a long way short of 114,000 using Logzilla SSDs. Comparing the latency improvement also shows this is no contest. In fact, to comfortably match the speed of current SSDs, disks would need to spin at 100K RPM and faster. Which may never happen. SSDs, on the other hand, are getting faster and faster.

What about NVRAM cards?

For the SLOG devices, I've been using Logzillas - write optimized SSDs. It is possible to use NVRAM cards as SLOG devices, which should deliver much faster writes than Logzilla can. While these may be an interesting option, there are some down sides to consider:

  • Slot budget: A Logzilla device consumes one disk bay, whereas an NVRAM card would consume one PCI-E slot. The Sun Storage 7410 has 6 PCI-E slots, and a current maximum of 288 disk bays. Consuming disk bays for Logzillas may be a far easier trade-off than spending one or more PCI-E slots.
  • Scaling: With limited slots compared to disk bays, the scaling of PCI-E cards will be very coarse. With Logzillas in disk bays it is easier to fine tune their number to match the workload, and to keep scaling to a dozen devices and more.
  • Complexity: One advantage of the Logzillas is that they exist in the pool of disks, and in clustered environments can be accessed by the other head node during failover. NVRAM cards are in the head node, and would require a high speed cluster interconnect to keep them both in sync should one head node fail.
  • Capacity: NVRAM cards have limited capacity when compared to Logzillas to begin with, and in clustered environments the available capacity from NVRAM cards may be halved (each head node's NVRAM cards must support both pools.) Logzillas are currently 18 Gbytes, whereas NVRAM cards are currently around 2 Gbytes.
  • Thermal: Battery backed NVRAM cards may have large batteries attached to the card. Their temperature envelope needs to be considered when placed in head nodes, which with CPUs and 10 GbE cards can get awfully hot (batteries don't like getting awfully hot.)
  • Cost: These are expected to be more expensive per Gbyte than Logzillas.

Switching to NVRAM cards may also make little relative difference for NFS clients over the network, when other latencies are factored. It may be an interesting option, but it's not necessarily better than Logzillas.

Conclusion

For synchronous write workloads, the ZFS intent log can use dedicated SSD based log devices (Logzillas) to greatly improve performance. The screenshots above from Analytics on the Sun Storage 7000 series showed 12x more IOPS and reaching over 20x reduced latency.

Thanks to Neil Perrin for creating the ZFS SLOG, Adam Leventhal for designing it into the Hybrid Storage Pool, and Sun's Michael Cornwell and STEC for making the Logzillas a reality.

Logzillas as SLOG devices work great, making the demos for this blog entry easy to create. It's not every day you find a performance technology that can deliver 10x and higher.

OpenSolaris b117

I just updated to OpenSolaris development build 117. Easy. Go get it here

Gnome 2.26


Apart from KDE4, Gnome 2.26 will also be available in BeleniX 0.8. I pulled the Desktop Consolidation trunk (JDS) and built it with a bunch of changes. The packages are already available in the package repository trunk but not yet recommended to upgrade to trunk as it is seeing a lot of churn at present. One of the things pending is to replace the default OpenSolaris branding with BeleniX branding. Below is a screenshot showing Gnome 2.26 + Compiz + Avant + Google Gadgets + Webkit on my box. The Gnome developer help documentation browser is built with Webkit support.

Gnome 2.26

Web: you can now not suck.

In perhaps a new trend, I’m blogging from 39011 feet (or so says the seatback in front of me). I’m traveling back home to the east coast from San Jose, CA where I attended (and spoke) at this year’s O’Reilly Velocity Conference.

I participated (and blogged) about the Velocity Summit in which I’ve participated for the past two years. The summit is the unconference preceding the real conference that help the organizers digest current hot topics and better define the conference track for the actual conference. The summit itself is filled with enough brain power to warp space-time, so I drop everything to go to that.

Ironically, despite being a well respected authority in web site (and general internet) scalability and performance, my talk proposals for Velocity 2008 were not accepted — I clearly need to write better proposals. This year, I managed to work my way into the workshop track on Monday. Despite having a bad headache and feeling "off" the day before, I managed to get my act together and put on an A-game for my workshop. For those of you interested, here is my scalable09 slide stack.

I thought I’d take a moment to talk about what I liked about the conference and what I think could use some improvement. I realize this is a down economy and that might be a legitimate justification for some the actions that resulted in some of my disappointment.

First, the negative. I usually start with positive and end with negative because I’m a pessimist. However, all in all the conference was awesome, so I thought I’d get my short list of gripes out of the way early.

O’Reilly is infamous for throwing good conferences for geeks. In my opinion, the field of web