Rustyvisor Network Performance

I thought it’d be useful to have a look at the performance of lguest networking, to establish a baseline for potential future improvements.

The I/O model uses a DMA-like scheme for transferring data between the host and the guest, which is clean and simple, but currently involves extra transitions across ring boundaries. For example, when an external packet for the guest arrives at the host, it is passed to host userland via the TAP driver. An I/O monitor detects that the TAP fd is readable via select(2), and sends a USR1 signal to the guest launcher, which causes the guest to switch out of ring 1 (where it was executing as the guest kernel) and back into a normal ring 3 process in the host. The guest process then prepares a DMA buffer with the packet and triggers a virtual IRQ, which fires after it switches back to the guest kernel in its main processing loop. The guest kernel then processes the packet via a paravirt network driver and its own network stack, after which, a userland process of the guest is switched in to receive the data.

So, in summary, the packet traverses two network stacks across several ring boundaries and context switches. Oh, and NAT or bridging would have been used in the host to get the packet into the TAP interface.

How does this impact performance?

I ran some very simple TCP bandwidth tests with iperf.

First, a vanilla kernel, without paravirt_ops configured, to help ultimately isolate the impact of lguest. The figures here are in MByte/s for a local 2.4Ghz 4-way Xeon system talking over switched gigabit ethernet to a uniprocessor 2.4Ghz P4.

loopback   client   server
--------------------------
     113     78.3     48.6

‘client’ indicates the bandwidth obtained with the local system acting as an iperf client, while ‘server’ means the local system was acting as an iperf server.

Next, I configured a bridge device as the external interface, to get an idea of the impact of the bridging code alone, as it will be used to get packets to the guest.

loopback   client   server
--------------------------
       -     30.2     44.3

Looks bridging imposes quite a hit on the receive path.

Then, the kernel was recompiled with paravirt_ops, to see if that alone would affect performance:

loopback   client   server
--------------------------
   113.6     77.1      49

Not really.

Bridging and paravirt_ops together had more of an impact than just bridging:

loopback   client   server
--------------------------
       -      26.4      40

This figure as is where we measure the impact of lguest from.

One thing I wanted to check before running a guest was the impact of simply loading the lguest module, as it disables PGE in the host, so that the TLBs covering kernel memory are flushed when switching between the guest and host kernels. This has the side-effect of causing global page TLB flushes on all context switches.

loopback   client   server
--------------------------
     110     26.1     39.9

Seems like there was only a slight performance hit in this case.

Now, a guest instance:

loopback   client   server
--------------------------
    42.6      8.3     10.5

So, it seems the network performance of lguest over link is around 25-30% allowing for the overhead of the bridging code. This is about an order of magnitude faster than what I’ve seen with Qemu (which is not really a fair comparison, as it’s an emulator, but a useful smoke-test), and competitive with Vmware and UML according to some Xen performance figures. Obviously, it’d be nice to get up to Xen’s near-native performance at some stage.

I also ran the same test using NAT instead of bridging, with similar results:

loopback   client   server
--------------------------
       -      8.5      9.9

Here are the figures for guest to host networking:

loopback   client   server
--------------------------
       -        8     11.8

while guest to guest networking (via –sharenet) ran at 20.1 MB/s.

Rustyvisor Adventures

Hacking continues on lguest, with changes now committed for host SMP support. This is obviously a precursor to guest SMP support (coming next), but also scratches a major itch in being able to move away from a UP kernel in the development environment and get back to 21st century kernel compile times.

Adding SMP host support involved the removal of a bunch of globals in the assembly hypervisor shim (a tiny chunk of code which performs guest/host switching and interrupt para-virtualization), which is thus now even smaller and better encapsulated. The other notable change was converting the hypervisor’s pte page from global to per-cpu, another form of cleanup.

It’s nice when you can add a feature which actually simplifies and reduces the code. One of the great things about lguest is its simplicity, and a key challenge will be avoiding complexity and bloat as it matures.

These patches could do with some more testing. There was a bug in the PGE-disable code which caused a bit of head scratching for a while. The PGE feature of PPro+ i386 chips allows pages to be marked as ‘global’, and not be included in TLB flushes. This is used for marking things like the kernel text as global, so that kernel TLBs are not flushed on every context-switch, which is handy because the kernel doesn’t get out much (like kernel hackers), and always sits at PAGE_OFFSET after the end of process address space (unlike kernel hackers, who sit at desks). PGE is currently a problem with a guest running, as each involves having real Linux kernel running (in ring 1), so, the kernel address space is now shared and no longer global.

This bug, which caused all kinds of ugly oopses in the host, exhibited some odd behavior: things would work perfectly under qemu, and also if you used taskset on a real machine to bind the lguest VMM (and thus the guest) to a single CPU. It seems that qemu’s SMP model also binds processes to a single CPU (as far as I observed, at least), which meant that debugging the problem under qemu^[1] wasn’t going to be much help. It was also a big clue. What was happening is that PGE was only being disabled on the current CPU, and when a guest kernel was run on a different CPU, it would collide with global TLB entries for the host kernel previously running on that CPU. Ahh, fun!

Btw, for anyone who wants to help out, there’s a great overview of lguest by Jon Corbet of LWN. Between that, and Rusty’s documentation, lguest could be one of the best documented projects of its type, making it even easier to hack on.

^[1] I should mention how easy it is to debug the lguest host with qemu (via Rusty):

Compile kernel with CONFIG_DEBUG_INFO=y
Run under $ qemu -s
$ gdb vmlinux
(gdb) target remote localhost:1234

Linux virtualization advances

lhype is now lguest, avoiding the scandalous implication that virtualization is in any way associated with hype. Of course, its real name is still the rustyvisor.

Ingo’s paravirtualization work on KVM now includes a reported 5000% performance improvement in host-guest networking (further demonstrating that paravirtualization is beneficial). Hopefully, this will become part of a general kernel paravirt infrastructure which can also be used by the rustyvisor.

Bletchley Park Photos

I’ve spent the last week in London, including a visit to Bletchley Park, the site where the British conducted codebreaking efforts during WWII. The park is a Victorian estate, including a sprawling museum of cryptographic and other wartime exhibits. I’ve uploaded some photos to a flickr photoset.

I’d highly recommend visiting this place if in the London area. It’s an easy 30-minute train ride from the city, and there’s a lot of unique and historically fascinating stuff there. Thankfully, the facility has not been significantly renovated or destroyed, and many of the buildings even seem as if they’ve been sealed up since the end of the war.

Rustyvisor Homepage

There’s an official home page now for Rusty’s lhype project, including a link to some unreliable documentation, and puppies.

Also, Rik van Riel created a Linux Virtualization Wiki.

2007 SELinux Symposium Agenda

The agenda for the 2007 SELinux Symposium has been published. Should be a solid couple of days for learning about the latest engineering & research efforts in the SELinux community. I’ll be especially interested in the SEDarwin talk, and also seeing what Karl has been doing with his new automated policy generation tool, Madison.

Speaking of whom, a sample chapter of SELinux by Example has been made available online.

We got jwz’d. Awesome.

SELinux kernel git repo

There’s been some flux recently in the git repos I maintain, also complicated by the ongoing kernel.org git issues.

I’ve now consolidated all of the SELinux-related repos into one repo, and use branches instead of separate repos. The public git URL is now:

      git://git.infradead.org/~jmorris/selinux-2.6

Web interface:
http://git.infradead.org/?p=users/jmorris/selinux-2.6;a=summary

Branches of interest:

    fixes      - patches which probably belong in the current Linus kernel
    for-2.6.20 - patches being queued for the 2.6.20 merge window
    for-davem  - network related SELinux patches heading for a DaveM repo

Patches in first two branches are typically likely to be also found in current -mm.

This repo is also available via kernel.org:

  git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6

(which is only probably of use to people with kernel.org accounts while the performance problems on the public repo system persist).

Thanks to David Woodhouse for providing the resources for this.

Update #2:
Thanks to some help from Sean Neakums (see comments), the branch for feeding into net-2.6.20 now works. i.e.

    git://git.infradead.org/~jmorris/selinux-2.6#net-2.6.20

Rustyvisor Quickstart

I’ve been poking at Rusty’s hypervisor, lhype, which he’s developing as example code for the upcoming paravirt ops framework. The lhype patch applies cleanly to recent -mm kernels, and I’m queuing up some patches for Rusty here (or browse the git repo). The latest patch there adds support for Thread Local Storage, allowing things like Fedora binaries to run without segfaulting.

[ NOTE: an updated version of this procedure is now published here ]

If anyone wants a very quick and simple recipe to get up and running with lhype, try this:

1. Obtain an appropriate kernel and lhype patches¹:

    $ git clone git://git.infradead.org/~jmorris/linux-2.6-lhype

2. Check out the ‘patches’ branch:

    $ cd linux-2.6-lhype
    $ git-checkout patches

3. Configure the kernel:

    $ make menuconfig

Keep it simple initially, to get up and running. Also keep in mind that lhype currently only works for i386, no high memory and no SMP. Enable the paravirt ops framework and select lhype as a module:

    CONFIG_PARAVIRT=y
    CONFIG_LHYPE=m

4. Build and install the kernel:

    $ make -j12 && sudo make modules_install && sudo make -j12 install

Or whatever usually works for you. Note that the kernel you’re building will be for both the guest and the host domains.

5. Boot into host kernel (dom0)

This should just work. Now the fun really begins.

6. Grab a nice Linux OS image²:

    $ wget http://fabrice.bellard.free.fr/qemu/linux-test-0.5.1.tar.gz
    $ tar -xzf linux-test-0.5.1.tar.gz linux-test/linux.img

7. Create a file for shared network i/o:

    $ dd if=/dev/zero of=netfile bs=1024 count=4

8. Launch a guest!

    $ sudo modprobe lhype
    $ sudo linux-2.6-lhype/drivers/lhype/lhype_add 32M 1 linux-2.6-lhype/vmlinux \
            linux-test/linux.img netfile root=/dev/lhba

If all went well, you should see the kernel boot messages fly past and then something like:

Linux version 2.6.19-rc5-mm2-gf808425d (jmorris@xeon.namei)
(gcc version 4.1.1 20060928 (Red Hat 4.1.1-28)) #1 PREEMPT Tue Nov 28 00:53:39 EST 2006

QEMU Linux test distribution (based on Redhat 9)

Type 'exit' to halt the system

sh: no job control in this shell
sh-2.05b#

It’s already useful for kernel hacking. Local networking works. It’d also likely be useful for teaching purposes, being relatively simple yet quite concrete.

‘lhype_add’ is an app included with the kernel which launches and monitors guest domains. It’s actually a simple ELF loader, which maps the guest kernel image into the host’s memory, then opens /proc/lhype and writes some config info about the guest. This kicks the hypervisor into action to initialize and launch the guest, while the open procfile fd is used for control, console i/o, and DMA-like i/o via shared memory (using ideas from Rusty’s earlier XenShare work). The hypervisor is simply a loadable kernel module. Cool stuff.

It’s a little different to Xen, in that the host domain (dom0) is simply a normal kernel running in ring 0 with userspace in ring 3. The hypervisor is a small ELF object loaded into the top of memory (when the lhype module is loaded), which contains some simple domain switching code, interrupt handlers, a few low-level objects which need to be virtualized, and finally an array of structs to maintain information for each guest domain (drivers/lhype/hypervisor.S).

The hypervisor runs in ring 0, with the guest domains running as host domain tasks in ring 1, trapping into the hypervisor for virtualized operations via paravirt ops hooks (arch/i386/kernel/lhype.c) and subsequent hypercalls (drivers/lhype/hypercalls.c). Thus, the hypervisor and host kernel run in the same ring, rather than, say, the hypervisor in ring 0 with the host kernel in ring 1, as is the case with Xen. The advantage for lhype is simplicity: the hypervisor can be kept extremely small and simple, because it only needs to handle tasks related solely to virtualization. It’s just 463 lines of assembler, with comments. Of course, from an isolation point of view the host kernel is effectively part of the hypervisor, because they share the same hardware privilege level. It has also been noted that in practice, a typical dom0 has so much privileged access to the hypervisor that it’s not necessarily meaningful to run them in separate rings. Probably a good beer @ OLS discussion topic.

Overall, it seems like a very clean and elegant design. I’ll see if I can write up some more detailed notes on what I’ve learned about it soon.

Note that Rusty will be giving a presumably canonical talk on lhype at LCA 2007.

The OSDL Virtualization list is probably the best place to keep up with development at this stage, as well as Rusty’s accurately titled bleeding edge page.

¹ You can also roll your own via the paravirt patch queue, where the core development takes place.
²Qemu image suggested for simplicity. e.g. an FC6 image will now work, although it won’t get past single user mode due to lack of support for initrd.

***

On a hypervisor-related note, IBM researcher Reiner Sailer has posted an ACM/sHype HOWTO for Fedora Core 6, which explains how to enable the current Xen security extensions.

***

While I was doing some ELF reading, I found this great document, A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux.

***

Looks like FOSS.IN/2006 was another big success.
o Photos
o Google news query

Send the Firerfox crop circle hackers to Australia

The nutters who did this:

resulting in the best bugzilla ever,

want to go here:

How to help.

selinuxnews.org stats

About nine months ago, I set up selinuxnews.org to host a blog aggregator and news blog for the SELinux project. Today, I thought it’d be interesting to see how many hits the site was getting and ran the logs through webalizer:

(more detail …)
It’s been fairly consistent, with about 2300 hits and 600 visits per day, not including syndicated access, which seems difficult to estimate. Also, it’s interesting to discover that most hits on the site were for RSS and similar feeds. So much for HTML.

James Morris

Linux Kernel Developer