Category Archives: Linux

SELinux Symposium 2007

I’m in Baltimore for the 2007 SELinux Symposium. The past two days have been for tutorials, with two days now for the conference, and then a final developer summit day.

SELinux Symposium 2007

This morning’s keynote featured a talk by Richard Schaeffer, Director of Information Assurance at the NSA. Richard has been in this business for a very long time, and he provided some very interesting perspectives on computer security (I think the slides will eventually go up on the web site). We then had two sessions of solid technical talks, and are currently in the first of two WIP sessions. There’s a lot of interesting work happening now extending SELinux out past the base OS, with increasingly mature analysis and development tools, as well as continued refinements to the core technology.

Chris Vance’s SEDarwin talk was particularly interesting: apparently the next version of OSX will ship with the TrustedBSD MAC framework. This won’t initially include the SE Darwin work, but hopefully it’ll be possible to get it running without too much trouble.

There’s also good progress being made on extending SELinux to the desktop, as described in NSA talks on GConf integration and X.org integration. Xinwen Zhang of Samsung gave an interesting talk on extending SELinux to mobile platforms (such as cell phones), and related research into platform integrity.

The SELinux community is growing and looking very healthy, although I think we can still do a lot more to encourage wider participation.

Snippets

Steve Rostedt and Glauber de Oliveira Costa have just posted initial patches for a 64-bit (x86-64) version of lguest. Looks like the next steps will be to consolidate the code into common and per-arch components. Initial feedback from Rusty seems good.

I’ve been working on converting lguest over to the new clockevent and dyntick frameworks. (Thomas Gleixner’s OLS paper, Hrtimers and Beyond: Transforming the Linux Time Subsystems, is also a great reference on the topic).

Tickless operation is particularly useful for virtual machines, allowing clock events (for timers etc.) to be programmed on demand, with events only being delivered to VMs as required, rather than say, generating a synthetic tick stream for each VM. (Or in the case of lguest, switching out of each guest on each host clock tick). It’ll be interesting to re-run the networking benchmarks again with tickless & high-resolution timer support.

There may be some scope to consolidate common clock code between several HV projects (Xen & lguest have nearly identical clockevent code in progress), although it’s not entirely clear yet how much can be usefully gained, as the new clock APIs make the client code fairly simple.

Máirín Duffy has created a new logo for lguest (also now known as the puppyvisor):

new lguest logo

Via Val Henson: some hilarious slides from a talk on network protocols by Radia Perlman. I’m sad to say I haven’t seen Radia give a talk as yet.

The SELinux Symposium is on next week!

Snippets

The rustyvisor is now segmentless, paving the way for x86_64, nested interrupts (useful for oprofile) and probably simpler guest SMP. It’s amazing how much can happen in Linux over a weekend. I was happily surprised to see my simple patch to enable bridging was applied in the midst of the rewrite.

Early registration for the 2007 SELinux Symposium ends in a few days. The WIP program is now quite full — I’ll be interested to see KaiGai Kohei’s talk on what’s happening with SELinux in Japan.

Rustyvisor Network Performance

I thought it’d be useful to have a look at the performance of lguest networking, to establish a baseline for potential future improvements.

The I/O model uses a DMA-like scheme for transferring data between the host and the guest, which is clean and simple, but currently involves extra transitions across ring boundaries. For example, when an external packet for the guest arrives at the host, it is passed to host userland via the TAP driver. An I/O monitor detects that the TAP fd is readable via select(2), and sends a USR1 signal to the guest launcher, which causes the guest to switch out of ring 1 (where it was executing as the guest kernel) and back into a normal ring 3 process in the host. The guest process then prepares a DMA buffer with the packet and triggers a virtual IRQ, which fires after it switches back to the guest kernel in its main processing loop. The guest kernel then processes the packet via a paravirt network driver and its own network stack, after which, a userland process of the guest is switched in to receive the data.

So, in summary, the packet traverses two network stacks across several ring boundaries and context switches. Oh, and NAT or bridging would have been used in the host to get the packet into the TAP interface.

How does this impact performance?

I ran some very simple TCP bandwidth tests with iperf.

First, a vanilla kernel, without paravirt_ops configured, to help ultimately isolate the impact of lguest. The figures here are in MByte/s for a local 2.4Ghz 4-way Xeon system talking over switched gigabit ethernet to a uniprocessor 2.4Ghz P4.

loopback   client   server
--------------------------
     113     78.3     48.6

‘client’ indicates the bandwidth obtained with the local system acting as an iperf client, while ‘server’ means the local system was acting as an iperf server.

Next, I configured a bridge device as the external interface, to get an idea of the impact of the bridging code alone, as it will be used to get packets to the guest.

loopback   client   server
--------------------------
       -     30.2     44.3

Looks bridging imposes quite a hit on the receive path.

Then, the kernel was recompiled with paravirt_ops, to see if that alone would affect performance:

loopback   client   server
--------------------------
   113.6     77.1      49

Not really.

Bridging and paravirt_ops together had more of an impact than just bridging:

loopback   client   server
--------------------------
       -      26.4      40

This figure as is where we measure the impact of lguest from.

One thing I wanted to check before running a guest was the impact of simply loading the lguest module, as it disables PGE in the host, so that the TLBs covering kernel memory are flushed when switching between the guest and host kernels. This has the side-effect of causing global page TLB flushes on all context switches.

loopback   client   server
--------------------------
     110     26.1     39.9

Seems like there was only a slight performance hit in this case.

Now, a guest instance:

loopback   client   server
--------------------------
    42.6      8.3     10.5

So, it seems the network performance of lguest over link is around 25-30% allowing for the overhead of the bridging code. This is about an order of magnitude faster than what I’ve seen with Qemu (which is not really a fair comparison, as it’s an emulator, but a useful smoke-test), and competitive with Vmware and UML according to some Xen performance figures. Obviously, it’d be nice to get up to Xen’s near-native performance at some stage.

I also ran the same test using NAT instead of bridging, with similar results:

loopback   client   server
--------------------------
       -      8.5      9.9

Here are the figures for guest to host networking:

loopback   client   server
--------------------------
       -        8     11.8

while guest to guest networking (via –sharenet) ran at 20.1 MB/s.

Rustyvisor Adventures

Hacking continues on lguest, with changes now committed for host SMP support. This is obviously a precursor to guest SMP support (coming next), but also scratches a major itch in being able to move away from a UP kernel in the development environment and get back to 21st century kernel compile times.

Adding SMP host support involved the removal of a bunch of globals in the assembly hypervisor shim (a tiny chunk of code which performs guest/host switching and interrupt para-virtualization), which is thus now even smaller and better encapsulated. The other notable change was converting the hypervisor’s pte page from global to per-cpu, another form of cleanup.

It’s nice when you can add a feature which actually simplifies and reduces the code. One of the great things about lguest is its simplicity, and a key challenge will be avoiding complexity and bloat as it matures.

These patches could do with some more testing. There was a bug in the PGE-disable code which caused a bit of head scratching for a while. The PGE feature of PPro+ i386 chips allows pages to be marked as ‘global’, and not be included in TLB flushes. This is used for marking things like the kernel text as global, so that kernel TLBs are not flushed on every context-switch, which is handy because the kernel doesn’t get out much (like kernel hackers), and always sits at PAGE_OFFSET after the end of process address space (unlike kernel hackers, who sit at desks). PGE is currently a problem with a guest running, as each involves having real Linux kernel running (in ring 1), so, the kernel address space is now shared and no longer global.

This bug, which caused all kinds of ugly oopses in the host, exhibited some odd behavior: things would work perfectly under qemu, and also if you used taskset on a real machine to bind the lguest VMM (and thus the guest) to a single CPU. It seems that qemu’s SMP model also binds processes to a single CPU (as far as I observed, at least), which meant that debugging the problem under qemu[1] wasn’t going to be much help. It was also a big clue. What was happening is that PGE was only being disabled on the current CPU, and when a guest kernel was run on a different CPU, it would collide with global TLB entries for the host kernel previously running on that CPU. Ahh, fun!

Btw, for anyone who wants to help out, there’s a great overview of lguest by Jon Corbet of LWN. Between that, and Rusty’s documentation, lguest could be one of the best documented projects of its type, making it even easier to hack on.

[1] I should mention how easy it is to debug the lguest host with qemu (via Rusty):

  1. Compile kernel with CONFIG_DEBUG_INFO=y
  2. Run under $ qemu -s
  3. $ gdb vmlinux
  4. (gdb) target remote localhost:1234

Linux virtualization advances

lhype is now lguest, avoiding the scandalous implication that virtualization is in any way associated with hype. Of course, its real name is still the rustyvisor.

Ingo’s paravirtualization work on KVM now includes a reported 5000% performance improvement in host-guest networking (further demonstrating that paravirtualization is beneficial). Hopefully, this will become part of a general kernel paravirt infrastructure which can also be used by the rustyvisor.

Bletchley Park Photos

I’ve spent the last week in London, including a visit to Bletchley Park, the site where the British conducted codebreaking efforts during WWII. The park is a Victorian estate, including a sprawling museum of cryptographic and other wartime exhibits. I’ve uploaded some photos to a flickr photoset.

Bletchley Park Photos

I’d highly recommend visiting this place if in the London area. It’s an easy 30-minute train ride from the city, and there’s a lot of unique and historically fascinating stuff there. Thankfully, the facility has not been significantly renovated or destroyed, and many of the buildings even seem as if they’ve been sealed up since the end of the war.

2007 SELinux Symposium Agenda

The agenda for the 2007 SELinux Symposium has been published. Should be a solid couple of days for learning about the latest engineering & research efforts in the SELinux community. I’ll be especially interested in the SEDarwin talk, and also seeing what Karl has been doing with his new automated policy generation tool, Madison.

Speaking of whom, a sample chapter of SELinux by Example has been made available online.

We got jwz’d. Awesome.

SELinux kernel git repo

There’s been some flux recently in the git repos I maintain, also complicated by the ongoing kernel.org git issues.

I’ve now consolidated all of the SELinux-related repos into one repo, and use branches instead of separate repos. The public git URL is now:

      git://git.infradead.org/~jmorris/selinux-2.6

Web interface:
http://git.infradead.org/?p=users/jmorris/selinux-2.6;a=summary

Branches of interest:

    fixes      - patches which probably belong in the current Linus kernel
    for-2.6.20 - patches being queued for the 2.6.20 merge window
    for-davem  - network related SELinux patches heading for a DaveM repo

Patches in first two branches are typically likely to be also found in current -mm.

This repo is also available via kernel.org:

  git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6

(which is only probably of use to people with kernel.org accounts while the performance problems on the public repo system persist).

Thanks to David Woodhouse for providing the resources for this.

Update #2:
Thanks to some help from Sean Neakums (see comments), the branch for feeding into net-2.6.20 now works. i.e.

    git://git.infradead.org/~jmorris/selinux-2.6#net-2.6.20