Rustyvisor Adventures

Hacking continues on lguest, with changes now committed for host SMP support. This is obviously a precursor to guest SMP support (coming next), but also scratches a major itch in being able to move away from a UP kernel in the development environment and get back to 21st century kernel compile times.

Adding SMP host support involved the removal of a bunch of globals in the assembly hypervisor shim (a tiny chunk of code which performs guest/host switching and interrupt para-virtualization), which is thus now even smaller and better encapsulated. The other notable change was converting the hypervisor’s pte page from global to per-cpu, another form of cleanup.

It’s nice when you can add a feature which actually simplifies and reduces the code. One of the great things about lguest is its simplicity, and a key challenge will be avoiding complexity and bloat as it matures.

These patches could do with some more testing. There was a bug in the PGE-disable code which caused a bit of head scratching for a while. The PGE feature of PPro+ i386 chips allows pages to be marked as ‘global’, and not be included in TLB flushes. This is used for marking things like the kernel text as global, so that kernel TLBs are not flushed on every context-switch, which is handy because the kernel doesn’t get out much (like kernel hackers), and always sits at PAGE_OFFSET after the end of process address space (unlike kernel hackers, who sit at desks). PGE is currently a problem with a guest running, as each involves having real Linux kernel running (in ring 1), so, the kernel address space is now shared and no longer global.

This bug, which caused all kinds of ugly oopses in the host, exhibited some odd behavior: things would work perfectly under qemu, and also if you used taskset on a real machine to bind the lguest VMM (and thus the guest) to a single CPU. It seems that qemu’s SMP model also binds processes to a single CPU (as far as I observed, at least), which meant that debugging the problem under qemu[1] wasn’t going to be much help. It was also a big clue. What was happening is that PGE was only being disabled on the current CPU, and when a guest kernel was run on a different CPU, it would collide with global TLB entries for the host kernel previously running on that CPU. Ahh, fun!

Btw, for anyone who wants to help out, there’s a great overview of lguest by Jon Corbet of LWN. Between that, and Rusty’s documentation, lguest could be one of the best documented projects of its type, making it even easier to hack on.

[1] I should mention how easy it is to debug the lguest host with qemu (via Rusty):

  1. Compile kernel with CONFIG_DEBUG_INFO=y
  2. Run under $ qemu -s
  3. $ gdb vmlinux
  4. (gdb) target remote localhost:1234

Linux virtualization advances

lhype is now lguest, avoiding the scandalous implication that virtualization is in any way associated with hype. Of course, its real name is still the rustyvisor.

Ingo’s paravirtualization work on KVM now includes a reported 5000% performance improvement in host-guest networking (further demonstrating that paravirtualization is beneficial). Hopefully, this will become part of a general kernel paravirt infrastructure which can also be used by the rustyvisor.

Bletchley Park Photos

I’ve spent the last week in London, including a visit to Bletchley Park, the site where the British conducted codebreaking efforts during WWII. The park is a Victorian estate, including a sprawling museum of cryptographic and other wartime exhibits. I’ve uploaded some photos to a flickr photoset.

Bletchley Park Photos

I’d highly recommend visiting this place if in the London area. It’s an easy 30-minute train ride from the city, and there’s a lot of unique and historically fascinating stuff there. Thankfully, the facility has not been significantly renovated or destroyed, and many of the buildings even seem as if they’ve been sealed up since the end of the war.

2007 SELinux Symposium Agenda

The agenda for the 2007 SELinux Symposium has been published. Should be a solid couple of days for learning about the latest engineering & research efforts in the SELinux community. I’ll be especially interested in the SEDarwin talk, and also seeing what Karl has been doing with his new automated policy generation tool, Madison.

Speaking of whom, a sample chapter of SELinux by Example has been made available online.

We got jwz’d. Awesome.

SELinux kernel git repo

There’s been some flux recently in the git repos I maintain, also complicated by the ongoing kernel.org git issues.

I’ve now consolidated all of the SELinux-related repos into one repo, and use branches instead of separate repos. The public git URL is now:

      git://git.infradead.org/~jmorris/selinux-2.6

Web interface:
http://git.infradead.org/?p=users/jmorris/selinux-2.6;a=summary

Branches of interest:

    fixes      - patches which probably belong in the current Linus kernel
    for-2.6.20 - patches being queued for the 2.6.20 merge window
    for-davem  - network related SELinux patches heading for a DaveM repo

Patches in first two branches are typically likely to be also found in current -mm.

This repo is also available via kernel.org:

  git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6

(which is only probably of use to people with kernel.org accounts while the performance problems on the public repo system persist).

Thanks to David Woodhouse for providing the resources for this.

Update #2:
Thanks to some help from Sean Neakums (see comments), the branch for feeding into net-2.6.20 now works. i.e.

    git://git.infradead.org/~jmorris/selinux-2.6#net-2.6.20

Rustyvisor Quickstart

I’ve been poking at Rusty’s hypervisor, lhype, which he’s developing as example code for the upcoming paravirt ops framework. The lhype patch applies cleanly to recent -mm kernels, and I’m queuing up some patches for Rusty here (or browse the git repo). The latest patch there adds support for Thread Local Storage, allowing things like Fedora binaries to run without segfaulting.

[ NOTE: an updated version of this procedure is now published here ]

If anyone wants a very quick and simple recipe to get up and running with lhype, try this:

1. Obtain an appropriate kernel and lhype patches1:

    $ git clone git://git.infradead.org/~jmorris/linux-2.6-lhype

2. Check out the ‘patches’ branch:

    $ cd linux-2.6-lhype
    $ git-checkout patches

3. Configure the kernel:

    $ make menuconfig

Keep it simple initially, to get up and running. Also keep in mind that lhype currently only works for i386, no high memory and no SMP. Enable the paravirt ops framework and select lhype as a module:

    CONFIG_PARAVIRT=y
    CONFIG_LHYPE=m

4. Build and install the kernel:

    $ make -j12 && sudo make modules_install && sudo make -j12 install

Or whatever usually works for you. Note that the kernel you’re building will be for both the guest and the host domains.

5. Boot into host kernel (dom0)

This should just work. Now the fun really begins.

6. Grab a nice Linux OS image2:

    $ wget http://fabrice.bellard.free.fr/qemu/linux-test-0.5.1.tar.gz
    $ tar -xzf linux-test-0.5.1.tar.gz linux-test/linux.img

7. Create a file for shared network i/o:

    $ dd if=/dev/zero of=netfile bs=1024 count=4

8. Launch a guest!

    $ sudo modprobe lhype
    $ sudo linux-2.6-lhype/drivers/lhype/lhype_add 32M 1 linux-2.6-lhype/vmlinux \
            linux-test/linux.img netfile root=/dev/lhba

If all went well, you should see the kernel boot messages fly past and then something like:

Linux version 2.6.19-rc5-mm2-gf808425d (jmorris@xeon.namei)
(gcc version 4.1.1 20060928 (Red Hat 4.1.1-28)) #1 PREEMPT Tue Nov 28 00:53:39 EST 2006

QEMU Linux test distribution (based on Redhat 9)

Type 'exit' to halt the system

sh: no job control in this shell
sh-2.05b#

It’s already useful for kernel hacking. Local networking works. It’d also likely be useful for teaching purposes, being relatively simple yet quite concrete.

‘lhype_add’ is an app included with the kernel which launches and monitors guest domains. It’s actually a simple ELF loader, which maps the guest kernel image into the host’s memory, then opens /proc/lhype and writes some config info about the guest. This kicks the hypervisor into action to initialize and launch the guest, while the open procfile fd is used for control, console i/o, and DMA-like i/o via shared memory (using ideas from Rusty’s earlier XenShare work). The hypervisor is simply a loadable kernel module. Cool stuff.

It’s a little different to Xen, in that the host domain (dom0) is simply a normal kernel running in ring 0 with userspace in ring 3. The hypervisor is a small ELF object loaded into the top of memory (when the lhype module is loaded), which contains some simple domain switching code, interrupt handlers, a few low-level objects which need to be virtualized, and finally an array of structs to maintain information for each guest domain (drivers/lhype/hypervisor.S).

The hypervisor runs in ring 0, with the guest domains running as host domain tasks in ring 1, trapping into the hypervisor for virtualized operations via paravirt ops hooks (arch/i386/kernel/lhype.c) and subsequent hypercalls (drivers/lhype/hypercalls.c). Thus, the hypervisor and host kernel run in the same ring, rather than, say, the hypervisor in ring 0 with the host kernel in ring 1, as is the case with Xen. The advantage for lhype is simplicity: the hypervisor can be kept extremely small and simple, because it only needs to handle tasks related solely to virtualization. It’s just 463 lines of assembler, with comments. Of course, from an isolation point of view the host kernel is effectively part of the hypervisor, because they share the same hardware privilege level. It has also been noted that in practice, a typical dom0 has so much privileged access to the hypervisor that it’s not necessarily meaningful to run them in separate rings. Probably a good beer @ OLS discussion topic.

Overall, it seems like a very clean and elegant design. I’ll see if I can write up some more detailed notes on what I’ve learned about it soon.

Note that Rusty will be giving a presumably canonical talk on lhype at LCA 2007.

The OSDL Virtualization list is probably the best place to keep up with development at this stage, as well as Rusty’s accurately titled bleeding edge page.


1 You can also roll your own via the paravirt patch queue, where the core development takes place.
2Qemu image suggested for simplicity. e.g. an FC6 image will now work, although it won’t get past single user mode due to lack of support for initrd.

***

On a hypervisor-related note, IBM researcher Reiner Sailer has posted an ACM/sHype HOWTO for Fedora Core 6, which explains how to enable the current Xen security extensions.

***

While I was doing some ELF reading, I found this great document, A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux.

***

Looks like FOSS.IN/2006 was another big success.
o Photos
o Google news query

selinuxnews.org stats

About nine months ago, I set up selinuxnews.org to host a blog aggregator and news blog for the SELinux project. Today, I thought it’d be interesting to see how many hits the site was getting and ran the logs through webalizer:

webalizer stats for selinuxnews.org
(more detail …)

It’s been fairly consistent, with about 2300 hits and 600 visits per day, not including syndicated access, which seems difficult to estimate. Also, it’s interesting to discover that most hits on the site were for RSS and similar feeds. So much for HTML.

Another quick update

SELinux

Dan Walsh has been giving some RHEL security presentations recently, and I thought the slides were pretty cool, so here they are in PDF form. They include a nice overview of the comprehensive security features available in RHEL, including SELinux, ExecShield and PIE. The slides are obviously written from a Red Hat point of view, but also apply to Fedora and any of the distros incorporating the same features.

I’ve been reviewing papers for the SELinux Symposium, and am very impressed with the quality and scope of the activity within and around the SELinux community. Much progress continues to be made in many areas, including usability (as can also be seen by tracking SELinux News).

The release of Fedora Core 6 has unleashed many of the SELinux infrastructure and usability improvements, such as modular policy and setroubleshoot. There’ve been some nice reviews, like Fedora Core 6: Innovations Continue from eweek:

Review: The fast-moving Red Hat distribution polishes SELinux, adds new tools and improves performance.

In its first five releases, Red Hat’s Fedora Core has represented the Linux technology vanguard. And so it is with Fedora Core 6.

During tests, Fedora Core 6 impressed eWEEK Labs with the progress it has made toward making Security-Enhanced Linux—and the dramatically improved security protections that SELinux helps afford—more palatable.

Of course, it’s not perfect yet, but it’s reassuring to see that people will come back and re-evaluate things you’ve been working to improve.

One analogy I like to use is to think of SELinux as a major, fundamental change in computer security, with the integration of Mandatory Access Control into general purpose computing. Like any significant change, it can be a long and difficult road, and really a matter of effort and persistence once you have the fundamentals in place. So, the progress of SELinux is perhaps like the development of programming languages, where you had a progression from machine code to assembler, then to higher level languages like C, Perl and Ajax. I’d say, with SELinux, that we’re past the assembly and 3GL phase and are moving to scripting and graphical IDEs. Just like you wouldn’t say computers are too complicated because Joe Random can’t understand the Intel® 64 and IA-32 Architectures Software Developer’s Manual, it is nonsensical to say that SELinux is fundamentally too complicated because of its underlying architecture. In fact, SELinux is merely revealing the true extent of the complexity of the security interactions in the OS, and providing a flexible mechanism to control it. Usability is entirely a matter of developing the right abstractions. You can’t pretend complexity isn’t there, but you can hide it. On the other hand, if your fundamentals are wrong, no amount of eye candy can help.

Virtualization

FC6 also features a bunch of improvements in virtualization (e.g. virt-manager and Cobbler), and I hope that we can make Fedora the place to go for the latest and greatest in this area, too (if it’s not already?) I’ve created a virtualization category on the Fedora Project wiki, to collect together the various related pages there. One area that I’m directly involved in is hypervisor research & development. We’ve moved some of our internal (RH) project planning & tracking stuff out to the wiki: see here. If you’re interested in this area, the Fedora project will warmly welcome any contributions.

Conferences

I’m sad to say I won’t be able to make it to FOSS.IN this year, due to family commitments. It’s one of a few top-tier technical conferences, alongside the likes of LCA and Linux-Kongress, and is especially exciting because of the stunning rate of growth of the community in India. (btw, these global Internet growth stats are very interesting, where India has had over 1000% Internet growth since 2000, which I suspect correlates strongly with FOSS growth). The Final List of Talks looks fantastic.

I wonder which country will be the next to establish a major Linux technical conference. (Japan?)

Rusty’s Moustache

Is reportedly making a comeback.