Rustyvisor Network Performance

I thought it’d be useful to have a look at the performance of lguest networking, to establish a baseline for potential future improvements.

The I/O model uses a DMA-like scheme for transferring data between the host and the guest, which is clean and simple, but currently involves extra transitions across ring boundaries. For example, when an external packet for the guest arrives at the host, it is passed to host userland via the TAP driver. An I/O monitor detects that the TAP fd is readable via select(2), and sends a USR1 signal to the guest launcher, which causes the guest to switch out of ring 1 (where it was executing as the guest kernel) and back into a normal ring 3 process in the host. The guest process then prepares a DMA buffer with the packet and triggers a virtual IRQ, which fires after it switches back to the guest kernel in its main processing loop. The guest kernel then processes the packet via a paravirt network driver and its own network stack, after which, a userland process of the guest is switched in to receive the data.

So, in summary, the packet traverses two network stacks across several ring boundaries and context switches. Oh, and NAT or bridging would have been used in the host to get the packet into the TAP interface.

How does this impact performance?

I ran some very simple TCP bandwidth tests with iperf.

First, a vanilla kernel, without paravirt_ops configured, to help ultimately isolate the impact of lguest. The figures here are in MByte/s for a local 2.4Ghz 4-way Xeon system talking over switched gigabit ethernet to a uniprocessor 2.4Ghz P4.

loopback   client   server
--------------------------
     113     78.3     48.6

‘client’ indicates the bandwidth obtained with the local system acting as an iperf client, while ‘server’ means the local system was acting as an iperf server.

Next, I configured a bridge device as the external interface, to get an idea of the impact of the bridging code alone, as it will be used to get packets to the guest.

loopback   client   server
--------------------------
       -     30.2     44.3

Looks bridging imposes quite a hit on the receive path.

Then, the kernel was recompiled with paravirt_ops, to see if that alone would affect performance:

loopback   client   server
--------------------------
   113.6     77.1      49

Not really.

Bridging and paravirt_ops together had more of an impact than just bridging:

loopback   client   server
--------------------------
       -      26.4      40

This figure as is where we measure the impact of lguest from.

One thing I wanted to check before running a guest was the impact of simply loading the lguest module, as it disables PGE in the host, so that the TLBs covering kernel memory are flushed when switching between the guest and host kernels. This has the side-effect of causing global page TLB flushes on all context switches.

loopback   client   server
--------------------------
     110     26.1     39.9

Seems like there was only a slight performance hit in this case.

Now, a guest instance:

loopback   client   server
--------------------------
    42.6      8.3     10.5

So, it seems the network performance of lguest over link is around 25-30% allowing for the overhead of the bridging code. This is about an order of magnitude faster than what I’ve seen with Qemu (which is not really a fair comparison, as it’s an emulator, but a useful smoke-test), and competitive with Vmware and UML according to some Xen performance figures. Obviously, it’d be nice to get up to Xen’s near-native performance at some stage.

I also ran the same test using NAT instead of bridging, with similar results:

loopback   client   server
--------------------------
       -      8.5      9.9

Here are the figures for guest to host networking:

loopback   client   server
--------------------------
       -        8     11.8

while guest to guest networking (via –sharenet) ran at 20.1 MB/s.