IPSec performance

I’ve been playing with Jari Ruusu’s Pentium-optimized MD5 code (from loop-aes), looking to integrate it into the kernel Crypto API. The performance improvement looks impressive so far for IPsec. For loopback AH, the performance on my Xeon quadruples, from 10MB/s to 40MB/s. Jari’s code uses only i386 instructions, optimized for the Pentium II. I wonder how much more this could be optimized with instructions & techniques for newer CPUs. The performance improvement for loopback ESP with null encryption and MD5 authentication is about 18%, from 2.7MB/s to 3.2MB/s. I wonder what’s making ESP so slow to start with though. Clearly, ESP header processing is going to add overhead, but this still seems pretty bad. Hmmm.