Category Archives: Linux

Recent Developments in SELinux Kernel Performance

Introduction
This article covers some recent changes in the SELinux kernel code including a performance patch from Kaigai Kohei of NEC and related updates to the selinuxfs API. Currently, these changes are waiting in the -mm tree for merging into Linus’ kernel in 2.6.10, and are also available in the development Fedora and RHEL4 Beta kernels. This article should be useful to sysadmins who are looking to understand SELinux performance and generally also to curious people.


Background
Until now, SELinux kernel development has been focused primarily on getting the code merged into the upstream kernel and preparation for integration into mainstream Linux distributions. This has taken several years, from the initial release in 2000 where SELinux was a patch against the kernel, to the port to the LSM framework, upstream merge and integration work performed by various distributions including Debian, Red Hat and Gentoo.

During this phase of the project, performance tuning had not received much attention from developers. While the general base performance hit of SELinux is around 7% — which is not too bad — testing on large systems revealed serious scalability issues.

AVC operation and analysis
To understand this, here’s a brief explanation of how the core of the SELinux code works. SELinux uses a series of hooks (via LSM) at strategic points within the kernel. These hooks are used to divert the normal flow of the kernel into a security checking mechanism called the Access Vector Cache (AVC). The AVC contains cached permission checks performed against the system’s security policy. As the working set of permission checks for a given workload is typically quite small, having them cached allows SELinux to scale virtually independently to the size of the security policy. For the vast majority of cases, an SELinux permission check is a quick cache lookup rather than a search of the entire security policy database.

This sounds great except that the AVC was protected by a global spinlock. Additionally, due to the fact that SELinux hooks can be called from hard interrupt context, interrupts had to be disabled and interrupt state saved before taking the lock, then resumed on release (i.e the spin_lock_irqsave/restore() variant was used). This is quite expensive and the lock showed up clearly in profiling runs.

Further analysis revealed that it was trivially possible to generate hundreds of thousands of AVC lookups a second on reasonably modest hardware (e.g. running ls -lR /usr twice), and the atomic operations associated with spinlocks alone were clearly hurting SMP performance.

Solution: RCU
A few attempts had been made at “fixing” or getting rid of the AVC lock, using reader/writer lock variants and per-cpu techniques. Neither worked very well. Kaigai’s patch, however, utilized RCU, a mechanism which allows lockless reads and the elimination of atomic operations in read paths. As the AVC is overwhelmingly read-only on a typical system, RCU turned out to be an ideal method for removing the AVC lock completely.

With the RCU patch applied, SELinux scalability problems noticed in several benchmarks were eliminated. The benchmark results can be found in this post to lkml. For example, Kaigai measured linear scalability for the dbench benchmark on a 32-way system.

Some minor tradeoffs in baseline performance showed up in some benchmarks, although the scalability win definitely outweighs these (if they are even noticeable to anyone in the real world).

The RCU patch does not solve all of the SELinux performance issues. For example, some networking hooks do not use the AVC and effectively always fall through to security policy lookups, which involves the use of a global spinlock. Work is ongoing in these cases and they should be similarly resolved with caching and RCU reasonably soon.

Further improvements?
During development and analysis, I instrumented the AVC code and enabled some of the internal debugging code to get a better look at what was really going on and see if there were any further improvements to be made. (Much of this instrumentation is now available via selinuxfs, and is discussed below). Firstly, I was surprised to see that the AVC seemed to be thrashing. The cache had a maximum size of 410 entries (I don’t know where this specific number comes from :-). When more than this number of entries was needed, the AVC code used an LRU policy to reclaim cache entries in chunks. It seemed that this would surely be bad for performance, given that I was seeing thousands of AVC entries being used in total. Additionally, there were 512 hash buckets for the cache, which meant that at least 102 buckets would not be used. My thoughts were to increase the maximum cache size to reduce the need to reclaim entries, and then implement a periodic garbage collector (GC) to keep the hash chains short. Shortening the hash chains was potentially important as some workload (such as bootup) can invoke thousands of AVC entries which are then not used again or used very infrequently from then on.

The first step in implementing the GC was to add a last used timestamp to each AVC entry. This was accomplished by setting a field in the AVC node structure to the current value of jiffies, the global current tick count. Then, from a timer, the GC cleanup code would traverse the cache, looking for AVC entries older than a certain amount of time (really, ticks converted to time). As the GC would run extremely infrequently compared to the AVC lookups (many orders of magnitude), the GC run itself should not cause any undue performance problems. And it didn’t, although it turned out that the GC system was still somehow hurting performance. The problem with the GC turned out to be reading the global jiffies variable on each AVC lookup. It is declared volatile (which kills some possible optimizations by the compiler) and as a constantly updated global value, reading it very frequently incurs significant overhead on a heavily loaded SMP system.

So, the GC idea was dropped, and the default value for the maximum size of the AVC cache was increased to allow it to fill up all of its hash buckets. Extensive benchmarking with various values showed 512 to be a good choice.

But what about the cache thrashing? After further analysis, this also turned out to be bogus. Using the instrumentation, I was able to observe that workloads tend to use a surprisingly small working set of AVC entries. Most of the cache entries were stale or infrequently used, so reclaiming them was not generally a problem. To keep an idle system running, with SSH and a monitoring script active, as little as 20 or 30 AVC entries might be in the working set. And perhaps that number again if running a simple server workload such as Apache with static content. It was difficult to find a real workload which used more than a hundred AVC entries in a working set. With the original reclamation code, if the cache was populated with mostly stale entries (e.g. those created during a login or daemon startup), and a new workload was started, there would be an initial hit while the LRU based system reclaimed enough entries for the new workload, but this would then be amortized very quickly over the life of the workload. So, the original design was fine all along. Lesson: be careful about second-guessing NSA code :-)

The new AVC API
One thing that seemed useful though would be to allow the sysadmin to control the maximum size of the AVC, so that complex, multiple or fast changing workloads could be more readily accommodated if needed. This is now a tunable exposed via the selinuxfs API.

To view the current value:

# cat /selinux/avc/cache_threshold
512

To set a new value:

# echo 1024 > /selinux/avc/cache_threshold

I would suggest always verifying that this succeed, as the new setsecparam permission is needed to write to the file.

It’s not much use providing a way to modify the maximum cache size without some way to monitor the operation of the AVC. Two further files have been added under the avc directory:

# cat /selinux/avc/hash_stats
entries: 503
buckets used: 269/512
longest chain: 5

This is really to help debug complicated performance problems, although it is worth explaining. The output shows the total number of AVC entries, hash bucket distribution and longest hash chain length. Maximum chain lengths should generally be quite short (less than 10 is typical on my systems). Significantly large values indicate possible performance issues, as the AVC may be spending a lot of time traversing long hash chains on each AVC lookup. In this case, performance may improve by reducing the size of the cache, although it might be worth posting a report to one of the SELinux mailing lists if you see anything odd here.

# cat /selinux/avc/cache_stats
lookups hits misses allocations reclaims frees
933871 928520 5351 5486 1216 5124
723798 719270 4528 4569 608 4483
1677626 1675464 2162 2180 352 2051
1390072 1388430 1642 1671 576 1744

These are per-cpu statistics for various AVC operations and hopefully fairly self explanatory. The lookups field is a good indicator of the load being placed on the AVC. The field to watch, probably, is reclaims. This is a count of the current total number of reclaimed cache entries. If it keeps changing, you may need to increase the cache size. Generally it is normal to see light reclaim activity from time to time.

The avcstat utility
Note that you probably should not just cat this file directly. Use the avcstat utility available here or in the latest libselinux package (Fedora users can just update to the latest development stuff via yum and it will all be there). This utility is similar to tools such as vmstat. It adds up the per-cpu values and displays them in a tabular format. You can display continuous values by including the number of seconds between updates on the command line:

# avcstat 1
   lookups       hits     misses     allocs   reclaims      frees
   4782948    4769221      13727      14013       2752      13465
        81         81          0          0          0          0
        47         47          0          0          0          0
        53         53          0          0          0          0
^C

The first line always shows the current total, then by default, shows relative values. Using the -c switch shows cumulative values, if desired.

Conclusions
The use of RCU solved a serious scalability problem with the AVC, thanks to the work of Kaigai and the RCU developers. It seems likely to be a useful approach for dealing with similar problems, and specifically with some of the SELinux networking code as mentioned. The new AVC API and avcstat utility should prove useful to sysadmins and SELinux developers who need to analyze and tune performance. Feedback from users is welcome. These are new features which have not had wide exposure and there are obviously going to be many workload which I’ve not tested. If you have any interesting performance situations, please post some information and output from the AVC API to one of the SELinux mailing lists.

Quantum Money

I’ve been trying to read Introduction to Quantum Computation and Information, a collection of papers on the subject used as an introduction to the field. I can’t claim to understand it, but there are some interesting parts nonetheless.

One is an example of Quantum Cryptology in the form of Quantum Money, a theoretical idea which could prevent counterfeiting by storing sequences of photons within banknotes, along with a serial number. The photons would be randomly polarized during printing, and the bank would keep a secret record of photon polarizations for each banknote. It would be infeasible to copy the banknotes, as measuring unknown quantum states disturbs the states irreversibly. As the bank knows the quantum states within each banknote, it can determine whether the banknote is a copy. This is a little simplified, but you get the idea.

Anyway, such an idea is not amazing today, but the theory for this was developed in 1970 or so by Stephen Weisner, who mysteriously did not publish it until 1983. Apparently his ideas were way ahead of their time and only taken up later, for example, by scientists who published the first known Quantum Key Distribution (QKD) system in 1984. The first working QKD was first demonstrated in 1989, and the first network protected by quantum cryptography was deployed only this year.

Devil’s Advocate

Arnaldo is one of the core kernel networking developers who has been heroically modernizing implementations of legacy protocols within the networking stack. Where else can you find fully threaded implementations of DDP, LLC, X.25 and IPX?

So, today, I was a little perturbed to see him connecting to IRC via IPv6, at last. Does this mean that IPv6 is now a legacy protocol? :-)

Blogging directions.

Here’s an interesting blog entry: Are you afraid to blog? by Robert Scoble (found via Jim Grisanzio of Sun). It’s a must read for corporate bloggers, although it doesn’t really cover the area of personal blogging, unless perhaps you are trying to market yourself.

I’ve embarked on a few public personal blogs in the past and abandoned them all after a short time. I guess I felt as if I’d shared too much of my own life that was ultimately of no real interest to anyone else anyway. So yes, I guess I am afraid to blog to some extent, on a personal level.

On more general topics such as politics, it’s not that I don’t want to express myself (and I have), it’s more an issue of time and focus. Writing does not come easily to me, and I’m not able to breezily whip up sharp paragraphs on whatever thoughts happen to be lurking in my mind. To write about an issue, I would need to first be sure of my research, then spend significant amounts of time and energy trying to construct an articulate, well constructed essay. The end result is never great, but it’s usually something I can live with.

But this kind of writing takes away both time and energy that I’d rather currently spend on more technical tasks relating to kernel development and security. I’m continually amazed at how much smarter and more experienced the people around me are, both at Red Hat and in the FOSS communities. Dealing with people with such giant brains all day is a truly humbling experience. To a large extent, for the foreseeable future, I feel that I need to continue to focus on expanding technical skills and knowledge, gaining more experience. This means not doing some other things that also take up a lot of energy, including any really concerted attempt at writing.

So I’m not exactly sure what to expect from this blog. It won’t be too personal, nor will it likely be a vehicle for deep observations and pontifications. (Probably much to the relief of anyone reading this). I don’t really feel like being part of Red Hat marketing, although I can’t guarantee that I won’t mention work specific things that could be construed as such, perhaps correctly.

So what is left to discuss? It’s kind of like Amateur Radio, where the rules in some countries exclude discussions of any real interest. For example, in the US, the FCC rules include:

§97.117 International communications.

Transmissions to a different country, where permitted, shall be made in plain language and shall be limited to messages of a technical nature relating to tests, and, to remarks of a personal character for which, by reason of their unimportance, recourse to the public telecommunications service is not justified.

Well, hopefully it won’t be that bad.

913

I’ve just finished reading Secure Coding, Principles & Practices, by Graff and van Wyk. I thought it started out a little slowly but by the end of the book they did a really good job of pulling important high-level ideas together into a practical, systematic approach to development from a security point of view. The recommended reading list at the end of the book is good, and I was impressed that they listed my favourite security book first: Ross Anderson’s Security Engineering.

A Linux Journal article I wrote on SELinux and filesystems has been published online. Hopefully this will be useful to people looking for more information on SELinux, which I don’t think is really documented very well yet. One of the issues, I think, is that the people currently working on getting SELinux into shape for general consumption are currently too busy to do much in the way of documentation, general presentations etc. The article by Faye Coker still seems to the best general introduction to SELinux.

IPSec performance

I’ve been playing with Jari Ruusu’s Pentium-optimized MD5 code (from loop-aes), looking to integrate it into the kernel Crypto API. The performance improvement looks impressive so far for IPsec. For loopback AH, the performance on my Xeon quadruples, from 10MB/s to 40MB/s. Jari’s code uses only i386 instructions, optimized for the Pentium II. I wonder how much more this could be optimized with instructions & techniques for newer CPUs. The performance improvement for loopback ESP with null encryption and MD5 authentication is about 18%, from 2.7MB/s to 3.2MB/s. I wonder what’s making ESP so slow to start with though. Clearly, ESP header processing is going to add overhead, but this still seems pretty bad. Hmmm.

Better kernel crypto

Looks like we’ll finally have some optimized crypto in the kernel, having just received GPL licensing signoff from the original author of the i586 optimized AES code, Brian Gladman.

In addition to optimized ASM code (of which we need more), we also need support for hardware crypto. This requires a new asynchronous API and an overhaul of much calling code (e.g. IPSec). I presented on the topic at the recent Kernel Networking Summit in Oregon (slides).

For anyone interested in helping with development, there’s a new Crypto API mailing list. Some collected requirements can be found in this aging document.

Kernel disk encryption has been problematic: the cryptoloop code is buggy on journaled filesystems, has multiple security problems, and is unmaintained. The newer dm-crypt solution is technically superior, but as it is compatible with cryptoloop encryption, it also has the same security issues. Thankfully, noted cryptographer David Wagner has stepped in to help review the security of our existing code, and has offered to help review future development. Discussion of improved security for dm-crypt is happening on the dm-crypt mailing list.

Michael Halcrow of IBM gave a talk last week at OLS on disk encryption, reviewing existing schemes and then explaining how he plans to extend Cryptfs for use as a Linux desktop disk encryption system. The OLS proceedings don’t seem to be officially online yet but they should be soon (or ask around for a copy). The Cryptfs work looks promising, hopefully we’ll see it upstream soon.