Locks at s2k11

State of affairs

Most of OpenBSD's kernel still runs under a single lock, ze KERNEL_LOCK(). That includes most of the syscalls, most of the interrupt handlers and most of the fault handlers. Most of them, not all of them. Meaning we have collected & fixed bugs while setting up infrastructures and examples. Now this lock remains the principal responsible for the spin % you can observe in top(1) and systat(1).

top.jpg

I believe that we opted for a difficult hike when we decided to start removing this lock from the bottom. As a result many SCSI & Network interrupt handlers as well as all Audio & USB ones can be executed without big lock. On the other hand very few syscalls are already or almost ready to be unlocked, as we incorrectly say. This explains why basic primitives like tsleep(9), csignal() and selwakeup() are only receiving attention now that the top of the Network Stack is running (mostly) without big lock.

Next steps

In the past years, most of our efforts have been invested into the Network Stack. As I already mentioned it should be ready to be parallelized. However think we should now concentrate on removing the KERNEL_LOCK(), even if the code paths aren't performance critical. The question is will that happen as predicted by George?

Dilemma

Taking parts of the kernel out of ze big lock doesn't improve performances. On short terms it might even decrease them, depending on the mechanism used to replace it. On the other hand it will improve the overall latency on your system. For example ratchov@ recently figured out that, due to ze big lock, firefox might prevent USB transfers to be completed. Crazy, no?

The performance dilemma and our egos led us to believe we could go lock-free from the start. We tried to pull the step of breaking the big lock in smaller locks, at least in the Network Stack. It worked well enough until we reach the insertion of an ARP entry. I am sure this approach can work but I doubt it is the easiest or the fastest.

socket_go_away.jpg

Splitting, splitting, splitting

So if splitting ze big lock is the way to, where do we start? Here are some ideas.

Basic blocks

tsleep(9) and friends are looking for some love in order to guarantee that that no wakeup(9) is lost without ze big lock. The FUTEX_WAIT operation as well as VFS lf_setlock() and lf_purgelocks() are two places that could directly benefit from this change.

selwakeup() and csignal() are two beasts that generated a lot of workarounds while unlocking the Network Stack. Recently we found two new problems elated to the fact that they still need ze big lock. Dealing with selwakeup() should be relatively easy but csignal() involes playing with many more data structures. In my opinion it's a great starting point for splitting the work that has been started as the proctree lock.

Syscalls

adjfreq(2) and adjtime(2) are ready to be unlocked, just do it!

The KERNEL_LOCK() should be pushed down into read(2) and write(2). This would allow the socket code to run without it. The functions are ready, it's a matter of grabbing the lock on a per fileops basis.

12 network syscalls could be unlocked. They should be ready.

pipe(2) and pipe2(2) are ready to be unlocked. That pushes the KERNEL_LOCK() around km_alloc(9). That's a great place to get started to move it down into the realm of UVM. Once ze big lock is also taken on a per fileops basis, it should be easy to remove it when writting & reading on pipes :)

mmap(2) has received some love such that ze big lock shouldn't be required when mapping anons. The underlying code is shared by sigaltstack(2) so debugging one or the other should hopefully help make progress.

Thanks to the work done to unlock sockets, it should be easy to push the KERNEL_LOCK() into ioctl(2). That should allow developers to test the parrallelism of their drivers. This would directly benefit network and SCSI devices as well as vmm(4).

Subsystems

The tty(4) subsystem is certainly the next middle-size interesting item. Like the Network Stack it contains both hard and soft interrupts. However it has fewer drivers. I would suggest following the Network Stack design and moving the softtty handler to a thread. That would be to keep some idea of consistency in and to be able to share fix/improvements amongst subsystems.

The usb(4) Stack is another subsystem with similar challenges. It share the properties of the tty(4)|http://man.openbsd.org/tty.4] subsystem and Network Stack, however it might be trickier to deal with since multiple devices have foots in other stacks.

Pushing ze big lock in the vfs(9) is also a nice next step now that the file descriptor layer has received some love. The starting place is similar to pipe(2) and ioctl(2) described above.

Fault handlers

The trivial part is uvm_grow().

The easy part is related to signals. It should be an easier starting point than csignal() because trapsignal() already has a reference to a live process.

Then comes uvm_fault(9) which could be done fairly gradualy. It is in my opinion a nice way to push ze lock in UVM and should be a funky alternative to mmap(2)http://man.openbsd.org/mmap.... to expose remaining bugs.

Other roads?

The ideas presented above are just examples of what could be done. There are of course many other possibilities. In my opinion it is better to get started with small changes, but it's up to the one that does it. Every function out of the KERNEL_LOCK() helps. So, let's get started?