A story of if_get(9)
During the l2k15 hackathon, we enjoyed putting a lot of if_get(9) all over the network stack. Then more recently we started removing them. I even briefly explained why we should try to avoid using this API. All of this can be confusing, so let me tell you a story. A story of garbage collection inside the kernel.
We have a problem!
When you detach a USB network interface or destroy a pseudo-interface from a machine running OpenBSD, the kernel will realize the hardware is gone and will free all memory previously allocated for the device to work. This might include some in-flight network packets. As you might know, in 4.4BSD and OpenBSD until 2014, mbuf(9) have a pointer referencing the network interface descriptor they were recevied on:
But since OpenBSD 5.9 some parts of the network stack run in parallel of the rest of the kernel. So if we had kept that pointer, the kernel would have to garbage collect in-flight packets when an interface is being detached. That would imply not processing traffic for a short period of time. This is something totally acceptable but, at that time, it was complicated to implement. A couple of reasons are that part of the network stack was still running in interrupt context, so we weren't able to rely on primitives needing a process context and weren't ready to pay the cost of grabbing locks per packet.
We also wanted to build a solution that could be used for any data structure holding an interface pointer. Nowadays we're also using this solution for route entries, some pseudo-driver and many multicast data structures.
A classical reference problem
Since we didn't want to serialize multiple contexts, we could have gone for reference counters. But modifying all the code manipulating interface pointers to add reference count is a huge task and making it correct is even harder. My experience debugging such problems with route entries is that you never get useful information in the bug reports as the leak or double free is always somewhere else.
The way we solved this problem is with another level of indirection. In other words, we store the interface pointer in an array and keep the index of the array in all items having a different lifetime.
The array index for a given interface is known as
interface index and
can be seen with ifconfig(8):
$ ifconfig ix0 ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 lladdr dc:f0:39:00:49:ad index 3 priority 0 llprio 3 ...
When I started to implement this solution, the kernel already had an array mapping indexes to interface pointers. It has been introduced for SNMP and even the KAME scope hack uses it. So I decided to consolidate all of this into a common API and if_get(9) was born.
Converting the code manipulating mbuf(9) from using a pointer to an index wasn't a small task. So I used a small trick to be able to do the conversion in small, reviewable step:
#define ph_ifidx rcvif->if_index
Then at l2k15 dlg@ introduced if_put(9), a function to release the reference obtained with if_get(9). Then we converted all the relevant code paths from using a pointer to this new API. To keep the conversion simple I insisted that every function doing an if_get(9) was responsible for doing the corresponding if_put(9). This allowed jsg@ and his static analysis tools to check if a code path was missing a call.
Even if this solution seemed easier for us than the alternatives, it doesn't mean it was easy. Some bugs and races are still waiting for you! For example at g2k16 awolk@ found, by extending delays, some cases we didn't get right.
We also need to keep in mind that being based on SRP (Shared Reference Pointers), if_get(9) has a cost, it's a small cost but it is still noticeable. When every CPU cycle matter, we should avoid calling this function more than necessary. Since we tried to make an easy transition, we did not changed the original code much. That's why we introduced almost as many if_get(9) as previous pointer dereferences. But now we can be clever and remove many of them.
Finally more network data structures will benefit from this solution. Not now, but as soon as we'll move more part of the stack out of the NET_LOCK().