ART single thread performances
By mpi on Friday 17 June 2016, - Network - Permalink
ART has been the default routing table backend in OpenBSD for some months now. That means that OpenBSD 6.0 will no longer consult the 4.3 BSD reduced radix tree to perform route lookups.
The principal motivation for adopting a new tree implementation can be explained in three letters: SMP.
I'll describe in a different context why and how ART is a good fit in our revamp of OpenBSD network stack. For the moment, let's have a look at the single-thread performances of this algorithm in OpenBSD -current.
Performances
ART is a free multibit trie based routing table. To keep it simple, it can be seen as using more memory for fewer CPU cycles. In other words, we get a faster lookup by wasting memory. The original paper presents some performance comparisons between two ART configurations and the BSD Radix. But how does this apply to OpenBSD?
I asked Hrvoje Popovski to run his packet forwarding test on his Xeon box (E5-2620 v2 @ 2.10GHz, 2400.34 MHz) with ix(4) (82599) interfaces. The test setup consist of three machines with the OpenBSD box in the middle:
The machine is configured with:
kern.pool_debug=0
kern.maxclusters=32768
net.inet.ip.forwarding=1
net.inet.ip.ifq.maxlen=8192
The simulations have been performed with an OpenBSD -current from June 9th. The machine is configured with pf(4) disabled in order to force a single route lookup for every IPv4 packet. Based on the result of the lookup the kernel decide if it should forward, deliver or drop the packet:
In order to measure the impact of a different routing table configurations we need to send enough packets to make sure the machine is CPU bound but not live locked. In this case it happens when sending 800 Kpps. During a test the sender generates packets for all addresses of a given subnet to invalidate any possible cache. Plus to see the difference of various subnets, the test has been reproduced on a /24, a /20 and a /16:
ART 8/4/4/4/4/4/4 is the configuration currently enabled in -current. It has been choosen because it has a memory footprint similar to the BSD Radix for a fullfeed BGP server. The other configurations waste more memory by using bigger stride lengths but, as shown on the graphic, perform faster.
Memory consumption
Sadly this test is not enough to see how much memory can be wasted by different ART configurations because it does not create fragmentation:
So I asked benno@ and phessler@ if I could measure once again the memory consumption of a BGP fullfeed on their test setup:
We can see that using bigger stride lengths result in more memory being wasted.
Conclusion
As expected ART performs faster than the BSD Radix on OpenBSD because it uses more memory. That's hopefully not new to me, so what did I learn during these experiments?
First of all I validated the choice I made two years ago to integrate an implementation supporting variable stride lengths. It is nice to be able to choose at compile time, depending on your usage, how much memory you want to dedicate to improve your route lookup performances. This is important for experimenting but it also matches the fact OpenBSD is a generic purpose operating system and its users have different requirements.
Secondly I've seen that the 8/8/8/8 configuration doesn't offer any advantage compared to the 16/8/8 one. The only problem is that it doesn't seem to be currently possible to allocate the 16-heap in a MP-safe way on OpenBSD, something to look at in the future?
Finally the in-kernel memory consumed by the routing table of a BGP server is around 130M in -current and would grow to 180M with a 16/8/8 configuration. Even if route lookup performances are not critical for such setups, I'd argue that consuming 180M is already acceptable today, in case we decide to crank the defaults in the upcoming years.