Observation

Hrvoje Popovski was kind enough to help me with some tests. He first confirmed that on his Xeon box (E5-2643 v2 @ 3.50GHz, 3500.44 MHz), forwarding performances without pf(4) dropped from 1.42Mpps to 880Kpps when using vlan(4) on both interfaces.

Setup

So I asked him to profile the kernel while executing such tests. With the data he collected, I generated the graphs below using gprof2dot. Every graph correspond to what a CPU has been executing while the machine was forwarding packets. From left to right CPU0 to CPU5.

Plain ix(4)

00.png 01.png 02.png 02.png 02.png 02.png

As you can see in the current architecture, the Network Stack already uses more that one CPU. CPU0 deals with interrupts, empty network interface rings and passes packets to CPU1 via if_input(). Then CPU1 process the packets in a process context and finishes by enqueuing them via if_enqueue() on network interface rings.

With vlan(4)

00.png 01.png 02.png 03.png 04.png 01.png

From a userland point of view

Let's start with some low hanging fruits. First of all we can see the effect of the recent preempt change, only CPU4 in the first case and CPU5 in the second case are executing syscalls. If I already showed you some similar graphs, you certainly remember that userland processes were jumping between CPUs.

We also see that my favorite scheduler hack is working as intended. Since CPU0 is not executing any syscall even if has enough free time to do so.

The lost percents

When Hrvoje did his first measurements he noticed a drop of forwarded packets of ~50%. However when doing some profiling he noticed that without vlan(4) the machine could only forward 730Kpps. So GPROF profiling has quite a huge overhead as you can see on CPU1.

overheard.png

With vlan(4) the machine could only forward 490Kpps. That's roughly 2/3 of the measured plain performances.

Now let's assume that the number of packets forwarded per second is mostly bound to the amount of work CPU1 can do processing such packets. In this case we are looking for functions where CPU1 spends between 1/3 and 1/2 of its processing time in the vlan(4) case. Without surprise, we could guess that they are the following.

vlan_input.png vlan_start.png

Together vlan_input() and vlan_start() represent 25% of the time CPU1 spends processing packets. This is not exactly between 33% and 50% but it is close enough. The assumption we made earlier is certainly too simple. If we compare the amount of work done in process context, represented by if_input_process() we clearly see that half of the CPU time is not spent in ether_input().

if_input.png

I'm not sure how this is related to the measured performance drop. It is actually hard to tell since packets are currently being processed in 3 different contexts. One of the arguments mikeb@ raised when we discussed moving everything in a single context, is that it is simpler to analyse and hopefully make it scale. Now to give you more information about the complexity of the current architecture, I asked Hrvoje the value of some counters. For the duration of the measurements we got the following.

Plain ix(4)

kern.netlivelocks=28
net.inet.ip.ifq.drops=42366

With vlan(4)

kern.netlivelocks=256
net.inet.ip.ifq.drops=1

It is not surprising that the netlivelocks counter is higher in the vlan(4) case. Since the kernel is spending more time processing packets, maybe too much time, we can assume that the live lock mitigation technique will reduce the size of the network interface rings. This will act as a domino effect and reduce even more the number of forwarded packets.

The fact that the number of ifq drops is higher in the plain ix(4) case tells us where packets are dropped. And indeed both counters fit in the global picture.

packets.png

Conclusion

With some measurements, a couple of nice pictures, a bit of analysis and some educated guesses we are now in measure of saying that the performances impact observed with vlan(4) is certainly due to the pseudo-driver itself. A decrease of 30% to 50% is not what I would expect from such pseudo-driver.

I originally heard that the reason for this regression was the use of SRP but by looking at the profiling data it seems to me that the queuing API is the problem. In the graph above the CPU time spent in if_input() and if_enqueue() from vlan(4) is impressive. Remember, in the case of vlan(4) these operations are done per packet!

When if_input() has been introduced the queuing API did not exist and putting/taking a single packet on/from an interface queue was cheap. Now it requires a mutex per operation, which in the case of packets received and sent on vlan(4) means grabbing three mutexes per packets.

I still can't say if my analysis is correct or not, but at least it could explain the decrease observed by Hrvoje when testing multiple vlan(4) configurations.

forwarded.png

vlan_input() takes one mutex per packet, so it decreases the number of forwarded packets by ~100Kpps on this machine, while vlan_start() taking two mutexes decreases it by ~200Kpps.

So, what's the age of the captain?