Observation

Hrvoje Popovski repeated the measurements he did two years ago, using a profiling kernel to see where CPU cycles are spent. I'm publishing here data gathered on the same Xeon box (E5-2643 v2 @ 3.50GHz, 3500.44 MHz):

Throughput while profiling

In the graph above, the number of forwarded packets per second is relatively small because almost 2/3 of the cycles are spent for the actual profiling. However it is interesting to see that this number dropped from 880Kpps to 535Kpps for plain forwarding. This is coherent with other performance degradations observed after applying CPU workarounds.

So yes, dlg@'s change did improve considerably the number of forwarded packets per second when the vlan(4) driver is used. But where are CPU cycles spared? And what explains the difference with plain ix(4) forwarding?

On a single CPU

From left to right, plain ix(4), then with vlan(4) before and after change:

1CPU-plain.png 1CPU-before.png 1CPU-plain.png

First of all the shape of profiled data when vlan(4) is used is now similar to the ix(4) one. The difference happens below ether_output() which corresponds to what we expected:

ether_output() tree before change ether_output() tree after change

Another indicator is the number of cycles spent processing packets vs receiving packets. In this single CPU configuration, it is easier to correlate the number of forwarded packets per second with the time spent in the interrupt handler, modulo the drops.

Network taskq CPU time Interrupt handler CPU time

Current bottleneck

So if_enqueue() did improve the situation, what's next? The first difference that appears in the graph is vlan_input(). It has been consuming 2.52% of CPU time, roughly the amount of time we're not spending in the interrupt handler compared to the plain scenario. Most of this time, 1.71% is used to enqueue the packet again, using mutexes and SPLs, in ifiq_input()... Deja vu?

Another noticeable difference is if_get(9). It seemed weird to me that both vlan_enqueue() and vlan_transmit() needed to called this function, accounting for 1/3 of the calls each. The reason for that was a hack that has already been removed:

CVSROOT:        /cvs
Module name:    src
Changes by:     dlg@cvs.openbsd.org     2019/01/23 16:17:25

Modified files:
        sys/net        : if_vlan.c 

Log message:
remove special casing for IFT_MPLSTUNNEL now mpw is IFT_ETHER.

So more performances by unifying drivers? That's the spirit!

Plain forwarding on a 6 CPU machine

Analysing global profiling data gathered on multiple CPUs is a challenge. Let's start without vlan(4):

6CPU-plain-00.png 6CPU-plain-01.png 6CPU-plain-02.png 6CPU-plain-03.png 6CPU-plain-04.png 6CPU-plain-05.png

CPU0 is handling interrupts, full time, and CPU3 is processing packets, full time as well. The others are either idle or doing some userland fun. Both CPUs are executing code for 2/3 of their time. So a simple multiplication gives us an average cost, in percent, of a specific operation. We already know that we're bounded by the amount of work done in CPU3, so that's where we want to look.

In this configuration, with pf(4) disabled, we can see that the route lookup has a cost of ~7.5% of the whole packet processing. That cost is also present with pf(4) enabled, but we could get it back by caching the route entry in the PF state.

Another question related to ix(4) is why ixgbe_encap() itself has a similar cost? Is it because of the atomic operations?

Finally it would be great to see how our stack scale in that scenario by using multiple threads to process packets. This should already work in this configuration when pf(4) is disabled.

Vlan forwarding on a 6 CPU machine

Let's look at what happens with a vlan(4) after the if_enqueue() change:

6CPU-after-00.png 6CPU-after-01.png 6CPU-after-02.png 6CPU-after-03.png 6CPU-after-04.png 6CPU-after-05.png

Once again CPU0 is handling interrupts and, thankfully, the same happens as in the previous scenario. For some reasons the network thread moved between CPU2 and CPU3. This happens when another thread gets scheduled on the same CPU. It only makes our analyse harder because CPU2 executed & profiled the thread for 10% of its time and CPU3 90%.

We already spoke about vlan_transmit() calling if_get(9), but what isn't obvious is that refcnt_rele_wake(9) is in fact a if_put(9). If you look at the numbers they correspond :) So we can speculate that dlg@'s change related to the conversion of mpw(4) improved performances by ~2.5%. It is also interesting to see that most of the costs related to if_get(9) on amd64 come from the reference counting and not the SRPs.

The ix(4) issue raised previously is also present here. We might speculate a bit more because both ixgbe_encap() and ixgbe_tx_ctx_setup() appear on the graph with a similar amount of CPU time and they both do an atomic decrement on the same variable. Coincidence?

Conclusion

Data analysis is hard. Why are the proportions of forwarded packets different between SP and MP with and without vlan(4)? How relevant or correct are these measurements? Hard to say.

At least we know that the previous speculations were not completely wrong. We're happy to see that dlg@'s fix is improving the situation. We have an idea about the next bottleneck: same same on the input side? We have good guesses about further micro-optimizations but we are all waiting to see how this stack behaves on more than one CPU first!

On top of that I want to thanks Hrvoje Popovski for his great job at collecting data.

Plus we found an old scheduling bug thanks to the profiling info.... What else?