Tuning suggestions for high-core-count Linux servers

Fri Jun 2 03:59:30 UTC 2017

<lots of stuff snipped out>

> -----Original Message-----
> From: Mathew Ian Eis [mailto:Mathew.Eis at nau.edu]
>
<snip>
> 
> Basically the math here is “large enough that you can queue up the
> 9X.XXXth percentile of traffic bursts without dropping them, but not so
> large that you waste processing time fiddling with the queue”. Since that
> percentile varies widely across environments it’s not easy to provide a
> specific formula. And on that note:

Yup. Experimentation seems to the be name of the day.

> > Will keep spinning test but using smaller increments to the wmem/rmem
> > values
>
> Tightening is nice for finding some theoretical limits but in practice
> not so much. Be careful about making them too tight, lest under your
> “bursty” production loads you drop all sorts of queries without intending
> to.

Yup.

> dropwatch is an easy indicator of whether the throughput issue is on or
> off the system. Seeing packets being dropped in the system combined with
> apparently low CPU usage suggests you might be able to increase
> throughput. `dropwatch -l kas` should tell you the methods that are
> dropping the packets, which can help you understand where in the kernel
> they are being dropped and why. For anything beyond that, I expect your
> Google-fu is as good as mine ;-)

Like the '-l kas':

830 drops at udp_queue_rcv_skb+374 (0xffffffff815e1c64)
 15 drops at __udp_queue_rcv_skb+91 (0xffffffff815df171)

Well and truly buried in the code.

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#udpqueuercvskb

This seems like a nice explanation as to what's going on. Still reading through it all.

> If your CPU utilization is still apparently low, you might be onto
> something with taskset/numa… Related things I have toyed with but don’t
> currently have in production:
> 
> increasing kernel.sched_migration_cost a couple of orders of magnitude
> 
> setting kernel.sched_autogroup_enabled=0
> 
> systemctl stop irqbalance

I've had irqbalance stopped previously, and sched_autogroup_enabled is already set to 0. Initial mucking about a bit with sched_migration_cost gets a few more QPS through, so will run more tests.

Thanks for this one, hadn't used it before.

> > Lastly (mostly for posterity for the list, please don’t take this as
> “rtfm” if you’ve seen them already) here are some very useful in-depth
> (but generalized) performance tuning guides:

Will give them a read. I do like manuals :P

<urls stripped due to stupid spam filtering corrupting their readability>

> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s supposedly
> scalable UDP load balancer. (As long as you don’t get a performance hit,
> it also opens up other interesting possibilities like being able to shift
> production load for maintenance on the named backends).

Yeah, I've had this thought.

I'm pretty sure I've pretty much reached the limit of what BIND can do in a single NUMA node for the moment.

I will report back if any great inspiration or successful increases in throughput occur.

Stuart