Tuning suggestions for high-core-count Linux servers

Thu Jun 1 22:26:56 UTC 2017

Howdy Stuart,

>  Re: net.core.rmem - I'd love to figure out what the math here should be. 'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such.

Basically the math here is “large enough that you can queue up the 9X.XXXth percentile of traffic bursts without dropping them, but not so large that you waste processing time fiddling with the queue”. Since that percentile varies widely across environments it’s not easy to provide a specific formula. And on that note:

> Will keep spinning test but using smaller increments to the wmem/rmem values

Tightening is nice for finding some theoretical limits but in practice not so much. Be careful about making them too tight, lest under your “bursty” production loads you drop all sorts of queries without intending to.

> Re: dropwatch - Oo! new tool! More google-fu to figure out how to use that information for good

dropwatch is an easy indicator of whether the throughput issue is on or off the system. Seeing packets being dropped in the system combined with apparently low CPU usage suggests you might be able to increase throughput. `dropwatch -l kas` should tell you the methods that are dropping the packets, which can help you understand where in the kernel they are being dropped and why. For anything beyond that, I expect your Google-fu is as good as mine ;-)

If your CPU utilization is still apparently low, you might be onto something with taskset/numa… Related things I have toyed with but don’t currently have in production:

increasing kernel.sched_migration_cost a couple of orders of magnitude
setting kernel.sched_autogroup_enabled=0
systemctl stop irqbalance

Lastly (mostly for posterity for the list, please don’t take this as “rtfm” if you’ve seen them already) here are some very useful in-depth (but generalized) performance tuning guides:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Performance_Tuning_Guide/
https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf

… and for one last really crazy idea, you could try running a pair of named instances on the machine and fronting them with nginx’s supposedly scalable UDP load balancer. (As long as you don’t get a performance hit, it also opens up other interesting possibilities like being able to shift production load for maintenance on the named backends).

Best of luck! Let us know where you cap out!

Regards,

Mathew Eis
Northern Arizona University
Information Technology Services

-----Original Message-----
From: "Browne, Stuart" <Stuart.Browne at neustar.biz>
Date: Thursday, June 1, 2017 at 12:27 AM
To: Mathew Ian Eis <Mathew.Eis at nau.edu>, "bind-users at lists.isc.org" <bind-users at lists.isc.org>
Subject: RE: Tuning suggestions for high-core-count Linux servers

    Cheers Matthew.

    1)  Not seeing that error, seeing this one instead:

    01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would block

    Only seeing a few of them per run (out of ~70 million requests).

    Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason for it being set (errno == EWOULDBLOCK || errno == EAGAIN).

    I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but no real difference after tweaks. I did another run with stupidly-large core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit so over tuning isn't good either. Need to figure out a good balance here.

    I'd love to figure out what the math here should be.  'X number of simultaneous connections multiplied by Y socket memory size = rmem' or some such.

    2) I am still seeing some udp receive errors and receive buffer errors; about 1.3% of received packets.

    From a 'netstat' point of view, I see:

    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State
    udp   382976  17664 192.168.1.21:53         0.0.0.0:*

    The numbers in the receive queue stay in the 200-300k range whilst the send-queue floats around the 20-40k range. wmem already bumped.

    3) Huh, didn't know about this one. Bumped up the backlog, small increase in throughput for my tests. Still need to figure out how to read sofnet_stat. More google-fu in my future.

    After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in the 2nd column.

    4) Yes, max_dgram_qlen is already set to 512.

    5) Oo! new tool! :)

    --
    ...
    11 drops at location 0xffffffff815df171
    854 drops at location 0xffffffff815e1c64
    12 drops at location 0xffffffff815df171
    822 drops at location 0xffffffff815e1c64
    ...
    --

    I'm pretty sure it's just showing more details of the 'netstat -u -s'. More google-fu to figure out how to use that information for good rather than, well, .. frustration? .. :)

    Will keep spinning test but using smaller increments to the wmem/rmem values, see if I can eek anything more than 360k out of it.

    Thanks for your suggestions Matthew!

    Stuart

    -----Original Message-----
    From: Mathew Ian Eis [mailto:Mathew.Eis at nau.edu] 
    Sent: Thursday, 1 June 2017 10:30 AM
    To: bind-users at lists.isc.org
    Cc: Browne, Stuart
    Subject: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

    360k qps is actually quite good… the best I have heard of until now on EL was 180k [1]. There, it was recommended to manually tune the number of subthreads with the -U parameter.

    Since you’ve mentioned rmem/wmem changes, specifically you want to:

    1. check for send buffer overflow; as indicated in named logs:

    31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): error sending response: unset

    fix: increase rmem via sysctl:

    net.core.rmem_max

    net.core.rmem_default

    2. check for receive buffer overflow; as indicated by netstat:

    # netstat -u -s

    Udp:

        34772479 packet receive errors

    fix: increase wmem and backlog via sysctl:

    net.core.wmem_max

    net.core.wmem_default

    … and other ideas:

    3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers (indicating dropped packets).

    If any are non-zero, increase net.core.netdev_max_backlog

    4. You may also want to want to increase net.unix.max_dgram_qlen (although since EL7 has default this to 512, this is not much of an issue - double check that it is 512).

    5. Try running dropwatch to see where packets are being lost. If it shows nothing then you need to look outside the system. If it shows something you may have a hint where to tune next.

    Please post your outcomes in any case, since you are already having some excellent results.

    [1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html

    Regards,

    Mathew Eis

    Northern Arizona University

    Information Technology Services

    -----Original Message-----

    From: bind-users <bind-users-bounces at lists.isc.org> on behalf of "Browne, Stuart" <Stuart.Browne at neustar.biz>

    Date: Wednesday, May 31, 2017 at 12:25 AM

    To: "bind-users at lists.isc.org" <bind-users at lists.isc.org>

    Subject: Tuning suggestions for high-core-count Linux servers

        Hi,

        I've been able to get my hands on some rather nice servers with 2 x 12 core Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND to respond at a faster rate.

        I'm seeing that pretty much cpu count beyond a single die doesn't get any real improvement. I understand the NUMA boundaries etc., but this hasn't been my experience on previous iterations of the Intel CPU's, at least not this dramatically. When I use more than a single die, CPU utilization continues to match the core count however throughput doesn't increase to match.

        All the testing I've been doing for now (dnsperf from multiple sources for now) seems to be plateauing around 340k qps per BIND host.

        Some notes:

        - Primarily looking at UDP throughput here

        - Intention is for high-throughput, authoritative only

        - The zone files used for testing are fairly small and reside completely in-memory; no disk IO involved

        - RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place

        - Current configure:

        built by make with '--build=x86_64-redhat-linux-gnu' '--host=x86_64-redhat-linux-gnu' '--program-prefix=' '--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' '--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' '--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' '--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' '--enable-shared' '--disable-static' '--disable-openssl-version-check' '--with-tuning=large' '--with-libxml2' '--with-libjson' 'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'

        Things tried:

        - Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's doesn't improve much beyond letting BIND make its own decision

        - NIC interfaces are set for TOE

        - rmem & wmem changes (beyond a point) seem to do little to improve performance, mainly just make throughput more consistent

        I've yet to investigate the switch throughput or tweaking (don't yet have access to it).

        So, any thoughts?

        Stuart