Tuning suggestions for high-core-count Linux servers

Paul Kosinski bind at iment.com
Sat Jun 3 00:45:51 UTC 2017


It's been some years now, but I had worked on developing code for a high
throughput network server (not BIND). We found that on multi-socketed
NUMA machines we could have similar contention problems, and it was
quite important to make sure that threads which needed access to the
same memory areas weren't split across sockets. Luckily, the various
services being run were sufficiently separate that we could assign the
service processes to different sockets and avoid a lot of contention.

With BIND, it's basically all one service, so this is not directly
possible. 

It might be possible, however, to run two (or more) *separate*
instances of BIND and do some strictly internal routing of the IP
traffic to those separate instances, or even to have separate NICs
feeding the separate processes. In other words, have several BIND
servers in one chassis, each with its own NUMA memory area.



On Fri, 2 Jun 2017 07:12:09 +0000
"Browne, Stuart" <Stuart.Browne at neustar.biz> wrote:

> Just some interesting investigation results. One of the URL's Matthew
> Ian Eis linked to talked about using a tool called 'perf'. For the
> hell of it, I gave it a shot.
> 
> Sure enough it tells some very interesting things.
> 
> When BIND was restricted to using a single NUMA node, the biggest
> call (to _raw_spin_lock) showed 7.05% overhead.
> 
> When BIND was allowed to use both NUMA nodes, the same call showed
> 49.74% overhead; an astonishing difference.
> 
> As it was running unrestricted, memory from both nodes was more used:
> 
> [root at kr20s2601 ~]# numastat -p 22441
> 
> Per-node process memory usage (in MBs) for PID 22441 (named)
>                            Node 0          Node 1           Total
>                   --------------- --------------- ---------------
> Huge                         0.00            0.00            0.00
> Heap                         0.45            0.12            0.57
> Stack                        0.71            0.64            1.35
> Private                      5.28         9415.30         9420.57
> ----------------  --------------- --------------- ---------------
> Total                        6.43         9416.07         9422.50
> 
> Given the numbers here, you wouldn't think it should make much of a
> difference.
> 
> Sadly, I didn't get which CPU the UDP listener was attached to.
> 
> Anyway, what I've changed so far:
> 
>     vm.swappines = 0
>     vm.dirty_ratio = 1
>     vm.dirty_background_ratio = 1
>     kernel.sched_min_granularity_ns = 10000000
>     kernel.sched_migration_cost_ns = 5000000
> 
> Query rate thus far reached (on 24 cores, numa node restricted): 426k
> qps Query rate thus far reached (on 48 cores, numa nodes
> unrestricted): 321k qps
> 
> Stuart
> 
> 'perf' data collected during a 3 minute test run:
> 
> [root at kr20s2601 ~]# ls -al perf.data*
> -rw-------. 1 root root  717350012 Jun  2 08:36 perf.data.24
> -rw-------. 1 root root 1366620296 Jun  2 08:53 perf.data.48
> 
> 'perf' top 5 (24 cores, numa restricted):
> 
> Overhead  Command  Shared Object         Symbol
>    7.05%  named    [kernel.kallsyms]     [k] _raw_spin_lock
>    6.96%  named    libpthread-2.17.so    [.] pthread_mutex_lock
>    3.84%  named    libc-2.17.so          [.] vfprintf
>    2.36%  named    libdns.so.165.0.7     [.] dns_name_fullcompare
>    2.02%  named    libisc.so.160.1.2     [.] isc_log_wouldlog
> 
> 'perf' top 5 (48 cores):
> 
> Overhead  Command  Shared Object         Symbol
>   49.74%  named    [kernel.kallsyms]     [k] _raw_spin_lock
>    4.52%  named    libpthread-2.17.so    [.] pthread_mutex_lock
>    3.09%  named    libisc.so.160.1.2     [.] isc_log_wouldlog
>    1.84%  named    [kernel.kallsyms]     [k] _raw_spin_lock_bh
>    1.56%  named    libc-2.17.so          [.] vfprintf
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
> 
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> 
> 


More information about the bind-users mailing list