Recursive bind becomes unresponsive with high load
Mathew Ian Eis
Mathew.Eis at nau.edu
Sat Apr 2 00:02:38 UTC 2016
A few thoughts:
* You can check for dropped packets on the receive path with # netstat -u -s
High numbers on "packet receive errors” can indicate an overflow in the receive buffer - this is fixable by network stack tuning as Mike Mitchell suggests.
* You can check for dropped packets on the send path by looking for "error sending response: unset” in the named logs
...similarly fixable with sysctl tuning.
We changed the following:
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 16777216
net.core.netdev_max_backlog = 5000
net.unix.max_dgram_qlen = 100
* Try watching your incoming UDP packet buffers in tight intervals at the same time as top
# watch -n 0.1 'cat /proc/net/udp | grep ":0035 00000000:0000 "'
# watch -n 0.1 'cat /proc/net/udp | grep -v "00000000:00000000 00:00000000 00000000”'
# top -d 0.1 -p $PID_OF_NAMED # where $PID_OF_NAMED is the named pid
Does the named unresponsiveness coincide with the UDP rx_queue filling up and named dropping to 0% CPU usage?
Northern Arizona University
Information Technology Services
mathew.eis at nau.edu
From: Michael Brunnbauer <brunni at netestate.de>
Date: Friday, April 1, 2016 at 9:29 AM
To: Mathew Eis <Mathew.Eis at nau.edu>
Cc: "bind-users at lists.isc.org" <bind-users at lists.isc.org>, <dot at dotat.at>
Subject: Re: Recursive bind becomes unresponsive with high load
>On Fri, Apr 01, 2016 at 04:01:04PM +0000, Mathew Ian Eis wrote:
>> What OS are you running your BIND server on? Is it virtualized?
>Linux Kernel 3.4.111 with glibc 2.22, 32bit, not virtualized. No distribution -
>everything was compiled by hand.
>> Is it fully unresponsive, or could it be simply taking longer to respond than your client timeout?
>Assuming that bind would report dropped queries, I guess it is the latter.
>Regarding the suggestion made by Tony Finch about too many TCP connections
>in the TIME_WAIT status: That would have been a good explanation. But I do not
>see more than 200 TCP connections in TIME_WAIT status when the problem occurs
>and not more than 5000 TCP/UDP connections with port 53.
>++ Michael Brunnbauer
>++ netEstate GmbH
>++ Geisenhausener Straße 11a
>++ 81379 München
>++ Tel +49 89 32 19 77 80
>++ Fax +49 89 32 19 77 89
>++ E-Mail brunni at netestate.de
>++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
>++ USt-IdNr. DE221033342
>++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
>++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
More information about the bind-users