Strange recursor response time pattern
he at uninett.no
Wed Sep 6 12:40:28 UTC 2017
>> Is that pulling the old-style stats file, or the HTTP-based stats channel?
As should be evident from my other message, this is using the
HTTP-based stats channel.
> If the latter... the zone list (and by extension the root
> document) seems to take a long time to process, and involves
> some sort of locking that blocks all query processing while the
> list is being generated. We encountered this on a 3+ million
> zone instance.. BIND would stop answering queries for several
> minutes if anyone requested the root stats document or the zone
Since this name server is approximately a pure recursive
resolver, the list of authoritative zones is short, in fact only
3 configured zones ("localhost", "127.in-addr.arpa" and the
corresponding for IPv6 loopback), and then there's the
"automatic" zones in addition, but still, the halting of query
processing while the list of zones is processed should not be an
That said, I'm also rather baffled that BIND would have to stop
processing all queries while traversing the zone instances; that
certainly seems to have an excessive effect on normal operations.
> As Ray says, you may be better off individually querying each
> of the other documents and processing those rather than polling
> the root doc to get them all in one shot.
It's not "me" who is doing the querying, it's the collectd
software. In the syscall trace, I see indeed that it is asking
for the root document:
GET / HTTP/1.1
However, your advice to query the separate documents in
individual requests would:
* require a rewrite of the BIND module in collectd
* still not entirely get rid of the problem that some queries
are put on hold while the stats channel data is processed and
Looking at the system call trace shows me that other BIND threads
do process DNS queries while this single thread which does the
HTTP handling does not. Hence my suggestion to instead use a
dedicated thread for the stats / HTTP handling.
Oh, BTW, it also seems that BIND in my case wastes 15ms doing
needless getsockname() syscalls on FD's which are invalid as part
of the early stages of stats processing:
5645 17 named 1504698577.991440645 CALL getsockname(0xffffffff,0x7f7fef1f06e0,0x7f7fef1f069c)
5645 17 named 1504698577.991446511 RET getsockname -1 errno 9 Bad file descriptor
(repeated lots of times).
More information about the bind-users