bind-9.3.1 stops answering queries for nearly a minute/hour
Danny Thomas
d.thomas at its.uq.edu.au
Thu Oct 13 07:47:45 UTC 2005
Hello,
we've been noticing for a while periods when our heaviest-loaded name-
server stops answering queries. It is running 9.3.1 and handles around
20 million queries per day, i.e. a few hundred per second. It's running
on a dual-processor FreeBSD 4 system and compiled without threads.
What we see is a period of 30-60 seconds when the name-server stops
answering queries. Monitoring with top shows a sudden spike in CPU activity
for named while this is happening. While this basically happens every hour,
it is just slightly longer, as if a one-hour timer is started when the
activity underlying this finishes.
The results can been seen from the output of trace-nameserver, a local tool
which sends a dns query every second for a record the name-server is author-
itative for, and plots the results; a 'T' when response took > 800ms
otherwise a '.'. Each line corresponds to a minute with the average/maximum
response time in msecs seen during that minute listed at the end.
NB queries marked with a 'T' are not counted in these numbers, which means
quite a few of those marked with a '.' around 13:09 were slow thereby
increasing the average from 2-3ms to 119ms.
12:58: ............................................................ 2/ 9
12:59: ............................................................ 2/ 6
13:00: ............................................................ 2/ 5
13:01: ............................................................ 3/ 15
13:02: ............................................................ 2/ 18
13:03: ............................................................ 2/ 6
13:04: ............................................................ 2/ 7
13:05: ............................................................ 2/ 4
13:06: ............................................................ 3/ 8
13:07: ............................................................ 3/ 6
13:08: ............................................................ 3/ 7
13:09: ....T.TTTTTT.TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT........T. 119/787
13:10: ............................................................ 3/ 28
13:11: ............................................................ 2/ 5
13:12: ............................................................ 2/ 10
13:13: ............................................................ 2/ 8
13:14: ............................................................ 3/ 9
13:15: ............................................................ 2/ 5
13:16: ............................................................ 2/ 13
13:17: ............................................................ 2/ 8
Because the period is not exactly hourly, we knew it was not cron-related.
Nor did it seem to be related to zone transfers nor was any noteworthy
activity apparent in the query logs.
I finally enabled tracing at level 10 a few minutes before an event
was due. A 100+Meg named.run was produced in around 5 minutes.
A quick scroll through identified large numbers of lines of the form:
13:06:20.663 no_references: delete from rbt: 0xc1d6d00 ns1.gpl.com.au
A short perl script counted the number of such lines in each second
and produced the following, i.e. the time axis is rotated by 90 degrees
compared to the output of trace-nameserver.
There is a clear correlation between the figure above and below so I suspect
the unresponsiveness to be coming during house-cleaning (deleting from
RedBlack trees ?).
We'd like to know if there is something that could be done to prevent
such long periods of unresponsiveness. Mind you, the effects would be
less important if UNIX resolvers re-ordered the list of name-servers
specified in resolv.conf so the fastest is consulted first. This dynamic
behaviour is what the Windows resolver apparently does, Win2K and later.
Could smaller cleanups be done every minute ?
Is this a case when threading would help, or would it just mean
extra CPUs would be working on garbage-collection ?
Are there any build options which might help, e.g. I've seen the metion of
ISC_MEM_USE_INTERNAL_MALLOC for performance, which is slightly different.
I suspect the timeouts we're seeing are a different issue to
1740. [bug] Replace rbt's hash algorithm as it performed badly
with certain zones. [RT #12729]
or several others mentioning bugs fixed wrt RBTs.
cheers,
Danny Thomas
each column in this tables corresponds to a minute, starting from 13:06:00
through 13:10:59. The numbers suddenly increase at 13:09:10, i.e. a few
seconds after trace-nameserver starts recording time-outs
06:00 - 07:00 - 08:00 - 09:00 - 10:00 -
06:01 - 07:01 - 08:01 - 09:01 - 10:01 -
06:02 - 07:02 - 08:02 - 09:02 - 10:02 -
06:03 - 07:03 - 08:03 - 09:03 - 10:03 -
06:04 - 07:04 - 08:04 - 09:04 - 10:04 -
06:05 - 07:05 - 08:05 - 09:05 - 10:05 -
06:06 - 07:06 - 08:06 - 09:06 - 10:06 -
06:07 - 07:07 - 08:07 - 09:07 - 10:07 -
06:08 - 07:08 - 08:08 - 09:08 - 10:08 -
06:09 - 07:09 - 08:09 - 09:09 - 10:09 -
06:10 - 07:10 - 08:10 - 09:10 1,141 10:10 4
06:11 - 07:11 - 08:11 1 09:11 1,112 10:11 14
06:12 - 07:12 1 08:12 - 09:12 1,043 10:12 -
06:13 1 07:13 - 08:13 - 09:13 884 10:13 2
06:14 - 07:14 2 08:14 - 09:14 1,369 10:14 5
06:15 - 07:15 1 08:15 - 09:15 1,452 10:15 2
06:16 - 07:16 - 08:16 - 09:16 1,434 10:16 1
06:17 - 07:17 - 08:17 9 09:17 1,756 10:17 3
06:18 2 07:18 - 08:18 6 09:18 1,861 10:18 1
06:19 - 07:19 6 08:19 2 09:19 1,711 10:19 6
06:20 4 07:20 7 08:20 2 09:20 1,543 10:20 -
06:21 - 07:21 - 08:21 6 09:21 1,301 10:21 1
06:22 6 07:22 - 08:22 - 09:22 1,323 10:22 6
06:23 - 07:23 1 08:23 - 09:23 1,441 10:23 1
06:24 6 07:24 1 08:24 6 09:24 1,405 10:24 -
06:25 3 07:25 - 08:25 - 09:25 1,500 10:25 1
06:26 3 07:26 - 08:26 1 09:26 1,535 10:26 2
06:27 - 07:27 - 08:27 12 09:27 1,659 10:27 1
06:28 1 07:28 - 08:28 18 09:28 1,709 10:28 -
06:29 6 07:29 1 08:29 6 09:29 1,456 10:29 1
06:30 18 07:30 - 08:30 - 09:30 1,863 10:30 2
06:31 - 07:31 12 08:31 - 09:31 1,764 10:31 -
06:32 - 07:32 - 08:32 12 09:32 2,031 10:32 1
06:33 - 07:33 1 08:33 - 09:33 1,960 10:33 2
06:34 - 07:34 8 08:34 25 09:34 2,330 10:34 18
06:35 - 07:35 2 08:35 - 09:35 2,199 10:35 -
06:36 - 07:36 - 08:36 7 09:36 1,988 10:36 2
06:37 6 07:37 - 08:37 6 09:37 2,346 10:37 -
06:38 6 07:38 - 08:38 1 09:38 2,685 10:38 -
06:39 - 07:39 - 08:39 3 09:39 2,368 10:39 -
06:40 - 07:40 - 08:40 - 09:40 3,021 10:40 -
06:41 - 07:41 - 08:41 13 09:41 4,766 10:41 -
06:42 3 07:42 - 08:42 - 09:42 5,256 10:42 -
06:43 - 07:43 - 08:43 5 09:43 5,594 10:43 -
06:44 2 07:44 2 08:44 1 09:44 5,824 10:44 -
06:45 1 07:45 - 08:45 - 09:45 4,422 10:45 -
06:46 6 07:46 1 08:46 1 09:46 5,084 10:46 -
06:47 - 07:47 - 08:47 - 09:47 7,487 10:47 -
06:48 - 07:48 1 08:48 - 09:48 6,626 10:48 -
06:49 - 07:49 1 08:49 - 09:49 2,931 10:49 -
06:50 - 07:50 1 08:50 - 09:50 - 10:50 -
06:51 - 07:51 1 08:51 - 09:51 - 10:51 -
06:52 7 07:52 1 08:52 - 09:52 3 10:52 -
06:53 - 07:53 6 08:53 8 09:53 4 10:53 -
06:54 - 07:54 - 08:54 - 09:54 - 10:54 -
06:55 2 07:55 - 08:55 - 09:55 5 10:55 -
06:56 1 07:56 3 08:56 - 09:56 - 10:56 -
06:57 2 07:57 2 08:57 3 09:57 - 10:57 -
06:58 2 07:58 11 08:58 - 09:58 - 10:58 -
06:59 - 07:59 7 08:59 - 09:59 1 10:59 -
--
d.thomas at its.uq.edu.au Danny Thomas,
+61-7-3365-8221 Software Infrastructure,
http://www.its.uq.edu.au ITS, The University of Queensland
More information about the bind-users
mailing list