bind-9.3.1 stops answering queries for nearly a minute/hour

Thu Oct 13 07:47:45 UTC 2005

Hello,
      we've been noticing for a while periods when our heaviest-loaded name-
server stops answering queries. It is running 9.3.1 and handles around
20 million queries per day, i.e. a few hundred per second. It's running
on a dual-processor FreeBSD 4 system and compiled without threads.

What we see is a period of 30-60 seconds when the name-server stops
answering queries. Monitoring with top shows a sudden spike in CPU activity
for named while this is happening. While this basically happens every hour,
it is just slightly longer, as if a one-hour timer is started when the
activity underlying this finishes.

The results can been seen from the output of trace-nameserver, a local tool
which sends a dns query every second for a record the name-server is author-
itative for, and plots the results; a 'T' when response took > 800ms
otherwise a '.'. Each line corresponds to a minute with the average/maximum
response time in msecs seen during that minute listed at the end.
NB queries marked with a 'T' are not counted in these numbers, which means
quite a few of those marked with a '.' around 13:09 were slow thereby
increasing the average from 2-3ms to 119ms.

12:58: ............................................................    2/  9
12:59: ............................................................    2/  6
13:00: ............................................................    2/  5
13:01: ............................................................    3/ 15
13:02: ............................................................    2/ 18
13:03: ............................................................    2/  6
13:04: ............................................................    2/  7
13:05: ............................................................    2/  4
13:06: ............................................................    3/  8
13:07: ............................................................    3/  6
13:08: ............................................................    3/  7
13:09: ....T.TTTTTT.TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT........T.  119/787
13:10: ............................................................    3/ 28
13:11: ............................................................    2/  5
13:12: ............................................................    2/ 10
13:13: ............................................................    2/  8
13:14: ............................................................    3/  9
13:15: ............................................................    2/  5
13:16: ............................................................    2/ 13
13:17: ............................................................    2/  8

Because the period is not exactly hourly, we knew it was not cron-related.
Nor did it seem to be related to zone transfers nor was any noteworthy
activity apparent in the query logs.

I finally enabled tracing at level 10 a few minutes before an event
was due. A 100+Meg named.run was produced in around 5 minutes.
A quick scroll through identified large numbers of lines of the form:
  13:06:20.663 no_references: delete from rbt: 0xc1d6d00 ns1.gpl.com.au
A short perl script counted the number of such lines in each second
and produced the following, i.e. the time axis is rotated by 90 degrees
compared to the output of trace-nameserver.

There is a clear correlation between the figure above and below so I suspect
the unresponsiveness to be coming during house-cleaning (deleting from
RedBlack trees ?).

We'd like to know if there is something that could be done to prevent
such long periods of unresponsiveness. Mind you, the effects would be
less important if UNIX resolvers re-ordered the list of name-servers
specified in resolv.conf so the fastest is consulted first. This dynamic
behaviour is what the Windows resolver apparently does, Win2K and later.

Could smaller cleanups be done every minute ?
Is this a case when threading would help, or would it just mean
extra CPUs would be working on garbage-collection ?
Are there any build options which might help, e.g. I've seen the metion of
ISC_MEM_USE_INTERNAL_MALLOC for performance, which is slightly different.

I suspect the timeouts we're seeing are a different issue to
1740.	[bug]		Replace rbt's hash algorithm as it performed badly
			with certain zones. [RT #12729]
or several others mentioning bugs fixed wrt RBTs.

cheers,
Danny Thomas

each column in this tables corresponds to a minute, starting from 13:06:00
through 13:10:59. The numbers suddenly increase at 13:09:10, i.e. a few 
seconds after trace-nameserver starts recording time-outs
06:00      -    07:00      -    08:00      -    09:00      -    10:00      -
06:01      -    07:01      -    08:01      -    09:01      -    10:01      -
06:02      -    07:02      -    08:02      -    09:02      -    10:02      -
06:03      -    07:03      -    08:03      -    09:03      -    10:03      -
06:04      -    07:04      -    08:04      -    09:04      -    10:04      -
06:05      -    07:05      -    08:05      -    09:05      -    10:05      -
06:06      -    07:06      -    08:06      -    09:06      -    10:06      -
06:07      -    07:07      -    08:07      -    09:07      -    10:07      -
06:08      -    07:08      -    08:08      -    09:08      -    10:08      -
06:09      -    07:09      -    08:09      -    09:09      -    10:09      -
06:10      -    07:10      -    08:10      -    09:10  1,141    10:10      4
06:11      -    07:11      -    08:11      1    09:11  1,112    10:11     14
06:12      -    07:12      1    08:12      -    09:12  1,043    10:12      -
06:13      1    07:13      -    08:13      -    09:13    884    10:13      2
06:14      -    07:14      2    08:14      -    09:14  1,369    10:14      5
06:15      -    07:15      1    08:15      -    09:15  1,452    10:15      2
06:16      -    07:16      -    08:16      -    09:16  1,434    10:16      1
06:17      -    07:17      -    08:17      9    09:17  1,756    10:17      3
06:18      2    07:18      -    08:18      6    09:18  1,861    10:18      1
06:19      -    07:19      6    08:19      2    09:19  1,711    10:19      6
06:20      4    07:20      7    08:20      2    09:20  1,543    10:20      -
06:21      -    07:21      -    08:21      6    09:21  1,301    10:21      1
06:22      6    07:22      -    08:22      -    09:22  1,323    10:22      6
06:23      -    07:23      1    08:23      -    09:23  1,441    10:23      1
06:24      6    07:24      1    08:24      6    09:24  1,405    10:24      -
06:25      3    07:25      -    08:25      -    09:25  1,500    10:25      1
06:26      3    07:26      -    08:26      1    09:26  1,535    10:26      2
06:27      -    07:27      -    08:27     12    09:27  1,659    10:27      1
06:28      1    07:28      -    08:28     18    09:28  1,709    10:28      -
06:29      6    07:29      1    08:29      6    09:29  1,456    10:29      1
06:30     18    07:30      -    08:30      -    09:30  1,863    10:30      2
06:31      -    07:31     12    08:31      -    09:31  1,764    10:31      -
06:32      -    07:32      -    08:32     12    09:32  2,031    10:32      1
06:33      -    07:33      1    08:33      -    09:33  1,960    10:33      2
06:34      -    07:34      8    08:34     25    09:34  2,330    10:34     18
06:35      -    07:35      2    08:35      -    09:35  2,199    10:35      -
06:36      -    07:36      -    08:36      7    09:36  1,988    10:36      2
06:37      6    07:37      -    08:37      6    09:37  2,346    10:37      -
06:38      6    07:38      -    08:38      1    09:38  2,685    10:38      -
06:39      -    07:39      -    08:39      3    09:39  2,368    10:39      -
06:40      -    07:40      -    08:40      -    09:40  3,021    10:40      -
06:41      -    07:41      -    08:41     13    09:41  4,766    10:41      -
06:42      3    07:42      -    08:42      -    09:42  5,256    10:42      -
06:43      -    07:43      -    08:43      5    09:43  5,594    10:43      -
06:44      2    07:44      2    08:44      1    09:44  5,824    10:44      -
06:45      1    07:45      -    08:45      -    09:45  4,422    10:45      -
06:46      6    07:46      1    08:46      1    09:46  5,084    10:46      -
06:47      -    07:47      -    08:47      -    09:47  7,487    10:47      -
06:48      -    07:48      1    08:48      -    09:48  6,626    10:48      -
06:49      -    07:49      1    08:49      -    09:49  2,931    10:49      -
06:50      -    07:50      1    08:50      -    09:50      -    10:50      -
06:51      -    07:51      1    08:51      -    09:51      -    10:51      -
06:52      7    07:52      1    08:52      -    09:52      3    10:52      -
06:53      -    07:53      6    08:53      8    09:53      4    10:53      -
06:54      -    07:54      -    08:54      -    09:54      -    10:54      -
06:55      2    07:55      -    08:55      -    09:55      5    10:55      -
06:56      1    07:56      3    08:56      -    09:56      -    10:56      -
06:57      2    07:57      2    08:57      3    09:57      -    10:57      -
06:58      2    07:58     11    08:58      -    09:58      -    10:58      -
06:59      -    07:59      7    08:59      -    09:59      1    10:59      -

-- 
   d.thomas at its.uq.edu.au    Danny Thomas,                                    
          +61-7-3365-8221    Software Infrastructure,
 http://www.its.uq.edu.au    ITS, The University of Queensland