bind-9.3.1 stops answering queries for nearly a minute/hour

Thu Oct 13 18:34:57 UTC 2005

>>>>> On Thu, 13 Oct 2005 17:47:45 +1000, 
>>>>> Danny Thomas <d.thomas at its.uq.edu.au> said:

> What we see is a period of 30-60 seconds when the name-server stops
> answering queries. Monitoring with top shows a sudden spike in CPU activity
> for named while this is happening. While this basically happens every hour,
> it is just slightly longer, as if a one-hour timer is started when the
> activity underlying this finishes.

Let me check one thing first.  Is that nameserver acting as a caching
(recursive) server (perhaps as well as acting as an authoritative
server as some zones)?  If so, the most likely reason is that the
overhead of the periodical cache cleaning as you speculated.

> Could smaller cleanups be done every minute ?
> Is this a case when threading would help, or would it just mean
> extra CPUs would be working on garbage-collection ?
> Are there any build options which might help, e.g. I've seen the metion of
> ISC_MEM_USE_INTERNAL_MALLOC for performance, which is slightly different.

> I suspect the timeouts we're seeing are a different issue to
> 1740.	[bug]		Replace rbt's hash algorithm as it performed badly
> 			with certain zones. [RT #12729]
> or several others mentioning bugs fixed wrt RBTs.

Change #1740 may have some good effect since it's generally effective
when deleting entries from an RBT DB, but, I agree this one does not
help you much in this particular symptom.

ISC_MEM_USE_INTERNAL_MALLOC should help at least to some extent in
terms of both memory footprint and response performance.

If it's still not good enough for you, consider trying the followings:

- decrease cleaning-interval.  it will decrease the amount of total
  work of each cleaning session, so it may relatively mitigate the
  problem.  However, it may still not be a good solution because named
  will still consume CPU during the cleaning session, in which
  responses will still be delayed.  It may even result in worse
  performance since the periodic cleaning occurs more frequently.  So,
  whether it helps or not depends on details of your environment
  (query pattern, TTLs of cached records, etc).

- decrease DNS_CACHE_CLEANERINCREMENT defined in lib/dns/cache.c (at
  line 48 for 9.3.1).  It's currently 1000, which means in the worst
  case the response to a query can be delayed until named examines
  1000 entries in the cache DB.  You can improve the response time
  during the cleaning session by decreasing this parameter.  Of
  course, it does not solve the high CPU usage during the session,
  which is an inevitable cost at least in the current implementation,
  and the cleaning period itself will be longer accordingly.

Finally, threading may help mitigate this situation, but you need at
least two CPUs (which you seem to have) and an SMP-capable OS (which
you don't).  Also, as you suspected, it would simply mean we can use
the extra CPUs for cache cleaning (garbage collection), so you'll
still see high CPU usage even with threads.

					JINMEI, Tatuya
					Communication Platform Lab.
					Corporate R&D Center, Toshiba Corp.
					jinmei at isl.rdc.toshiba.co.jp