BIND Load Problem

Sat Apr 14 15:57:55 UTC 2007

Hi Folks,
We're running into some issues with load on an instance of BIND 9.4.0
running on FreeBSD 6.2-p1. Some background:

The machine is a Dual Xeon 3.06 with 4Gs of RAM. Drives are 10K RPM
SCSI disks with a mirror for the OS and another mirror dedicated to
/etc/namedb for speed. BIND was compiled from source with no compiler
options, other than --sysconfdir=/etc/namedb. BIND is logging via
syslog. Query logging is not enabled. Recursion is disabled. The box
is not providing any other services. All sysctls for the OS are
default.

The instance is the master for approximately 300 zones which range
from very small (<10KB) to very large (70-80MB). All of these zones
have dynamic updates with NOTIFY enabled (6 slaves). We are using
incremental zone transfers.

According to top, named is using these resources:
  PID USERNAME   THR PRI NICE   SIZE    RES    STATE  C   TIME   WCPU COMMAND
 1982 named           1     97    0        1108M 1108M select    0
84:44  8.36% named

A quick iostat indicates that quite a heavy amount of disk IO is being used:
      tty             da4             cpu
 tin tout  KB/t tps  MB/s  us ni sy in id
   0   29 17.68   2  0.03   0  0  0  0 100
   0 6797 16.00  93  1.45   7  0  1  0 91
   0 7220 16.00  70  1.09   8  0  1  0 91
   0 6337 16.00  70  1.10   6  0  2  0 92
   0 6898 16.00  97  1.52   7  0  1  1 91
   0 6492 16.00  56  0.88   7  0  2  1 90
   0 6454 16.00  81  1.26   6  0  1  1 92
   0 6260 16.00 100  1.56   6  0  1  1 92
   0 6992 18.02  81  1.43   8  0  1  1 91
   0 5585 15.93  84  1.30   6  0  1  1 93
   0 7211 15.75  77  1.19   7  0  1  1 91

We've checked all of our zonefiles for consistency using
named-checkzone; only handful of minor errors have been reported (all
bad owner name). We're monitoring both dynamic update completion time
and dns query response time using smokeping from a box on the same
local network.

Recently, we've seen update completion time rise to over 1-3 seconds
per update. DNS query response times have risen nearly 20 times, from
< 1ms to > 20ms. Shutting down the process generating the dynamic
updates causes named's CPU utilization to rise to approx 20-30%, but
query response time does not drop much, possible 1-2 ms max. Removing
query load from the box does not seem to help to speed dynamic
updates.

In comparison, we have another master with O(100,000) zones loaded
ranging from <5KB to 200KB in size, processing 3-5 updates/second,
answering 500 queries per second. CPU utilization is around 20% and
disk IO is minimal.

If anyone has any thoughts on what might be going on to cause our
first master to have so much IO activity, causing BIND to slow down,
it would be appreciated. If you need more background, please let me
know as well.

Thanks,
Tom Daly

-- 
Tom Daly
tomdaly at gmail.com