Mysterious load increase, BIND 8.2.2,p5

Fri Mar 24 14:17:38 UTC 2000

I'm investigating a bizarre load problem that we were having yesterday
with two of our name servers.  Both of them are running BIND 8.2.2,
patchlevel 5.  The machines are a Sun Sparc 20 - Solaris 2.6 and an IBM
RS6K 370 running AIX 4.3.3.  Inexplicibly, the CPU usage on both of
those machines for the named process jumped to over 95% of the
machine's capacity without warning.  Because it was swamping the
machine so badly, we kill TERM'ed the named process and restarted it.
On the Sun (which is our primary DNS server), the process would not
shut down, so I kill KILL'ed it to force it to quit.  As a result, I
could not get the NSTATS/XSTATS logs on that machine that could help
diagose the problem.  On our secondary nameserver the RS/6000, however,
I was able to pull out these:

Mar 23 13:15:09 juno named[24810]: USAGE 953835309 953806548
CPU=868.71u/608.16s CHI LDCPU=0u/0s
Mar 23 13:15:09 juno named[24810]: NSTATS 953835309 953806548 0=589
A=220248 NS=12 M D=1 CNAME=120 SOA=223 MG=98 PTR=169690 HINFO=3
MX=90035 AAAA=117 SRV=186 AXFR=9 MAIL B=1 ANY=9091
Mar 23 13:15:09 juno named[24810]: XSTATS 953835309 953806548 RR=92791
RNXD=10510 RFwdR=52440 RDupR=363 RFail=6532 RFErr=0 RErr=371 RAXFR=9
RLame=6839 ROpts=0 SSysQ=213 86 SAns=412916 SFwdQ=65313 SDupQ=100696
SErr=0 RQ=490480 RIQ=191 RFwdQ=0 RDupQ=14675
 RTCP=42 SFwdR=52440 SFail=74 SFErr=0 SNaAns=230104 SNXD=84839

Mar 23 13:44:27 juno named[24810]: USAGE 953837067 953806548
CPU=920.52u/646.61s CHI LDCPU=0u/0s
Mar 23 13:44:27 juno named[24810]: NSTATS 953837067 953806548 0=631
A=232939 NS=12 M D=1 CNAME=133 SOA=233 MG=102 PTR=184943 HINFO=3
MX=91908 AAAA=125 SRV=194 AXFR=9 MAI LB=1 ANY=9670
Mar 23 13:44:27 juno named[24810]: XSTATS 953837067 953806548 RR=98667
RNXD=11429 RFwdR=55911 RDupR=379 RFail=7090 RFErr=0 RErr=448 RAXFR=9
RLame=7353 ROpts=0 SSysQ=226 01 SAns=437933 SFwdQ=69818 SDupQ=107761
SErr=0 RQ=520976 RIQ=191 RFwdQ=0 RDupQ=15738
RTCP=44 SFwdR=55911 SFail=75 SFErr=0 SNaAns=240409 SNXD=90767

The first set is from the normal hourly stats dump; the second, from
when we shutdown the server.  I don't see anything here that can
explain the sudden jump in CPU load.

The only somewhat plausible explanation that I have is that we were
having problems with another Internet site that had set its TTL values
on their A records to zero.  They called and complained to us about our
nameservers beating up theirs for answers.  After explaining to them
what a TTL does, they corrected the problem.  Magically, we have not
had another load spike (we experienced four different sets of them over
the course of the day, making our nameservers almost unusable).  In
the O'Reilly DNS and BIND book, there is some mention of the behavior
with TTL's of 0 for BIND 4, but none for BIND 8.  Could that be the
culprit?

Thanks,

MM, ITC/Network Systems
University of Virginia