Robustness Issues

Martin McCormick martin at
Fri Mar 30 16:41:36 UTC 2001

	I recently posted a message describing how we lost our
bind-9.1.1rc3 primary after intermittent ISP outages that took
away all root name servers.

	Someone suggested that I upgrade to bind-9.1.1rc5 which
had been out for a while at the time and we did.  I am happy to
report that after our most recent ISP glitch, our master stumbled
a bit, but seemed to fully recover after the ISP did.  Our
greatest thanks to all who are working with bind and who have
made it what it is today.

	What it did do was to time out sporadically on local
lookups but then resumed normal response time when the network
started working properly again.

	I just finished a Korn shell script that was what
actually let me know something had been wrong.  What it does is
to randomly pick one of the A records and use dig to do a forward
lookup on it.  The time it took is stored and, at the moment,
doesn't mean much, but nothing at all means that something bad
happened and that means a lot.  Here is what I did in case
anybody else is interested.  This script does not care whether
the process is alive as it can be alive but not working.  It just
tries the forward lookup and is happy if it gets a response that
includes the query time.

	You must know how many A records are in your present data
base so your script needs to run somewhere that lets you count
them.  It also works fine on Suns with the Korn shell, but I need
to beat on it a little before it will run under bash on a Linux
system.  Also, be reasonable about what you do with the script.
It does 5 lookups, each 4 seconds apart and can be run under
cron, but one would not want to do so many tests that it might
actually contribute to the problem.

	Here is the script.  Modify to custom.

#! /bin/ksh
export PATH
     x=5  #number of lookups per run
     while [ $x -gt 0 ]; #loop top
#How many A records are there?
count=`grep "	A	" /var/named/db.hosts |wc -l`
#Randomly, pick one to lookup.
testhost=$( grep "	A	" /var/named/db.hosts \
| tail +` echo $count | nawk 'BEGIN {count = int($1)}
#Pick a number.
{chosen = int ((rand(srand))*$count);  print chosen}'` \
 | head -1 |nawk '{print $1""}')
#Save number of milliseconds it took to get answer.
`dig @your.dns.IP.address $testhost | grep ";; Query time: " \
| nawk '{print $4}'`
#If the variable is empty, then it isn't looking up anything.
echo  "$qtime `date +%h%d%H%M%S `" >>/var/log/dnsaccess
#Save the log but remember to trim it frequently.
if test -z "$qtime"; then
#Make lots of noise.  This is not good.
echo "ns failed a lookup at `date`"
echo "ns failed a lookup just now." |wall
#Users will get the time in the message.
sleep 4
#loop bottom
exit 0
#This causes root to get the cron output plus
#bombs all logged in with the failure messages so
#use with utmost care.  It can get very obnoxious
# if lookups start to fail.  Normally, it is very quiet.

Martin McCormick 405 744-7572   Stillwater, OK
OSU Center for Computing and Information services Data Communications Group

More information about the bind-users mailing list