Remote hosts retrying downed DNS server

Marc.Thach at Marc.Thach at
Wed Oct 31 10:01:10 UTC 2001

As far as I can ascertain, in some scenarios a resolving BIND server will
always retry a server which is down first.  I haven't seen this myself, but
the algorithm used in BIND 8 and earlier looks suspect to me.  I haven't
looked at BIND 9 code but I suspect it's similar.  This BTW is just theory,
although I do a lot of work with non-existent servers! which are therefore
always down, I haven't ever seen this but then I also haven't looked for
it.  When BIND first picks up a set of NS records, it seems to allocate a
random RTT to each, between 1 and 25-ish millisecs.  If a server is down
when queried then the stored RTT is multiplied by 1.2 using integer
arithmetic.  With my compiler, this means the that if the original random
RTT is 5 or less (20% chance), then that servers RTT will never be
incremented, usually ensuring that it always has the lowest RTT of the
servers for the domain.  I have seen mention of some subtleties concerning
the way the servers are selected, I understand that the calculated RTTs are
then put into RTT bands so that in some cases different RTTs are considered
equidistant but I haven't looked at that bit.
Since this is only my evaluation of the situation, I'd be very interested
in comments from ISC or Nominum.
Marc TXK

                    Chuck Sterling                                                                                   
                    <csterlin at ziane        To:     comp-protocols-dns-bind at                        
          >                 cc:                                                                       
                    Sent by:               Subject:     Remote hosts retrying downed DNS server                      
                    ce at                                                                                       

When one of my three DNS servers at work goes down, there is a small
percentage of external hosts that retry that address over and over, to
the exclusion of the other two servers that are still up. Most external
hosts will try once and then move on to a working server, not retrying
the down one. All three are listed in the "whois" record and all three
handle reverse DNS lookups.

I'm interested in understanding what causes these retries. At least some
of them must be DNS servers, judging by the ns1... and ns2... style
domain names their addresses resolve to. And at least two are in what I
would have to think of as "trusted" domains. I could understand a client
workstation having only one domain name server in its search list, but
doubt that external hosts would be using our DNS machines rather than
their own, or that these external hosts are simple clients. Last time
this happened, we were having trouble sending and receiving e-mail to at
least two external sites, to the extent that some mail was not delivered
at all and others were delayed several hours (where we expect delays of
only a minute or so). This sorta pissed off a few users that missed
important calls...

I pulled the plug on one machine this morning and captured about 1/2
hour of snoop output at the victimlan side or our firewall to get a
small pile of addresses for hosts hitting our DNS machines, and some
were doing the retry bit, as expected.

Now that I have a list of suspects, is there a program, or some command
in dig, etc., that I can run to retrieve DNS server info, such as what
version of BIND or whatever substitutes for it, and so forth, from these
remote machines? Seems like I've read about this sort of thing but have
not had occasion to pay attention before... I'm wondering if a
particular DNS implementation is at fault. I'd wonder if it was a
problem on my end if not for there being no excessive retries from most
of the remote hosts; I'm guessing they are working right and the others
are munged.

Chuck Sterling
csterlin at

More information about the bind-users mailing list