DNS resolver problems when one nameserver is down

James Pearson j.pearson at ge.ucl.ac.uk
Wed Oct 1 11:45:43 UTC 2003


Kevin Darcy <kcd at daimlerchrysler.com> wrote in message news:<blcont$2h3e$1 at sf1.isc.org>...
> James Pearson wrote:
> 
> >I've recently had a major problem when one of my internal DNS servers
> >went down and I'm trying to work out a way of improving the situation.
> >
> >I'm have a network of mainly RedHat 7.2 based machines that each have
> >a /etc/resolv.conf like:
> >
> >domain my.domain
> >nameserver 1.2.3.4
> >nameserver 1.2.3.5
> >options rotate
> >
> >The 2nd listed nameserver above crashed and _all_ my linux clients had
> >problems resolving hostnames - which has a massive knock-on effect,
> >grinding everything to a halt.
> >
> >I'm now trying to get a better understanding of how the resolver works
> >and how I can improve matters if this happens again.
> >
> >According to the resolv.conf man page, the 'options rotate' should
> >spread the load amongst the nameservers - but in my subsequent tests,
> >this doesn't happen - all it does is force the resolver to use the 2nd
> >nameserver first for _every_ lookup - so when the 2nd nameserver
> >crashed, every lookup times out after 5 seconds before using the 1st
> >nameserver. It appears that if I hadn't used the rotate option, I
> >would have been OK when the 2nd nameserver went down (but not if the
> >1st did!).
> >
> >Should the rotate option work with RH7.2 (glibc 2.2.4)?
> >
> >I can improve matters if I reduce the timeout to 1 second, but it
> >appears the resolver code is not intelligent enough to realize that it
> >keeps timing out on the same nameserver with subsequent lookups.
> >
> >I guess I could use something like nscd - but that again still uses
> >the same nameserver for subsequent lookups of hostnames that are not
> >cached.
> >
> >Is there something analogous to the NIS 'ypbind' for DNS lookups? i.e.
> >something like nscd that instead of caching hostnames, caches the good
> >nameserver to use?
> >
> >Sorry if this is in a FAQ somewhere, but as it has always appeared to
> >work OK, I've never really had to think about this before ...
> >
> If the unavailability of your *second*-listed nameserver caused 
> problems, I think it's reasonable to assume that "rotate" is working on 
> your platform -- without "rotate", the second-listed nameserver would 
> only be consulted if the first-listed nameserver wasn't answering queries.

But the rotate option _always_ (as far as I can tell) uses the second
nameserver. From what the man page states, this option "causes round
robin selection of nameservers from among those listed". If I give
'options rotate' the first nameserver is never used. - which, as I
read it, is not what the man page says.
 
> Sounds like your root problem is that you don't have enough nameserver 
> resources to handle a single nameserver failure, given the way your 
> clients are configured. Possible solutions:
> 
> 1) Add another nameserver
> 2) Beef up your existing nameservers
> 3) Reconfigure your clients (you implied that your clients were 
> Unix/Linux) with their own nameserver instances in order to reap the 
> benefits of local caching.

If I don't use 'options' rotate, then lookups will use the first
listed nameserver, if that fails, it uses the second listed, if that
fails, the third...

The problem is that if the first nameserver goes away, then _all_
lookups will time out after 5 seconds before trying the second
nameserver. It doesn't matter how many other nameservers you have.

This five second timeout is what crippled us.

I have thought of using a local caching namesever - but I don't know
if that will suffer from the same time out problem - OK it will have
known hosts cached - but will the local nameserver have the same
timeout problem for unknown hosts if the first used upstream
nameserver has gone away?

My current work round is to hack the libresolv source to re-order the
list of nameservers to make sure the last successfully used nameserver
is used first on subsequent lookups. This appears to work OK.

If I use this modified libresolv with nscd, then it appears I can
quite happily resolve hostnames etc. without problems - nscd gets an
initial 5 second time out on its first lookup, but all subsequent
lookups use the next working nameserver first.

However, I'm not sure if doing this is likely to cause other
problems...

James Pearson


More information about the bind-users mailing list