DNS resolver problems when one nameserver is down

Tue Oct 7 02:21:18 UTC 2003

James Pearson wrote:

>Kevin Darcy <kcd at daimlerchrysler.com> wrote in message news:<blcont$2h3e$1 at sf1.isc.org>...
>  
>
>>James Pearson wrote:
>>
>>    
>>
>>>I've recently had a major problem when one of my internal DNS servers
>>>went down and I'm trying to work out a way of improving the situation.
>>>
>>>I'm have a network of mainly RedHat 7.2 based machines that each have
>>>a /etc/resolv.conf like:
>>>
>>>domain my.domain
>>>nameserver 1.2.3.4
>>>nameserver 1.2.3.5
>>>options rotate
>>>
>>>The 2nd listed nameserver above crashed and _all_ my linux clients had
>>>problems resolving hostnames - which has a massive knock-on effect,
>>>grinding everything to a halt.
>>>
>>>I'm now trying to get a better understanding of how the resolver works
>>>and how I can improve matters if this happens again.
>>>
>>>According to the resolv.conf man page, the 'options rotate' should
>>>spread the load amongst the nameservers - but in my subsequent tests,
>>>this doesn't happen - all it does is force the resolver to use the 2nd
>>>nameserver first for _every_ lookup - so when the 2nd nameserver
>>>crashed, every lookup times out after 5 seconds before using the 1st
>>>nameserver. It appears that if I hadn't used the rotate option, I
>>>would have been OK when the 2nd nameserver went down (but not if the
>>>1st did!).
>>>
>>>Should the rotate option work with RH7.2 (glibc 2.2.4)?
>>>
>>>I can improve matters if I reduce the timeout to 1 second, but it
>>>appears the resolver code is not intelligent enough to realize that it
>>>keeps timing out on the same nameserver with subsequent lookups.
>>>
>>>I guess I could use something like nscd - but that again still uses
>>>the same nameserver for subsequent lookups of hostnames that are not
>>>cached.
>>>
>>>Is there something analogous to the NIS 'ypbind' for DNS lookups? i.e.
>>>something like nscd that instead of caching hostnames, caches the good
>>>nameserver to use?
>>>
>>>Sorry if this is in a FAQ somewhere, but as it has always appeared to
>>>work OK, I've never really had to think about this before ...
>>>
>>>      
>>>
>>If the unavailability of your *second*-listed nameserver caused 
>>problems, I think it's reasonable to assume that "rotate" is working on 
>>your platform -- without "rotate", the second-listed nameserver would 
>>only be consulted if the first-listed nameserver wasn't answering queries.
>>    
>>
>
>But the rotate option _always_ (as far as I can tell) uses the second
>nameserver. From what the man page states, this option "causes round
>robin selection of nameservers from among those listed". If I give
>'options rotate' the first nameserver is never used. - which, as I
>read it, is not what the man page says.
>
Right, and it doesn't make any sense either. Why would "rotate" just 
alter the nameserver list in some deterministic way? That would be no 
different than just having the list in a different order. Sounds like 
you have a bad implementation of a resolver library, or you're 
misinterpreting the "rotate" results (for the latter, have you tried 
comparing query logs between the two nameservers?)

> 
>  
>
>>Sounds like your root problem is that you don't have enough nameserver 
>>resources to handle a single nameserver failure, given the way your 
>>clients are configured. Possible solutions:
>>
>>1) Add another nameserver
>>2) Beef up your existing nameservers
>>3) Reconfigure your clients (you implied that your clients were 
>>Unix/Linux) with their own nameserver instances in order to reap the 
>>benefits of local caching.
>>    
>>
>
>If I don't use 'options' rotate, then lookups will use the first
>listed nameserver, if that fails, it uses the second listed, if that
>fails, the third...
>
>The problem is that if the first nameserver goes away, then _all_
>lookups will time out after 5 seconds before trying the second
>nameserver. It doesn't matter how many other nameservers you have.
>
>This five second timeout is what crippled us.
>
>I have thought of using a local caching namesever - but I don't know
>if that will suffer from the same time out problem - OK it will have
>known hosts cached - but will the local nameserver have the same
>timeout problem for unknown hosts if the first used upstream
>nameserver has gone away?
>
I'm assuming by your use of the word "upstream" that you would configure 
your boxes as forwarders, presumably because your "main" nameservers are 
the only ones which are allowed to query the Internet for Internet 
names. If that assumption is correct, then your question is about 
forwarder failover, and the answer is: it depends on the version of BIND 
you're running. In some versions, forwarders are selected according to 
accumulated RTT (round trip-time) statistics, similar to the way caching 
resolvers choose among NS records. In some other version of BIND, the 
forwarders are always tried sequentially, which can result in resolution 
delays if the 1st through nth forwarders are unavailable.

If, on the other hand, your individual client boxes have the ability to 
query Internet names directly, then one option to consider is to 
configure them all as *pure* caching resolvers (with nothing but a hints 
zone and a master zone for the loopback address). NS failover is 
*always* based on RTT, so recovery is fairly rapid. However, depending 
on your routing/security paradigm, this configuration may not be an 
option for you, and, even if it is, it may be arguably viewed as 
anti-social, inasmuch as you'll have a whole bunch of boxes 
independently deluging Internet nameservers with probably the same 
queries over and over.

>My current work round is to hack the libresolv source to re-order the
>list of nameservers to make sure the last successfully used nameserver
>is used first on subsequent lookups. This appears to work OK.
>
Hacked libresolv? I'm glad it works for you, but I'd consider that a 
last-resort workaround...

>
>If I use this modified libresolv with nscd, then it appears I can
>quite happily resolve hostnames etc. without problems - nscd gets an
>initial 5 second time out on its first lookup, but all subsequent
>lookups use the next working nameserver first.
>
>However, I'm not sure if doing this is likely to cause other
>problems...
>  
>
Frankly, I don't know either. But it wouldn't surprise me...

                        - Kevin