Timeouts and retries on high speed Lans

Tue Sep 14 14:23:03 UTC 2010

So the cache servers are HA behind something (F5 LTM, Cisco local director,
something else). Are the authoritative servers? It would seem sensible to do
the same with them. That way a timeout only occurs if the whole HA cluster
is unavailable.
You can alleviate even that situation by seeding the cache servers every
(TTL-some value) minutes. Or slaving the domain on the cache servers.

On 14/09/10 11:34 AM, "Howard Wilkinson" <howard at cohtech.com> wrote:

> I have been working on building out a couple of large data centres and
> have been struggling with how to set up the systems so that we get a high
> resilience, highly responsive DNS service in the presence of failing
> equipment.
> 
> The configuration we have adopted includes a layer of BIND 9.6.x servers
> that act as pure name server caches. We have six of these servers in each
> data centre paired to provide service on VIPs so that if one of the pair
> fails the other cache takes over.
> 
> Our resolv.conf is of the following form.
> 
> search xxx.com yyy.com
> nameserver 10.1.1.1
> nameserver 10.1.2.1
> nameserver 10.1.3.1
> options timeout:1 attempts:15 no-check-names rotate
> 
> The name servers are thus on different networks within the DCs.
> 
> Our first problem arises because the timeouts seem to be taken serially on
> each server rather than the rotate applying between each name server
> request. Is this what I should have expected i.e. a 15 second timeout
> before the next server is tried in sequence.
> 
> The second problem we face is that even if we could get a one second
> timeout this orders of magnitude too slow for names that should be
> resolved within our local name space. In other words for lookups within
> the xxx.com and yyy.com domains I would like to see timeouts in the
> micro-second range.
> 
> Thinking further about this problem I have been considering whether the
> resolver should be multi-threaded or parallelised in some way so that it
> tries all fo the servers at once and accepts the first to respond. I have
> come to the conclusion that this would be too difficult to make resilient
> in the general use of the resolver code, but would make sense if the
> lwresd layer is added to the equation.
> 
> Which brings me on to the use of lwresd, this would reduce the incidence
> of problems with non-responsive servers in that it would detect and switch
> to an alternative server on the first failed attempt. However, this still
> means that if lwresd has not detected the down server then we get a stall
> in response within the data centre.
> 
> So my questions are:
> 
> 1. Does anybody have any experience in building such systems and
> suggestions on how we should tune the clients and servers to make the
> system less fragile in the presence of hardware, software and network
> failures.
> 
> 2. Is is possible with lwresd as it is written today to get the effect of
> precognition - i.e. can I get lwresd to notice that a server has gone down
> or has come back up without it needing to be triggered by a resolv
> request.
> 
> 3. Does anybody know if I can configure lwresd to expect particular zones
> to be resolved within very small windows and use this to fail over to the
> next server.
> 
> And for discussion I wonder if there would be room to add to the resolver
> code and or lwresd additional options of the form
> 
> options zone-timeout: xxx.com:1usec
> 
> or something similar, whereby the resolver could be told that if the cache
> does not respond within this time about that particular zone then it can
> be assumed that the server is misbehaving.
> 
> Thank you for your attention
> 
> Regards, Howard.
> 
> _______________________________________________
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

-- 
Kal Feher