BIND BOTTLENECK: internall 90 seconds query timeout & recursive-clients limit

Tue May 18 07:32:16 UTC 2004

On Tue, May 18, 2004 at 01:27:35AM -0400, Barry Margolin wrote:
> In article <c8btnt$73u$1 at sf1.isc.org>,
>  Ladislav Vobr <lvobr at ies.etisalat.ae> wrote:
> 
*snip*
> > When all the nameservers for certain domain are unreachable, bind
> > doesn't log or bogus such a servers or domain, be it unreachable even
> > for hours/days/weeks/years. Administrator has no idea how many such
> > servers are being permanently retried in the background from his server
> > for hours/days/weeks/years. He can discovered it only by chance or
> > waiting for the customer complain, to trigger the troubleshooting.
> > 
> > worse of this, imho if the internal timeout of each such a query is 90
> > seconds, 11 such queries to unreachable domains per seconds are enough
> > to fill the default 1000 concurrent recursive query queue after these 90
> > seconds only by these type of requests.
> 
> What's the chance that so many queries for unreachable domains will 
> happen simultaneously?  I don't think I've ever seen a server get stuck 
> like this, and our caching servers at Genuity were very heavily used.
>
> I think this will actually only be a problem if *all* the servers for a 
> domain are down.  BIND keeps track of past response times for servers, 
> and chooses the one with the best previous response time when selecting 
> which NS record for a domain to use.

We've seen this happen often.  There are two types of situations, one is
logresolving.  Lots of people do this at night, after logrotation, so 
there are a lot of those at the same time.  And there are a lot of wrongly
configured reverse domain zones, which are not resolvable because
nameservers do not respond or exist.

The other is broken clients.  They ask to resolve a certain hostname,
which is not possible because its nameservers are not reachable.  Up
until here there's no problem, but those broken clients are not happy
with the reply they got.  Result?  They keep querying for the same
hostname, until they get an answer.  True, this is a customer problem,
but you can't change 'm all (and they still use your resolvers) :/

There are some more rare situations that also bring your resolvers down.
It has happened once that akamai's nameservers were unreachable.  The
result of that is very nice :/  It has happened that the transatlantic
links went down.  It was as if the internet stopped working (it were
'only' the resolvers that got dogslow).

So I would not say this is a rare problem.  But ofcourse having heavily
overspec'd hw makes this problem mostly invisible.

Jan Gyselinck