Name resolution failure on a caching server -- many '; pending-answer' records in the cache

Sat Jan 30 04:41:22 UTC 2016

Thanks for the followup.

> 
> NXDOMAIN is not a "failure" response. Are you *sure* you're getting NXDOMAIN? 

Yes. Pretty sure. With hindsight I should have run the tests inside a 'script' session.

> If you're using nslookup to test, be aware that it will do suffix searching by default, so if the original query, e.g. www.bbc.co.uk  fails, it'll quietly (unless debug-mode is in effect) start appending suffixes. Looking up those suffixed names, e.g. www.bbc.co.uk.example.com, mostly likely gets an NXDOMAIN, so nslookup reports NXDOMAIN as the overall result of the query. So, it's basically a misreporting of the error by nslookup. 

Yes. I was mostly using nslookup.  I'll try dig too next time this occurs.

> 
> Note that only 1 of the records in your cache dump is actually relevant -- the CNAME from www.bbc.co.uk to www.bbc.net.uk -- and the others are for a different part of the namespaces (thdow.bbc.co.uk).

I'll contact you privately with a link to the whole cache.  Every entry tagged 'pending-*' in the cache which I tried querying failed to resolve when queried, many hours after the network congestion had ended.

> 
> If you do an explicit query of the CNAME, when the problem is occurring, does it resolve? I would expect, even though the cache entry is marked "pending-answer", it will still resolve. But, without the target of the CNAME also resolving, the lookup as a whole cannot succeed.

I'll try that next time.

Regards
Tom.

> 
> 													- Kevin
> 
> -----Original Message-----
> From: bind-users-bounces at lists.isc.org [mailto:bind-users-bounces at lists.isc.org] On Behalf Of TPCbind at mklab.ph.rhul.ac.uk
> Sent: Tuesday, January 26, 2016 8:02 PM
> To: bind-users at lists.isc.org
> Subject: Name resolution failure on a caching server -- many '; pending-answer' records in the cache
> 
> Dear All,
>      I run a caching server on a section of the departmental LAN.
> Occasionally network congestion results in timeouts & name resolution failures.  Lookups performed on name servers outside my LAN section fail with NXDOMAIN.  Querying my name server for items not in its cache gets the same result.
> 
> My problem is that long after the congestion has subsided, queries to my name server still result in NXDOMAIN failure.  AFAICT this situation remains indefinitely, until the cache is flushed 'rndc flush' or the bind restarted.  When it is in this state dumping the cache with 'rndc dumpdb' shows numerous entries like this,
> 
> --------------------------------------------------------------------------------------------
> ; pending-additional
> thdow.bbc.co.uk.        76632   NS      ns3.bbc.net.uk.
>                         76632   NS      ns4.bbc.co.uk.
>                         76632   NS      ns4.bbc.net.uk.
>                         76632   NS      ns3.bbc.co.uk.
> ; pending-answer
> ns0.thdow.bbc.co.uk.    2082    \-AAAA  ;-$NXRRSET
> ; thdow.bbc.co.uk. SOA ns.bbc.co.uk. hostmaster.bbc.co.uk. 2015122100 1800 600 864000 86400 ; pending-answer
>                         76632   A       212.58.240.162
> ; pending-answer
> www.bbc.co.uk.          30      CNAME   www.bbc.net.uk.
> ; glue
> --------------------------------------------------------------------------------------------
> 
> and attempts to lookup eg. www.bbc.co.uk result in NXDOMAIN.
> 
> Browsing the documentation I noticed the parameter 'max-ncache-ttl'
> which is unset in my named.conf and apparently defaults to 3hours.
> However the problem persists long after 3hours has elapsed following incidents of network congestion.
> 
> I could setup a cronjob to check name resolution on external domains and flush the cache when it fails?  I am assuming there must be better solution!  Should I set max-ncache-ttl to something fairly short in my named.conf and hope that the default value is for some reason actually
> >> 3hours?
> 
> BTW I there a way to dump out all the parameters from a running named
> -- just to see all their values ?
> 
> 
> Any ideas on how to solve or further diagnose the problem?
> 
> Many thanks
> Tom Crane
> 
> System details:
> OS:    Scientific Linux CERN SLC release 6.7 (Carbon) [NB: SLC is a derivative of RHEL]
> BIND:  bind-9.8.2-0.37.rc1.el6_7.5.x86_64
> 
> Ps. I originally posted in Usenet NG comp.protocols.dns.bind but got no followups and then noticed all messages in that NG had this ML's fields 'NNTP-Posting-Host: lists.isc.org' and 'X-Original-To: 
> bind-users at lists.isc.org' etc. in their headers.  Is c.p.d.b actually a moderated group now or exclusively tied to this ML via a mail2news gateway?
> 
> -- 
> Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
> Egham, Surrey, TW20 0EX, England.
> Email:  T dot Crane at rhul dot ac dot uk
> 
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
> 
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
> 
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> 

-- 
-- 
Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill,
Egham, Surrey, TW20 0EX, England. 
Email:  T.Crane at rhul.ac.uk
Fax:    +44 (0) 1784 472794