Problem with BIND 9.10.1-P1 recursion limits

David A. Evans Evans_David_A at cat.com
Wed Dec 10 15:07:41 UTC 2014


How does the max-recursion-queries counter interact with DNSSEC validation 
and RPZ validation?   Are the queries for these checks included in the 
max-recursion-queries count or are they in a separate queue?


Why I am asking:
I've been running through my test of the new code and getting a few hits 
on domains that I think resolve without these limits.   I've have more 
work to do on validating if these domains didn't resolve because of some 
momentary network or external DNS resolution issue or something related to 
these new thresholds.  (or in the case of 4842b.y.dotnxdomain.net, Cricket 
out registering new domains just to mess with us.) 

My test is ~5 million unique A record lookups I'm pushing to a test server 
at ~300 q/s.
It has a lengthy RPZ enabled and DNSSec validation on.
After reading this thread, I'm flushing the cache every 30 mins.

I'm getting a handful of these messages, some are just broken domains but 
a handful of them seem to resolve on DNS servers on older bind code.  They 
do not seem to be timed with the cache clearing.

Dec  9 13:59:11 198.206.x.x named[13525]: exceeded max queries resolving 
'growthcentre.org/A'
Dec  9 14:15:05 198.206.x.x named[13525]: exceeded max queries resolving 
'megadeth.rockmetal.art.pl/A'
Dec  9 14:22:33 198.206.x.x named[13525]: exceeded max queries resolving 
'ns3.iplay.net/A'
Dec 10 03:18:54 198.206.x.x named[13525]: exceeded max queries resolving 
'4842b.y.dotnxdomain.net/DNSKEY'
Dec 10 03:59:02 198.206.x.x named[13525]: exceeded max queries resolving 
'dsl-188-34-202-200.asretelecom.net/A'
Dec 10 03:59:03 198.206.x.x named[13525]: exceeded max queries resolving 
'ns1.asretelecom.com/A'
Dec 10 08:19:15 198.206.x.x named[13525]: exceeded max queries resolving 
'knurow.eu.org/A'
Dec 10 08:27:36 198.206.x.x named[13525]: exceeded max queries resolving 
'lb.z.optimix.asia/NS'
Dec 10 08:31:04 198.206.x.x named[13525]: exceeded max queries resolving 
'NS4-AUTH.ALLTEL.NET/A'



David A. Evans
Enterprise IP/DNS Management
Network Infrastructure Tools and Services




From:   Evan Hunt <each at isc.org>
To:     Stuart Henderson <stu at spacehopper.org>
Cc:     Tony Finch <dot at dotat.at>, bind-users at lists.isc.org
Date:   12/09/2014 01:41 PM
Subject:        Re: Problem with BIND 9.10.1-P1 recursion limits
Sent by:        bind-users-bounces at lists.isc.org



On Tue, Dec 09, 2014 at 05:51:58PM +0000, Evan Hunt wrote:
> That's unexpected. I'll see if I can reproduce it.

Okay, I can.

Part of the problem is the somewhat crazypants DNS configuration
of www.ibm.com:

  $ dig +noall +answer www.ibm.com
  www.ibm.com.            3600    IN      CNAME   www.ibm.com.cs186.net.
  www.ibm.com.cs186.net.  60      IN      CNAME 
china-cdn.san.ibm.com.edgekey.net.
  china-cdn.san.ibm.com.edgekey.net. 21600 IN CNAME 
china-cdn.san.ibm.com.edgekey.net.globalredir.akadns.net.
  china-cdn.san.ibm.com.edgekey.net.globalredir.akadns.net. 900 IN CNAME 
e7826.x.akamaiedge.net.
  e7826.x.akamaiedge.net. 20      IN      A       23.59.201.136

... like, *wow*.  A chain of five aliases with TTLs ranging from 20
seconds to 6 hours, passing through five different zones (ibm.com,
cs186.net, edgekey.net, akadns.net, akamaiedge.net), hosted by
servers in three *more* zones (ihost.com, akam.net, and akadns.org,
in addition to akadns.net and akamaiedge.net).  I had to almost
double the maximum recursion queries to 99 to get this to work on
an empty cache.  Yikes.

Almost any non-empty cache will dodge the bullet. Preceeding the
lookup of www.ibm.com with "dig @::1 ns com" causes the query to
succeed.  Also, as previously noted, on 9.9 it will succeed without
a five-minute delay if you just issue the query a second time.

So, possible workarounds if this issue is causing problems for you:

  - Ensure that the first query sent to a newly-primed recursive
    resolver isn't quite as spectacular as this one;
  - Add "max-recursion-queries 100;" to your options statement;
  - Run 9.9.6-P1 instead of 9.10.1-P1

The five-minute delay is still a bit of a puzzle. It happens because
of this code in adb.c:

        /* XXXMLG Don't pound on bad servers. */
        if (address_type == DNS_ADBFIND_INET) {
                name->expire_v4 = ISC_MIN(name->expire_v4, now + 300);
                name->fetch_err = FIND_ERR_FAILURE;
                inc_stats(adb, dns_resstatscounter_gluefetchv4fail);
        } else {
                name->expire_v6 = ISC_MIN(name->expire_v6, now + 300);
                name->fetch6_err = FIND_ERR_FAILURE;
                inc_stats(adb, dns_resstatscounter_gluefetchv6fail);
        }

The "now + 300" bit is where the five minutes comes from.  That's code
that's been around for years, and it is in 9.9, but apparently it's
reached more easily in 9.10.  I'm looking into the reasons for this.

The problem should be addressed in 9.10.2, which is likely to be
released next month.

-- 
Evan Hunt -- each at isc.org
Internet Systems Consortium, Inc.
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to 
unsubscribe from this list

bind-users mailing list
bind-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20141210/7b8dc0c2/attachment.html>


More information about the bind-users mailing list