hhs.gov resolvers broken, or BIND misconfigured?

James Ralston ralston at pobox.com
Tue Mar 1 18:46:04 UTC 2016


We have a mystery.

We're running a recursive resolver on RHEL6, using the latest
RHEL-provided BIND package, bind-9.8.2-0.37.rc1.el6_7.6.  The
recursive resolver only has an IPv4 interface; it does not have an
IPv6 interface.  DNSSEC is enabled (by default).

Our recursive resolver periodically returns SERVFAIL for lookups for
hhs.gov records, which are served by these nameservers:

rh202ns1.355.dhhs.gov.  168     IN      A       158.74.30.98
rh202ns1.355.dhhs.gov.  14260   IN      AAAA    2607:f220:0:1::2a
rh202ns2.355.dhhs.gov.  168     IN      A       158.74.30.99
rh202ns2.355.dhhs.gov.  14260   IN      AAAA    2607:f220:0:1::2b
rh120ns2.368.dhhs.gov.  81      IN      A       158.74.30.103
rh120ns2.368.dhhs.gov.  81      IN      AAAA    2607:f220:0:1::2d
rh120ns1.368.dhhs.gov.  168     IN      A       158.74.30.102
rh120ns1.368.dhhs.gov.  14260   IN      AAAA    2607:f220:0:1::2c

When this happens, BIND logs the following:

01-Mar-2016 09:10:02.064 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.064 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.064 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.065 lame-servers: info: error (network
unreachable) resolving 'hhs.gov/MX/IN': 2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.065 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.065 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2c#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2a#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.066 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh120ns2.368.dhhs.gov/A/IN':
2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2d#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh202ns2.355.dhhs.gov/A/IN':
2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh202ns1.355.dhhs.gov/A/IN':
2607:f220:0:1::2b#53
01-Mar-2016 09:10:02.067 lame-servers: info: error (network
unreachable) resolving 'rh120ns1.368.dhhs.gov/A/IN':
2607:f220:0:1::2b#53

If I dump the cache, the only information in the cache for the
nameservers in question are the AAAA records:

rh202ns1.355.dhhs.gov.  56878   AAAA    2607:f220:0:1::2a
rh202ns2.355.dhhs.gov.  56878   AAAA    2607:f220:0:1::2b
rh120ns1.368.dhhs.gov.  56878   AAAA    2607:f220:0:1::2c
rh120ns2.368.dhhs.gov.  56878   AAAA    2607:f220:0:1::2d

If I look at the queries the recursive resolver issued at the same
time as this failure (which I captured via ngrep), I see it attempt to
refresh the A records for the dhhs.gov nameservers by performing
recursive resolution from the root servers.  Based on the capture,
everything appears to be legitimate.  And indeed, I can successfully
recursively resolve the A records for all 4 nameservers with
"dig +trace +dnssec".

If I flush these records from the cache, then retry the hhs.gov query,
it succeeds, and then the cache contains:

rh202ns1.355.dhhs.gov.  86114   A       158.74.30.98
                        86114   AAAA    2607:f220:0:1::2a
rh202ns2.355.dhhs.gov.  86114   A       158.74.30.99
                        86114   AAAA    2607:f220:0:1::2b
rh120ns1.368.dhhs.gov.  86114   A       158.74.30.102
                        86356   AAAA    2607:f220:0:1::2c
rh120ns2.368.dhhs.gov.  86114   A       158.74.30.103
                        86114   AAAA    2607:f220:0:1::2d

So: it seems like something goes wrong when BIND attempts to refresh
the A records for the above nameservers, and as a result, BIND thinks
that these nameservers only have AAAA addresses.  Because our
recursive resolver does not have an IPv6 interface, all queries for
all zones served by the above nameservers (and there are a bunch more
than just hhs.gov, alas) return SERVFAIL.

We can work around this by adding a cron job to call "rndc flushname"
on the above records when queries for hhs.gov return SERVFAIL.

But we'd really love to know why this happens in the first place.

Can anyone else reproduce this?  (E.g., set up a cron job up an
IPv4-only host to run "dig hhs.gov mx" every 5 minutes or so, and see
when/if the dig starts returning SERVFAIL.)

Is something subtly broken with the DNS resolution path for these
nameservers?

Have we misconfigured our recursive resolver in some way?

Is there a bug in the version of BIND we're running?

Something else?

Any thoughts/guesses appreciated.


More information about the bind-users mailing list