hhs.gov resolvers broken, or BIND misconfigured?

Wed May 11 14:32:20 UTC 2016

On 09/03/2016 07:59, Petr Spacek wrote:
> On 8.3.2016 21:04, Daisuke HIGASHI wrote:
>> Hi James,
>>
>> ISC might not handle BIND 9.8's issue because it has been
>> End of Life. You should ask Redhat for help or try latest BIND9.
> 
> Oh yes, open a support case about it.
> 
>> But that sounds like same issue described in http://pastebin.com/j84451Nz .
>> Possible workaround is to run named in IPv4-only mode (e.g. named -4).
>>
>> Several people in (Japanese) local community pointed out this issue and
>> reported it to ISC two years ago. I don't know whether the issue
>> has been fixed in latest BIND9...
> 
> I don't know either. In any case, you might try to disable IPv6 on the host to
> avoid attempts to contact servers over IPv6 altogether as it might cause
> various issues with other software, too.
> 
> For RHEL 6 please see
> https://access.redhat.com/solutions/8709#rhel6disable
> 
> Recommended approach is to edit /etc/sysctl.conf and set following parameters:
> # IPv6 support in the kernel, set to 0 by default
> net.ipv6.conf.all.disable_ipv6 = 1
> net.ipv6.conf.default.disable_ipv6 = 1
> 
> Have a nice day.
> 

Late to the party on this one - but I do recall the bug ticket.

There were two contributory issues, but the main one was that the
resolver that was getting 'stuck' with only IPv6 addresses for the
authoritative server in cache was set-up in such a way that queries to
IPv6 addresses timed-out rather than failing or returning an error.

That means (to named) that the responsiveness of those IPv6 servers was
unknown because we didn't get a response back from them.  They were not
marked as unreachable because there were no errors encountered, so the
reason for the non-response was unknown and potentially transient.

Of course, the servers were penalized with large SRTT values - and while
there were IPv4 addresses still available in cache for the same zones,
then those ones would be preferred.

The explanation provided to the submitter on the bug ticket (RT #33327)
includes that BIND doesn't care whether the addresses for the
authoritative servers are IPv4 or IPv6 - it is picking the 'nearest' one
based on the round trip times when they respond.  They're all pooled
together for this purpose.

Then also, while there are still 'reachable' servers available in cache
for a zone, named won't go back to refresh the NS RRset from the
authoritative server delegation path until the last one has expired.

So what we have here is an edge case:

- The IPv4 A records timed out of cache earlier than the IPv6 AAAA
records, so the list of available nameservers for the zone in cache was
a reduced set.  Ordinarily though, this wouldn't be a problem, but..

- The IPv6 servers are 'apparently' reachable, because there are no
errors being passed back up the stack to named when it tries to send to
them.  We don't know why there was no response back - perhaps there
might be if we tried again... i.e. this is not a hard failure.

The solution from a resolver's point of view is to fix the the
production environment that is giving the impression to BIND that it is
operating in a fully-IPv6-capable server/network.  Doing this should
(unless it's a network stack problem that can't be gotten around) cause
IPv6 authoritative server addresses to be tagged as unreachable from
BIND's perspective - and then the complete lack of any usable
nameservers for the zone will cause named to go back and refresh the NS
RRset, picking up both IPv4 and IPv6 addresses for those servers once
again of course, but this includes IPv4 server addresses that *can* be used.

Disabling IPv6 in the kernel is one way to achieve this.

It's unusual to encounter authoritative servers whose AAAA RRs have a
smaller TTL than the A RRs, but it does happen from time to time - and
this is also contributing to what is happening in this edge case.

I should probably write a KB article about this...

Cathy