Intermittent "failure trying master... operation canceled" on zone refresh

Fri May 25 23:23:42 UTC 2018

I apologize for being so slow to respond; it has been "one of those weeks", but I do very much appreciate everyone's comments.

The machines are on the same LAN, and there is no evidence of unusual load or dropped packets on the network.

We do not have any firewall rules restricting DNS traffic.  We _do_ have state-aware firewalld rules on the machines that should not apply to DNS traffic - the default baseline firewalld rules that come to us from redhat include rules that check state.

The reason that I mention rules that should not apply to DNS traffic is that I found one of my colleagues who worked on this problem a few years back.  They found at the time that it was the fact of loading the conntrack module by _any_ rule that caused the fault, regardless of which rule actually used the data.  We're talking back a major release or two of redhat and everything else, so we are not assuming that this is necessarily the exact same problem, but to test it on our current systems I'll have to temporarily pull a server out of prod (the problem is not reproduceable under the load we get in test.)  So next week, at this point.

(They solved the problem last time by disabling _all_ state-aware rules from iptables, but my sysadmins are resisting a similar approach this time, so I am attempting to find an alternate solution...)

Thanks,

     - rob.

From: Matthew Pounsett <matt at conundrum.com>
Sent: Friday, May 18, 2018 11:08 AM
To: Rob Moser
Cc: bind-users at lists.isc.org
Subject: Re: Intermittent "failure trying master... operation canceled" on zone refresh

On 17 May 2018 at 17:05, Rob Moser <Rob.Moser at nau.edu> wrote:

We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7.  Our configuration consists of a hidden master and three hidden slave/recursive resolvers.  I'm getting a LOT of errors  on the slaves that look like:

17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation canceled

In addition to checking for firewalls and other stateful network devices as Tony mentions, you should also have a look at the condition of the network in between the hosts.  That feels a lot like moderate packet loss, or extreme latency, to me.  

Are these machines all on the same LAN?  Are there multiple networks in between them?