Intermittent "failure trying master... operation canceled" on zone refresh

Thu May 17 21:05:32 UTC 2018

Hi All,

We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7.  Our configuration consists of a hidden master and three hidden slave/recursive resolvers.  I'm getting a LOT of errors on the slaves that look like:

17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation canceled

Which a little digging shows me is often the result of network connectivity problems, firewall misconfigurations, etc.  But in our case the failures are intermittent (but frequent; roughly 40% of our zone refreshes seem to end this way.)  This includes failures on every one of our zones - forward and reverse - and refreshes on a single zone which succeed, then fail, then succeed again; so it's not keyed to a particular zone config.  It happens on all three slaves intermittently - so a given refresh will make it to 2/3 slaves, and the next attempt will make it to a different set of 2...  Its definitely affecting my users at this point; they'll add a new record and get results from an nslookup 2 tries out of 3 for an hour or so, until the final server finally gets the message.

Any thoughts on what I can do to debug this issue?

I turned up this link from back in 2014:

https://kb.isc.org/article/AA-01213/0/What-causes-refresh%3A-failure-trying-master-...%3A-operation-canceled-error-messages.html

along with a couple of references to what appears to be the same issue in the archives of this list.  All describe a problem with the netfilter kernel modules, but the links to the associated bug report(s) are long gone, and I wasn't able to find any signs of the issue being solved.  Does anyone know if this problem was ever fixed or had a workaround (besides disabling netfilter, which my sysadmins would prefer not to do...)

Appreciate any advice you can give me,

    - rob.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20180517/23a04a37/attachment.html>