Intermittent "failure trying master... operation canceled" on zone refresh
Rob.Moser at nau.edu
Thu May 17 21:05:32 UTC 2018
We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7. Our configuration consists of a hidden master and three hidden slave/recursive resolvers. I'm getting a LOT of errors on the slaves that look like:
17-May-2018 13:27:28.421 general: info: zone 34.22.10.in-addr.arpa/IN/internal-view: refresh: failure trying master 10.20.30.3#53 (source 0.0.0.0#0): operation canceled
Which a little digging shows me is often the result of network connectivity problems, firewall misconfigurations, etc. But in our case the failures are intermittent (but frequent; roughly 40% of our zone refreshes seem to end this way.) This includes failures on every one of our zones - forward and reverse - and refreshes on a single zone which succeed, then fail, then succeed again; so it's not keyed to a particular zone config. It happens on all three slaves intermittently - so a given refresh will make it to 2/3 slaves, and the next attempt will make it to a different set of 2... Its definitely affecting my users at this point; they'll add a new record and get results from an nslookup 2 tries out of 3 for an hour or so, until the final server finally gets the message.
Any thoughts on what I can do to debug this issue?
I turned up this link from back in 2014:
along with a couple of references to what appears to be the same issue in the archives of this list. All describe a problem with the netfilter kernel modules, but the links to the associated bug report(s) are long gone, and I wasn't able to find any signs of the issue being solved. Does anyone know if this problem was ever fixed or had a workaround (besides disabling netfilter, which my sysadmins would prefer not to do...)
Appreciate any advice you can give me,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the bind-users