We're running a series of RHEL 7.4 machines (kernel version 3.10.0-693.1.1.el7.x86_64) running bind version 9.9.4-RedHat-9.9.4-51.el7.  Our configuration consists of a hidden master and three hidden slave/recursive resolvers.  I'm getting a LOT of errors on the slaves that look like:

17-May-2018 13:27:28.421 general: info: zone refresh: failure trying master (source operation canceled

Which a little digging shows me is often the result of network connectivity problems, firewall misconfigurations, etc.  But in our case the failures are intermittent (but frequent; roughly 40% of our zone refreshes seem to end this way.)  This includes failures on every one of our zones - forward and reverse - and refreshes on a single zone which succeed, then fail, then succeed again; so it's not keyed to a particular zone config.  It happens on all three slaves intermittently - so a given refresh will make it to 2/3 slaves, and the next attempt will make it to a different set of 2...  Its definitely affecting my users at this point; they'll add a new record and get results from an nslookup 2 tries out of 3 for an hour or so, until the final server finally gets the message.

Any thoughts on what I can do to debug this issue?

I turned up this link from back in 2014:

along with a couple of references to what appears to be the same issue in the archives of this list.  All describe a problem with the netfilter kernel modules, but the links to the associated bug report(s) are long gone, and I wasn't able to find any signs of the issue being solved.  Does anyone know if this problem was ever fixed or had a workaround (besides disabling netfilter, which my sysadmins would prefer not to do...)

Appreciate any advice you can give me,

    - rob.

