Stalling slave transfers

Thu May 9 09:36:46 UTC 2013

On 08/05/13 19:15, Tom Sommer wrote:
> 
> On 5/8/13 12:25 PM, Cathy Almond wrote:
>> On 08/05/13 08:26, Tom Sommer wrote:
>>> Hi,
>>>
>>> I have a problem with one of 3 slave servers, all set up the exact same
>>> way, with the exact same bind version and configuration.
>>>
>>> One slave has a problem transfering zones from the master.
>>>
>>> The logfiles are flooded with "received notify for zone" .. "refresh in
>>> progress, refresh check queued" lines and "rndc status" returns a
>>> constant high number of "soa queries in progress".
>>> After a few hours the zones are transfers, so the connection to the
>>> master is working, but there is a major delay. I tried resetting the
>>> slave and transfering ALL slave zones again, which worked fine
>>> instantly. The problem still appeared again after a few hours though.
>>>
>>> The master has three network-paths, one on external IP, one on internal
>>> IP and one on IPv6. All 3 paths work fine, because the transfers happen
>>> after an hour or so.
>>>
>>> There is no hints in the master's log.
>>> The other two slaves are running perfectly, no errors or delays what so
>>> ever.
>>>
>>> Bind version 9.9.2-P2 (recently upgraded to).
>>>
>>> Any hints would be appreciated, as I feel like I've exhausted most
>>> options.
>>>
>>> Thank you.
>> Have a look at this KB article (you'll need to register to view - but
>> registration is open to all):
>>
>> https://kb.isc.org/article/AA-00726/30/Tuning-your-BIND-configuration-effectively-for-zone-transfers-particularly-with-many-frequently-updated-zones.html
>>
>>
>> Also - and this isn't covered in that article (yet) - if you're using
>> views, then use-alt-transfer-source defaults to 'yes'.  You might want
>> to set it explicitly to 'no' or to define alt-transfer-source
>> and/or alt-transfer-source-v6.
>>
> Thank you, great resource. I think I solved it with raising
> serial-query-limit, it's just odd that it's not required on the other
> two servers.
> 
> Another issue has arisen now though, the logfile is filled with lots of
> named[5596]: zone example.com/IN: refresh: failure trying master
> 1.2.3.4#53 (source 0.0.0.0#0): operation canceled
> 
> But if I do a "dig example.com @1.2.3.4" it's working just fine. Same
> server as with the previous issue.
> 
> Any thoughts? Thank you.
> 
> // Tom

I don't think you solved the problem - I think you moved it (or made it
happen faster...)

The refresh errors indicate that the master isn't responding to your
slave for some reason.  That's what you'll need to investigate.  I would
suggest auditing the differences between this slave and the others in
their named configurations as well as their configured IP interfaces and
routing tables.

A pair of network packet traces (slave and the non-responding auth
server) might also point you in the right direction.

Cathy