occassional jnl out-of-sync of slaved zones kills bind's response;	re-sync .jnl removal is needed.  why?
    grantksupport at operamail.com 
    grantksupport at operamail.com
       
    Sat Sep  6 17:13:00 UTC 2014
    
    
  
Hi,
In my bind BIND 9.10.0-P2 on linux/64 instance, I use slaved rpz zones.
Usually, all works as expected.
On an infrequent basis, I start getting errors related to that zone, e.g.:
	main.log:Sep  4 07:58:24 dns named[7285]: 04-Sep-2014 07:58:24.509 general: error: zone rpz.spamhaus.org/IN/internal: journal rollforward failed: journal out of sync with zone
	main.log:Sep  4 07:58:24 dns named[7285]: 04-Sep-2014 07:58:24.509 general: warning: zone rpz.spamhaus.org/IN/internal: unable to load from '/namedb/slave/rpz.spamhaus.org.zone.jnl'; renaming file to '/namedb/slave/jn-NI0GTceo' for failure analysis and retransferring.
	main.log:Sep  4 07:58:24 dns named[7285]: 04-Sep-2014 07:58:24.509 general: warning: zone rpz.spamhaus.org/IN/internal: unable to load from '/namedb/slave/rpz.spamhaus.org.zone'; renaming file to '/namedb/slave/db-9mSfiPuk' for failure analysis and retransferring.
Once (one or more?) of these errors occur, named stops responding to clients' queries/lookups/etc.; it's still running, just not responding.
Brute-force restarting bind fixes it, until the next transfer attempt from master->slave -- then it freezes up again.
Noting
	journal rollforward failed: journal out of sync with zone
if I
	rndc sync -clean
	rndc reload
or manually
	'stop' bind
	rm -f slave/*jnl
	'start' bind
it seems to fix all the issues.  The next N transfers are all ok, and bind keeps responding.
Until -- at some "random" point -- the journal gets out of sync again.
Is there a way -- from 'inside' bind -- to monitor for those errors/fails, and automatically sync & reload?  Or should I simply run a sync/reload job externally with cron on a 'fairly frequent' basis?  That seems wasteful ...
This ONLY happens with the slaved rpz zone (I do NOT have any other non-rpz slaves) -- I'm not clear what's causing the the out-of-sync (corruption?) in the 1st place.
Suggestions?
Grant
    
    
More information about the bind-users
mailing list