occassional jnl out-of-sync of slaved zones kills bind's response; re-sync .jnl removal is needed. why?

grantksupport at operamail.com grantksupport at operamail.com
Sat Sep 6 17:13:00 UTC 2014


Hi,

In my bind BIND 9.10.0-P2 on linux/64 instance, I use slaved rpz zones.

Usually, all works as expected.

On an infrequent basis, I start getting errors related to that zone, e.g.:

	main.log:Sep  4 07:58:24 dns named[7285]: 04-Sep-2014 07:58:24.509 general: error: zone rpz.spamhaus.org/IN/internal: journal rollforward failed: journal out of sync with zone
	main.log:Sep  4 07:58:24 dns named[7285]: 04-Sep-2014 07:58:24.509 general: warning: zone rpz.spamhaus.org/IN/internal: unable to load from '/namedb/slave/rpz.spamhaus.org.zone.jnl'; renaming file to '/namedb/slave/jn-NI0GTceo' for failure analysis and retransferring.
	main.log:Sep  4 07:58:24 dns named[7285]: 04-Sep-2014 07:58:24.509 general: warning: zone rpz.spamhaus.org/IN/internal: unable to load from '/namedb/slave/rpz.spamhaus.org.zone'; renaming file to '/namedb/slave/db-9mSfiPuk' for failure analysis and retransferring.

Once (one or more?) of these errors occur, named stops responding to clients' queries/lookups/etc.; it's still running, just not responding.

Brute-force restarting bind fixes it, until the next transfer attempt from master->slave -- then it freezes up again.

Noting

	journal rollforward failed: journal out of sync with zone

if I

	rndc sync -clean
	rndc reload

or manually

	'stop' bind
	rm -f slave/*jnl
	'start' bind

it seems to fix all the issues.  The next N transfers are all ok, and bind keeps responding.

Until -- at some "random" point -- the journal gets out of sync again.

Is there a way -- from 'inside' bind -- to monitor for those errors/fails, and automatically sync & reload?  Or should I simply run a sync/reload job externally with cron on a 'fairly frequent' basis?  That seems wasteful ...

This ONLY happens with the slaved rpz zone (I do NOT have any other non-rpz slaves) -- I'm not clear what's causing the the out-of-sync (corruption?) in the 1st place.

Suggestions?

Grant


More information about the bind-users mailing list