IXFR journal dump making 9.2.4 server non-responsive

Thu Dec 16 19:19:40 UTC 2004

In article <cpsk8r$22ej$1 at sf1.isc.org>, Derek D. wrote:
> We subscribe to an e-mail DNS RBL that we zone transfer via IXFR and
> have noticed what we believe to be a correlation of BIND stop answering
> queries and the dumping of the journal file to disk.
> 
> The server is a Sun v120 with 2GB of RAM running Solaris 8 and Bind
> 9.2.4.
> 
> I noticed the Bind 9 ARM mentions that the default time for dumping the
> journal file to disk is 15 minutes, but we seem to be seeing it at
> about 20 minutes.  For example the end of transfer log entry for the
> zone is at 00:56:25 and all is well until 01:16:51 when log entries
> stopped.  Then at 01:20:45 queries start getting logged again.  During
> this outage the machine is running pretty close to 100% CPU and a truss
> shows that a new zone file is being dumped to disk.  Normally the
> machine is running with a load average of about 0.2.  A fresh start of
> BIND takes about 8 to 10 minutes to load this zone plus the others that
> is has.  The RBL zone file is about 102MB.

Presumably you mean the zone file is being dumped, rather than the
journal - the journal is constantly updated as updates to the zone come
in.

I've seen this problem before on a large zone slaved using IXFR. The
problem appears to be that, 15 minutes after an update, BIND will write
out the zone file. While it's doing this, the in-memory copy is locked,
which prevents access to it. Any thread which attempts to read this
copy will block until it becomes unlocked. In doing so, the thread is
prevented from doing any other work (normally, the zone file would
be written out and unlocked in a few milliseconds, so this wouldn't
be an issue).

If sufficient queries are made against the zone in question, all the
threads on your server will be taken up waiting for the zone to finish
writing, and you'll stop responding to all queries.

> Does the above make any sense?
> Would a dual CPU box help this?

Not really. You'll be able to have more threads, but if your server
is busy enough, they'll still all eventually block. Disk IO speed is
probably the real limiting factor.

> Any ideas or suggestions?

Increase the number of threads (beware of overloading the server if it's
busy, though), remove the "file" directive from the zone config (if you
can live with having to refetch the entire zone every time you start
the nameserver), or put the file into a memory filesystem, syncing it to
disk every 15 minutes or so, and putting it back after a reboot.

None of these are ideal solutions. I wish I could tell you how I solved
the problem when I saw it, but I ended up not having to slave the
huge zone, so the issue went away.

Brian
-- 
   *  *   * *  **       *  * ** ** *   *
   *  ** *      *      ** *   *  *    *
 *    *        *     *  *             *