Recursion ceases for 5-10 minutes at random intervals throughout the day

Sat Feb 23 18:56:09 UTC 2008

 > Okay (the graph helps).
 >
 >> The server itself has been relatively flat when it comes to memory
 >> usage.  It sits at about 750M.   I can set up a process memory graph if
 >> needed.
 >
 > Yes, a memory graph would also help.
 > Okay, some additional questions:
 > - One common reason for SERVFAIL caused internally is memory
 >   allocation failure.  are you sure that named does not hit any
 >   (possibly implicit) limitation of memory usage?  For example, (at
 >   least some older versions of) FreeBSD has a relatively small upper
 >   limit of datasize.  When this occurs, you should normally see log
 >   messages like this:
 >   error: could not mark server as lame: out of memory
 >   (and you don't have to raise the log level to see them because these
 >   are generally categorized as a pretty high-level error).

I checked out our logging server and I haven't seen any references to,
"memory" on any of the machines.
     I have added a graph per machine to monitor the memory usage of bind
over time.  It has almost a day of soak time.  I have put it up with a 
few other graphs at:

http://home.fuse.net/springall/bind-022108-022208.html

 > - Memory related troubles of BIND9 caching server are often due to
 >   overhead of cache cleaning.  Can you identify whether cleaning is
 >   performed while you see the problem?  To see this, you may want to
 >   apply the patch attached to this message and add (something like)
 >   the following to the logging statement of named.conf:
 > <snip>

It could be, and I wonder if it is choking on something it is cleaning.
     I'm not sure how to determine when they are cleaning their caches. 
   The server I posted has a default cache cleaning time but a max
cache-ttl and ncache-ttl set to 60 seconds.  3 have this configuration, 
1 has cleaning-interval of 30min and no (n)ttl settings, and one has no 
cache limiting/cleaning settings (defaults).
     The problem seems to move around from one group of 
primary/secondary servers to another - with different frequency - very 
strange.   After staring at these graphs for weeks, something makes me 
believe it is a specific record or packet, or non-standard upstream 
response/query, that is making it hiccup.
     I do want to apply the patch you sent.   I will work it into an
upcoming night maintenance on these servers to see what we can find.

 > - It would also be helpful if you can periodically keep track of the
 >   number of recursive clients by executing 'rndc status', and
 >   summarize the result in a graph.  Failure of recursion due to

I have added a client connection graph for all hosts (recur and tcp
clients) and have added it to the web page above.  So far they are 
hovering between anywhere from 130 to ~500 across all 6.

 > If none of the above provides any useful hint, I'd like to identify
 > detailed cause of SERVFAIL by applying a simple patch (if your
 > operational environment allows that).

That would be great.  Let me know of the graphs provide any idea and, if 
not, I would be more than willing to introduce this patch in to find the 
exact cause.   In the mean time, I will work on getting the patch you 
sent into a running machine during an upcoming maintenance window.

Thanks for you help!

- Bill

--
Bill Springall
Systems Engineer/UNIX Administrator
Email: springall at fuse.net

JINMEI Tatuya / ???? wrote:
> At Fri, 15 Feb 2008 14:48:24 -0500,
> Bill Springall <springall at fuse.net> wrote:
> 
>> Correct, the requests themselves were answered but just with, "Server 
>> Failure", messages.   (always seemed to respond quickly)  When it has 
>> happened to me, I was unable to get anything but the error message, 
>> although the graphs indicate ~100qps getting success (perhaps cache?)
>>
>> (Graph: http://home.fuse.net/springall/dns-3.png - 5 min poll)
> 
> Okay (the graph helps).
> 
>> The server itself has been relatively flat when it comes to memory 
>> usage.  It sits at about 750M.   I can set up a process memory graph if 
>> needed.
> 
> Yes, a memory graph would also help.
> 
>> The CPU does jump up to 25% CPU load from 10%, during the last spike I 
>> checked.
>>
>> Unfortunately, I haven't tried Bind without thread support.  We have had 
>> good luck with threads in testing and prod (especially with 2xdual 
>> Opterons), so I haven't tried it.
> 
> Okay, some additional questions:
> - One common reason for SERVFAIL caused internally is memory
>   allocation failure.  are you sure that named does not hit any
>   (possibly implicit) limitation of memory usage?  For example, (at
>   least some older versions of) FreeBSD has a relatively small upper
>   limit of datasize.  When this occurs, you should normally see log
>   messages like this:
>   error: could not mark server as lame: out of memory
>   (and you don't have to raise the log level to see them because these
>   are generally categorized as a pretty high-level error).
> 
> - Memory related troubles of BIND9 caching server are often due to
>   overhead of cache cleaning.  Can you identify whether cleaning is
>   performed while you see the problem?  To see this, you may want to
>   apply the patch attached to this message and add (something like)
>   the following to the logging statement of named.conf:
> 
>         channel dblog {
>                 file "db.log" versions 5 size 10M;
>                 severity debug 1;
>                 print-severity yes;
>                 print-time yes;
>         };
>         category database { dblog; };
> 
>   Then you'll see something like this in the "db.log" file (under the
>   appropriate directory):
> 
>   20-Feb-2008 15:54:46.145 info: begin cache cleaning, mem inuse 33347457
>   20-Feb-2008 15:54:46.607 info: end cache cleaning, mem inuse 33380881
> 
>   The attached patch raises the required log level for the cleaning
>   related log messages in order to keep the entire log output
>   reasonably quiet.
> 
>   Frankly, however, I don't think the cleaning overhead is the main
>   reason for the SERVFAIL since the overhead normally doesn't result
>   in the error; it would rather cause query drop.
> 
> - It would also be helpful if you can periodically keep track of the
>   number of recursive clients by executing 'rndc status', and
>   summarize the result in a graph.  Failure of recursion due to
>   recursive-clients quota would cause SERVFAIL errors, although I
>   doubt this is the case for you as you seem to specify a pretty high
>   value for this variable.
> 
> If none of the above provides any useful hint, I'd like to identify
> detailed cause of SERVFAIL by applying a simple patch (if your
> operational environment allows that).
> 
> Thanks,
> 
> ---
> JINMEI, Tatuya
> Internet Systems Consortium, Inc.
> 
> 
> 
> ------------------------------------------------------------------------
> 
> No virus found in this incoming message.
> Checked by AVG Free Edition. 
> Version: 7.5.516 / Virus Database: 269.20.9/1290 - Release Date: 2/20/2008 8:45 PM