2010/3/22 Cathy Almond <span dir="ltr"><<a href="mailto:cathya@isc.org" target="_blank">cathya@isc.org</a>></span><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>Fabien Seisen wrote: </div></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>
> yes, max-cache-size 512M but named process takes ~900MB<br>
<br>
</div>The extra memory is for keeping track of recursive clients (i.e.<br>
in-progress client queries).<br></blockquote><div><br></div><div>ok </div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
This doesn't sound like a hugely loaded server, </blockquote><div><br></div><div>exact, on my own test (with "real life" queries), the server can handle</div><div> ~70000 queries/s with response time ~1ms at 70% cpu and no </div>
<div>packet lost.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
else it's somewhat throttled (not particularly large cache and probably default </blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> limit on recursive clients). What kind of query rates do you have? Do you get<br>
any logging that suggests resource problems? If so, you might need to<br>
increase some of the limits.<br></blockquote><div><br></div><div>We have a pool of several more or less identicals servers with a load-balancer in front.</div><div><br></div><div>On average, each server gets 1800 queries/s and 4000 at peak.</div>
<div><br></div><div>The problem occurs every few weeks and never on all servers at a time.</div><div><br></div><div>Recursive clients config is not modified (rndc status: recursive clients: 188/2900/3000) and we have</div>
<div>- on avg: 200 recursive clients</div><div>- at peak 600</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
It's intriguing that you're seeing the same issues on two bind versions<br>
and two OS (and that other people's experience is different from yours)<br></blockquote><div><br></div><div>only Solaris 10</div><div>- Solaris 10 U6 with bind 9.5.1-P3 with threads compiled with SUNSpro 12</div><div>
- Solaris 10 U6 with bind 9.6.2 with threads compiled with gcc</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
- it suggests to me that it's specific to your configuration or client<br>
base/queries or your environment.<br></blockquote><div><br></div><div>we gets real life queries from customers (evil?).</div><div><br></div><div>A simple "rndc flush" revives named.</div><div><br></div><div>Perhaps, a bad formated packet freeze named or create a cache dead lock </div>
<div><br></div><div>Can something go wrong in the cache ?</div><div><br></div><div>I am not fluent with core files but i have got one in my pocket.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
For troubleshooting I'd start by looking at the logging output - if<br>
you've got any categories going to null, un-suppress them temporarily;<br>
and add query-errors (see 9.6.2 ARM). Then perhaps do some sampling of<br>
network traffic (perhaps there's a UDP message size/fragmentation issue)<br>
to see what's happening (or not).<br></blockquote><div><br></div><div>all category to non-null and we do not use specific 9.6.2 configuration.</div><div>I did not noticied weird log message (beside regular: shutting down due to TCP receive error: 202.96.209.6#53: connection reset)</div>
<div><br></div><div>here is our log config:</div><div> category client { client.log; };</div><div> category config { config.log; default_syslog; };</div><div> category database { database.log; default_syslog; };</div>
<div> category default { default.log; default_syslog; };</div><div> category delegation-only { delegation-only.log; };</div><div> category dispatch { dispatch.log; };</div><div> category general { default.log; };</div>
<div> category lame-servers { lamers.log; };</div><div> category network { network.log; };</div><div> category notify { notify.log; default_syslog; };</div><div> category queries { queries.log; };</div><div> category resolver { resolver.log; };</div>
<div> category security { security; };</div><div> category unmatched { unmatched.log; };</div><div> category update { update.log; };</div><div> category xfer-in { xfer-in.log; default_syslog; };</div><div> category xfer-out { xfer-out.log; default_syslog; };</div>
<div><br></div></div>-- <br>Fabien<br>