recursive-clients queue size & clean-up

Tue Aug 17 08:10:58 UTC 2004

Do any guidelines about how to size your recursive-clients queue exist ?

I have public recursive server with around 2000req/sec.

Does each slot in the recursive-client queue being clean up after the 
timeout expire, if there is no response?  Or some slots are being 
occupied longer, it seems to me that when I reach this limit there is no 
really way back to stabilize bind, all cpu will be used and even if I 
leave it over night when the traffic sometimes goes as little as 300-400 
req/sec it will not recover and still the messages keeps coming from 
time to time, cpu is very high (abnormal to the number of incoming 
requests) and number of requests logged to the query.log file is almost 
just half of what the box is really suppose to receive, (looks like bind 
or os dropping the traffic).
There is no weird traffic, maybe there was a weird spike, but it should 
recover. When I stop and start  service resumes, cpu drops, traffic 
comes back to normal rate, not almost like half rate as it was during 
the problem, and recursive-client queue is not overflowed.

I have recently moved to Solaris9 with the latest patches, I have tried 
several ways how to compile the bind, and I had solaris 8 before, I had 
even tried several bind versions, single thread, multithread, 32bit 
code, 64 bit code, but I still face this problem from time to time, I 
managed to trace it back little, it looks to me like there is always 
before this problem happen some spike in the traffic, like temporarily 
flood (let's say for few seconds ,minutes - like 500/600 req/sec of 
unreachable domain), recursive-client queue gets full and doesn't really 
recover afterwards...

Server is e280 2xCPU Sparc3,bind 9.2.1 and 9.2.3

Does rndc flush, flush the recursive client queue as well ?

If I assume 90 seconds timeout for each slot in the queue, it basically 
means (11 unreachable req)/sec will fill 1000 slot queue in 90 seconds, 
once it is full how it will recover? Unless I have traffic with less 
then (11 unreachable req)/second it can not recover. How many such a 
requests are in public traffic received with 2000req/sec rate? 
definitely 11 such a requests will be there, not just eleven but IMHO 
100 (one hundred) maybe 200 or 300... What should be the queue size? 
300/11*1000=27272(twentyseventhousand)???

I posted similar issue some time back, but couldn't make some conclusion 
from answers. Does it really seems to be so minor thing or there is 
really no clue how to set the queue size, since it is not clear how it 
is being used?

Do we need commercial support to get somebody answers, yes this is the 
way how the queue is managed, these are the guidelines how to set it, 
this is the way how to recover, if it became full? The queue size 
doesn't purely depend on number of users or requests, but also on the 
weirdness of the traffic, which is especially in public environment 
increasingly becoming very very weird. If there are guidelines, and 
general understanding of the queue management, each of us can tune it as 
per his own traffic characteristic.

Ladislav