recursive queries fail with high load?
tsimbonis at forthnet.gr
Mon Feb 26 15:00:23 UTC 2007
Chris Michels wrote:
> I have 3 DNS servers running bind 9.3.2. Two of them are failing to resolve
> recursive queries. Both of these servers have a higher load because they
> are used by our spam filtering software. I have increased the
> recursive-clients option on both servers. It seems like recursive queries
> are just taking a long time and timing out. What is going on here or where
> should I be looking for what is wrong?
> A dig of a random name returns:
> [root at ruby named]# dig www.websudoku.com @ns2.nau.edu
> ; <<>>DiG 9.2.4 <<>>www.websudoku.com @ns2.nau.edu
> ; (1 server found)
> ;; global options: printcmd
> ;; connection timed out; no servers could be reached
> But if I set the timeout high it returns:
> [root at ruby named]# dig +time=240 www.websudoku.com @ns2.nau.edu
Unfortunately, we seem to face the same problem with bind 9.3.3. After
2-3 days of uptime, for no apparent reason, all answers take too long
and usually timeout.
When this happens, we notice a drop in successful queries in
named.stats, machine load jumps to >1 (normally around 0.50), named
process starts consuming 100% of cpu (normally it's under 30%) and
memory usage stays the same.
You can see relevant graphs at
The problem started at Sun 12:00 and I restarted the server on Mon 16:30
.. See the load graph, and all bind9 related graphs in category "Other".
Tail -f of querylog shows successful processing of queries, but they are
probably the ones with a long timeout value.. We use two views, this
happens to both of them.
I tried increasing debug level to 99, but nothing useful found so far..
# rndc status
number of zones: 20
debug level: 99
xfers running: 0
xfers deferred: 0
soa queries in progress: 0
query logging is ON
recursive clients: 25/10000
tcp clients: 0/100
server is up and running
This is the top output at the time of the problem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16380 named 23 0 870m 838m 2040 S 100 41.4 2502:39 named
The only solution so far is to restart bind..
Thoughts/suggestions of how to debug this further are more than welcome.
More information about the bind-users