Strange: My Bind (8.4.6) freezes randomly
sromero at servicom2000.com
Wed Jan 10 16:21:38 UTC 2007
I have a strange problem with bind 8.4.6 in a Compaq Proliant ML350
running Debian Sarge Linux as a primary DNS Server. The server has been
running with no problems until a couple of days ago, when it started to
"freeze" (Only the DNS service) and stop answering queries.
# named -v
named 8.4.6-REL-NOESW Wed Aug 2 14:56:29 CEST 2006
English is not my native tongue, so I'm going to try to explain it
and give all the info or facts I could.
I hope any of you find my problem challenging enough to give me a
hand and help me to find where's the problem...
I have a machine (brand new hardware) installed 5 months ago and
running as a DNS+DHCP+RADIUS server. I haven't had any incidence since
we installed it. We keep it updated (kernel, security updates) and
the machine didn't need to be rebooted since we installed it.
Suddenly, since 4th January 2007 we have strange problems with the
named/bind daemon. It's working perfectly until randomly it freezes.
DNS querys continue arriving to the machine but bind doesn't resolve
them. I can't even solve queries launched from the own machine:
(into the DNS server machine):
# dig www.google.com @My_IP
; <<>> DiG 9.2.4 <<>> www.google.com @My_IP
;; global options: printcmd
;; connection timed out; no servers could be reached
I only can "kill -9" the named process and start it again, and it
works again correctly. I've have about 7 DNS machines and I've been
maintaining them since 1999 and I never found such a "freezing".
Bind of this machine never had any problem UNTIL 4th January, so
I'm starting to think that maybe it's any kind of new (or old)
DoS bug. Bind doesn't log anything strange to daemon.log or
named.run (started with -d 4), it just stops logging (and working).
As the server has configured it's own IP address as primary ns in
resolv.conf, when this happens I can't log into the console or
execute any su or sudo commando, so I have a permanent root console
session oppened to solve it (kill -9, /etc/init.d/bind start).
How can I reproduce it
1.- While trying to detect the problem I found also the following:
The SIGWINCH signal is suppossed to switch on the "query logging" to
syslog. If I do a kill -s SIGWINCH `pidof named`, a few seconds/minutes
later I suffer the same "named freezing" that I suffer randomly.
2.- I wrote an script that does about 200 DNS querys to DNS blacklists
using the rblcheck command. Something like:
for i in `seq 1 $j`;
timeout -s TERM 10 rblcheck -t -c -s sbl-xbl.spamhaus.org \
-s bl.spamcop.net -s list.dsbl.org -s dnsbl.ahbl.org \
-s combined.njabl.org -s dnsbl.sorbs.net My_IP
Sometimes, when executing the script, bind freezes and "timeout"
command kills the queries because they are not working:
rbl.sh: line 19: 20709 Terminated timeout -s TERM 10 rblcheck -t \
-c -s sbl-xbl.spamhaus.org -s bl.spamcop.net -s list.dsbl.org -s \
dnsbl.ahbl.org -s combined.njabl.org -s dnsbl.sorbs.net My_IP
Things I did previously to contact this list
- I compiled latest bind version (8.4.7) and it happens also.
- I copied "named" binary from another machine (8.4.6 compiled in
a Woody machine running in my Sarge machine) and it happens a lot
less times. ***But the SIGWINCH signal also freezes the DNS***.
- The Sigwinch signal works perfectly in all my other DNS servers,
it just start logging queries until I disable it.
- People from our Networking Staff put an sniffer in the switch
port the machine is connected to, and it just found that when the
freeze happens, the machine continues receiving DNS querys but
it doesn't answer any of them.
- I can't find any kind of error messages on system or kernel
logs, and all other services are working, it's just the DNS
service which fails.
More information about the bind-users