Bind Timeout on External Names (8.2.3)

Thu Oct 4 12:28:50 UTC 2001

We have been running two nameservers for a long time, using Bind 8.2.3
under Red Hat Linux 6.2.  These machines have been running since January
or so with 8.2.3 and have been running Bind for over 3.5 years with
*NO* problems.

Starting around noon on Tuesday, BOTH of these machines developed a timeout
problem which manifested itself in people being unable to resolve outside
names.  Names for which the machines are authoritative resolve properly,
but we get timeouts on the external names.

I did some nslookup and dig tests on ns1 (132.223.4.1).

I had had 'nameserver 0.0.0.0' and 'nameserver 132.223.4.7' lines in
resolv.conf.  When the problem occurred on ns1, the connection to 0.0.0.0
always failed and 132.223.4.7 would respond, at least if *it* hadn't
already developed the problem.  I tried adding 'nameserver 132.223.4.1'
explicitly and, not surprisingly, it timed out as well as 0.0.0.0.

Here's an example (I left just the 0.0.0.0 nameserver line in resolv.conf
at this point to shorten the messages:
***************************************************************************
[root at ns1 /etc]# dig @ns.cadence.com ftp.cadence.com.

; <<>> DiG 8.3 <<>> @ns.cadence.com ftp.cadence.com. 
; Bad server: ns.cadence.com -- using default server and timer opts
; (1 server found)
;; res options: init recurs defnam dnsrch
;; res_nsend to server default -- 0.0.0.0: Connection timed out
------

If I stop and restart named, the problem goes away for a while:

------
[root at ns1 /etc]# service named restart
Shutting down named: [  OK  ]
Starting named: [  OK  ]

[root at ns1 /etc]# dig @ns.cadence.com ftp.cadence.com.

; <<>> DiG 8.3 <<>> @ns.cadence.com ftp.cadence.com. 
; (1 server found)
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 2
;; QUERY SECTION:
;;      ftp.cadence.com, type = A, class = IN

;; ANSWER SECTION:
ftp.cadence.com.        15M IN A        158.140.2.2

;; AUTHORITY SECTION:
cadence.com.            15M IN NS       ns.cadence.com.
cadence.com.            15M IN NS       auth00.ns.uu.NET.

;; ADDITIONAL SECTION:
ns.cadence.com.         15M IN A        158.140.1.253
auth00.ns.uu.NET.       1d17h43m3s IN A  198.6.1.65

;; Total query time: 118 msec
;; FROM: ns1.genrad.com to SERVER: ns.cadence.com  158.140.1.253
;; WHEN: Wed Oct  3 09:18:43 2001
;; MSG SIZE  sent: 33  rcvd: 139

***************************************************************************
These machines are behind a firewall which allows only port 25 and port 53
connections to them.  Temporarily we have blocked *all* outside connections
to ns2, but the problems continue on BOTH machines, so the problems doesn't
seem to result from some outside attack.

We have the following forwarder info in named.conf as recommended by
Genuity, our connectivity provider:
----------------------------------------
options {
        directory "/var/named"; // running directory for named
        forwarders {
                4.2.2.1;
                4.2.2.2;
                4.2.2.3;
        };
};
----------------------------------------
Here is a bunch of observations:

When the problem occurs, pointing nslookup, dig, or host at the forwarders
resolves the name fine.

As the problem develops, external name resolution seems to start taking
longer and then fails completely.

Sometimes dig and/or host will resolve the name, but nslookup will not,
or "dig ftp.cadence.com." or "host ftp.cadence.com." will work, but
"dig @ns1.genrad.com. ftp.cadence.com." or 
"host ftp.cadence.com. ns1.genrad.com." will fail on ns1 (similar result
for ns2).

Stopping and restarting named "fixes" the problem for a while.  "A while"
might be anywhere from less than 1/2 hour to several hours.  Since the
problem occurs on two separate machines, I suspected it was related to
some problem in our network, but if that's the case, why does restarting
named make the problem go away?  Our network people didn't see anything
unusual that would explain the problem.

It appeared that when the problem occurred, it would happen on both
machines at once, but I currently have a crontab job that tries every 5
minutes to resolve an outside address and restarts named when it fails ---
the resulting log from overnight last night shows that both machines
didn't always fail at once.  Of course the different levels of failure
mentioned above may have made the test fail on one machine and not on
the other.

We've been unable to correlate the appearance of this problem with any
other change or event.

***************************************************************************
Questions:

    Has anybody seen a problem like this?

    Any suggestions about what I should look at or ask others to check?

    Are there any further data I could provide to help the diagnosis?

***************************************************************************

I'd appreciate any help I could get with this problem.  It's causing
serious pain! :-(

        pete peterson
        GenRad, Inc.
        7 Technology Park Drive
        Westford, MA 01886-0033

        petersonp at genrad.com or rep at genrad.com
        +1-978-589-7478 (GenRad);
        +1-978-589-2088 (Closest FAX); +1-978-589-7007 (Main GenRad FAX)