Caching only nameserver fails to resolve external zones periodically

Curtis Rempel curtis at telus.net
Mon May 17 18:59:44 UTC 2004


On Mon, 17 May 2004 18:07:17 +0000, phn wrote:

> Curtis Rempel <curtis at telus.net> wrote:
>> Hi,
> 
>> I've got a caching name server which also handles a zone (.lan) on an
>> internal 192.168.1.0/24 network.   Both internal and external lookups work
>> fine as I have a forwarder entry defined in 
>> /var/named/chroot/etc/named.conf
> 
>> That is, until "something" happens which causes the external lookups to
>> fail.  The internal zone resolution still works, however, it seems as far
>> as I can tell, that the forwarder entry does not respond and then it
>> starts crawling through the root name servers and eventually gives up.
> 
>> Here's some sample output (from Fedora Core 1 Linux and bind 9.2.2.P3-9
> 
>> When everything is working (i.e. immediately after a 'service named
>> restart' command), the following 'host' command works.  However, when
>> things aren't working, I get the following output:
> 
>> [root at vault root]# host www.telus.net
>> ;; connection timed out; no servers could be reached
> 
>> This can be rectified by restarting the name server as above, but only for
>> awhile (which seems to vary), and then external lookups hang again.  The
>> internal zone information can still be resolved.
> 
>> When the system is not responding to external zone lookups, a tcpdump
>> looks like this with the above 'host' command:
> 
>> 15:51:01.996338 vault.lan.33305 > ns7so.cg.shawcable.net.domain:  35946+ [1au] A? www.telus.net. (42) (DF)
>> 15:51:03.728476 vault.lan.33305 > f.root-servers.net.domain:  50741 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:06.008121 vault.lan.33305 > 198.41.0.4.domain:  14024 [1au] A? www.telus.net. (42) (DF)
>> 15:51:07.747854 vault.lan.33305 > G.ROOT-SERVERS.NET.domain:  52631 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:10.027489 vault.lan.33305 > 128.9.0.107.domain:  65124 [1au] A? www.telus.net. (42) (DF)
>> 15:51:11.767237 vault.lan.33305 > 128.63.2.53.domain:  65468 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:14.046919 vault.lan.33305 > 192.33.4.12.domain:  65502 A? www.telus.net. (31) (DF)
>> 15:51:15.786573 vault.lan.33305 > 192.36.148.17.domain:  32751 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:18.066210 vault.lan.33305 > d.root-servers.net.domain:  55260 A? www.telus.net. (31) (DF)
>> 15:51:19.038994 laser.lan.1024 > vault.lan.domain:  27316 A? fsa.cpsc.ucalgary.ca. (50)
>> 15:51:19.805969 vault.lan.33305 > k.root-servers.net.domain:  13778 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:22.085587 vault.lan.33305 > E.ROOT-SERVERS.NET.domain:  3376 A? www.telus.net. (31) (DF)
>> 15:51:23.825310 vault.lan.33305 > 202.12.27.33.domain:  1688 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:26.104947 vault.lan.33305 > f.root-servers.net.domain:  844 A? www.telus.net. (31) (DF)
>> 15:51:27.844754 vault.lan.33305 > j.root-servers.net.domain:  33190 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:30.124317 vault.lan.33305 > G.ROOT-SERVERS.NET.domain:  49363 A? www.telus.net. (31) (DF)
>> 15:51:31.864043 vault.lan.33305 > l.root-servers.net.domain:  18756 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:34.143694 vault.lan.33305 > 128.63.2.53.domain:  4724 A? www.telus.net. (31) (DF)
>> 15:51:35.883596 vault.lan.33305 > ns7so.cg.shawcable.net.domain:  2362+ PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:38.163051 vault.lan.33305 > 192.36.148.17.domain:  1181 A? www.telus.net. (31) (DF)
>> 15:51:40.902620 vault.lan.33305 > 198.41.0.4.domain:  24263 PTR? 182.181.179.142.in-addr.arpa. (46) (DF)
>> 15:51:42.182418 vault.lan.33305 > k.root-servers.net.domain:  22529 A? www.telus.net. (31) (DF)
> 
>> The first entry above (15:51:01) indicates that the requested is being
>> forwarded to the "forwarders" entry which resolves to
>> ns7so.cg.shawcable.net
> 
>> When external resolution is working, this is the last entry as
>> ns7so.cg.shawcable.net provides the answer.
> 
>> In a "hung" lookup, the output is above, first stop is the forwarder entry
>> and then the root servers and finally failure.
> 
>> Does anybody have any idea why this external name resolution is
>> periodically failing like this?  Any suggestions for debugging info?
> 
>> It seems that external lookups can function fine for days and then quit,
>> sometimes only minutes and then quit.
> 
>> Thanks!
> 
>> curtis at telus dot net (which the smarter spambots can likely figure out
>> anyway...)
> 
> I see three issues here :
> 
> 1/ the zone "telus.net" is badly configured on a number of issues ( where mismatch
> between nameservers delegated to and the list of nameservers the servers say),
> very short ttl on NS records etc.
> 
> 2/ you are running a beta-version of bind. Why ? 9.2.3 has been available for
> a long time.
> 
> 3/ you state that you use forwarders. Why ? Failiure of the forwarders might
> give the behaviour you observe.

Thanks for your reply.

1/ - telus.net is only one zone I happened to use for the example.  As it
turns out, any external zone lookup fails.

2/ - 9.2.2.P3-9 is what was "out of the box" on Fedora Core 1 and is the
latest according to yum and rpmfind.net (latest RPM that is).  I am a
little hesitant to download/compile bind from source for the latest, I
would rather keep everything RPM if possible.

3/ - This is my prime suspicion - that the forwarder IP is failing.
However, here is some additional information I've since discovered: once
named gets into this failed state where the host command does not respond
correctly (i.e. it returns 'no servers could be reached'), I can specify
the IP of the forwarder on the 'host' command as follows and query the
forwarder directly and all works:

# host telus.net  64.59.135.133
Using domain server:
Name: 64.59.135.133
Address: 64.59.135.133#53
Aliases:
 
www.telus.net is an alias for cityweb.telus.net.
cityweb.telus.net has address 198.161.157.214

If I then use 'host telus.net' immediately after, it still fails.  The
only way to get the caching name server working again is a 'service named
restart'

So, perhaps, the forwarder is not faulty at all but maybe bind is?

If that suspicion is true, can you suggest some sort of logging I might
enable to see if in fact bind is falling over?

Thanks!


More information about the bind-users mailing list