No such Name, and 5second dns delay.

Tory M Blue tmblue at gmail.com
Sun Feb 28 05:33:01 UTC 2010


I've running into some issues and trying to diagnose, so maybe folks
on here can help me with steps to troubleshoot.

Bind 9.6.1-P1
Fedora Core

What I am experiencing and led to my investigation is a random 5
second delay in name resolution. Now I know that nslookup/dig resolver
has a default 5 second retry, if it doesn't get an answer it will try
the second server listed in the resolv.conf.. So I sort of could
explain the 5 second delay, didn't understand why it was happening,
but felt I was getting closer.

So then I started running some network traces (which takes some time,
as the 5 second delay is very random}, however being patient and
running enough "time dig host +trace" revealed a few 5 second delays,
for the most part they are all low ms (as I expect), but a couple were
5 second.

The delay occurs in the upper part of dig. (although interesting
enough not one section shows more than say 175ms, ever).

[tblue at w05 ~]$ time dig apps.domain.com +trace +stats

; <<>> DiG 9.3.2 <<>> apps.domain.com +trace +stats
;; global options:  printcmd
.                       317993  IN      NS      C.ROOT-SERVERS.NET.
.                       317993  IN      NS      J.ROOT-SERVERS.NET.
.                       317993  IN      NS      B.ROOT-SERVERS.NET.
.                       317993  IN      NS      L.ROOT-SERVERS.NET.
.                       317993  IN      NS      D.ROOT-SERVERS.NET.
.                       317993  IN      NS      I.ROOT-SERVERS.NET.
.                       317993  IN      NS      F.ROOT-SERVERS.NET.
.                       317993  IN      NS      G.ROOT-SERVERS.NET.
.                       317993  IN      NS      M.ROOT-SERVERS.NET.
.                       317993  IN      NS      K.ROOT-SERVERS.NET.
.                       317993  IN      NS      A.ROOT-SERVERS.NET.
.                       317993  IN      NS      H.ROOT-SERVERS.NET.
.                       317993  IN      NS      E.ROOT-SERVERS.NET.

<<<<PAUSES HERE>>>>>
;; Query time: 1 msec
;; SERVER: 0.0.0.15#53(216.249.24.15)
;; WHEN: Sat Feb 27 21:25:21 2010
;; MSG SIZE  rcvd: 500

net.                    172800  IN      NS      H.GTLD-SERVERS.net.
net.                    172800  IN      NS      M.GTLD-SERVERS.net.
net.                    172800  IN      NS      I.GTLD-SERVERS.net.
net.                    172800  IN      NS      F.GTLD-SERVERS.net.
net.                    172800  IN      NS      K.GTLD-SERVERS.net.
net.                    172800  IN      NS      L.GTLD-SERVERS.net.
net.                    172800  IN      NS      E.GTLD-SERVERS.net.
net.                    172800  IN      NS      J.GTLD-SERVERS.net.
net.                    172800  IN      NS      D.GTLD-SERVERS.net.
net.                    172800  IN      NS      G.GTLD-SERVERS.net.
net.                    172800  IN      NS      B.GTLD-SERVERS.net.
net.                    172800  IN      NS      A.GTLD-SERVERS.net.
net.                    172800  IN      NS      C.GTLD-SERVERS.net.
;; Query time: 14 msec
;; SERVER: 192.33.4.12#53(C.ROOT-SERVERS.NET)
;; WHEN: Sat Feb 27 21:25:21 2010
;; MSG SIZE  rcvd: 505

domain.com.              172800  IN      NS      ns1.domain.com.
domain.com.          172800  IN      NS      ns2.domain.com.
;; Query time: 54 msec
;; SERVER: 192.55.83.30#53(M.GTLD-SERVERS.net)
;; WHEN: Sat Feb 27 21:25:26 2010
;; MSG SIZE  rcvd: 104

apps.domain.com.     300     IN      A       216.249.24.50
domain.com.          86400   IN      NS      ns2.domain.com.
domain.com.          86400   IN      NS      ns1.domain.com.
;; Query time: 0 msec
;; SERVER: 0.0.0.15#53(ns1.domain.com)
;; WHEN: Sat Feb 27 21:25:26 2010
;; MSG SIZE  rcvd: 120


real    0m5.090s
user    0m0.004s
sys     0m0.004s

So since I finally caught one of these in the wild, I could look at
the network trace. I was caught off guard when I saw "No such Name"
"Flags: 0x8483 (Standard query response, No such name)"

What? I can query my 4 servers (behind a Load balancer or through the
LB) and the resolve fine, all are running, all have current zone files
(they are slaves), so I don't understand the "no such name",  I have
no idea why this server is giving this response. And since it's so
infrequent it makes no sense at all. Servers are not busy, very low
load, gig network, no saturation, no retransmissions, everything seems
healthy.

So now I'm wondering if that's the 5 second delay, it sends out a
request, one server sends back, no such name, so it queries my other
set of dns servers and get's an immediate response. However all 4
servers seem fine.

So I've sniffed the traffic, I've looked at what logs I have, is there
other logs I can enable to catch, watch for this,  Is there a possible
configuration that is out dated wrong?

New to the bind list so not clear what information will allow me to
help you, help me.

I even thought that maybe I had a bad hint file named.cache file, but
it appears to be current (well last major update seems to have been
Dec 08).

Thanks
Tory



More information about the bind-users mailing list