Frequent timeout

Tue Sep 11 21:19:08 UTC 2018

I will walk back my previous comments and just say that bandwidth may be in play because anytime you soak a circuit it is not good.

Take a look at this query sequence:

dns.qry.type == 28 && dns.qry.name == concured.co

Packet 42356 shows a AAAA query for concurred.co.
Packets 42357/8 show 68.195.193.45 relaying the query to 62.138.132.21.
Packets 43015/16 show 62.138.132.21 replying with its query response to 68.195.193.45.

And that's it.  Nothing is seen being sent back to 127.0.0.1.  At least on the wire.  By way of comparison, packet 161 shows 127.0.0.1 answering itself so I would consider the previous no response a clue.

Moving on:

Packet 48874 shows 127.0.0.1 asking for a AAAA record again.
This time we don’t see any external communication.
Packet 87174 shows 127.0.0.1 replying with server failure.

It took nearly 25 seconds to decide upon a SERVFAIL and that is another clue.

That said, there a heaps of queries where DNS worked as expected.  I really had to dig for the above examples because it seems like the vast majority of the server failure messages either do not get a reply on the localhost or we don’t see the routable adapter on the server attempting to reach out to get the answer.  concurred.co is unique in that we see that attempt to reach out and no attempt.

If the traffic that 127.0.0.1 is putting on the wire does not go out I am thinking firewall but you may be dealing with bandwidth exhaustion exclusively and it is presenting itself in this manner.  Or you may have a server configuration issues or a server that is under powered.

Sometimes pcap's are black and white and it gives you a "here is your problem" answer and other times it is like this where it does not give us anything conclusively to work with.  Since this sever is sputtering around I would set about first stabilizing traffic from 127.0.0.1 going out.  You need to see outbound traffic hit 127.0.0.1 then hit your external adapter without missing.  Boom, boom, boom on down the line.

Hopefully others may have better more insightful suggestions.

Good hunting!

John

-----Original Message-----
From: Alex [mailto:mysqlstudent at gmail.com] 
Sent: Tuesday, September 11, 2018 1:57 PM
To: John W. Blue; bind-users at lists.isc.org
Subject: Re: Frequent timeout

Hi,

On Tue, Sep 11, 2018 at 2:47 PM John W. Blue <john.blue at rrcic.com> wrote:
>
> If you use wireshark to slice n dice the pcap .. "dns.flags.rcode == 2" shows all of your SERVFAIL happens on localhost.
>
> If you switch to "dns.qry.name == storage.pardot.com" every single query is localhost.
>
> Unless you have another NIC that you are sending traffic over this does not look like a bandwidth issue at this particular point in time.

Thanks so much. I think I also may have confused things by suggesting it was related to bandwidth or utilization. I see it also happen now more regularly too.

Can you ascertain why it is reporting these SERVFAILs?

The queries are on localhost because /etc/resolv.conf lists localhost as the nameserver. Is that why we can't diagnose this? This most recent packet trace was started with "-i any". Why would the ones on localhost be the ones which are failing? I'm assuming postfix and/or some other process is querying bind on localhost to cause these errors?