Bind9 stops responding for some clients

Thu Jun 6 02:02:33 UTC 2019

I just randomly spotted this post, and thought I would toss in 2¢

How many nics and how many it's are on the servers?  Are the failing
clients on the same subnet as the server?

--
Gordon A. Lang

On Thu, May 30, 2019, 8:10 PM Gregory Sloop <gregs at sloop.net> wrote:

> So, this is a very odd situation and I'm kind of grasping at straws here.
> So, I've come to see if any of you have any good straws!
>
> The setup.
> ---
> Ubuntu 18.04 LTS is the distro we're running on.
> All software is packaged [from the distro] - not compiled from sources.
> Bind9 acting as a recursive resolver for a smallish network. 150 seats.
> They're also handling DHCP and Chrony/NTP requests.
> [I actually have a pair of these handling DNS/DHCP/NTP this is the master.]
>
> They are running on a Xen/XCP VM.
>
> The one I'm having problems is the master for several internal zones - the
> one that's working fine is the slave for those same zones. None of the
> zones are large.
>
> Intermittently, Bind9 simply stops handling queries from *some* hosts.
> Meaning, it simply times out for responses for those hosts.
> Yet BIND *is* working fine for lots of other machines on the same
> networks. It's working fine doing dig queries locally on the server, and
> handles dns queries fine for lots of other machines. Yet, again, some
> machines simply get time-outs. I can't find any pattern to which machines
> get timeouts and which don't.
>
> I've checked - no firewalls, fail2ban or the like that might be causing
> this.
> No selinux/apparmour.
> Hosts that can't do dns queries can ping the dns server fine.
> [So, there's at least some network pathway to the DNS machine.]
>
> Review of the logs for bind don't show anything that looks like a problem
> to me.
> [But I'm not sure what keywords I ought to be looking for, in an effort to
> find symptoms/problems.]
>
> Finally, the two bind/dhcp/ntp servers are currently running on the same
> Xen host, so if it's somehow host related, I'd expect both to have
> problems, but they don't.
>
> Top doesn't show any CPU distress.
> Processes look fine
> Memory in use is far below what allocated to the machine. [1G allocated,
> like <400M used.]
> Restart of BIND doesn't do anything, at least in the cases I've seen -
> which aren't all that many yet.
> A restart of the whole VM does appear to fix the issue immediately.
> These appear to occur every 3-5 days.
> Oh, and if you simply wait, it eventually starts handling queries for all
> hosts again - but it might be a couple+ hours.
>
> Any suggestions on things I might hunt for in the logs in an attempt to
> figure out what's happening?
> Other suggestions for things to look for/consider?
> <gregs at sloop.net>
> TIA
> -Greg
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
>
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20190605/585a03d5/attachment-0001.html>