random latency in named

Thu May 21 22:01:07 UTC 2015

Hi BIND,

I’ve been trying to track down the source of random latency in our production servers, without much luck. At random intervals - several times an hour - named appears to suddenly stop processing queries for around 0-2500ms, only to resume moments later. This of course introduces latency in response times.

When the glitch occurs, we can watch the rx_queue in /proc/net/udp fill up, so, the kernel network stack up to named’s socket buffer has more or less been ruled out because packets are coming in with no issues (packet traces collaborate with this as well).

Simultaneously, named’s CPU usage drops to 0%, and a stack trace captured at that moment looks identical to an idle server. This seems to suggest that the issue is likely not inside of named. It’s as if named isn’t getting notified about the new packets, but I’m not able to find any known issues with epoll, and this could be a “red herring” anyway.

Other bits of info that might be relevant:
* We’ve updated to BIND 9.9.7 with no effect.
* The OS is RHEL 6.6; we just updated the kernel to 2.6.32-504.16.2.el6.x86_64, also with no effect.
* The issue is vaguely load dependent, although it’s not clear what kind of load, as we haven’t yet been able to reproduce it in a dev environment.
* That being said, our load does not seem at all high. Generally < 5000 QPS, load average < 0.1, > 90% idle CPU.
* Nothing stands out in logs from trace 3 / querylog, except, perhaps, the fact that there are never any logs at all during the glitch.
* Here is a typical stack trace during the glitch: http://pastebin.com/raw.php?i=JZhrPSFv

Anyone have any thoughts about what to look at next?

Thanks in advance,

Mathew Eis

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20150521/bb7529b6/attachment.html>