random latency in named

Thu May 21 22:54:31 UTC 2015

Hi, Mathew--

On May 21, 2015, at 3:01 PM, Mathew Ian Eis <Mathew.Eis at nau.edu> wrote:
> Hi BIND,
> 
> I’ve been trying to track down the source of random latency in our production servers, without much luck. At random intervals - several times an hour - named appears to suddenly stop processing queries for around 0-2500ms, only to resume moments later. This of course introduces latency in response times.
> 
> When the glitch occurs, we can watch the rx_queue in /proc/net/udp fill up, so, the kernel network stack up to named’s socket buffer has more or less been ruled out because packets are coming in with no issues (packet traces collaborate with this as well).
> 
> Simultaneously, named’s CPU usage drops to 0%, and a stack trace captured at that moment looks identical to an idle server. This seems to suggest that the issue is likely not inside of named. It’s as if named isn’t getting notified about the new packets, but I’m not able to find any known issues with epoll, and this could be a “red herring” anyway.
> 
> Other bits of info that might be relevant:
> * We’ve updated to BIND 9.9.7 with no effect.
> * The OS is RHEL 6.6; we just updated the kernel to 2.6.32-504.16.2.el6.x86_64, also with no effect.
> * The issue is vaguely load dependent, although it’s not clear what kind of load, as we haven’t yet been able to reproduce it in a dev environment.
> * That being said, our load does not seem at all high. Generally < 5000 QPS, load average < 0.1, > 90% idle CPU.
> * Nothing stands out in logs from trace 3 / querylog, except, perhaps, the fact that there are never any logs at all during the glitch.
> * Here is a typical stack trace during the glitch: http://pastebin.com/raw.php?i=JZhrPSFv
> 
> Anyone have any thoughts about what to look at next?

I've seen something similar on RHEL 5.x & 6.x derived platforms running Linux 2.6.x kernels under high packet loads.
It wasn't visible under low traffic volume (less than 1000 packets per second), but at higher volumes both TCP and UDP
traffic would show periodic hangs of anywhere up to ~2-3 seconds.  UDP would be enqueued but the process wouldn't seem
to notice that it had stuff to do; TCP would also show queued received traffic, but again that would not drain for
a few seconds.

Oddly enough, the issue was most noticeable with IPv6 and 4-in-6 wrapped traffic; pure IPv4 sockets didn't seem to
experience the issue.  If you have the option of testing under IPv4 only, that might be interesting.  Alternatively,
if you can run another platform like FreeBSD instead of RHEL, perhaps try that.

Regards,
-- 
-Chuck