High recursive client counts

Cathy Almond cathya at isc.org
Tue Mar 25 17:34:39 UTC 2014

On 25/03/2014 16:14, Jason Brandt wrote:
> Mike,
>   I appreciate your insight here.  We are indeed on virtual systems,
> using enterprise grade hardware as well.  I will be doing more
> investigation today, to see if I can duplicate the behavior, which I
> have been able to do recently.
> Your VM vs Physical point is the thing that got me head scratching.  As
> I stated, this is a new system, replacing our old resolvers; however,
> even though I've had 2 different types of software doing resolution on
> our old servers, they were actual physical machines.  Load in VMWare
> monitoring shows what you'd normally expect, that the system isn't being
> taxed heavily, network usage is fairly low.  To us, it seems like an
> application configuration issue.  I could definitely see it being a VM
> issues of some sort too though, with the strange way it's behaving.
> I'll keep digging and debugging, to see if I can come up with more
> detail and correlate results to try and come up with a common theme/cause.
> Thank you for your help.
> On Tue, Mar 25, 2014 at 10:52 AM, Mike Hoskins (michoski)
> <michoski at cisco.com <mailto:michoski at cisco.com>> wrote:
>     Hi Jason,
>     I've experienced similar things in the past on 9.8.  Since then we've
>     moved to the latest 9.9, but don't think this is at all version specific
>     (that said, you could obviously try upgrading).  I don't have an exact
>     solution for you, but some ideas of things to check and personal
>     experiences which might help you.
>     Are the servers in question VM or bare metal?  Several years back we
>     made
>     a big push to virtualize everything, and after migrating recursive
>     DNS it
>     worked great for awhile...as sites grew we hit a tipping point where
>     VM-based resolvers seemed to introduce additional query latency.  These
>     servers were running far below BIND's capabilities, not taxing virtual
>     resources, optimized per all available BIND/OS/virtualization knobs, and
>     using enterprise (read: not just the latest free bits slapped
>     together and
>     expected to work) network, server and hypervisor tech.  I spent several
>     months trying to improve the situation and find a real root cause,
>     but on
>     a whim I setup an identical cluster on bare metal...no more problems.  I
>     didn't have time to dig further, so we avoid virtualization on busy
>     resolvers (for now at least).
>     As your client count has grown...is there any bottlenecks on your
>     network
>     that might be unaccounted for?  Beyond bandwidth I'm thinking of things
>     like resource constrained firewalls (are the resolvers in a DMZ?) which
>     could cause queries to be dropped/timed out/retried, etc?  I've seen
>     issues where overworked NetOps teams got behind in capacity
>     planning/upgrades and as clients/#DMZs grew firewalls couldn't keep
>     up and
>     created all sorts of issues not related to BIND itself.
>     When the recursive client count backs up, you know more queries than
>     usual
>     are taking longer than expected to get answers...if this is not
>     related to
>     BIND itself, your servers, or the network...a bit of spelunking is in
>     order.  Capture some packets with tcpdump, and take a look at rndc
>     recursing output.  Take a look at the queries causing delays, dig them
>     manually from various locations, and try to find a common theme.  If
>     there
>     is no common theme to the query destinations, then look even closer at
>     your network.  :-)
>     hth
>     -----Original Message-----
>     From: Jason Brandt <jbrandt at fsmail.bradley.edu
>     <mailto:jbrandt at fsmail.bradley.edu>>
>     Date: Tuesday, March 25, 2014 at 10:31 AM
>     To: "bind-users at lists.isc.org <mailto:bind-users at lists.isc.org>"
>     <bind-users at lists.isc.org <mailto:bind-users at lists.isc.org>>
>     Subject: High recursive client counts
>     >We recently migrated to BIND for our internal resolvers, and since the
>     >migration, we are experiencing periods of high recursive client counts,
>     >which will at times cause the BIND server to quit responding.  As a
>     >workaround, I've been able to point
>     > the BIND server to a forwarder, bypassing the root hints, to restore
>     >stability, but this morning even with the forwarder, our count spiked.
>     >
>     >
>     >We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1.  The server is
>     >configured strictly as a resolver, and is not authoritative for any
>     >domains.
>     >
>     >
>     >We have approximately 15-20k client devices on campus.  Our average
>     >recursive client count is between 10 and 50.  When the spikes occur,
>     >counts will get upwards of 3-4k (this morning: recursive clients:
>     >2358/9900/10000).
>     >
>     >
>     >What are possible causes of high recursive client count?  What can be
>     >done to prevent this or tune around it?  Obviously raising the max
>     >clients doesn't solve the problem, and the forwarder seemed to
>     help, but
>     >apparently is still susceptible to
>     > the issue.
>     >
>     >
>     >Any suggestions would be greatly appreciated.
>     >
>     >
>     >--
>     >Jason K. Brandt
>     >Systems Administrator
>     >
>     >
>     >
>     >

Packet tracing and/or looking at rndc recursing is good - then you'll
see which client queries are waiting for answers from authoritative servers.

Depending on what you've upgraded from, this might be a problem with
whether or not your infrastructure can handle EDNS0 and large packet
sizes.  Newer version of BIND set the DO bit by default on the iterative
queries, so perhaps some servers are sending back larger response than
you were receiving before.  It's worth checking that your network
infrastructure can handle both EDNS0 and large UDP packet sizes (and DNS
queries via TCP of course too).  See

I should also comment that the distro BIND 9.8 that you're using isn't
the current ISC version, so you're missing-out on recent fixes - you
might be better off with a self-build of 9.8.7-W1 or 9.8.5-W1:

These also might be helpful:



More information about the bind-users mailing list