High recursive client counts

Jason Brandt jbrandt at fsmail.bradley.edu
Tue Mar 25 16:14:01 UTC 2014


Mike,

  I appreciate your insight here.  We are indeed on virtual systems, using
enterprise grade hardware as well.  I will be doing more investigation
today, to see if I can duplicate the behavior, which I have been able to do
recently.

Your VM vs Physical point is the thing that got me head scratching.  As I
stated, this is a new system, replacing our old resolvers; however, even
though I've had 2 different types of software doing resolution on our old
servers, they were actual physical machines.  Load in VMWare monitoring
shows what you'd normally expect, that the system isn't being taxed
heavily, network usage is fairly low.  To us, it seems like an application
configuration issue.  I could definitely see it being a VM issues of some
sort too though, with the strange way it's behaving.

I'll keep digging and debugging, to see if I can come up with more detail
and correlate results to try and come up with a common theme/cause.

Thank you for your help.


On Tue, Mar 25, 2014 at 10:52 AM, Mike Hoskins (michoski) <
michoski at cisco.com> wrote:

> Hi Jason,
>
> I've experienced similar things in the past on 9.8.  Since then we've
> moved to the latest 9.9, but don't think this is at all version specific
> (that said, you could obviously try upgrading).  I don't have an exact
> solution for you, but some ideas of things to check and personal
> experiences which might help you.
>
> Are the servers in question VM or bare metal?  Several years back we made
> a big push to virtualize everything, and after migrating recursive DNS it
> worked great for awhile...as sites grew we hit a tipping point where
> VM-based resolvers seemed to introduce additional query latency.  These
> servers were running far below BIND's capabilities, not taxing virtual
> resources, optimized per all available BIND/OS/virtualization knobs, and
> using enterprise (read: not just the latest free bits slapped together and
> expected to work) network, server and hypervisor tech.  I spent several
> months trying to improve the situation and find a real root cause, but on
> a whim I setup an identical cluster on bare metal...no more problems.  I
> didn't have time to dig further, so we avoid virtualization on busy
> resolvers (for now at least).
>
> As your client count has grown...is there any bottlenecks on your network
> that might be unaccounted for?  Beyond bandwidth I'm thinking of things
> like resource constrained firewalls (are the resolvers in a DMZ?) which
> could cause queries to be dropped/timed out/retried, etc?  I've seen
> issues where overworked NetOps teams got behind in capacity
> planning/upgrades and as clients/#DMZs grew firewalls couldn't keep up and
> created all sorts of issues not related to BIND itself.
>
> When the recursive client count backs up, you know more queries than usual
> are taking longer than expected to get answers...if this is not related to
> BIND itself, your servers, or the network...a bit of spelunking is in
> order.  Capture some packets with tcpdump, and take a look at rndc
> recursing output.  Take a look at the queries causing delays, dig them
> manually from various locations, and try to find a common theme.  If there
> is no common theme to the query destinations, then look even closer at
> your network.  :-)
>
> hth
>
> -----Original Message-----
> From: Jason Brandt <jbrandt at fsmail.bradley.edu>
> Date: Tuesday, March 25, 2014 at 10:31 AM
> To: "bind-users at lists.isc.org" <bind-users at lists.isc.org>
> Subject: High recursive client counts
>
> >We recently migrated to BIND for our internal resolvers, and since the
> >migration, we are experiencing periods of high recursive client counts,
> >which will at times cause the BIND server to quit responding.  As a
> >workaround, I've been able to point
> > the BIND server to a forwarder, bypassing the root hints, to restore
> >stability, but this morning even with the forwarder, our count spiked.
> >
> >
> >We are using Ubuntu 12.04 LTS, BIND version 9.8.1-P1.  The server is
> >configured strictly as a resolver, and is not authoritative for any
> >domains.
> >
> >
> >We have approximately 15-20k client devices on campus.  Our average
> >recursive client count is between 10 and 50.  When the spikes occur,
> >counts will get upwards of 3-4k (this morning: recursive clients:
> >2358/9900/10000).
> >
> >
> >What are possible causes of high recursive client count?  What can be
> >done to prevent this or tune around it?  Obviously raising the max
> >clients doesn't solve the problem, and the forwarder seemed to help, but
> >apparently is still susceptible to
> > the issue.
> >
> >
> >Any suggestions would be greatly appreciated.
> >
> >
> >--
> >Jason K. Brandt
> >Systems Administrator
> >
> >
> >
> >
>
>


-- 
Jason K. Brandt
Systems Administrator
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20140325/7a19ae03/attachment.html>


More information about the bind-users mailing list