file descriptor exceeds limit

Thu Jun 18 23:39:30 UTC 2015

Inline...

On 6/18/15, 9:22 AM, "Cathy Almond" <cathya at isc.org> wrote:

>On 18/06/2015 12:00, Matus UHLAR - fantomas wrote:
>> On 17.06.15 22:39, Shawn Zhou wrote:
>>> BIND on my resolvers reaches the max open file limit and I am getting
>>> lots
>>> of SERVFAILs
>>> http://pastebin.com/SxRsHLff
>> 
>>> After I increased the max-socks (-s 8192) to 8192, I no longer saw the
>>> file
>>> limit error from the log anymore; however, I am still many SERVFAILs.
>> 
>> no other errors?
>> 
>>> Our resolvers were doing about 15k queries per seconds when this was
>>> happening and those were legit traffic.  I am aware that I am setting
>>> recursive clients to a very high number.  Those resolvers are running
>>>on
>>> 12-cores cpu and 24G RAM hardware.  cpu utilization was at about 20%
>>>and
>>> plenty of RAM left.
>> 
>>> I am wondering if I've reached the limit of BIND for the amount of
>>> recursive queries it can serve.  Any other tunings I should try?
>> 
>> maybe changing number of recursive-clients, max-clients-per-query.
>> 
>> Does EDNS work for you? EDNS problems often result to increased number
>>of
>> TCP queries which slows down resolution ...
>> 
>>> By the way, the resolvers are running RHEL 6.x.
>> 
>> precise BIND version would help a bit more... seems RH6.6 contains 9.8.2
>> but
>> that may be different for older RH6 versions.
>> 
>> 
>
>Unless you're running a build with --with-tuning=large (for which there
>are a number of caveats around the capacity of the machine etc..), then
>you don't really want to have a backlog of recursive clients that
>exceeds 3000-3500.  If you're getting that many in your backlog, then as
>already highlighted to you, there is Something Wrong going on.

We're running --with-tuning=large, but I think we are OK (128GB RAM, 32
cores).  If there are other caveats to be aware of, please share.

For years I kept recursive clients conservatively set (based on some of
your docs, and community comments).  I finally raised it much higher just
to see what would happen (after having to repeatedly explain why blindly
increasing that number wasn't a good thing), and it had no effect one way
or another.  Still got the servfails.

We are in a somewhat unique situation, because we have batch type jobs
generating rules/etc which often purposefully crawl the "bad" parts of the
'Net and in turn generate DNS requests for things which legitimately
return servfail.  However, we were getting increasingly consistent
complaints from users about seeing servfails where they weren't expected.
The biggest thing which helped for us was increasing
DISC_SOCKET_MAXEVENTS.  We're still digging to see if the remaining
servfail reports are genuinely something we can tune around, or a symptom
of the use case.

>You're probably running into other resource limits that will be what are
>causing the SERVFAIL responses you're still seeing despite increasing
>the maximum number of sockets that named can use.  I would tune down the
>limit to 3000 and allow named to drop the oldest outstanding client
>queries when new ones need to be processed.

I'm going to crank this back down in our environments.

>There is another logging category you can use (query-errors) that can
>tell you more, but it's probably not worth it in this instance.
>
>And I have another suggestion for what might be causing your backlog
>(apart from problems in the network path between your servers and the
>Internet authoritative servers), for which we have some
>soon-to-be-released new mitigation features (in 9.10.3):
>
>https://kb.isc.org/article/AA-01178
>
>(this will be updated to reflect the features we will actually include
>in the upcoming release - but they're essentially going to be
>fetches-per-server and fetches-per-zone along with with improved
>logging/stats for both of those)
>
>There's going to be a webinar about both the problem and the mitigations
>on July 8th:
>
>https://www.facebook.com/events/100311766979499/
>
>http://goo.gl/Z8idQf

Looking forward to this.  We've been sticking to 9.9.x (currently running
9.9.7) as an ESV release, but maybe 9.10 makes sense.  Not sure how the
community feels about that?

For the record I've spent a lot of time with our network team looking at
firewall logs, getting packet traces, etc and not found any smoking guns.
We have a perhaps not so unique setup where the caches are in a DMZ, so
clients talk through a firewall, and the DNS servers talk through a
firewall.  I've identified and fixed a number of issues along the
way...enumerating here in case it helps anyone else.

The internal firewall was oversubscribed, and at peak times would reset
connections causing clients to retry which quickly wound up recursive
clients.  Replaced those firewalls, and that specific behavior got a lot
better.

The external firewall was sharing a PAT for all caches, which eventually
exhausted 65k ports.  Can't drop these direct on the 'Net for security
reasons, but now have 1-to-1 NAT per cache and haven't seen this exact
behavior sense.

We do still routinely see that at least some of these also don't resolve
manually from other networks when we dig in ourselves...so at least some
of it is just "bad Internet" (whether network, firewall, DNS server,
whatever).  However, some of them resolve just fine on other networks or
through things like Google DNS when we're returning servfail.  This is
what we're trying to debug...the tweaking we've done so far has helped
reduce but not eliminated our servfail problem.

>Hoping that this is useful?

Very, thanks for the ideas.