File Descriptor limit and malfunction bind

Imri Zvik imriz at inter.net.il
Fri Jan 8 12:38:17 UTC 2010


Hi :)

While I agree with you that 4096 should be sufficient (what is your definition of a highly loaded server?), there are a couple of situations where a server might use more sockets than it would normally use:

1. DOS attack
2. Higher latency while trying to resolve recursion queries.
3. A server with flushed/unprimed cache.

I think that the main issue here is why bind freeze when it runs out of sockets.

Bottom line, even if there is another, transient, issue which is causing the higher socket usage, raising the limits will at least help avoiding the hang.

Regarding epoll - I already mentioned that epoll is the immediate suspect to this and some other issues in 9.4.3 (see my 9.4.3 oddities thread).

Please note that I've tried that myself (recompiling with --disable-epoll) on 3.4.3-P*, and ran into this error:
05-Jan-2010 20:54:33.798 general: critical: socket.c:3138: fatal error:
05-Jan-2010 20:54:33.806 general: critical: exiting (due to fatal error in library)

Also, my server returned a lot SERVFAIL errors.

At the time I was more interested in getting my service back to acceptable levels than debugging/troubleshooting this issue, so I downgraded to 9.4.2, which worked flawlessly.



-----Original Message-----
From: JINMEI Tatuya / 神明達哉 [mailto:jinmei at isc.org] 
Sent: Friday, January 08, 2010 8:55 AM
To: Imri Zvik
Cc: bind-users at lists.isc.org
Subject: Re: File Descriptor limit and malfunction bind

At Tue, 05 Jan 2010 10:36:27 +0200,
Imri Zvik <imriz at inter.net.il> wrote:

> > i have a high load DNS server running bind 9.4.3 on RH -
> > yesterday we experienced a problem with the bind  (the bind froze) , and
> > when looking at the logs i saw the following error :
> > named error: socket: file descriptor exceeds limit (4096/4096)
> > i looked at my OS file descriptor limit and using ulimit -n   - 1024 .
> > where the number 4096 come from?

It's the hard-coded default maximum number of file descriptor (which
is nearly equal to the maximum allowable number of open sockets).

> If I'm not mistaken, you should either recompile with a higher value for 
> ISC_SOCKET_MAXSOCKETS or restart named with the -S <maxsockets> argument.

I'm afraid it's yes and no.  Yes, you can raise the hard coded default
value by the -S command line option.  (I'm afraid) no, I suspect it
won't solve the problem.  From my past experiences, 4096 should be
sufficient even for a very busy server.  If it still consumes all
available sockets, it's more likely to mean there's some unexpected
serious error (bug) which can't be mitigated by raising that limit.

I've heard of similar reports (seemingly consuming all available
sockets and named "freezes"), but unfortunately I couldn't reproduce
it myself and since it seems to be quite rare I've not figured out the
problem.

One possible workaround one may want to try is to *disable* epoll, the
efficient version of I/O API for Linux:
./configure --disable-epoll

This means named will use the inefficient API of select, but depending
on the machine power and the server load, it may provide acceptable
performance and rather stabler behavior as select is (seemingly)
stabler API.

---
JINMEI, Tatuya
Internet Systems Consortium, Inc.




More information about the bind-users mailing list