BIND 10 #1705: attempt to run multiple auth servers causes FATAL [b10-auth.server_common] SRVCOMM_EXCEPTION_ALLOC exception when allocating a socket: File exists
BIND 10 Development
do-not-reply at isc.org
Thu Feb 23 15:08:34 UTC 2012
#1705: attempt to run multiple auth servers causes FATAL [b10-auth.server_common]
SRVCOMM_EXCEPTION_ALLOC exception when allocating a socket: File exists
-------------------------------------+-------------------------------------
Reporter: jreed | Owner: UnAssigned
Type: | Status: reviewing
defect | Milestone:
Priority: major | Sprint-20120306
Component: | Resolution:
Inter-module communication | Sensitive: 0
Keywords: | Sub-Project: Core
Defect Severity: N/A | Estimated Difficulty: 0
Feature Depending on Ticket: | Total Hours: 0
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Changes (by vorner):
* owner: vorner => UnAssigned
* status: accepted => reviewing
* subproject: DNS => Core
* component: b10-auth => Inter-module communication
* milestone: New Tasks => Sprint-20120306
Comment:
The problem seems to be from the category „You'll probably not believe me,
but I'm telling the truth, even if it sounds incredible…“.
So, the file exists error comes from `epoll_ctl` when adding a new socket
from inside an asio tcp acceptor. It means the file descriptor being added
to the watcher (or whatever it is) is already there. Which was strange,
because the file descriptor was just received from the boss when it was
added. However, it turned out that the recvmsg actually does return the
same file descriptor multiple times (maybe something inside the kernel
gets confused when we are sending the same file descriptor to multiple
applications).
I found a workaround that seems to help ‒ I dup the file descriptor after
I receive it and close the original one. Now they are not being
duplicated, but I fear two things:
* There's a bug in linux kernel and we should report it.
* When it can create a duplicate FD to one the recvmsg returned, it might
as well hit a different one. I don't know how to check this, but it could
be quite a disaster if it did.
And, I'm not sure how to write any kind of test for this. I propose we
include this fix now and create a ticket to investigate the kernel or
something. I'd be very interesting where this comes from.
As the feature was not in previous release, I don't think it needs a
changelog entry (and I wouldn't like to describe the error there in few
sentences).
--
Ticket URL: <http://bind10.isc.org/ticket/1705#comment:3>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list