BIND 10 #1522: Implementatio of isc::server_common::SocketRequestor
BIND 10 Development
do-not-reply at isc.org
Wed Jan 4 09:21:29 UTC 2012
#1522: Implementatio of isc::server_common::SocketRequestor
-------------------------------------+-------------------------------------
Reporter: | Owner: jinmei
vorner | Status: reviewing
Type: task | Milestone:
Priority: | Sprint-20120110
blocker | Resolution:
Component: | Sensitive: 0
Unclassified | Sub-Project: Core
Keywords: | Estimated Difficulty: 0
Defect Severity: N/A | Total Hours: 10
Feature Depending on Ticket: |
Socket creator |
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:17 vorner]:
> > This caused an additional, unexpected issue: if the remote end (in
> > practice, boss) closes the connection first, requestSocket() would
> > result in SIGPIPE (and the application would die). Just like we did
> > for `SocketSession`, we need to at least ignore that signal. I've
> > added that code too (btw I suspect this is also the case for the boss
-
> > an app could die after requestSocket(), which would automatically
> > close the connection). The fact that we did the same trick again is
> > another reason why we should unify this and `SocketSession` (but
> > that's beyond the scope of our current work).
>
> Well, this wouldn't be such a problem, because boss is one of the
components
> that aren't allowed to crash. And it doesn't close the socket anywhere
in its
> code. But maybe better to have it there.
I'm afraid I was misunderstood. Assuming your comment is about
"btw I suspect this is also the case for the boss", what I wanted to
point out is that a bug in a client component could make the boss
crash, too.
> > > Anyway, it would be bad to close the connection. That would signal
the boss to reclaim all the sockets transferred previously.
> >
> > Hmm, on further considering this and other bullet points below, I now
> > think we should probably manage the states of open sockets at the boss
> > and applications separately from the channel between them
>
> Well, they are bound to the socket they were sent over for a simple
reason: we
> need to know when an application crashed, to release its sockets. And it
seems
I understood that, but it can be done from the fact that the
application dies itself. In any case, as long as a socket could be in
a broken state, I don't think it acceptable to pretend that it's safe
to keep using it. The choices I can see are: drop everything and
create all necessary sockets from the scratch; similar to the previous
one, but just abort the application and restart it; or separate the
socket state and connection state.
> an overkill to do so much to handle a possible bug, at last for this
stage of
> the project, it seems to me.
>
> And anyway, I consider this code to be mostly temporary, for two
reasons:
> * If/when we adopt dbus instead of msgq, we can register at the buss to
tell
> the boss when an application that sent given request disconnects from
the
> boss.
> * If/when we adopt dbus, we might want to get rid of all this low-level
> error-prone code and send the sockets over the dbus (it doesn't allow
sending
> them on windows, but our current code doesn't either). And the socket
could go
> directly in the response to the request, avoiding this two-phase way.
>
> So, having a note in the code or low-priority ticket for it is OK, but
I'd be
> against trying to solve it now or in the next sprint or like that.
As a bottom line, I'm not requesting this be addressed within the
branch or by the release. But I oppose to doing nothing (beyond
making some comments) about this even in a near future. While I admit
the possibility of this failure mode should be quite small, the
underlying system is sufficiently complicated and we cannot be so
sure. Also, since we don't have a concrete plan of when and how to
replace msgq yet (and in any case I'm not so sure how the replacement
magically solves all the problems), I'm not comfortable to use the
plan as an excuse for not doing anything. Leaving the possibility of
undefined behavior in the code for a non determined period seems to be
irresponsible.
Maybe my practical suggestion would be to just abort the application
if such "should not happen" thing happens with an assert or non
catchable exception. If it really doesn't happen, it's just fine; if
it can happen in practical scenario, it's better to know that fact
sooner. And I insist we should do it sooner than later - if not for
the coming release, in the next sprint or so.
> > > Hmm, what do you mean? Like if the boss sent a "2\n" and the socket?
Yes, then it would leak, but can we really avoid that in any way?
> >
> > See the third bullet, for example. If we pass the pair of status (1
> > byte) and FD via a single sendmsg()/recvmsg() (the status will be in
> > the 1-byte "normal" data; the FD is passed as ancillary data),
> > requestSocket() could examine both the FD and status and can close the
> > socket (e.g.) if FD is bogus.
>
> But if the „status“ byte would be bogus, the fd could leak anyway,
couldn't it?
> If there was some?
The difference is that in the current style recv_fd() can be called
only after we examine the status, and it just wouldn't make sense to
call it if we know the status code is bogus. If we piggy-back the
status in the recvmsg, we can detect the code is bogus while still
retrieving the FD from the ancillary data, then close the FD (all
within recv_fd()).
> > > > - I personally don't think it the best idea to signal the case of
> > > > "CREATOR_SOCKET_UNAVAILABLE" by exception. At the very least it
> > > > should be specifically separated from other exceptions such as
the
> > > > one as a result of bogus status data; otherwise, assuming the
app
> > > > should generally drop the connection on the latter case, we
could
> > > > drop and re-establish the connection unnecessarily
> > >
> > > I'd like to point that such situation shouldn't happen in practice
and would be same kind of bug as a wrong data sent, etc. In the time the
client picks up the socket, it was already prepared and reserved. It can
be unavailable only in case the token is wrong.
> >
> > I thought "unavailable" can be possible for a reason we cannot
> > control, such as unrelated third-party application already opens the
> > port. Am I mistaken?
>
> We might have misunderstood each other. The CREATOR_SOCKET_UNAVAILABLE
constant
> is for the case I described. If the socket can't be created for reasons
like the
> port being taken already, the result would be sent over the msgq. But
you're
> right this case isn't signalled by a different exception now. The
problem lies
> within boss, as it sends the same error status for all errors, which
should be
> improved.
Right, I realized I misunderstood it about CREATOR_SOCKET_UNAVAILABLE
after making this comment, and I also understood the main point still
stands as you noted.
> Another cleanup ticket, maybe?
I think so, and I think is quite important because such error can
easily happen in practice.
> Few notes for the new code:
> * Is there no logging when the request fails for some reason (error
response
> over msgq, CREATOR_SOCKET_UNAVAILABLE, other failures…).
My intent was that in these cases we throw an exception and it will be
logged in an appropriate context at a higher layer. (But when we
separate the error case of failing to create a socket than other rare
failures, it would probably make sense to log the event about the
former).
> * The protocolString function could be used inside the
> createRequestSocketMessage as well and reuse some code.
Okay, done, and while doing so I noticed that the current code could
accept bogus 'protocol' and 'share mode'. Even though it may be
rejected after exchanging messages with the boss, it would be better
to reject such a case at an earlier point. I added the check and
corresponding tests.
> * Is it possible to make gtest produce a warning (similar with the „You
have
> disabled tests“ line at the bottom of the tests) if some of the tests
are
> skipped when it is impossible to do the timeout?
As far as I know, no. I've quickly looked at the google test
document but couldn't find anything usable for this purpose.
> * When reading the tokens in the child, the hardcoded length is 4.
Maybe a note
> to the function it shouldn't be used with tokens of different
lengths?
Good idea, and I actually thought about something like that. Added
some comments.
> * What is wrong with ssize_t in
85bf91d5836306a39b471567952ad1e0af91d948?
> ssize_t is signed, so the error handling should work, and seems more
clean
> than int, as it is 64bit on 64bit systems (therefore it allows
sending more
> than 2GB of data at once, though it is unlikely anyone would do
that).
I simply tried to use the same type as the return value of read() or
write(), but I didn't necessarily have to change it. In any case
using a larger size than int doesn't help anyway because
read()/write() can only return int values. So in either case the
behavior of the code should exactly the same. If you still want to
revert this part, however, please do so.
--
Ticket URL: <http://bind10.isc.org/ticket/1522#comment:18>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list