BIND 10 #1522: Implementatio of isc::server_common::SocketRequestor

Wed Jan 4 09:21:29 UTC 2012

#1522: Implementatio of isc::server_common::SocketRequestor
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  vorner                             |                Status:  reviewing
                       Type:  task   |             Milestone:
                   Priority:         |  Sprint-20120110
  blocker                            |            Resolution:
                  Component:         |             Sensitive:  0
  Unclassified                       |           Sub-Project:  Core
                   Keywords:         |  Estimated Difficulty:  0
            Defect Severity:  N/A    |           Total Hours:  10
Feature Depending on Ticket:         |
  Socket creator                     |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:17 vorner]:

 > > This caused an additional, unexpected issue: if the remote end (in
 > > practice, boss) closes the connection first, requestSocket() would
 > > result in SIGPIPE (and the application would die).  Just like we did
 > > for `SocketSession`, we need to at least ignore that signal.  I've
 > > added that code too (btw I suspect this is also the case for the boss
 -
 > > an app could die after requestSocket(), which would automatically
 > > close the connection).  The fact that we did the same trick again is
 > > another reason why we should unify this and `SocketSession` (but
 > > that's beyond the scope of our current work).
 >
 > Well, this wouldn't be such a problem, because boss is one of the
 components
 > that aren't allowed to crash. And it doesn't close the socket anywhere
 in its
 > code. But maybe better to have it there.

 I'm afraid I was misunderstood.  Assuming your comment is about
 "btw I suspect this is also the case for the boss", what I wanted to
 point out is that a bug in a client component could make the boss
 crash, too.

 > > > Anyway, it would be bad to close the connection. That would signal
 the boss to reclaim all the sockets transferred previously.
 > >
 > > Hmm, on further considering this and other bullet points below, I now
 > > think we should probably manage the states of open sockets at the boss
 > > and applications separately from the channel between them
 >
 > Well, they are bound to the socket they were sent over for a simple
 reason: we
 > need to know when an application crashed, to release its sockets. And it
 seems

 I understood that, but it can be done from the fact that the
 application dies itself.  In any case, as long as a socket could be in
 a broken state, I don't think it acceptable to pretend that it's safe
 to keep using it.  The choices I can see are: drop everything and
 create all necessary sockets from the scratch; similar to the previous
 one, but just abort the application and restart it; or separate the
 socket state and connection state.

 > an overkill to do so much to handle a possible bug, at last for this
 stage of
 > the project, it seems to me.
 >
 > And anyway, I consider this code to be mostly temporary, for two
 reasons:
 >  * If/when we adopt dbus instead of msgq, we can register at the buss to
 tell
 >    the boss when an application that sent given request disconnects from
 the
 >    boss.
 >  * If/when we adopt dbus, we might want to get rid of all this low-level
 >    error-prone code and send the sockets over the dbus (it doesn't allow
 sending
 >    them on windows, but our current code doesn't either). And the socket
 could go
 >    directly in the response to the request, avoiding this two-phase way.
 >
 > So, having a note in the code or low-priority ticket for it is OK, but
 I'd be
 > against trying to solve it now or in the next sprint or like that.

 As a bottom line, I'm not requesting this be addressed within the
 branch or by the release.  But I oppose to doing nothing (beyond
 making some comments) about this even in a near future.  While I admit
 the possibility of this failure mode should be quite small, the
 underlying system is sufficiently complicated and we cannot be so
 sure.  Also, since we don't have a concrete plan of when and how to
 replace msgq yet (and in any case I'm not so sure how the replacement
 magically solves all the problems), I'm not comfortable to use the
 plan as an excuse for not doing anything.   Leaving the possibility of
 undefined behavior in the code for a non determined period seems to be
 irresponsible.

 Maybe my practical suggestion would be to just abort the application
 if such "should not happen" thing happens with an assert or non
 catchable exception.  If it really doesn't happen, it's just fine; if
 it can happen in practical scenario, it's better to know that fact
 sooner.  And I insist we should do it sooner than later - if not for
 the coming release, in the next sprint or so.

 > > > Hmm, what do you mean? Like if the boss sent a "2\n" and the socket?
 Yes, then it would leak, but can we really avoid that in any way?
 > >
 > > See the third bullet, for example.  If we pass the pair of status (1
 > > byte) and FD via a single sendmsg()/recvmsg() (the status will be in
 > > the 1-byte "normal" data; the FD is passed as ancillary data),
 > > requestSocket() could examine both the FD and status and can close the
 > > socket (e.g.) if FD is bogus.
 >
 > But if the „status“ byte would be bogus, the fd could leak anyway,
 couldn't it?
 > If there was some?

 The difference is that in the current style recv_fd() can be called
 only after we examine the status, and it just wouldn't make sense to
 call it if we know the status code is bogus.  If we piggy-back the
 status in the recvmsg, we can detect the code is bogus while still
 retrieving the FD from the ancillary data, then close the FD (all
 within recv_fd()).

 > > > > - I personally don't think it the best idea to signal the case of
 > > > >   "CREATOR_SOCKET_UNAVAILABLE" by exception.  At the very least it
 > > > >   should be specifically separated from other exceptions such as
 the
 > > > >   one as a result of bogus status data; otherwise, assuming the
 app
 > > > >   should generally drop the connection on the latter case, we
 could
 > > > >   drop and re-establish the connection unnecessarily
 > > >
 > > > I'd like to point that such situation shouldn't happen in practice
 and would be same kind of bug as a wrong data sent, etc. In the time the
 client picks up the socket, it was already prepared and reserved. It can
 be unavailable only in case the token is wrong.
 > >
 > > I thought "unavailable" can be possible for a reason we cannot
 > > control, such as unrelated third-party application already opens the
 > > port.  Am I mistaken?
 >
 > We might have misunderstood each other. The CREATOR_SOCKET_UNAVAILABLE
 constant
 > is for the case I described. If the socket can't be created for reasons
 like the
 > port being taken already, the result would be sent over the msgq. But
 you're
 > right this case isn't signalled by a different exception now. The
 problem lies
 > within boss, as it sends the same error status for all errors, which
 should be
 > improved.

 Right, I realized I misunderstood it about CREATOR_SOCKET_UNAVAILABLE
 after making this comment, and I also understood the main point still
 stands as you noted.

 > Another cleanup ticket, maybe?

 I think so, and I think is quite important because such error can
 easily happen in practice.

 > Few notes for the new code:
 >  * Is there no logging when the request fails for some reason (error
 response
 >    over msgq, CREATOR_SOCKET_UNAVAILABLE, other failures…).

 My intent was that in these cases we throw an exception and it will be
 logged in an appropriate context at a higher layer.  (But when we
 separate the error case of failing to create a socket than other rare
 failures, it would probably make sense to log the event about the
 former).

 >  * The protocolString function could be used inside the
 >    createRequestSocketMessage as well and reuse some code.

 Okay, done, and while doing so I noticed that the current code could
 accept bogus 'protocol' and 'share mode'.  Even though it may be
 rejected after exchanging messages with the boss, it would be better
 to reject such a case at an earlier point.  I added the check and
 corresponding tests.

 >  * Is it possible to make gtest produce a warning (similar with the „You
 have
 >    disabled tests“ line at the bottom of the tests) if some of the tests
 are
 >    skipped when it is impossible to do the timeout?

 As far as I know, no.   I've quickly looked at the google test
 document but couldn't find anything usable for this purpose.

 >  * When reading the tokens in the child, the hardcoded length is 4.
 Maybe a note
 >    to the function it shouldn't be used with tokens of different
 lengths?

 Good idea, and I actually thought about something like that.  Added
 some comments.

 >  * What is wrong with ssize_t in
 85bf91d5836306a39b471567952ad1e0af91d948?
 >    ssize_t is signed, so the error handling should work, and seems more
 clean
 >    than int, as it is 64bit on 64bit systems (therefore it allows
 sending more
 >    than 2GB of data at once, though it is unlikely anyone would do
 that).

 I simply tried to use the same type as the return value of read() or
 write(), but I didn't necessarily have to change it.  In any case
 using a larger size than int doesn't help anyway because
 read()/write() can only return int values.  So in either case the
 behavior of the code should exactly the same.  If you still want to
 revert this part, however, please do so.

-- 
Ticket URL: <http://bind10.isc.org/ticket/1522#comment:18>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development