BIND 10 #657: listen_on for b10-auth leads to 100% CPU usage

Mon Mar 7 21:41:21 UTC 2011

#657: listen_on for b10-auth leads to 100% CPU usage
-------------------------------------+-------------------------------------
                 Reporter:  jinmei   |                Owner:  jinmei
                     Type:  defect   |               Status:  reviewing
                 Priority:  blocker  |            Milestone:  A-Team-
                Component:           |  Sprint-20110309
  b10-auth                           |           Resolution:
                 Keywords:           |            Sensitive:  0
Estimated Number of Hours:  0.0      |  Add Hours to Ticket:  0
                Billable?:  1        |          Total Hours:  0
                Internal?:  0        |
-------------------------------------+-------------------------------------
Changes (by vorner):

 * owner:  vorner => jinmei
 * status:  assigned => reviewing

Comment:

 Hello

 Replying to [comment:7 jinmei]:
 > In fact, #388 does this for UDPServer (so the loop doesn't occur even
 > without the patch of this thread), but the TCPServer version is buggy.
 > TCPServer::stopped_by_hand_ is a real value (instead of a pointer or
 > reference that can be shared by multiple objects), so even if
 > TCPServer::stop() sets it to true, once the TCPServer object is copied
 > via the coroutine framework the effect is canceled.

 Well, the problem is, if we share it, we might close an already accepted
 connection by accident by it (or, not close it, but stop handling it). A
 solution that doesn't have such problem would be much more complicated and
 unreadable.

 Actually, I don't like the stopped_by_hand_ much anyway. If we close the
 socket, the OS knows. So nothing more will come.

 > Moving one step forward, we should (probably) have been able to avoid
 > having this bug in #388 if we really wrote tests.  I understand tests
 > that would rely on real sockets and network I/O like this one may not
 > be trivial, but according to the ticket log of #388 it's even
 > questionable whether the implementor even thought about tests.  As a
 > hindsight, we should have not allowed that quality of code to be
 > merged in the first place.

 About the tests, I agree we should have them. But in the ticket it would
 need to implement test framework around the whole sockets stuff. I believe
 it would be too big for that ticket. I think there was a ticket opened for
 the test framework back then, but I can't find it right now.

 Anyway, I can't think of a test that could actually catch this. Because
 all functionality was preserved, nothing stopped working. Even the socket
 got closed, so testing we can't connect wouldn't help. And at the end of
 the test, the io service would be destroyed, eliminating even the 100%CPU
 problem.

 > Assuming we fix this bug of #388, I'm also not sure if this
 > implementation is reasonable.  It introduces another state,
 > stopped_by_hand_, which has subtle relationship with other state of
 > the class (i.e. sockets), making the code difficult to understand and
 > more error prone.  Closing sockets at this point doesn't seem to be a
 > good idea, either; it will make subsequent socket operations fail
 > abruptly, and since it cannot be distinguished from real errors in the
 > network layer, e.g., we cannot be sure whether we should log the event
 > or not (as commented in the patch).  I'd rather "cancel" the socket
 > here, and handle the cancel event inside the handler.  That way, and
 > with my understanding that it can be called multiple times, we can
 > also eliminate the stopped_by_hand_ member variable.  I also note
 > "socket_->close()" in TCPServer::stop() doesn't do what the code
 > intended to do (whether it's close() or cancel()) for the same reason
 > as its buggy stopped_by_hand_.

 Hmm. That sounds like a nice refactoring of the code. Should it be on a
 separate ticket, this one or #388? (repoened)

 > I have another concern about how we handle updating "listen_on".
 > Currently it clears all existing sockets and listens on new ones from
 > the scratch.  If the old and new sets of interfaces share the same
 > address/port, this will (possibly) lead to service disruption.  We
 > should actually do this in more sophisticated way: compare the old and
 > new sets, and stop only disappearing ones and open only really new
 > ones.

 Yes, and I believe there's a TODO about it. I wanted something working at
 the time of writing it into resolver with a note that this is fine-tuning.
 I agree we want it in the long term. But it's not part of this ticket.
 Should I just open a new one for it?

 > With all these considered, most substantially abut the code/design
 > quality of #388 without tests, I'd suggest we should step back first
 > and restart: revert #388, #575, and #576, restart #388 with tests,
 > then re-add #575 and #576.

 Hmm, I'm little bit afraid about it. I'm not sure what will reverting
 these branches do with the code and history, they are already quite
 tangled into the rest. It might work, though.

 But we might either promise to have it done before release, or revert even
 the parts with documentation updates.

 Anyway, do you have an idea how to do the tests in reasonable way?

 > Finally, comments about the patch itself follow:
 >
 > - [comment] handling socket errors in UDP/TCPServer::operator() would
 >   be necessary anyway, even if we fix the various bugs of #388 (there
 >   may be some system level errors that could lead to infinite loop).

 ACK, I was quite surprised about whoever wrote that code, why he did this.
 It makes kind of loopish impression.

 > - I'd avoid using UNIX-specific error code such as EWOULDBLOCK.  ASIO
 >   has a higher level object constants (see asio/error.hpp)

 Good ☺. The asio online documentation is little bit unfriendly to me, but
 I'll have a look there soon.

 > - [comment] you actually fixed another bug in TCPServer::operator().
 >   In the following original code:
 > {{{
 >             do {
 >                 CORO_YIELD acceptor_->async_accept(*socket_, *this);
 >             } while (!ec);
 > }}}
 >   !ec should actually be just ec.

 Yes, I know. I even had a theory why the original code did work.

 > - as for logging, as I suggested I'd separate intentional cancel and
 >   real error, and handle them separately in terms of logging.

 Should that be part of this ticket?

 ACK about the false nop, I already noticed the hidden code there.

 (switching the ticket state back to review, accept kind of takes it from
 review state)

-- 
Ticket URL: <https://bind10.isc.org/ticket/657#comment:9>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development