BIND 10 #657: listen_on for b10-auth leads to 100% CPU usage

Mon Mar 7 20:19:47 UTC 2011

#657: listen_on for b10-auth leads to 100% CPU usage
-------------------------------------+-------------------------------------
                 Reporter:  jinmei   |                Owner:  jinmei
                     Type:  defect   |               Status:  accepted
                 Priority:  blocker  |            Milestone:  A-Team-
                Component:           |  Sprint-20110309
  b10-auth                           |           Resolution:
                 Keywords:           |            Sensitive:  0
Estimated Number of Hours:  0.0      |  Add Hours to Ticket:  0
                Billable?:  1        |          Total Hours:  0
                Internal?:  0        |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:4 vorner]:

 > The problem was caused by a loop in the tcp and udp server which retried
 accept/recv on error. That caused a busy loop, because when the original
 socket was closed in update, it generated error on it's accept and it kept
 retrying on the closed socket.
 >
 > I added a condition to allow only some errors, which seemed to me to be
 non-fatal. However, I have no idea how to test this (because even with the
 bug, the server did work, it only kept being busy) and the asio
 documentation is quite sparse to the point how to recognize the errors
 (Stephen suggested that it probably contains errno, which makes sense, but
 how to test that too?).

 I confirmed the change stopped the symptom, but as I closely looked at
 the implementation, multi-fold concerns arose (and so the review took
 much longer than I originally expected).

 First, we should have been able to avoid having this problem if we had
 implemented #388 correctly.  "when the original socket was closed in
 update", the stop interface introduced in #388 is called, and since
 the "stop state" is checked in TCP/UDPServer::operator(), it should be
 able to avoid the infinite loop.

 In fact, #388 does this for UDPServer (so the loop doesn't occur even
 without the patch of this thread), but the TCPServer version is buggy.
 TCPServer::stopped_by_hand_ is a real value (instead of a pointer or
 reference that can be shared by multiple objects), so even if
 TCPServer::stop() sets it to true, once the TCPServer object is copied
 via the coroutine framework the effect is canceled.

 Moving one step forward, we should (probably) have been able to avoid
 having this bug in #388 if we really wrote tests.  I understand tests
 that would rely on real sockets and network I/O like this one may not
 be trivial, but according to the ticket log of #388 it's even
 questionable whether the implementor even thought about tests.  As a
 hindsight, we should have not allowed that quality of code to be
 merged in the first place.

 Assuming we fix this bug of #388, I'm also not sure if this
 implementation is reasonable.  It introduces another state,
 stopped_by_hand_, which has subtle relationship with other state of
 the class (i.e. sockets), making the code difficult to understand and
 more error prone.  Closing sockets at this point doesn't seem to be a
 good idea, either; it will make subsequent socket operations fail
 abruptly, and since it cannot be distinguished from real errors in the
 network layer, e.g., we cannot be sure whether we should log the event
 or not (as commented in the patch).  I'd rather "cancel" the socket
 here, and handle the cancel event inside the handler.  That way, and
 with my understanding that it can be called multiple times, we can
 also eliminate the stopped_by_hand_ member variable.  I also note
 "socket_->close()" in TCPServer::stop() doesn't do what the code
 intended to do (whether it's close() or cancel()) for the same reason
 as its buggy stopped_by_hand_.

 I have another concern about how we handle updating "listen_on".
 Currently it clears all existing sockets and listens on new ones from
 the scratch.  If the old and new sets of interfaces share the same
 address/port, this will (possibly) lead to service disruption.  We
 should actually do this in more sophisticated way: compare the old and
 new sets, and stop only disappearing ones and open only really new
 ones.

 With all these considered, most substantially abut the code/design
 quality of #388 without tests, I'd suggest we should step back first
 and restart: revert #388, #575, and #576, restart #388 with tests,
 then re-add #575 and #576.

 Finally, comments about the patch itself follow:

 - [comment] handling socket errors in UDP/TCPServer::operator() would
   be necessary anyway, even if we fix the various bugs of #388 (there
   may be some system level errors that could lead to infinite loop).
 - I'd avoid using UNIX-specific error code such as EWOULDBLOCK.  ASIO
   has a higher level object constants (see asio/error.hpp)
 - similarly, I'm not sure if we have to explicitly check the error
   category (it may not do harm actually, but the intent isn't clear
   and IMO reduces code readability)
 - [comment] you actually fixed another bug in TCPServer::operator().
   In the following original code:
 {{{
             do {
                 CORO_YIELD acceptor_->async_accept(*socket_, *this);
             } while (!ec);
 }}}
   !ec should actually be just ec.
 - as for logging, as I suggested I'd separate intentional cancel and
   real error, and handle them separately in terms of logging.

-- 
Ticket URL: <http://bind10.isc.org/ticket/657#comment:7>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development