BIND 10 #657: listen_on for b10-auth leads to 100% CPU usage
BIND 10 Development
do-not-reply at isc.org
Mon Mar 7 20:19:47 UTC 2011
#657: listen_on for b10-auth leads to 100% CPU usage
-------------------------------------+-------------------------------------
Reporter: jinmei | Owner: jinmei
Type: defect | Status: accepted
Priority: blocker | Milestone: A-Team-
Component: | Sprint-20110309
b10-auth | Resolution:
Keywords: | Sensitive: 0
Estimated Number of Hours: 0.0 | Add Hours to Ticket: 0
Billable?: 1 | Total Hours: 0
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:4 vorner]:
> The problem was caused by a loop in the tcp and udp server which retried
accept/recv on error. That caused a busy loop, because when the original
socket was closed in update, it generated error on it's accept and it kept
retrying on the closed socket.
>
> I added a condition to allow only some errors, which seemed to me to be
non-fatal. However, I have no idea how to test this (because even with the
bug, the server did work, it only kept being busy) and the asio
documentation is quite sparse to the point how to recognize the errors
(Stephen suggested that it probably contains errno, which makes sense, but
how to test that too?).
I confirmed the change stopped the symptom, but as I closely looked at
the implementation, multi-fold concerns arose (and so the review took
much longer than I originally expected).
First, we should have been able to avoid having this problem if we had
implemented #388 correctly. "when the original socket was closed in
update", the stop interface introduced in #388 is called, and since
the "stop state" is checked in TCP/UDPServer::operator(), it should be
able to avoid the infinite loop.
In fact, #388 does this for UDPServer (so the loop doesn't occur even
without the patch of this thread), but the TCPServer version is buggy.
TCPServer::stopped_by_hand_ is a real value (instead of a pointer or
reference that can be shared by multiple objects), so even if
TCPServer::stop() sets it to true, once the TCPServer object is copied
via the coroutine framework the effect is canceled.
Moving one step forward, we should (probably) have been able to avoid
having this bug in #388 if we really wrote tests. I understand tests
that would rely on real sockets and network I/O like this one may not
be trivial, but according to the ticket log of #388 it's even
questionable whether the implementor even thought about tests. As a
hindsight, we should have not allowed that quality of code to be
merged in the first place.
Assuming we fix this bug of #388, I'm also not sure if this
implementation is reasonable. It introduces another state,
stopped_by_hand_, which has subtle relationship with other state of
the class (i.e. sockets), making the code difficult to understand and
more error prone. Closing sockets at this point doesn't seem to be a
good idea, either; it will make subsequent socket operations fail
abruptly, and since it cannot be distinguished from real errors in the
network layer, e.g., we cannot be sure whether we should log the event
or not (as commented in the patch). I'd rather "cancel" the socket
here, and handle the cancel event inside the handler. That way, and
with my understanding that it can be called multiple times, we can
also eliminate the stopped_by_hand_ member variable. I also note
"socket_->close()" in TCPServer::stop() doesn't do what the code
intended to do (whether it's close() or cancel()) for the same reason
as its buggy stopped_by_hand_.
I have another concern about how we handle updating "listen_on".
Currently it clears all existing sockets and listens on new ones from
the scratch. If the old and new sets of interfaces share the same
address/port, this will (possibly) lead to service disruption. We
should actually do this in more sophisticated way: compare the old and
new sets, and stop only disappearing ones and open only really new
ones.
With all these considered, most substantially abut the code/design
quality of #388 without tests, I'd suggest we should step back first
and restart: revert #388, #575, and #576, restart #388 with tests,
then re-add #575 and #576.
Finally, comments about the patch itself follow:
- [comment] handling socket errors in UDP/TCPServer::operator() would
be necessary anyway, even if we fix the various bugs of #388 (there
may be some system level errors that could lead to infinite loop).
- I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO
has a higher level object constants (see asio/error.hpp)
- similarly, I'm not sure if we have to explicitly check the error
category (it may not do harm actually, but the intent isn't clear
and IMO reduces code readability)
- [comment] you actually fixed another bug in TCPServer::operator().
In the following original code:
{{{
do {
CORO_YIELD acceptor_->async_accept(*socket_, *this);
} while (!ec);
}}}
!ec should actually be just ec.
- as for logging, as I suggested I'd separate intentional cancel and
real error, and handle them separately in terms of logging.
--
Ticket URL: <http://bind10.isc.org/ticket/657#comment:7>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list