BIND 10 #657: listen_on for b10-auth leads to 100% CPU usage
BIND 10 Development
do-not-reply at isc.org
Tue Mar 8 03:37:57 UTC 2011
#657: listen_on for b10-auth leads to 100% CPU usage
-------------------------------------+-------------------------------------
Reporter: jinmei | Owner: jinmei
Type: defect | Status: reviewing
Priority: blocker | Milestone: A-Team-
Component: | Sprint-20110309
b10-auth | Resolution:
Keywords: | Sensitive: 0
Estimated Number of Hours: 0.0 | Add Hours to Ticket: 0
Billable?: 1 | Total Hours: 0
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:9 vorner]:
> Well, the problem is, if we share it, we might close an already accepted
connection by accident by it (or, not close it, but stop handling it). A
solution that doesn't have such problem would be much more complicated and
unreadable.
>
> Actually, I don't like the stopped_by_hand_ much anyway. If we close the
socket, the OS knows. So nothing more will come.
[snip]
> About the tests, I agree we should have them. But in the ticket it would
need to implement test framework around the whole sockets stuff. I believe
it would be too big for that ticket. I think there was a ticket opened for
the test framework back then, but I can't find it right now.
>
> Anyway, I can't think of a test that could actually catch this. Because
all functionality was preserved, nothing stopped working. Even the socket
got closed, so testing we can't connect wouldn't help. And at the end of
the test, the io service would be destroyed, eliminating even the 100%CPU
problem.
Right now I don't have a specific idea about how to test it, but the
difficulty you mentioned above seemed to highlight another problem
about how it was developed (although I understand you were not
responsible for that): we should have began with how to test it and
then implement it. Clearly #388 wasn't developed that way, and now
the code is shown we are puzzled about how to test that given code.
This is another reason why I'd suggest we should first step back.
Aynchronous behavior like this one is generally hard to understand,
so IMO we shouldn't have allowed it to be merged without tests. Even
if we didn't have this bug, we'll eventually want to update the code,
but without tests we won't be able to do it with confidence.
Another reason why I'd rather prefer stepping back is to keep
ourselves responsible for writing tests. As I noted in my previous
comment, #388 was asked for review without tests and even without
explaining why the tests were missing. IMO this is not sufficiently
responsible, and it would be an effective way to keep ourselves
motivated if the cost of not doing so is being forced to revert the
irresponsible commit.
If you don't agree with reverting these features, however, one
possible compromise might be:
- merge the error check proposed in this ticket anyway. This would be
necessary regardless of the problem of #388 as discussed separately,
so we could say this is not to give an immunity to #388 but an
independent fix that happens to solve another bug due to #388.
- in parallel with that create a test ticket for #388 and give it a
highest priority (someone who is most responsible for #388 should do
it at a highest priority). This process should be done a test
driven manner, so that we won't be trapped into a particular style
of implementation, saying "we cannot test this implementation".
> > Assuming we fix this bug of #388, I'm also not sure if this
> > implementation is reasonable.
[snip]
> Hmm. That sounds like a nice refactoring of the code. Should it be on a
separate ticket, this one or #388? (repoened)
I'd say it will go to a separate ticket.
> > I have another concern about how we handle updating "listen_on".
[snip]
> Yes, and I believe there's a TODO about it. I wanted something working
at the time of writing it into resolver with a note that this is fine-
tuning. I agree we want it in the long term. But it's not part of this
ticket. Should I just open a new one for it?
Yes, please.
> > With all these considered, most substantially abut the code/design
> > quality of #388 without tests, I'd suggest we should step back first
> > and restart: revert #388, #575, and #576, restart #388 with tests,
> > then re-add #575 and #576.
>
> Hmm, I'm little bit afraid about it. I'm not sure what will reverting
these branches do with the code and history, they are already quite
tangled into the rest. It might work, though.
>
> But we might either promise to have it done before release, or revert
even the parts with documentation updates.
>
> Anyway, do you have an idea how to do the tests in reasonable way?
See above.
> > - I'd avoid using UNIX-specific error code such as EWOULDBLOCK. ASIO
> > has a higher level object constants (see asio/error.hpp)
>
> Good. The asio online documentation is little bit unfriendly to me, but
I'll have a look there soon.
Making documentation friendly is a difficult task:-) I think the ASIO
documentation is quite decent, but I agree it could be much improved.
It will also apply to our documentation:-)
> > - as for logging, as I suggested I'd separate intentional cancel and
> > real error, and handle them separately in terms of logging.
>
> Should that be part of this ticket?
Not necessarily. It depends on how we deal with this general issue:
if we step back, there will be multiple follow up tasks, and this
would be included part of them. If we choose to merge the error check
fix, it wouldn't contain the larger change such as using cancel
anyway.
--
Ticket URL: <http://bind10.isc.org/ticket/657#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list