BIND 10 #657: listen_on for b10-auth leads to 100% CPU usage

Tue Mar 8 03:37:57 UTC 2011

#657: listen_on for b10-auth leads to 100% CPU usage
-------------------------------------+-------------------------------------
                 Reporter:  jinmei   |                Owner:  jinmei
                     Type:  defect   |               Status:  reviewing
                 Priority:  blocker  |            Milestone:  A-Team-
                Component:           |  Sprint-20110309
  b10-auth                           |           Resolution:
                 Keywords:           |            Sensitive:  0
Estimated Number of Hours:  0.0      |  Add Hours to Ticket:  0
                Billable?:  1        |          Total Hours:  0
                Internal?:  0        |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:9 vorner]:

 > Well, the problem is, if we share it, we might close an already accepted
 connection by accident by it (or, not close it, but stop handling it). A
 solution that doesn't have such problem would be much more complicated and
 unreadable.
 >
 > Actually, I don't like the stopped_by_hand_ much anyway. If we close the
 socket, the OS knows. So nothing more will come.

 [snip]

 > About the tests, I agree we should have them. But in the ticket it would
 need to implement test framework around the whole sockets stuff. I believe
 it would be too big for that ticket. I think there was a ticket opened for
 the test framework back then, but I can't find it right now.
 >
 > Anyway, I can't think of a test that could actually catch this. Because
 all functionality was preserved, nothing stopped working. Even the socket
 got closed, so testing we can't connect wouldn't help. And at the end of
 the test, the io service would be destroyed, eliminating even the 100%CPU
 problem.

 Right now I don't have a specific idea about how to test it, but the
 difficulty you mentioned above seemed to highlight another problem
 about how it was developed (although I understand you were not
 responsible for that): we should have began with how to test it and
 then implement it.  Clearly #388 wasn't developed that way, and now
 the code is shown we are puzzled about how to test that given code.
 This is another reason why I'd suggest we should first step back.

 Aynchronous behavior like this one is generally hard to understand,
 so IMO we shouldn't have allowed it to be merged without tests.  Even
 if we didn't have this bug, we'll eventually want to update the code,
 but without tests we won't be able to do it with confidence.

 Another reason why I'd rather prefer stepping back is to keep
 ourselves responsible for writing tests.  As I noted in my previous
 comment, #388 was asked for review without tests and even without
 explaining why the tests were missing.  IMO this is not sufficiently
 responsible, and it would be an effective way to keep ourselves
 motivated if the cost of not doing so is being forced to revert the
 irresponsible commit.

 If you don't agree with reverting these features, however, one
 possible compromise might be:
 - merge the error check proposed in this ticket anyway.  This would be
   necessary regardless of the problem of #388 as discussed separately,
   so we could say this is not to give an immunity to #388 but an
   independent fix that happens to solve another bug due to #388.
 - in parallel with that create a test ticket for #388 and give it a
   highest priority (someone who is most responsible for #388 should do
   it at a highest priority).  This process should be done a test
   driven manner, so that we won't be trapped into a particular style
   of implementation, saying "we cannot test this implementation".

 > > Assuming we fix this bug of #388, I'm also not sure if this
 > > implementation is reasonable.

 [snip]

 > Hmm. That sounds like a nice refactoring of the code. Should it be on a
 separate ticket, this one or #388? (repoened)

 I'd say it will go to a separate ticket.

 > > I have another concern about how we handle updating "listen_on".
 [snip]
 > Yes, and I believe there's a TODO about it. I wanted something working
 at the time of writing it into resolver with a note that this is fine-
 tuning. I agree we want it in the long term. But it's not part of this
 ticket. Should I just open a new one for it?

 Yes, please.

 > > With all these considered, most substantially abut the code/design
 > > quality of #388 without tests, I'd suggest we should step back first
 > > and restart: revert #388, #575, and #576, restart #388 with tests,
 > > then re-add #575 and #576.
 >
 > Hmm, I'm little bit afraid about it. I'm not sure what will reverting
 these branches do with the code and history, they are already quite
 tangled into the rest. It might work, though.
 >
 > But we might either promise to have it done before release, or revert
 even the parts with documentation updates.
 >
 > Anyway, do you have an idea how to do the tests in reasonable way?

 See above.

 > > - I'd avoid using UNIX-specific error code such as EWOULDBLOCK.  ASIO
 > >   has a higher level object constants (see asio/error.hpp)
 >
 > Good. The asio online documentation is little bit unfriendly to me, but
 I'll have a look there soon.

 Making documentation friendly is a difficult task:-)  I think the ASIO
 documentation is quite decent, but I agree it could be much improved.
 It will also apply to our documentation:-)

 > > - as for logging, as I suggested I'd separate intentional cancel and
 > >   real error, and handle them separately in terms of logging.
 >
 > Should that be part of this ticket?

 Not necessarily.  It depends on how we deal with this general issue:
 if we step back, there will be multiple follow up tasks, and this
 would be included part of them.  If we choose to merge the error check
 fix, it wouldn't contain the larger change such as using cancel
 anyway.

-- 
Ticket URL: <http://bind10.isc.org/ticket/657#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development