BIND 10 #851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy

Wed Apr 20 16:39:01 UTC 2011

#851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  jinmei                             |                Status:  reviewing
                       Type:         |             Milestone:
  defect                             |  Sprint-20110503
                   Priority:         |            Resolution:
  critical                           |             Sensitive:  0
                  Component:  data   |           Sub-Project:  DNS
  source                             |  Estimated Difficulty:  0.0
                   Keywords:         |           Total Hours:  0
            Defect Severity:  N/A    |
Feature Depending on Ticket:         |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:6 vorner]:

 > Looking at the surrounding code, it is indeed quite hairy. Why, if the
 data really lives in cache, did it touch the database in the first place?
 But that's not the point here, I guess.

 Do you mean, for example, zoneinfo (or some part of it that is
 necessary for query processing) should also be cached?  If so, I tend
 to agree, but as you probably understand that's a separate issue.  We
 could solve this symptom along with that point, but I'm afraid that
 will require more fundamental and big changes and longer
 development/review cycles.  I also believe the current architecture of
 query processing and its interaction of hot spot cache are already too
 complicated and am reluctant to introduce more bandaids without
 revisiting the architecture (which we've been calling "refactoring").
 Hopefully we have time for this architecture level of work this year.

 > I have one question. Is the whole thing wrapped in some kind of
 transaction? If yes, then I believe the code can be merged. If not, then
 this only makes the probability of occurrence smaller ‒ because the
 following code still does expect that if it got there, it can get the data
 from DB and be happy.

 I'm not sure if I understand your concern.  First, there's no (any
 sense of) transaction around this logic of code.  But I don't think
 the revised code has any possibility of triggering the same bug, even
 with a smaller probability.  Any query process begins with doQuery(),
 and it always calls doQueryTask().  If the timing issue that triggered
 this crash bug happens in our first attempt of getting the zoneinfo in
 doQueryTask(), it will result in 'task.flags' having the NO_SUCH_ZONE
 flag.  Then in doQueryTask() any further process is canceled (either
 resulting in a REFUSED answer or in missing a particular part of
 response):

 {{{
         if (task->flags == NO_SUCH_ZONE) {
             if (task->state == QueryTask::GETANSWER) {
                 m.setRcode(Rcode::REFUSED());
                 return;
             }
             continue;
         }
 }}}

 If you think this observation is not correct or you meant something
 different, could you explain your concern along with the code?

-- 
Ticket URL: <http://bind10.isc.org/ticket/851#comment:7>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development