BIND 10 #851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
BIND 10 Development
do-not-reply at isc.org
Wed Apr 20 16:39:01 UTC 2011
#851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
-------------------------------------+-------------------------------------
Reporter: | Owner: jinmei
jinmei | Status: reviewing
Type: | Milestone:
defect | Sprint-20110503
Priority: | Resolution:
critical | Sensitive: 0
Component: data | Sub-Project: DNS
source | Estimated Difficulty: 0.0
Keywords: | Total Hours: 0
Defect Severity: N/A |
Feature Depending on Ticket: |
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:6 vorner]:
> Looking at the surrounding code, it is indeed quite hairy. Why, if the
data really lives in cache, did it touch the database in the first place?
But that's not the point here, I guess.
Do you mean, for example, zoneinfo (or some part of it that is
necessary for query processing) should also be cached? If so, I tend
to agree, but as you probably understand that's a separate issue. We
could solve this symptom along with that point, but I'm afraid that
will require more fundamental and big changes and longer
development/review cycles. I also believe the current architecture of
query processing and its interaction of hot spot cache are already too
complicated and am reluctant to introduce more bandaids without
revisiting the architecture (which we've been calling "refactoring").
Hopefully we have time for this architecture level of work this year.
> I have one question. Is the whole thing wrapped in some kind of
transaction? If yes, then I believe the code can be merged. If not, then
this only makes the probability of occurrence smaller ‒ because the
following code still does expect that if it got there, it can get the data
from DB and be happy.
I'm not sure if I understand your concern. First, there's no (any
sense of) transaction around this logic of code. But I don't think
the revised code has any possibility of triggering the same bug, even
with a smaller probability. Any query process begins with doQuery(),
and it always calls doQueryTask(). If the timing issue that triggered
this crash bug happens in our first attempt of getting the zoneinfo in
doQueryTask(), it will result in 'task.flags' having the NO_SUCH_ZONE
flag. Then in doQueryTask() any further process is canceled (either
resulting in a REFUSED answer or in missing a particular part of
response):
{{{
if (task->flags == NO_SUCH_ZONE) {
if (task->state == QueryTask::GETANSWER) {
m.setRcode(Rcode::REFUSED());
return;
}
continue;
}
}}}
If you think this observation is not correct or you meant something
different, could you explain your concern along with the code?
--
Ticket URL: <http://bind10.isc.org/ticket/851#comment:7>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list