BIND 10 #851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
BIND 10 Development
do-not-reply at isc.org
Thu Apr 21 18:57:27 UTC 2011
#851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
-------------------------------------+-------------------------------------
Reporter: | Owner: jinmei
jinmei | Status: reviewing
Type: | Milestone:
defect | Sprint-20110503
Priority: | Resolution:
critical | Sensitive: 0
Component: data | Sub-Project: DNS
source | Estimated Difficulty: 0.0
Keywords: | Total Hours: 0
Defect Severity: N/A |
Feature Depending on Ticket: |
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:9 vorner]:
> Well, I didn't really read the code to depth. But as I understand it, it
worked like this before the change:
> * Someone asked a query, the answer got cached.
> * Some time passes.
> * The DB becomes busy (by xfrin maybe).
> * Someone asks the same query.
> * The cache says it's valid, so we continue processing.
> * We get part of the answer from cache.
> * We ask the DB something more (eg. the `zoneinfo.getEnclosingZone()`
on line 947), it fails because of being busy, and returns NULL. We don't
expect NULL here, so we crash.
This is correct (or at least matches my understanding).
> What I think can happen now is this:
> * Someone asked a query, the answer got cached.
> * Some time passes.
> * Someone asks the same query.
> * We first check the DB (in `doQueryTask`) if it is valid. The DB
confirms.
And the 'zoneinfo' variable stores the positive result. From this
point, zoneinfo.getDataSource() or zoneinfo.getEnclosingZone() won't
require DB search, and both always return non NULL.
> * We get part of the answer from cache.
> * The DB becomes busy right now (by xfrin). It's a tiny time window,
but it can happen, we didn't lock the DB in transaction.
> * We ask the DB something more (again, the
`zoneinfo.getEnclosingZone()` on line 947), it fails, crash.
This doesn't happen, because zoneinfo remembers the positive result.
And, from my check before submitting this patch, there shouldn't be
any other form of unintentional DB access within the same while loop
of DataSrc::doQuery().
> Anyway, asking the DB several times without having a transaction is IMO
wrong, because we can get inconsistent information (someone can modify the
DB between our queries to it).
Whether or not solving it using a transaction, yes, it's bad that we
can have (and subsequently return) inconsistent data in terms of 100%
accuracy. This argument applies even without the hot spot cache, and
IMO is far beyond the scope of this bug fix ticket. So, in any case,
I'd propose separating this issue, and concentrate on fixing the crash
bug by making the code as good/bad as the one without the hot spot
cache.
With noting that, for the food for the separate point, we previously
discussed how accurate we want to be in terms of this type of timing
issue. BIND 9 tries to be very accurate so that, e.g., all records
from the same zone for a single response should belong to the same
version even if the zone is being modified by a dynamic update running
in parallel with the query processing. But to ensure this level of
accuracy its code became super complicated. We discussed whether we
should provide the same level of accuracy at all cost and complexity
in BIND 10, and, although I don't think there was a clear consensus,
we generally though there should be a better middle ground (i.e.,
preferring implementation simplicity by sacrificing some level of
accuracy).
I have no problem with revisiting this discussion. But please trigger
it in a different thread.
--
Ticket URL: <http://bind10.isc.org/ticket/851#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list