BIND 10 #851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy

Thu Apr 21 18:57:27 UTC 2011

#851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  jinmei                             |                Status:  reviewing
                       Type:         |             Milestone:
  defect                             |  Sprint-20110503
                   Priority:         |            Resolution:
  critical                           |             Sensitive:  0
                  Component:  data   |           Sub-Project:  DNS
  source                             |  Estimated Difficulty:  0.0
                   Keywords:         |           Total Hours:  0
            Defect Severity:  N/A    |
Feature Depending on Ticket:         |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:9 vorner]:

 > Well, I didn't really read the code to depth. But as I understand it, it
 worked like this before the change:
 >  * Someone asked a query, the answer got cached.
 >  * Some time passes.
 >  * The DB becomes busy (by xfrin maybe).
 >  * Someone asks the same query.
 >  * The cache says it's valid, so we continue processing.
 >  * We get part of the answer from cache.
 >  * We ask the DB something more (eg. the `zoneinfo.getEnclosingZone()`
 on line 947), it fails because of being busy, and returns NULL. We don't
 expect NULL here, so we crash.

 This is correct (or at least matches my understanding).

 > What I think can happen now is this:
 >  * Someone asked a query, the answer got cached.
 >  * Some time passes.
 >  * Someone asks the same query.
 >  * We first check the DB (in `doQueryTask`) if it is valid. The DB
 confirms.

 And the 'zoneinfo' variable stores the positive result.  From this
 point, zoneinfo.getDataSource() or zoneinfo.getEnclosingZone() won't
 require DB search, and both always return non NULL.

 >  * We get part of the answer from cache.
 >  * The DB becomes busy right now (by xfrin). It's a tiny time window,
 but it can happen, we didn't lock the DB in transaction.
 >  * We ask the DB something more (again, the
 `zoneinfo.getEnclosingZone()` on line 947), it fails, crash.

 This doesn't happen, because zoneinfo remembers the positive result.

 And, from my check before submitting this patch, there shouldn't be
 any other form of unintentional DB access within the same while loop
 of DataSrc::doQuery().

 > Anyway, asking the DB several times without having a transaction is IMO
 wrong, because we can get inconsistent information (someone can modify the
 DB between our queries to it).

 Whether or not solving it using a transaction, yes, it's bad that we
 can have (and subsequently return) inconsistent data in terms of 100%
 accuracy.  This argument applies even without the hot spot cache, and
 IMO is far beyond the scope of this bug fix ticket.  So, in any case,
 I'd propose separating this issue, and concentrate on fixing the crash
 bug by making the code as good/bad as the one without the hot spot
 cache.

 With noting that, for the food for the separate point, we previously
 discussed how accurate we want to be in terms of this type of timing
 issue.  BIND 9 tries to be very accurate so that, e.g., all records
 from the same zone for a single response should belong to the same
 version even if the zone is being modified by a dynamic update running
 in parallel with the query processing.  But to ensure this level of
 accuracy its code became super complicated.  We discussed whether we
 should provide the same level of accuracy at all cost and complexity
 in BIND 10, and, although I don't think there was a clear consensus,
 we generally though there should be a better middle ground (i.e.,
 preferring implementation simplicity by sacrificing some level of
 accuracy).

 I have no problem with revisiting this discussion.  But please trigger
 it in a different thread.

-- 
Ticket URL: <http://bind10.isc.org/ticket/851#comment:10>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development