BIND 10 #851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy

Thu Apr 14 00:25:37 UTC 2011

#851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
-------------------------------------+-------------------------------------
           Reporter:  jinmei         |                      Owner:
               Type:  defect         |                     Status:  new
           Priority:  critical       |                  Milestone:  Year 3
          Component:  data source    |  Task Backlog
          Sensitive:  0              |                   Keywords:
Add Hours to Ticket:  0              |  Estimated Number of Hours:  0
        Total Hours:  0              |                  Billable?:  1
                                     |                  Internal?:  0
-------------------------------------+-------------------------------------
 When b10-auth uses the sqlite3 data source with hot spot cache, it's
 possible that there's a substantial difference between the cache
 content and the actual database.  In some worst cases it can lead to
 crashing b10-auth (and apparently it actually happened to my personal
 "production" server).

 Specifically, DataSrc::doQuery() can reach the following point after
 retrieving the answer from the cache successfully:

 {{{
         const Name* const zonename = zoneinfo.getEnclosingZone();
         if ((task->flags & REFERRAL) != 0 &&
             (zonename->getLabelCount() == task->qname.getLabelCount() ||
 }}}

 But if the zone itself has been deleted from the DB by the time
 doQuery() is called, getEnclosingZone() can return NULL, which (when
 the task has the REFERRAL flag on) will subsequently cause a crash due
 to NULL pointer dereference.

 Such inconsistency between the cache and DB can always happen under
 a normal operation, but there's another scenario that can cause the
 same effect more realistically.  sqlite3_step() seems to fail
 immediately when the DB is locked with a return value of SQLITE_BUSY.
 But our sqlite3 data source code always handles all non successful
 code as if there were no record found (in which case SQLITE_DONE will
 be returned).  So, for example, if b10-auth handles a query while
 xfrin is updating the DB, some call inside the sqlite3 data source to
 sqlite3_step() can fail with SQLITE_BUSY and make the same effect as
 the previous scenario.

 For an initial fix, we should at least stop assuming
 getEnclosingZone() is always valid (even after successful initial
 lookup).  If we find inconsistency it should result in SERVFAIL for
 now.

 We should then separate SQLITE_DONE and other failures of
 sqlite3_step().  In the latter case, I suspect we should return
 SERVFAIL for now.

 For a longer term solution, we should ensure (if possible at all)
 queries don't result in SERVFAIL just due to updating a zone after
 xfrin, etc (I believe SERVFAIL is okay in rare operational cases such
 as the zone is really deleted while some RRs of the zone still remain
 in the cache).  But it will be difficult to implement.  One simple
 approach is to impose a short timeout for sqlite3_step() or a retry
 after SQLITE_BUSY, but it's not 100% reliable.  Also, we probably
 don't want to introduce blocking operation in query processing even if
 it's small.  So this should be deferred to a separate future task.

-- 
Ticket URL: <http://bind10.isc.org/ticket/851>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development