BIND 10 #851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
BIND 10 Development
do-not-reply at isc.org
Thu Apr 14 00:25:37 UTC 2011
#851: b10-auth (+ hotspot cache) can crash in handling query when DB is busy
-------------------------------------+-------------------------------------
Reporter: jinmei | Owner:
Type: defect | Status: new
Priority: critical | Milestone: Year 3
Component: data source | Task Backlog
Sensitive: 0 | Keywords:
Add Hours to Ticket: 0 | Estimated Number of Hours: 0
Total Hours: 0 | Billable?: 1
| Internal?: 0
-------------------------------------+-------------------------------------
When b10-auth uses the sqlite3 data source with hot spot cache, it's
possible that there's a substantial difference between the cache
content and the actual database. In some worst cases it can lead to
crashing b10-auth (and apparently it actually happened to my personal
"production" server).
Specifically, DataSrc::doQuery() can reach the following point after
retrieving the answer from the cache successfully:
{{{
const Name* const zonename = zoneinfo.getEnclosingZone();
if ((task->flags & REFERRAL) != 0 &&
(zonename->getLabelCount() == task->qname.getLabelCount() ||
}}}
But if the zone itself has been deleted from the DB by the time
doQuery() is called, getEnclosingZone() can return NULL, which (when
the task has the REFERRAL flag on) will subsequently cause a crash due
to NULL pointer dereference.
Such inconsistency between the cache and DB can always happen under
a normal operation, but there's another scenario that can cause the
same effect more realistically. sqlite3_step() seems to fail
immediately when the DB is locked with a return value of SQLITE_BUSY.
But our sqlite3 data source code always handles all non successful
code as if there were no record found (in which case SQLITE_DONE will
be returned). So, for example, if b10-auth handles a query while
xfrin is updating the DB, some call inside the sqlite3 data source to
sqlite3_step() can fail with SQLITE_BUSY and make the same effect as
the previous scenario.
For an initial fix, we should at least stop assuming
getEnclosingZone() is always valid (even after successful initial
lookup). If we find inconsistency it should result in SERVFAIL for
now.
We should then separate SQLITE_DONE and other failures of
sqlite3_step(). In the latter case, I suspect we should return
SERVFAIL for now.
For a longer term solution, we should ensure (if possible at all)
queries don't result in SERVFAIL just due to updating a zone after
xfrin, etc (I believe SERVFAIL is okay in rare operational cases such
as the zone is really deleted while some RRs of the zone still remain
in the cache). But it will be difficult to implement. One simple
approach is to impose a short timeout for sqlite3_step() or a retry
after SQLITE_BUSY, but it's not 100% reliable. Also, we probably
don't want to introduce blocking operation in query processing even if
it's small. So this should be deferred to a separate future task.
--
Ticket URL: <http://bind10.isc.org/ticket/851>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list