Are failures cached?

Thu May 1 17:30:39 UTC 2008

Here I'm rehashing and expanding upon what has been said.

The top level domain server (for the .com domain, that delegates
water.com), answered queries for the water.com nameserver
with the wrong nameserver, saying furthermore, that that data
was valid for two days.  This caused caching nameservers to
hold on to the wrong delegation data (NS record) for two days.  (We
know it was the wrong nameserver because that's how you described
the problem.)

If that wrong nameserver didn't give answers for lookups within
water.com (e.g. no SOA, no authority flag (or no answer at all?)),
then caching nameservers running BIND9 would cache
the fact that that nameserver is not answering for 10 minutes, which
is a type of negative caching.  This helps the caching nameservers'
efficiency by not recursing every single query it receives to
that nameserver.  However, this negative caching saying the
nameserver is lame/missing doesn't affect that fact that this caching 
nameserver still
has an NS record pointing at that wrong nameserver, i.e., it had been 
told that
that is where to go to resolve those names.  While it will wait ten 
minutes before sending
queries to that nameserver, after the ten minutes it goes and asks the 
same
nameserver again.  Hence problems remain for two days.

Since we know the specifics of your problem, we know that, in
this particular case, it would be helpful to delete the water.com NS 
records in that
caching nameserver (even though the records still had time left on 
their TTLs),
and go back to the top level domain server and get the NS records again.
We know, from your description, that the top level domain nameservers' 
data was
corrected long before 2 days had passed.  However, BIND9 does not assume
that the problem you describe is what triggered its negative caching.
In the general case, I expect tracing what records to throw out might
include some repeated backtracking, and even then, might
require human judgment.  The nameserver software was
told by the authoritative server for ".com" that that water.com NS 
record was
good for 2 days, and keeps using it.

On another note, what you learned here about negative caching was the
behavior of BIND9 as a caching nameserver.  Folks trying to reach you 
would
be using lots of nameservers around the world, many of which would not
be BIND9.  They'd likely follow the rules of caching NS records,
but we can't say what their negative caching behavior was.

Also, if this wrong nameserver for water.com answered with the authority
flag but with a wrong (positive) answer, than that record's TTL comes 
into
play.  Furthermore, if it said the name does not exist, then the SOA's 
negative
caching time comes into play.  Neither of these cases overrides the TTL 
on the NS
record that delegated water.com to that nameserver.

John Wobus