Odd failure - poisoning?

Sun Jan 8 23:32:12 UTC 2006

Hi there.

We just had a most peculiar failure at ETHZ in Switzerland. For about 24 hours, many (VERY many) external addresses were not being resolved, we were seeing SERVFAIL (and some timeouts). It wasn't ALL addresses, but over the course of the period, more and more addresses failed; to most users it felt like nothing was resolving, though we know it wasn't actually that complete a failure. Some big names, like Google, never failed to resolve.

The thing is, there were no changes on the nameservers before or during the period, and no changes in our network (except see below). The whole thing resolved itself, leaving us with no clue as to why it happened or what fixed it. There was no significant load on our nameservers, no significant load on our network and people outside our network didn't see any problem (so it wasn't a global problem or even just a Swiss problem).

Towards the end of the period, we moved the two nameservers from behind a firewall to in front of the firewall - no change, so we moved them back. The problem resolved itself in the 30 minutes to one hour after that. If it had suddenly come good, I'd suspect some firewall problem, but as it is I can't see that the firewall was involved.

During the problem period, some names would not resolve up to twenty or thirty times - then would suddenly resolve. Some never did. Some, as noted, never failed to resolve.

Internal addresses resolved correctly at all times.

So we are completely stumped. I thought it might be cache poisoning (things going bad slowly and coming good "on their own" is usually cache-related), but the range of names involved seems too large and too random.

The root hints are present and correct. I checked connectivity (and port 53 connectivity) to all of the root servers, all were present and correct. We checked fresh names, and many of them resolved, so it doesn't seem to have been a network problem as such. General network connectivity to the outside world was fine. We have received no reports of people being unable to resolve *us*, and  no reports of connectivity problems to us (and we would expect such reports from people like the supercomputer users and the thousands of students using our services from outside). There were no interesting log messages. Clients of every stripe were affected (WinXX, Linux, Mac etc.).

Does anyone have any ideas as to what might have caused this behaviour? And what diagnostics we could have used during the period to give us some more clues?

Regards, K.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer at biplane.com.au)                   +61-2-64957160 (w/h)
http://www.biplane.com.au/~kauer/                  +61-428-957160 (mob)