dns resiliency prompted by apnic.net problems over last 24 hours
d.thomas at its.uq.edu.au
Wed Jun 25 01:34:20 UTC 2008
I noticed about 10am AEST yesterday that apnic.net was no longer
resolvable, although it appears the problem began 3 hours earlier:
but I'm not sure that description conveys the extent of the problem,
at least from a dns point of view.
The delegation comprises 4 records, (glue from c.gtld-servers.net)
cumin.apnic.net. 172800 IN A 220.127.116.11
ns-sec.ripe.net. 172800 IN A 18.104.22.168
ns-sec.ripe.net. 172800 IN AAAA 2001:610:240:0:53::4
tinnie.apnic.net. 172800 IN A 22.214.171.124
tinnie.arin.net. 172800 IN A 126.96.36.199
so they also use ARIN and RIPE name-servers which is good, although
all 4 belong to the same registry.
When I noticed the problem yesterday, the 2 APNIC name-servers
seemed unresponsive which probably was unreachability from our
location, but unfortunately both ARIN and RIPE ones returned SERVFAIL.
Even if the APNIC were reachable from some places, I'm not sure they
were healthy as I soon noticed them either returning NORESPONSE
or SOA, NS and MX records, but no additional section for NS A RR's.
I don't belive the APNIC problems significantly affected reverse
lookups as those zones are spread across several RIRs (but so were
those for apnic.net itself).
188.8.131.52 returns SOA, NS and MX records; no additional RR's
184.108.40.206 seems OK (additional includes cached ARIN/RIPE RR)
220.127.116.11 seems OK
18.104.22.168 returns SERVFAIL (oops just came good)
Moving from the apnic.net situation, some of the general points
to improve authoritative dns resiliency
* reasonable number of delegated name-servers
* with diverse connectivity
* not all under the zone
all of which APNIC had (but see below for a fourth point)
It's possible the ARIN/RIPE name-server SERVFAIL resulted from the
ivity problems, but suspect it was pre-existing. We run a fairly basic
script checking our zones against the authoritative name-servers
stealth masters and our official stealth secondaries at various sites
* have same serial number (currently not smart enough for zones
changing a lot)
My main point is that apart from designing for resiliency, it's
also important to properly monitor the situation to not only detect
failures (hopefully before reports start appearing), but to detect
problems leading to reduced resiliency.
I don't know whether it's specific to the edu.au registrar or to the .au
registry, but there have been cases when name-servers listed in the
registrar's management interface not ending up as glue records in the
registry. For example the ns1.adelaide.edu.au glue record is again
for adelaide.edu.au after reappearing a month ago after being missing
a year. The tiny-teddy.aarnet.edu.au glue for aarnet.edu.au was missing
for several years:
That's one reason I'd recommend not having all all authoritative name-
servers under one registry.
It might even be worth considering extending the TTL for authoritative
NS records and their associated A records to partially protect against
glue lossage. This would only help sites which have cached the records,
but those are the ones likely to be most affected. Doing so doesn't help
much unless there's monitoring to detect the problem so it can hopefully
be fixed before the records with longer TTLs start expiring.
More information about the bind-users