dns resiliency prompted by apnic.net problems over last 24 hours

Wed Jun 25 01:34:20 UTC 2008

I noticed about 10am AEST yesterday that apnic.net was no longer
resolvable, although it appears the problem began 3 hours earlier:
   http://www.apnic.net/news/2008/0624.html

but I'm not sure that description conveys the extent of the problem,
at least from a dns point of view.

The delegation comprises 4 records, (glue from c.gtld-servers.net)
   cumin.apnic.net.   172800  IN  A     202.12.29.59
   ns-sec.ripe.net.   172800  IN  A     193.0.0.196
   ns-sec.ripe.net.   172800  IN  AAAA  2001:610:240:0:53::4
   tinnie.apnic.net.  172800  IN  A     202.12.29.60
   tinnie.arin.net.   172800  IN  A     168.143.101.18

so they also use ARIN and RIPE name-servers which is good, although
all 4 belong to the same registry.

When I noticed the problem yesterday, the 2 APNIC name-servers
seemed unresponsive which probably was unreachability from our
location, but unfortunately both ARIN and RIPE ones returned SERVFAIL.
Even if the APNIC were reachable from some places, I'm not sure they
were healthy as I soon noticed them either returning NORESPONSE
or SOA, NS and MX records, but no additional section for NS A RR's.

I don't belive the APNIC problems significantly affected reverse
lookups as those zones are spread across several RIRs (but so were
those for apnic.net itself).

currently
202.12.29.59 returns SOA, NS and MX records; no additional RR's
202.12.29.60 seems OK (additional includes cached ARIN/RIPE RR)
193.0.0.196  seems OK
168.143.101.18 returns SERVFAIL (oops just came good)

Moving from the apnic.net situation, some of the general points
to improve authoritative dns resiliency
   * reasonable number of delegated name-servers
   * with diverse connectivity
   * not all under the zone
all of which APNIC had (but see below for a fourth point)

It's possible the ARIN/RIPE name-server SERVFAIL resulted from the  
connect-
ivity problems, but suspect it was pre-existing. We run a fairly basic
script checking our zones against the authoritative name-servers  
including
stealth masters and our official stealth secondaries at various sites
so they're
   * reachable
   * authoritative
   * have same serial number (currently not smart enough for zones
     changing a lot)

My main point is that apart from designing for resiliency, it's
also important to properly monitor the situation to not only detect
failures (hopefully before reports start appearing), but to detect
problems leading to reduced resiliency.

Danny

I don't know whether it's specific to the edu.au registrar or to the .au
registry, but there have been cases when name-servers listed in the
registrar's management interface not ending up as glue records in the
registry. For example the ns1.adelaide.edu.au glue record is again  
missing
for adelaide.edu.au after reappearing a month ago after being missing  
for
a year. The tiny-teddy.aarnet.edu.au glue for aarnet.edu.au was missing
for several years:
   http://public.uqnet.cc.uq.edu.au/aarnet-member-delegations.html
That's one reason I'd recommend not having all all authoritative name-
servers under one registry.

It might even be worth considering extending the TTL for authoritative
NS records and their associated A records to partially protect against
glue lossage. This would only help sites which have cached the records,
but those are the ones likely to be most affected. Doing so doesn't help
much unless there's monitoring to detect the problem so it can hopefully
be fixed before the records with longer TTLs start expiring.