What happens when one out of three NSs are down?

Tue Jun 11 23:30:53 UTC 2013

On Jun 11, 2013, at 4:12 PM, Gary Wallis <wgg1970 at gmail.com> wrote:
> DNS experts:
> 
> What really happens in the real world when 1 out of three authoritative NSs are down for 30 minutes due to a datacenter outage?

Properly functioning nameservers will note that queries sent to the NS which is down aren't getting replies, and will direct the vast majority of future requests for the zone to the other NS boxes which are up and responsive.  They will occasionally retry the NS which is down in case it comes back up.

[ ... ]
> I think I have a grasp on the basic theory here, but in practice, the unreachable ns3 nameserver creates problems for a small group of customers trying to reach web sites with zones hosted by these three authoritative NSs.

OK.  I suspect you could correlate data about such unfortunate folks and perhaps identify why they are having issues. Getting packet captures via tcpdump/wireshark/ethereal/etc would be revealing.

> Will round robin glue NS records help?

No, not really.

> Can quick or automated changes at the registrar of the NS3 IP help? For example to change to a hot spare in some other datacenter? In this case would the running NSs have to have the changed NS A record also match?

Adjusting the TTLs and SOA values down to reduce cache timing is more likely to be helpful.  However, trying to change the A record for a NS for only a 30-minute interval strikes me as more churn than it would be worth.

> Any comments and best practice solution info very welcome.

Folks with significant requirements with regard to high availability are likely to put a hardware loadbalancer running a VIP which receives DNS requests and balances it onto a pool of reals (aka the boxes running nameservers), including liveness checks so the LB will transparently migrate around a nameserver which is down.

Regards,
-- 
-Chuck