Managing an Internet outage
dawn.connelly at gmail.com
Sun May 11 20:46:52 UTC 2008
Best practice is to always make sure that your authoritative DNS servers are
on physically different networks so your boss is right in thinking this
needs to happen. Couple things to consider. If your master DNS server is
down, you'll need to reconfigure the offsite machine to be primary so you
can change the DNS resolution. Not a big deal but make sure to include that
step in your DR plan. You have control over your TTLs. You can drop them to
10 minutes (or whatever your SLAs dictate) in the event of a network outage
so you can recover faster but not always have the increased load. Mail will
queue on the email servers that are trying to send it for awhile if it can't
contact your mail server so that buys you some time too. You might want to
leave your MX resolution to the correct machine IP address even in your
failure state to make sure that mail queues on the remote end and to make
sure it sends as soon as the network is back up. It would be better if you
had an email server as your DR site with a higher weight though from a best
practice stand point. Also some ISPs tend to just cache one authoritative
DNS server and continually try to hit it over and over even if it's down.
The only thing you can do to fix that is ask the ISP to clear their cache.
Road Runner has burned me with that multiple times.
So your DR plan would look something like this:
Network outage is detected.
Stand-by named.conf file swapped on offsite machine to reference outage zone
files and configure machine as master
Outage zone files include the following records:
@ 600 IN A <IP address of "We are broken" webserver>
@ 3600 IN MX 10 <IP address of email server>
* 600 IN A <IP address of "We are broken" webserver>
Once failure has been cleared, stand-by named.conf is swapped back with
original file and named is restarted.
You can script this to happen automatically if you have a monitoring system
in place with some peril scripts or you can do it manually. You can also
look at products that do all of this for you automagically. The Global
Traffic Manager by F5 (Big-IP GTM) is the one I'm most familiar with but I'm
sure other's on this list could give other examples too. The GTM box will
continually test access to your resources and as soon as they become
unavailable they will hand out whatever information you have configured as
your fallback IP address.
On Sun, May 11, 2008 at 12:31 PM, Mike Diggins <diggins at mcmaster.ca> wrote:
> We occasionally have a situation where our Internet access is completely
> down. My Manager has asked about the viability of locating a DNS server
> off site, and during a situation when we're down, modifying it so that it
> resolves my entire domain to a single IP address. Web users would be
> redirected to that address, and a web page would explain we're off line.
> Our DNS TTL is set to 1 hour, however, I'm concerned that sites might
> cache that address for longer than the TTL, and affect things such as mail
> delivery beyond the outage. Does anyone have an opinion on this plan?
> Obviously improving our redundancy is a better solution, and that will
> come in time. Right now this seems like a quick and easy (dirty) solution.
More information about the bind-users