Site Failover

Sat Jul 9 02:19:06 UTC 2005

Greg Zill wrote:

>I am just putting pencil to paper for the first time on planning a
>remote failover site for our co-lo production facility. As I just took
>over authoritative DNS and setup of a master and slave, I am wondering
>what would be the preferred configuration for another site with
>equivalent services to tide us over in the case of tornado or
>significant natural or unnatural disaster.
> 
>At first I thought two more slaves back to master to keep everything up
>to date, but I do not know the impact of repeated slaving errors on
>performance once the current master falls of the face of the earth. 
>
Zone transfer failures you mean? They're bad, but not terribly so. How 
many zones are we talking about here? Hundreds? Thousands? Tens of 
thousands? More? BIND 9 seems to do a fairly good job of controlling 
this type of workload, although I've never had occasion to watch a 
"disconnected slave"s behavior with more than a thousand or so zones. It 
should be fairly easy to test this scenario, of course: copy your 
production slave-server config to some spare box, let it transfer the 
zones, then yank the network cable and watch it (on the console, 
presumably :-) to see how well or how badly it deals with the cascade of 
zone-transfer failures.

>Do I
>assume the manual task of switching one of the slaves to a temporary
>master in the event of failover. 
>
You're going to have to do something like that anyway if you want to 
change anything in your DNS data during the outage.

I'm not sure why you say "manually", though: you can automate the 
slave-to-temporary-master switchover as much as you wish and your 
scripting/programming skills allow.

Note that you can configure all of your slaves to all pull zones from 
each other, in addition to pulling from the primary master -- this way, 
you won't have to reconfigure any of them if you decide to "promote" one 
of the slaves to primary master temporarily, or when you "demote" it 
again. It also has the benefit of ensuring that all of your slaves will 
automatically synchronize to the latest-available version of the zone, 
if the primary master stays down for an extended period of time. The 
downside, however, is that it will increase your serial-checking volume, 
which could be a problem if you have a huge number of zones and/or small 
(i.e. rabid) REFRESH settings. Or, you can trade off the serial-checking 
volume against the synchronization time by choosing an inter-slave 
topology of ring (e.g. slave A pulls from the primary master and slave 
B, slave B pulls from the primary master and slave C and slave C pulls 
from the primary master and slave A) or any other topology less 
connected than any-to-any, e.g. tree, star, daisy-chain, hybrid.

                                          - Kevin