failover A records

Mon Jul 31 23:00:16 UTC 2000

Here's my mini-FAQ-in-progress about failover and DNS:

To do failover with DNS, there's no really *good* way until all clients know
about "SRV" records. These relatively-new resource records add a layer of
indirection to the resource-location process, and have "preference" values
associated with them, so clients using them know exactly in what order servers
should be tried. Unfortunately, SRV-aware clients for the more common
protocols like HTTP and FTP are probably years away. (SMTP doesn't really need
SRV records, since MX records already have preference values).

The simple-minded DNS-based approach to failover is to just have a single
A record which is changed when the main box goes down (possibly using some
sort of automated script, possibly making the change via Dynamic Update). The
main problem with the simple-minded approach is caching: for some interval of
time after the record has been changed on the master and all slaves, anyone
getting a stale record cached in some intermediate nameserver will still
go to the "dead" server. Unlike NOTIFY in the master/slave context, there's
currently no way to tell arbitrary sets of caching nameservers on the 'Net
that a particular A record they cached has changed, and that they should come
and fetch the new value.

A refinement of the simple-minded approach is to have the name of the site
resolve to *both* addresses, and arrange for them to always be given out in
the appropriate order, i.e. main box first, backup second. This can be
achieved by specifying a "fixed" rrset-order on the master and all slaves for
the zone (you need at least BIND 8.2 for this, and for security reasons that
means you'd be wanting to run BIND 8.2.2 patchlevel 5). When the main box
fails, then remove its A record from the RRset. The advantage of this
"refined" approach is that, even if they get a "stale" RRset in the interim,
some clients are smart enough to automatically failover when the first address
in an address list is unreachable, so for those clients this means more
availability (after a short failover delay) for your site. Then again, some
clients are *not* smart enough to do this failover, so it's a partial solution
at best. Moreover, caching complicates things here as well: for multi-valued A
records, intermediate caching servers will tend to randomize/round-robin the
order of the answers they give out. This re-ordering effect actually *helps*
you in failure mode -- it means that approximately 50% of the clients getting
a stale RRset will still be able to connect without any failover delay -- but
the flip side is that under normal circumstances, when the main box is up, it
means that there will be a certain amount of "leakage" to the backup server.
For web servers, one straightforward way of dealing with this "leakage" is to
configure a web redirect on the backup webserver, but that will add latency to
client accesses, and you'll have to make arrangements to somehow automatically
turn this redirect off when the primary box goes down (this can be facilitated
somewhat by having the master DNS server run on the same box as the backup
server -- at least then all of the changes can be made on a single box).

Note that with either approach, you can mitigate -- but not completely
eliminate -- the effects of intermediate caching servers by making the address
records volatile (by reducing their TTL values). But this will greatly
increase the traffic to your nameservers for that name, not to mention the
extra work you'll cause for all other nameservers on the 'Net to constantly
re-query the name. Overall, it's a Bad Thing, but many folks resort to it
nonetheless.

There are non-DNS solutions to this problem, of course. The obvious one is to
just have lots of redundant and/or diverse network paths so that access is
"never" (never say never) lost to the main server, and of course the server
itself should be clustered or high-availability so that it "never" goes down,
but this "brute force" approach can get expensive. And then there are
specialized hardware/software failover solutions, which also tend to be
expensive. Some of these make *both* servers look like the same IP address,
and the failover happens transparently. Some of them also integrate dynamic
load-balancing between servers, which is probably something you want anyway if
you outgrow your current server capacity.

                                                                - Kevin

John Banas wrote:

> Is there a way I can failover an A record, so If one webserver goes down, it
> will automatically go to the other webserver.
>
> I setup a "round robin" (load sharing) which works great but if the one
> server goes down it will not "forward" to the other IP. Can I even have a A
> record failover?
>
> Current Configuration:
>                 IN      A       192.1.1.1
>                 IN      A       192.1.1.2
>
> Thanks for the help...