Round-robin for high availability?

Mon Jul 17 22:52:11 UTC 2006

Dave Henderson wrote:
> */Kevin Darcy <kcd at daimlerchrysler.com>/* wrote:
>
>     cdevidal wrote:
>     > ==== My real address is Chris (AT) deVidal (DOT) tv ====
>     >
>     > I've been experimenting with multiple A records for both
>     > load-distributing AND high availability.
>     >
>     > Up until this point I was always told that round-robin is for
>     > load-distributing ONLY and should not be used for high availability
>     > failover. But in practice this is not proving to be true. I'm
>     > beginning to think that was just FUD.
>     >
>     > Do a lookup on roundrobintest8.strangled.net and
>     > roundrobintest9.strangled.net. Notice the A records:
>     > roundrobintest8.strangled.net. 3600 IN A 127.0.0.1
>     > roundrobintest8.strangled.net. 3600 IN A 63.95.68.129 # Real server
>     >
>     > roundrobintest9.strangled.net. 3600 IN A 10.69.96.69 # Bogus IP
>     > roundrobintest9.strangled.net. 3600 IN A 63.95.68.129 # Real server
>     >
>     > Now, disable anything running on localhost:443 and make sure you do
>     > *not* have a host at 10.69.96.69.
>     >
>     > Browse https://roundrobintest8.strangled.net/ and
>     > https://roundrobintest9.strangled.net/ You should never get a DNS
>     > error. It should always give you first an SSL warning (hostname
>     > mismatch) and login prompt. Oh it'll pause while it tries the bad IP
>     > but after about 5 seconds it flips to the real server.
>     >
>     > Now load up an SSL web server on localhost. I used Apache+mod_ssl on
>     > Linux and TinySSL on Windows. Set up an index page with links to
>     > several other pages.
>     >
>     > (Sorry to require SSL, it was the only web server I have control
>     over
>     > that no one is using at the moment, so I can kill the web
>     service any
>     > time I want... You could also load up an FTP or SSH server on
>     localhost
>     > instead of SSL. My server has all three.)
>     >
>     > Flush your cache (e.g. ipconfig /flushdns) and reload the website.
>     > Sometimes you will get localhost, sometimes my server. That's the
>     > load-distributing action we all know and love.
>     >
>     > If you don't get localhost, keep flushing your cache until you
>     get it.
>     > Then kill your server and click on a link in the web page that
>     is still
>     > up on your screen. It will fail back to my server and generate a
>     404.
>     > That's high availability! Even though it generates an error, it's
>     > coming from my server nonetheless!
>     >
>     > -No- client I've tried (browser, FTP client, MySQL, SSH etc.)
>     fails on
>     > the bad IP (10.69.96.69). It thinks for a few seconds and then tries
>     > the good IP.
>     >
>     > Nor does it fail when the IP is good, as in the case of
>     localhost, but
>     > no service is listening on that port.
>     >
>     >
>     > I've tried this on:
>     > Windows 95
>     > Windows 98
>     > Windows 2000
>     > Windows XP
>     > Ubuntu 6.06
>     > Debian 3.1
>     > CentOS 3
>     > CentOS 4
>     >
>     > With these clients:
>     > Netscape 4.5 (Nice and old!!!)
>     > IE 5.5
>     > IE 6
>     > Firefox 1.0
>     > Firefox 1.5
>     > DOS FTP
>     > Linux FTP
>     > Linux NcFTP
>     > MySQL client
>     > OpenSSH client
>     >
>     >
>     > My idea is to set up a live server running web/mail/DNS/DB/FTP and a
>     > warm standby, such as:
>     > www.example.com. 3600 IN A 1.1.1.1
>     > www.example.com. 3600 IN A 2.2.2.2
>     >
>     > The warm standby is powered on but no services are started. Live is
>     > synchronized to warm standby. If the live fails I bring up the
>     > standby. Bing bang boom, the client automatically goes to the
>     standby.
>     >
>     > It'll be just web/POP/SSH/FTP because DNS and SMTP already have
>     > built-in load-distributing and high availability capabilities. No
>     > database ports will be exposed to the outside world but if I do they
>     > should work.
>     >
>     >
>     > If this works, so cool! Replacement for expen$ive and complicated HA
>     > solutions :-)
>     >
>     > Was clued into this by Mr. Tenereillo:
>     > http://www.tenereillo.com/GSLBPageOfShame.htm
>     >
>     >
>     > What am I missing? Do I need to do more testing?
>     >
>     > Am I crazy? Or crazy like a fox? ;-)
>     >
>     > Someone check me on this because I'm not sure I'm testing it
>     right...
>     >
>     >
>     A 5-second delay on half of the accesses is not acceptable to most
>     folks
>     in the market for a "high-availability" solution.
>
>
>     - Kevin
>
>
>
> Wouldn't high-availability mean that you get what your looking for 99% 
> of the time versus getting it less (5 seconds of time loss or not)?  
> In a situation like the OP has described, the user would get their 
> answer 100% of the time.  Shouldn't that count for high-availability?
Dave,
          That's why I phrased things the way I did. I could have just 
said "5-second delay on 50% of the accesses != high-availability". But I 
didn't say that. My point is that the delay, while it may not 
technically negate "high-availability", wouldn't be acceptable to most 
folks *in*the*market* for high-availability. This market segment 
consists of large customers (generally) who care a lot about serving 
their web content to various customer audiences and are willing to pay 
significant sums of money to make sure that the "web experience" of 
those customers is as pleasant and productive as possible.

It's certainly possible to deliver a "high-availability" solution "on 
the cheap" which features 5-second delays. But I don't think most 
potential customers who start shopping for high-availability would 
accept it. Hence the market for dedicated load-balancing devices, or 
services like Akamai.

Once one steps out of the web-centric world, of course, then things get 
even murkier. For one thing, apps running on other protocols may take 
much *longer* than 5 seconds to fail over to the second address, in 
which case the user waiting for their transaction to finish assumes that 
it's stuck and cancels it. Or, in a multi-layered application design, 
the lookup subroutine of a given app may be operating with different 
thresholds than other levels of the app, with the result that the upper 
level may timeout before the lookup subroutine even tries the second 
address in the list. So, in a *multi-protocol* high-availability 
environment it's often necessary to invest in hardware load-balancing 
anyway, and if you're doing that, why not also use that subsystem for 
web access?

                              - Kevin