Failover + PXE

Sat Nov 24 13:14:36 UTC 2007

>X-AuditID: 0a666446-adfc3bb000000806-71-4746f94a9c71
>Content-class: urn:content-classes:message
>Subject: RE: Failover + PXE
>Date: Fri, 23 Nov 2007 11:01:13 -0500
>X-MS-Has-Attach: 
>X-MS-TNEF-Correlator: 
>Thread-Topic: Failover + PXE
>From: "Todd Snyder" <tsnyder at rim.com>
>To: <dhcp-users at isc.org>
>X-OriginalArrivalTime: 23 Nov 2007 16:01:14.0205 (UTC) 
FILETIME=[12DBC8D0:01C82DEA]
>X-Brightmail-Tracker: AAAAAQfPxyI=
>X-archive-position: 5029
>X-ecartis-version: Ecartis v1.0.0
>X-original-sender: tsnyder at rim.com
>List-software: Ecartis version 1.0.0
>X-List-ID: <dhcp-users.isc.org>
>X-list: dhcp-users
>X-MIME-Autoconverted: from quoted-printable to 8bit by mail.uniq.com.au id 
lANG1epT014528
>
>Thanks for the information!  Much appreciated, esp the tip on TFTP, I
>hadn't thought of mirroring that source yet.
>
>A quick question regarding the failover methodology.  From what I've
>read on the subject, it would appear that this set is more about load
>balancing than HA (in my understanding).  Since it splits the pool in 2
>and each server hands out a limited range, this seems to be less about
>HA.  Does each server keep track of what the other server has handed
>out, so when the other fails it keeps things organized?  Is it possible
>to run one as a 'cold' standby, that keeps a copy of the leases but
>doesn't actively respond to anything unless the other one is down for XX
>seconds?
>
>Just curious - we're setting this up in an HA environment, and I would
>like to understand the behaviour a little more before I write the
>document on how it behaves and what our options are.
>
>(wow, that appears to be a lot less quick than I'd hoped for, but the
>answer should be easy, I hope)
>
>Cheers,
>
>Todd. 

The two servers divide each pool in half (actually depends on the split
values) and both offer addresses from their portion of the total pool
to clients. If one server stops, then the other continues to offer
addresses from its share of the pool.

The dhcp servers communicate with each other and so keep the dhcp lease
information synchronised between the two servers.

When a client accepts an assdress from a dhcp server, it remembers the
server's address and periodically (usually half way through the lease)
requests a renewal of that address from the server. When a dhcp server
failure occurs, clients that were requesting address renewal stop
receiving replies. Remewmber we're only half weay through the ;lease,
so the client keeps using the existing IP address. The client continues
attempting renewal until near to the end of the lease. The client then
broadcasts for a new dhcp server and receives a reply (hopefully) from
the partner dhcp server. The new dhcp server assigns an address from
within it's pool.

There is a mode, the so called 'partner down' mode, where you tell the
dhcp server that it's partner has failed. The surviving dhcp server
will now allocate IP addresses from the whole pool of IP addresses. Any
clients that broadcast for a new DHCP server will be offered their old
IP address. This is usually only required for extended outages, eg the
server crashes on Friday arvo and you won't be able to fix it until
Monday.

Once upon a time I wrote a simple script that polled the other dhcp
server and switched to partner down if it failed to contact it for 30
minutes. Search the archives, I'm sure it has been posted. It uses
OMAPI to communicate with the dhcp server.

Remember the client is only impacted if one dhcp server is down for
more than approximately 50% of the lease time. I typically use at least
one day as the lease time, so any downtime needs to be >12 hours to
really bother the dhcp clients. If you use longer leases then the
corresponding failure period can be greater. If you have a stable
source of clients or plenty of spare IP addresses there is no reason to
not have much longer leases. For IP phones, which don't move around I
use a week.

No special action needs to be taken when a failure occurs or when the
failed server returns to service. The server returning to service
communicates with the running dhcp server and they synchronise lease
databases and return to the normal mode of operation.

I've been running DHCP failover since 3.0.1rc6 and on the whole find
that it works really well. The load is shared between the two servers,
which can be in completely different locations. There is no downtime
during failover, the partner server just keeps on running. For extended
outages the remaining server can be switched to partner-down mode where
it will hand out addresses from the full range.

There is a section in the dhcpd.conf man page titled DHCP FAILOVER that
covers this area. There is also a document distributed with dhcp-3.0.5
and earlier titled "draft-ietf-dhc-failover-07.txt" in the doc
directory of the source distribution which covers the failover mode in
RFC-like detail. :) In later versions there is a reference to the IETF
document draft-ietf-dhc-failover-12.txt. Use google to locate a copy.

If any of the above is not clear please post questions to the list.

regards,
-glenn
--
Glenn Satchell     mailto:glenn.satchell at uniq.com.au | It's a dog  eat dog
Uniq Advances Pty Ltd         http://www.uniq.com.au | world, and by golly,
PO Box 70 Paddington NSW Australia 2021              | we better make sure
tel:0409-458-580  tel:02-9380-6360  fax:02-9380-6416 | we're the dog.