Failover fails to re-establish normal communication after interrupted

Thu May 17 03:55:15 UTC 2007

> Dear Folks,
> On 07/05/07 11:14 +0800, jgomez at infoweapons.com wrote:
>>> Dear Folks,
>>> We are experiencing occasional problems in re-establishing normal
>>> state after the failover pair enter communications-interrupted
>>> state.  We are using ISC DHCP server 3.0.4.
>>>
>>> The failover pair enter communications interrupted when one of the
>>> pair begins to re-write its dhcpd.leases file.
>>>
>>> I was woken at midnight last night again to return them to normal,
>>> which I did by restarting the primary member of the pair.
>>>
>>> 1. Have others experienced this problem?
>>> 2. Is there any known patch or fix?
>>> 3. Any suggestions on how to go about fixing it?
>>> 4. Can I use omapi to help restore communication without restarting
>>>    the dhcp server?
>>
>>hello,
>>
>>i suggest that you use the latest version of ISC DHCP which is
>>dhcp-3.0.5 because some of the older versions won't support DHCP
>>Failover.
>
> Yes, I agree that this may help; there is one change that *may* have
> some impact.  The changes from 3.0.4 -> 3.0.5 are not great, and I am
> arguing for this with some members of my team.  But this is a
> production system that looks after more than half a million customers,
> so changes are not to be made lightly.  And I was shouted at very
> abusively for suggesting this.  I continue to argue for this.
>
>>yes, u can use omapi protocol to connect to the ISC DHCP server and
>> change
>>its state without stopping it.
>
> But how can I use it to change from "Communications-Interrupted" to
> "Normal"?  The failover pair should enter the "Normal" state from the
> "Communications-Interrupted" state automatically.  The problem is that
> from time to time, there is a failure to return to normal
> communication, resulting to a night-time call to the on-call phone
> under my pillow.

I have here some Outputs when running failover server in
Communications-Interrupted state:

Primary server side
root:/home/admin>dhcpd -cf /usr/local/etc/dhcpd.conf em0
Internet Systems Consortium DHCP Server V3.0.4
Copyright 2004-2006 Internet Systems Consortium.
All rights reserved.
For info, please visit http://www.isc.org/sw/dhcp/
Wrote 25 leases to leases file.
Listening on BPF/em0/00:0c:29:93:2e:60/128.1.2/24
Sending on BPF/em0/00:0c:29:93:2e:60/128.1.2/24
Sending on Socket/fallback/fallback-net
failover peer Primary: I move from communications-interrupted to normal

Secondary server side
# dhcpd -cf /usr/local//etc/dhcpd.conf em0
Internet Systems Consortium DHCP Server V3.0.5
Copyright 2004-2006 Internet Systems Consortium.
All rights reserved.
For info, please visit http://www.isc.org/sw/dhcp/
Wrote 25 leases to leases file.
Listening on BPF/em0/00:0c:29:36:cd:45/128.1.2/24
Sending on BPF/em0/00:0c:29:36:cd:45/128.1.2/24
Sending on Socket/fallback/fallback-net
failover peer Secondary: I move from communications-interrupted to normal

Note: The system logs show failover peer Primary: I move from
communicationsinterrupted to startup because both servers are configured
and working. But when the Primary server goes down the system logs will
show failover peer Secondary: I move from normal to
communications-interrupted. In this case, the Secondary server acts as a
backup and continues to provide service.
When the primary server in a failover pair goes down because it is being
stopped, the secondary server continues to provide service, but in a
limited mode called Communications-Interrupted state. In a
communications-interrupted state, failover peer can’t tell if the other
server is providing DHCP service because the servers are not
communicating with one another. Because of the limitations of
communicationsinterrupted, it’s good to put the secondary server into
Partner-down state if the primary
server goes down and isn’t expected to come back up quickly. In the
partner-down state, the secondary server can completely take over DHCP
service on the network after waiting for the MCLT including reclaiming all
of the down server’s Ip addresses.
To put an ISC DHCP server into the partner-down state, the OMAPI protocol
will be used. The OMAPI protocol is a control protocol that allows an
administrator to connect to the ISC DHCP server and change its state
without stopping it.

I'm reading up on omshell to put my failover/secondary into
partner-down status, so I can move my secondary to different
hardware.  I followed the example on the DHCP Handbook,
2nd edition, .  On my primary server
(Primary Server), I do thiss with omshell:

% omshell
> server Primary
> port [myport]
> key itsomkey [mykey]
> connect
obj: <null>
> new failover-state
obj: failover-state
> set name = "jonna"
obj: failover-state
name = "jonna"
> open
obj: failover-state
name = "jonna"
partner-address = 00:10:1e:58
partner-port = 00:00:03:22
local-address = 00:10:1e:28
local-port = 00:00:03:21
max-outstanding-updates = 00:00:00:0a
mclt = 00:00:02:58
load-balance-max-secs = 00:00:00:03
load-balance-hba =
ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:00:00:00:00
partner-state = 00:00:00:02
local-state = 00:00:00:02
partner-stos = 43:a9:b3:1e
local-stos = 45:7f:12:82
hierarchy = 00:00:00:00
last-packet-sent = 00:00:00:00
last-timestamp-received = 00:00:00:00
skew = 00:00:00:00
max-response-delay = 00:00:00:3c
cur-unacked-updates = 00:00:00:00
> set local-state = 1
obj: failover-state
name = "jonna"
partner-address = 00:10:1e:58
partner-port = 00:00:03:22
local-address = 00:10:1e:28
local-port = 00:00:03:21
max-outstanding-updates = 00:00:00:0a
mclt = 00:00:02:58
load-balance-max-secs = 00:00:00:03
load-balance-hba =
ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:00:00:00:00
partner-state = 00:00:00:02
local-state = 1
partner-stos = 43:a9:b3:1e
local-stos = 45:7f:12:82
hierarchy = 00:00:00:00
last-packet-sent = 00:00:00:00
last-timestamp-received = 00:00:00:00
skew = 00:00:00:00
max-response-delay = 00:00:00:3c
cur-unacked-updates = 00:00:00:00
> update
obj: failover-state
name = "jonna"
partner-address = 00:10:1e:58
partner-port = 00:00:03:22
local-address = 00:10:1e:28
local-port = 00:00:03:21
max-outstanding-updates = 00:00:00:0a
mclt = 00:00:02:58
load-balance-max-secs = 00:00:00:03
load-balance-hba =
ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:ff:00:00:00:00
partner-state = 00:00:00:02
local-state = 1
partner-stos = 43:a9:b3:1e
local-stos = 45:7f:12:82
hierarchy = 00:00:00:00
last-packet-sent = 00:00:00:00
last-timestamp-received = 00:00:00:00
skew = 00:00:00:00
max-response-delay = 00:00:00:3c
cur-unacked-updates = 00:00:00:00

So here is what I see happen in the syslog on my primary:

Dec 12 15:39:25 primary dhcpd: [ID 702911 daemon.info] failover peer jonna: I
move from normal to partner-down
Dec 12 15:39:25 primary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
moves from normal to potential-conflict
Dec 12 15:39:25 primary dhcpd: [ID 702911 daemon.info] failover peer jonna: I
move from partner-down to potential-conflict
Dec 12 15:39:25 primary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
update completed.
Dec 12 15:39:25 primary dhcpd: [ID 702911 daemon.info] failover peer jonna: I
move from potential-conflict to normal
Dec 12 15:39:27 primary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
moves from potential-conflict to normal

Likewise on the secondary:

Dec 12 15:39:25 secondary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
moves from normal to partner-down
Dec 12 15:39:25 secondary dhcpd: [ID 702911 daemon.info] failover peer
jonna: I move
from normal to potential-conflict
Dec 12 15:39:25 secondary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
moves from partner-down to potential-conflict
Dec 12 15:39:25 secondary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
moves from potential-conflict to normal
Dec 12 15:39:27 secondary dhcpd: [ID 702911 daemon.info] failover peer
jonna: peer
update completed.
Dec 12 15:39:27 secondary dhcpd: [ID 702911 daemon.info] failover peer
jonna: I move
from potential-conflict to normal

Note:
As soon as the peer detects that its partner is back, it will try to get
back to normal automatically.

Use partner-down to tell a peer that its partner really is down! That
is, remove the partner, shut it down or whatever, and *then* tell the
other to go into partner-down.

Theoretically, shutting down ISC DHCP in an orderly fashion should
result in the partner being told about it, so that it goes into
partner-down automatically.

When you start the server that was down, it automatically connects to the
running server, synchronizes with it, waits for the MCLT to expire, and
begins serving clients.
The mclt statement defines the maximum client lead time by which either
server can extend a lease without contacting the other server. This value
has to be a compromise between client lease time and recovery time. The
value should be reasonably long so that clients that get a lease that is
mclt seconds long have a useful lease that won’t lead to instability for
them. The value should not be too long because the mclt is also the
recovery interval for the server. That is, the longer the mclt is, the
longer it takes to return to normal failover operations after a server
failure. The longer it is being set, the longer it will take for the
running
server to recover Ip addresses after moving into PARTNER-DOWN state. The
shorter it is being set, the more load your servers will experience when
they are not communicating. A value of something like 1800 is reasonable.
The mclt is configured only on the primary server, in order to avoid
disagreements between the primary and secondary servers about its value.

Hope this will help you..;)

Jonna

--------
This email and/or attachments are confidential and may also be
legally privileged. If you are not the intended recipient, you are
hereby notified, that any review, dissemination, distribution or
copying of this email and/or attachments is strictly prohibited.
Please notify security at infoweapons.com immediately by email and
delete this message and all its attachments. Thank you.