Failover strangeness

Mon Oct 9 19:12:52 UTC 2006

Here's the situation.  I've got two servers both running isc dhpcd 
3.1.0a1 configured as failover peers.  Here's the sequence of events:

1.  Both dhcp servers are running.
2.  Client gets a lease from the primary.
3.  Primary fails.
4.  Client gets renewal from secondary.
5.  Primary starts again.
6.  Client continues to get renewals from secondary.
7.  Secondary fails.
8.  Primary dhcp server refuses to hand out lease.
9.  Secondary comes back up.
10. Primary dhcp server hands out lease!
11. Secondary fails
12.  Primary continues to hand out lease.

Here are the messages I'm getting from the primary dhcpd from secondary 
failure at #8

# /usr/sbin/dhcpd -d eri0
pool 1560a0 PROGPL_LAB  total 1  free 0  backup 0  lts 0
pool 1560a0 PROGPL_LAB: lts <= max-lease-misbalance (0), pool rebalance 
event skipped.
peer ontdhcp: disconnected
failover peer ontdhcp: I move from normal to communications-interrupted
DHCPREQUEST for 10.55.255.18 from 00:13:ba:20:00:fa (ON2342) via 
10.50.255.1: lease in transition state expired
DHCPDISCOVER from 00:13:ba:20:00:fa via 10.50.255.1: network PROGPL_LAB: 
no free leases
DHCPDISCOVER from 00:13:ba:20:00:fa via 10.50.255.1: network PROGPL_LAB: 
no free leases
DHCPDISCOVER from 00:13:ba:20:00:fa via 10.50.255.1: network PROGPL_LAB: 
no free leases
DHCPDISCOVER from 00:13:ba:20:00:fa via 10.50.255.1: network PROGPL_LAB: 
no free leases
DHCPDISCOVER from 00:13:ba:20:00:fa via 10.50.255.1: network PROGPL_LAB: 
no free leases
DHCPDISCOVER from 00:13:ba:20:00:fa via 10.50.255.1: network PROGPL_LAB: 
no free leases

I'm not sure what's happening here.  I've got the lease times set to 60 
seconds, and I waited at least 10 minutes between steps.

Here are my failover statements from the dhcpd.conf files:

failover peer "ontdhcp" {
  primary;
  address 10.50.1.245;
  port 519;
  peer address 10.50.1.241;
  peer port 520;
  max-response-delay 60;
  max-unacked-updates 10;
  mclt 30;   # based on the lease time!  Max to 3600?
  split 255;  # really!  One server will handle all the requests, the 
secondary is there just for failover
  load balance max seconds 5;
  min-balance 3600;    #Rebalancing doesn't need to happen very often
  max-balance 86400;
}

failover peer "ontdhcp" {
  secondary;
  address 10.50.1.241;
  port 520;
  peer address 10.50.1.245;
  peer port 519;
  max-response-delay 60;
  max-unacked-updates 10;
  load balance max seconds 5;
  min-balance 3600;
  max-balance 84000;
}

I'm not sure what's happening here.  I've got the lease times set to 60 
seconds, and I waited at least 10 minutes between steps.

Any ideas or things I should look at?

-Greg G