Failure of dhcp server failover

Sat Apr 30 21:41:37 UTC 2016

Hi!

Is it possibly to run pair of ISC DHCP Servers in a failover mode
and reliably supply static allocations for some DHCP clients,
e.g. using pools containing single IPv4 address?

I run ISC DHCP Servers version 4.3.3P1 with --enable-secs-byteorder configuration option enabled.

This version runs just fine for "ordinary" IPv4 pools containing multiple addresses.
It also runs just fine for single-address pools when not in failover mode.

In failover mode, ordinary pools run just fine, but single-address pools sometimes
run fine, sometimes do not, that is:

- first DHCP server logs:

Apr 18 00:00:02 k-45-monitor dhcpd: DHCPDISCOVER from 44:d9:e7:58:fd:e1 via 31.220.160.2: load balance to peer default

- second one logs:

Apr 18 00:00:02 m-19-monitor dhcpd: DHCPDISCOVER from 44:d9:e7:58:fd:e1 via 31.220.160.2: peer holds all free leases

And DHCP client obtains no address.

Some more details: I have several UniFi wireless access points (AP) controlled by UniFi Controller software.
These access points act as transparent L2 bridges supporting several distinct WLANs and vlans
for their wireless clients plus extra management vlan. They have wired uplink connected
to L2 manageable switches that insert DHCP option 82 to all request from AP themselves and their clients.
Mentioned vlans are routed by Cisco routers acting as DHCP relays. These routers relay DHCP requests
to pair of ISC DHCP Servers. There is ordinary IP pool for wireless clients and it works just fine.

UniFi access points theyselves obtain their IP addressess and additional DHCP vendor options from DHCP servers.
Each AP makes use of at least two IP addresses, one per vlan.
Each request of single AP comes from single MAC address of that AP but has distinct option 82
corresponding to distinct vlans and falls in distinct classes configured for DHCP servers.
The following part of DHCP Server configuration describes single UniFi AP:

   # vlan 1600 = 0x640
   class "ward2-vlan1600" {
     match if binary-to-ascii(16, 8, ":", hardware) = "1:44:d9:e7:58:fd:e1"
     and substring(binary-to-ascii(16, 8, ".", option agent.circuit-id), 0, 8) = "0.4.6.40";
   }
   # vlan 2065 = 0x811
   class "ward2-vlan2065" {
     match if binary-to-ascii(16, 8, ":", hardware) = "1:44:d9:e7:58:fd:e1"
     and substring(binary-to-ascii(16, 8, ".", option agent.circuit-id), 0, 8) = "0.4.8.11";
   }
   pool { failover peer "default"; allow members of "ward2-vlan1600"; range 10.19.50.13; }
   pool { failover peer "default"; allow members of "ward2-vlan2065"; range 62.231.174.13; }

All access points run the same hardware and software versions and their DHCP server
configuration of uniform but only some of they are served by DHCP servers just fine,
and some do not obtain an answer with logs cited above until I stop second DHCP server.
There seem to be some race condition problem in failover mode.
If I stop second (or first) DHCP server, remaining one servs all APs just fine.

Here is head part of DHCP server configuration:

# default ports tcp/647
failover peer "default" {
         secondary;
         address 62.231.191.174;
         peer address 62.231.191.161;
         max-response-delay 60;
         max-unacked-updates 10;
         mclt 60;
         auto-partner-down 60;
         load balance max seconds 3;
}
subnet 62.231.191.172 netmask 255.255.255.252 {}
include "/usr/local/etc/dhcpd.master";
# EOF

The second server resides in another network segment and IP network
and has the same configuration except for its IP address.
Both servers share single dhcpd.master containing main configuration,
part of which describing classess and pools is cited above.

Please help to debug this. I am ready to change configuration, test patches etc.
For the time being, I'm forced to stop using failover mode and stop second server.

Eugene Grosbein