excessive failover pool balancing, leases files getting out of sync

Wed Jun 15 03:54:15 UTC 2011

We run into the same issue, although not daily. It's more of a rare event
for us (maybe once a month). I figured it was just a bug that hopefully is
fixed in a newer version, but could be wrong.

Our setup is very similar to yours. We run 3.1.0 on Solaris 10 (sparc) in a
non-global zone with an exclusive IP stack. We also use "USE_SOCKET" as well
(although we don't need to, this setting carried over from when we didn't
run DHCP in its own zone).

What we do is put one server (our primary) into the partner-down mode, blow
away the lease files on the secondary, touch a new lease file, and then
start it up. That 'resolves' the issue for us for a while. Sounds like you
do a very similar thing.

We're slated to upgrade DHCP versions here somewhat soon. Hoping the issue
vanishes.

Jack Kielsmeier
netINS Systems Administrator
systems at netins.net

-----Original Message-----
From: dhcp-users-bounces+jackkiel=netins.net at lists.isc.org
[mailto:dhcp-users-bounces+jackkiel=netins.net at lists.isc.org] On Behalf Of
Gordon A. Lang
Sent: Tuesday, June 14, 2011 9:26 PM
To: Users of ISC DHCP
Subject: excessive failover pool balancing, leases files getting out of sync

I need help.

My log files show that the failover peers have very different ideas
about the number of "free" and "backup" leases.  There is very
frequent pool balancing going on.  My leases files are growing from
10 meg after a restart up to 120 meg within a couple of hours.

We have version 3.1.1 compiled with USE_SOCKET defined, configured
with failover, running on Sun servers with Solaris 10, and it has
been working flawlessly for years until suddenly on May 26, things
have become quite rough.

The continuing symptom is this: at some point in the middle of the
"morning rush," each work day between 7 am and 9 am, while most users
get leases, many users do not.  The users who do not get a lease,
usually do get one on their second attempt (via reboot), but not
always.  Restarting dhcpd at that time completely stops all dhcp
related help desk calls for the remainder of the day.

The first time this occurred, on May 26, it was after a power glitch
that sharpened the "morning rush" into a "boot storm."  Restarting
dhcpd did not stop the help desk calls on that day.  In fact, we found
that the problems did not stop until we shut down the primary and
put the failover into "partner down" state.

After that, things were fine for several days, until the symptoms
resurfaced on a daily basis, and we got used to waiting for the
first help desk call each morning.

Then I rebooted both servers (with the primary dhcpd inhibited), took
the failover dhcpd out of the "partner down" state, deleted the leases
file on the primary, and brought the primary dhcpd back to life.
The leases file was repopulated very quickly.  Everything worked
flawlessly, and the leases files stayed in sync for 6 days.  But
today we are seeing that the leases files are out of sync.

And now I am getting a new message in the logs that I didn't see before:

dhcpd: bind update on 10.110.1.80 got ack from nsti1-nsti2: xid mismatch.

If anyone can give any suggestions, I would very much appreciate it.

More information: We use 2 day leases, with pool sizes typically 3 to 4
times larger than the expected number of users.

I don't want to post my whole config file, but in case it is a useful
reference, here is an drastically trimmed, sanitized version of primary
(nsti1) conf file:

local-address 192.168.104.11;
subnet 192.168.104.11 netmask 255.255.255.255 { }

option option-62 code 62 = string;
option option-98 code 98 = string;
option option-116 code 116 = boolean;
option option-117 code 117 = unsigned integer 16;
option option-119 code 119 = string;
option option-120 code 120 = string;
option option-121 code 121 = string;
option option-150 code 150 = array of ip-address;
option option-176 code 176 = string;

ddns-update-style interim;
authoritative ;
ddns-updates True ;
default-lease-time 172800 ;
log-facility local4 ;
max-lease-time 172800 ;
min-lease-time 172800 ;
omapi-port 7911 ;
pid-file-name "/export/local/etc/dhcpd.pid" ;
ping-check True ;
ping-timeout 1 ;
server-identifier 192.168.104.11 ;
update-optimization False ;
update-static-leases True ;
allow unknown-clients ;
allow duplicates ;

failover peer "nsti1-nsti2" {
        primary;
        address 192.168.104.11;
        port 647;
        peer address 192.168.104.21;
        peer port 647;
        max-response-delay 60;
        max-unacked-updates 20;
        mclt 3600;
        split 255;
        load balance max seconds 5;
}

class "Cisco IP Phones" {
match if (substring (option vendor-class-identifier, 0, 28) = "Cisco 
Systems, Inc. IP Phone");
}

subnet 127.0.0.1 netmask 255.255.255.255 {
}

subnet 10.3.0.0 netmask 255.255.0.0 {
        option subnet-mask 255.255.0.0 ;
        option routers 10.3.1.1 ;
        option time-servers 10.106.1.30 ;
        option domain-name-servers 192.168.53.10 , 192.168.53.20 ;
        option domain-name "domain.name" ;
        option vendor-encapsulated-options 
06:01:0B:08:07:AA:AA:01:0A:06:01:14:00 ;
        option netbios-name-servers 10.105.3.39 , 10.105.3.41 ;
        option netbios-node-type 8 ;
        option bootfile-name "BStrap/x86pc/BStrap.0" ;
        option slp-directory-agent true 10.105.6.241 , 10.151.209.100 ;
        option slp-service-scope true "slp-fs" ;
        next-server fsta4.ad.domain.name ;
         deny bootp ;
        pool {
                range 10.3.4.64 10.3.7.255;
                failover peer "nsti1-nsti2";
                deny dynamic bootp clients;
        }
        host OC0001L1464807.ad.domain.name.-68B599EC9E7B-10-3-4-7 {
                hardware ethernet 68:B5:99:EC:9E:7B;
                fixed-address 10.3.4.7;
        }
}

subnet 10.110.1.0 netmask 255.255.255.0 {
        option subnet-mask 255.255.255.0 ;
        option routers 10.110.1.1 ;
        option time-servers 10.106.1.30 ;
        option domain-name-servers 192.168.53.10 , 192.168.53.20 ;
        option domain-name "domain.name" ;
        option netbios-name-servers 10.105.3.39 , 10.105.3.41 ;
        option netbios-node-type 8 ;
        option slp-directory-agent true 10.105.6.241 , 10.151.209.100 ;
        option slp-service-scope true "slp-fs" ;
        option option-150 10.104.1.7 ;
        pool {
                range 10.110.1.76 10.110.1.78;
                failover peer "nsti1-nsti2";
                deny dynamic bootp clients;
        }
        pool {
                range 10.110.1.80 10.110.1.239;
                failover peer "nsti1-nsti2";
                deny dynamic bootp clients;
        }
        pool {
                range 10.110.1.251 10.110.1.254;
                failover peer "nsti1-nsti2";
                deny dynamic bootp clients;
        }
        host null-0006296C03CB-10-110-1-16 {
                hardware ethernet 00:06:29:6C:03:CB;
                fixed-address 10.110.1.16;
        }

        host null-0003BA2448D1-10-110-1-17 {
                hardware ethernet 00:03:BA:24:48:D1;
                fixed-address 10.110.1.17;
        }
}

--
Gordon A. Lang 

_______________________________________________
dhcp-users mailing list
dhcp-users at lists.isc.org
https://lists.isc.org/mailman/listinfo/dhcp-users