Failover Partners mysteriously disconnect and never reconnect

Wed Oct 22 15:08:29 UTC 2008

I've been tracking a problem with our DHCP failover setup for some  
time now and it seems we've tracked it down to the fact that the  
partners get into a "communication" interrupted state and never try to  
connect to each other again.

Both servers are stable running FreeBSD 6.3-RELEASE.

They are both connected via Gig-E to a Cisco 6509.

There are no errors on the ports, no errors on the NIC's and all other  
applications on the servers run fine.  The ports are 1000M/full and I  
can transfer data between them no problem.  There are about 200  
devices connected tot he same switch, all with no connectivity  
issues.  We've tried moving ports and NIC's, but we still see this  
issue randomly pop up.

Running ISC-DHCPD 3.0.7 from FreeBSD ports.

dhcp0 - primary
dhcp1 - secondary peer

Both servers sync'd time via broadcast NTP.

What we see in the logs on both servers:

Oct 22 01:04:50 dhcp1 dhcpd: timeout waiting for failover peer dhcp- 
failover
Oct 22 01:04:50 dhcp1 dhcpd: peer dhcp-failover: disconnected
Oct 22 01:04:50 dhcp1 dhcpd: failover peer dhcp-failover: I move from  
normal to communications-interrupted

Oct 22 01:05:06 dhcp0 dhcpd: timeout waiting for failover peer dhcp- 
failover
Oct 22 01:05:06 dhcp0 dhcpd: peer dhcp-failover: disconnected
Oct 22 01:05:06 dhcp0 dhcpd: failover peer dhcp-failover: I move from  
normal to communications-interrupted

When this happens the peers seem to just go about their business.   
First off, why the connection gets interrupted is a mystery, but even  
more interesting is that they never try to connect again. (we never  
get a communication normal).  We can go on the servers right about the  
time this happens (shortly after) and everything seems fine.  No link  
flaps, no errors from dmesg, etc.   We can ping with no loss between  
the servers...

Looking at sockstat and netstat, it appears they are listening on the  
TCP socket, but there is no ESTABLISHED connection between them:

[dhcp0:~] sockstat | grep dhcp
dhcpd    dhcpd      19728 3  dgram  -> /var/run/logpriv
dhcpd    dhcpd      19728 6  udp4   *:67                  *:*
dhcpd    dhcpd      19728 10 tcp4   10.0.0.18:520         *:*
root     syslogd    691   8  dgram  /var/db/dhcpd/var/run/log
[auth0:~] netstat -an | grep dhcpd
ffffff00c29d0c80 dgram       0      0 ffffff00c17dc7c0        0         
0        0 /var/db/dhcpd/var/run/log
[dhcp0:~] netstat -an | grep 520
tcp4       0      0  10.0.0.18.520          *.*                     
LISTEN

and looks similar on the peer.

Config info on the primary:

failover peer "dhcp-failover" {
   primary;
   address dhcp0;
   port 520;
   peer address dhcp1;
   peer port 520;
   max-response-delay 30;
   max-unacked-updates 10;
   load balance max seconds 3;
   mclt 600;
   split 128;
}

failover peer "dhcp-failover" {
   secondary;
   address dhcp1;
   port 520;
   peer address dhcp0;
   peer port 520;
   max-response-delay 30;
   max-unacked-updates 10;
   load balance max seconds 3;
}

Our default lease times are 3600 and max is 86400.

Restarting dhcpd on both servers fixes it all is fine for X days until  
this happens again.  The errors in the logs don't really give more  
information on what may be going on or why things were disconnected  
(ie: timeout vs socket reset, etc)

Any suggestions?  I know that running TCP dump on both servers might  
help, but running it will a capture over a period of a few days to  
catch this may not be possible due to the size of the file produced.

-- 
Robert Blayzor, BOFH
INOC, LLC
rblayzor at inoc.net
http://www.inoc.net/~rblayzor/