DHCP failover problems - still

sthaug at nethelp.no sthaug at nethelp.no
Wed May 6 20:18:02 UTC 2009

> We've been having a problem for quite some time with failover and  
> DHCPD server.  For weeks at a time the servers will run absolutely  
> great... then suddenly they just "lose connection" to each other and  
> NEVER try to reconnect.
> The servers are sitting right next to each other on the same Cisco Gig- 
> E switch, both servers are identical software run diskless via NFS...  
> no other network service problems, no errors, nothing.

DHCP is a (disk) I/O intensive activity, and I would be rather sceptical
about running this over NFS.

> Suddenly, one day all of our leases are consumed and the servers stop  
> handing out new leases.
> After more research we found that the failover connection between the  
> two servers has been "interrupted".  Even though the logs claim that  
> the connection was interrupted, both servers are running perfectly  
> independent of each other on the same LAN.

We have seen this too.

> So question #1 is I'm not sure why connections are interrupted in the  
> first place...  The LAN never lost carrier, the servers sit on a  
> private low traffic network.  According to the syslog....

My own theory is that the server gets sufficiently busy with disk I/O
(or in your case NFS I/O) that it doesn't have time to process the
keepalive messages. See for instance


> The second question is, why don't they attempt to "reconnect"?

This is a bug that I and several others have seen. I can reproduce it,
and I have tried to give ISC enough info to reproduce it (offering the
use of my lab if necessary). But so far no luck. See


Steinar Haug, Nethelp consulting, sthaug at nethelp.no

More information about the dhcp-users mailing list