Failover Partners mysteriously disconnect and never reconnect
Robert Blayzor
rblayzor.bulk at inoc.net
Wed Oct 22 15:08:29 UTC 2008
I've been tracking a problem with our DHCP failover setup for some
time now and it seems we've tracked it down to the fact that the
partners get into a "communication" interrupted state and never try to
connect to each other again.
Both servers are stable running FreeBSD 6.3-RELEASE.
They are both connected via Gig-E to a Cisco 6509.
There are no errors on the ports, no errors on the NIC's and all other
applications on the servers run fine. The ports are 1000M/full and I
can transfer data between them no problem. There are about 200
devices connected tot he same switch, all with no connectivity
issues. We've tried moving ports and NIC's, but we still see this
issue randomly pop up.
Running ISC-DHCPD 3.0.7 from FreeBSD ports.
dhcp0 - primary
dhcp1 - secondary peer
Both servers sync'd time via broadcast NTP.
What we see in the logs on both servers:
Oct 22 01:04:50 dhcp1 dhcpd: timeout waiting for failover peer dhcp-
failover
Oct 22 01:04:50 dhcp1 dhcpd: peer dhcp-failover: disconnected
Oct 22 01:04:50 dhcp1 dhcpd: failover peer dhcp-failover: I move from
normal to communications-interrupted
Oct 22 01:05:06 dhcp0 dhcpd: timeout waiting for failover peer dhcp-
failover
Oct 22 01:05:06 dhcp0 dhcpd: peer dhcp-failover: disconnected
Oct 22 01:05:06 dhcp0 dhcpd: failover peer dhcp-failover: I move from
normal to communications-interrupted
When this happens the peers seem to just go about their business.
First off, why the connection gets interrupted is a mystery, but even
more interesting is that they never try to connect again. (we never
get a communication normal). We can go on the servers right about the
time this happens (shortly after) and everything seems fine. No link
flaps, no errors from dmesg, etc. We can ping with no loss between
the servers...
Looking at sockstat and netstat, it appears they are listening on the
TCP socket, but there is no ESTABLISHED connection between them:
[dhcp0:~] sockstat | grep dhcp
dhcpd dhcpd 19728 3 dgram -> /var/run/logpriv
dhcpd dhcpd 19728 6 udp4 *:67 *:*
dhcpd dhcpd 19728 10 tcp4 10.0.0.18:520 *:*
root syslogd 691 8 dgram /var/db/dhcpd/var/run/log
[auth0:~] netstat -an | grep dhcpd
ffffff00c29d0c80 dgram 0 0 ffffff00c17dc7c0 0
0 0 /var/db/dhcpd/var/run/log
[dhcp0:~] netstat -an | grep 520
tcp4 0 0 10.0.0.18.520 *.*
LISTEN
and looks similar on the peer.
Config info on the primary:
failover peer "dhcp-failover" {
primary;
address dhcp0;
port 520;
peer address dhcp1;
peer port 520;
max-response-delay 30;
max-unacked-updates 10;
load balance max seconds 3;
mclt 600;
split 128;
}
failover peer "dhcp-failover" {
secondary;
address dhcp1;
port 520;
peer address dhcp0;
peer port 520;
max-response-delay 30;
max-unacked-updates 10;
load balance max seconds 3;
}
Our default lease times are 3600 and max is 86400.
Restarting dhcpd on both servers fixes it all is fine for X days until
this happens again. The errors in the logs don't really give more
information on what may be going on or why things were disconnected
(ie: timeout vs socket reset, etc)
Any suggestions? I know that running TCP dump on both servers might
help, but running it will a capture over a period of a few days to
catch this may not be possible due to the size of the file produced.
--
Robert Blayzor, BOFH
INOC, LLC
rblayzor at inoc.net
http://www.inoc.net/~rblayzor/
More information about the dhcp-users
mailing list