DHCP 4.0.1 fails to recover from communications-interrupted after network outage

Thu May 29 01:29:22 UTC 2014

All,

I have two 4.0.1 DHCP servers running in peer-mode. I have included the
failover peer definitions in the bottom of this message, but for now,
understand that both servers seem to function 100% prior to my
failure-testing. My greatest wish is for somebody to tell me that this is
fixed in a future revision, but for now I will present the scenario.

If I am physically on the primary node, I can set link-down on the serviced
interfac(es) (aka: ip link set eth0 down) and cause a seemingly
irreversible state-change to "communications-interrupted". That is to say
that, aside from also physically restarting the dhcpd daemon, there is
seemingly no automated recovery from the state of
"communications-interrupted". Is this by design?

My understanding (correct me if I'm wrong) is that, upon primary
network-down, after "max-response-delay" the transition to
"communications-interrupted" should be completed on both nodes. And upon
restoration of the primary network, and a maximum of "mclt" + 1, there
should be no need to remain in "communications-interrupted"... that is IF
the daemon is aware that the network came back, and it is able to
re-establish communication with its peer...

Is there a potential failure to reopen a socket? Perhaps a race between the
connection code and the kernel making the interface truly available?

I will note that I see the following in /var/log/messages on the primary
node:
    May 28 18:13:43 cld kernel: bnx2: eth0: using MSI
    May 28 18:13:43 cld kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
    May 28 18:13:43 cld kernel: bnx2: eth1: using MSI
    May 28 18:13:43 cld kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
    May 28 18:13:43 cld dhcpd: receive_packet failed on eth0: Network is
down
    May 28 18:13:43 cld kernel: bnx2: eth0: using MSI
    May 28 18:13:43 cld kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
    May 28 18:13:43 cld dhcpd: receive_packet failed on eth1: Network is
down

Seemingly indicating that there is an attempt by DHCP to use the interface
that is not yet ready? Either way, these messages from dhcpd are the last
and final that will ever come from the primary unit until I restart the
daemon.

As always, any pointers or help are immensely appreciated.

 - Jeremiah D. Jinno

-------------------------------------
# Primary node configuration snippet
max-lease-time 60;
default-lease-time 60;
failover peer "Public" {
    primary;
    address 10.4.175.8;
    port 847;
    peer address 10.4.175.10;
    peer port 847;
    max-response-delay 18;
    mclt 30;
    split 255;
    load balance max seconds 3;
}

-------------------------------------
# Secondary node configuration snippet
max-lease-time 60;
default-lease-time 60;
failover peer "Public" {
    secondary;
    address 10.4.175.10;
    port 847;
    peer address 10.4.175.8;
    peer port 847;
    max-response-delay 18;
    load balance max seconds 3;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20140528/b453bf43/attachment.html>