<div dir="ltr">All,<div><br></div><div>I have two 4.0.1 DHCP servers running in peer-mode. I have included the failover peer definitions in the bottom of this message, but for now, understand that both servers seem to function 100% prior to my failure-testing. My greatest wish is for somebody to tell me that this is fixed in a future revision, but for now I will present the scenario.</div>

<div><br></div><div>If I am physically on the primary node, I can set link-down on the serviced interfac(es) (aka: ip link set eth0 down) and cause a seemingly irreversible state-change to "communications-interrupted". That is to say that, aside from also physically restarting the dhcpd daemon, there is seemingly no automated recovery from the state of "communications-interrupted". Is this by design?</div>

<div><br></div><div>My understanding (correct me if I'm wrong) is that, upon primary network-down, after "max-response-delay" the transition to "communications-interrupted" should be completed on both nodes. And upon restoration of the primary network, and a maximum of "mclt" + 1, there should be no need to remain in "communications-interrupted"... that is IF the daemon is aware that the network came back, and it is able to re-establish communication with its peer...</div>

<div><br></div><div>Is there a potential failure to reopen a socket? Perhaps a race between the connection code and the kernel making the interface truly available?</div><div><br></div><div>I will note that I see the following in /var/log/messages on the primary node:</div>

<div><div>    May 28 18:13:43 cld kernel: bnx2: eth0: using MSI</div><div>    May 28 18:13:43 cld kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready</div><div>    May 28 18:13:43 cld kernel: bnx2: eth1: using MSI</div><div>

    May 28 18:13:43 cld kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready</div><div>    May 28 18:13:43 cld dhcpd: receive_packet failed on eth0: Network is down</div><div>    May 28 18:13:43 cld kernel: bnx2: eth0: using MSI</div>

<div>    May 28 18:13:43 cld kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready</div><div>    May 28 18:13:43 cld dhcpd: receive_packet failed on eth1: Network is down</div></div><div><br></div><div>Seemingly indicating that there is an attempt by DHCP to use the interface that is not yet ready? Either way, these messages from dhcpd are the last and final that will ever come from the primary unit until I restart the daemon.</div>

<div><br></div><div>As always, any pointers or help are immensely appreciated.</div><div><br></div><div> - Jeremiah D. Jinno</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>

<br></div><div><div>-------------------------------------</div><div># Primary node configuration snippet</div></div><div><div>max-lease-time 60;</div><div>default-lease-time 60;</div></div><div><div>failover peer "Public" {</div>

<div>    primary;</div><div>    address 10.4.175.8;</div><div>    port 847;</div><div>    peer address 10.4.175.10;</div><div>    peer port 847;</div><div>    max-response-delay 18;</div><div>    mclt 30;</div><div>    split 255;</div>

<div>    load balance max seconds 3;</div><div>}</div></div><div><br></div><div>-------------------------------------</div><div># Secondary node configuration snippet</div><div><div>max-lease-time 60;</div><div>default-lease-time 60;</div>

</div><div><div>failover peer "Public" {</div><div>    secondary;</div><div>    address 10.4.175.10;</div><div>    port 847;</div><div>    peer address 10.4.175.8;</div><div>    peer port 847;</div><div>    max-response-delay 18;</div>

<div>    load balance max seconds 3;</div><div>}</div></div></div>