4.2.1-P1 and failover communications problem

Carlos Vicente cvicente.lists at gmail.com
Mon Apr 25 19:53:25 UTC 2011


Hi list,

We have been running a failover pair for years. After upgrading from
4.1.1-P1 to 4.2.1-P1, I noticed the following message in the logs:

    dhcpd: failover: link startup timeout

This happened every 20 seconds, on both servers. An inspection of the leases
file showed that both servers were in "communications interrupted" state.

We had always run both peers on port 520 without issues. I decided to switch
to the standard port, 647, just in case. That didn't fix it. I also tried
assigning different ports on the primary (647) and the secondary (847). This
had no effect either.

A packet capture shows the following TCP sequences:

  7   5.000940 x.x.60.22 -> x.x.32.35 TCP 47015 > dhcp-failover [SYN] Seq=0
Win=5840 Len=0 MSS=1460 TS\
V=1574604576 TSER=0 WS=7
  8   5.001111 x.x.32.35 -> x.x.60.22 TCP dhcp-failover > 47015 [SYN, ACK]
Seq=0 Ack=1 Win=5792 Len=0 \
MSS=1460 TSV=105374765 TSER=1574604576 WS=7
  9   5.001130 x.x.60.22 -> x.x.32.35 TCP 47015 > dhcp-failover [ACK] Seq=1
Ack=1 Win=5888 Len=0 TSV=1\
574604577 TSER=105374765
 10   5.001674 x.x.32.35 -> x.x.60.22 TCP 38331 > dhcp-failover2 [SYN] Seq=0
Win=5840 Len=0 MSS=1460 T\
SV=105374765 TSER=0 WS=7
 11   5.001689 x.x.60.22 -> x.x.32.35 TCP dhcp-failover2 > 38331 [SYN, ACK]
Seq=0 Ack=1 Win=5792 Len=0\
 MSS=1460 TSV=1574604577 TSER=105374765 WS=7
 12   5.001842 x.x.32.35 -> x.x.60.22 TCP 38331 > dhcp-failover2 [ACK] Seq=1
Ack=1 Win=5888 Len=0 TSV=\
105374765 TSER=1574604577
 13  20.001583 x.x.60.22 -> x.x.32.35 TCP 47015 > dhcp-failover [FIN, ACK]
Seq=1 Ack=1 Win=5888 Len=0 \
TSV=1574619577 TSER=105374765
 14  20.001817 x.x.32.35 -> x.x.60.22 TCP dhcp-failover > 47015 [FIN, ACK]
Seq=1 Ack=2 Win=5888 Len=0 \
TSV=105389765 TSER=1574619577
 15  20.001833 x.x.60.22 -> x.x.32.35 TCP 47015 > dhcp-failover [ACK] Seq=2
Ack=2 Win=5888 Len=0 TSV=1\
574619577 TSER=105389765
 16  20.002561 x.x.60.22 -> x.x.32.35 TCP dhcp-failover2 > 38331 [FIN, ACK]
Seq=1 Ack=1 Win=5888 Len=0\
 TSV=1574619578 TSER=105374765
 17  20.002751 x.x.32.35 -> x.x.60.22 TCP 38331 > dhcp-failover2 [FIN, ACK]
Seq=1 Ack=2 Win=5888 Len=0\
 TSV=105389766 TSER=1574619578
 18  20.002760 x.x.60.22 -> x.x.32.35 TCP dhcp-failover2 > 38331 [ACK] Seq=2
Ack=2 Win=5888 Len=0 TSV=\
1574619578 TSER=105389766


I double-checked that both servers were running the same dhcpd version
(4.2.1-P1) and also that the peer "stanza" was the same on both
configurations.

After running out of ideas, I decided to downgrade back to 4.1.1-P1. After
that, the problem was gone. Obviously, something changed between those two
versions that is affecting our setup. The release notes mention a few
improvements related to failover, but nothing that could explain what we're
seeing.

Here are the current peer configuration files for reference:

failover peer "dhcp-peer" {
  primary;                      # This is the primary server
  address x.x.32.35;
  port 647;
  peer address x.x.60.22;
  peer port 847;
  max-response-delay 60;
  max-unacked-updates 10;
  split 128;
  mclt 900;
  load balance max seconds 3;
}

failover peer "dhcp-peer" {
  secondary;                    # This is the secondary server
  address x.x.60.22;
  port 847;
  peer address x.x.32.35;
  peer port 647;
  max-response-delay 60;
  max-unacked-updates 10;
  load balance max seconds 3;
}

I also grabbed a trace file using the -tf parameter. Running that with
-play, the only failover lines are:

failover peer dhcp-peer: I move from communications-interrupted to startup
failover: listener: no matching state
failover peer dhcp-peer: I move from startup to communications-interrupted
failover: link startup timeout
failover: listener: no matching state
failover: link startup timeout
failover: listener: no matching state
failover: link startup timeout
failover: listener: no matching state
failover: link startup timeout
failover: listener: no matching state
failover: link startup timeout
failover: listener: no matching state
failover: link startup timeout
failover: listener: no matching state

I can provide the trace file and packet captures to ISC if that helps.

Thank you in advance for any hints.

Regards,

Carlos Vicente
University of Oregon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20110425/66068b22/attachment.html>


More information about the dhcp-users mailing list