Failover peer separation revisted
rblayzor.bulk at inoc.net
Mon Nov 17 10:47:24 UTC 2008
On Nov 15, 2008, at 3:45 PM, David Pick wrote:
> For what it's worth, I've been hunting a problem between pairs
> of (FreeBSD 6) machines on a backbone LAN, but nothing to do
> with DHCP traffic. So far, I've found that under some yet-to-be-
> defined circumstances one machine gets into a state where it
> issues an ARP request, receives a reply (according to "tcpdump"),
> but does not put the MAC address in that received packet into the
> ARP tables. At the same time (more-or-less) using the "arp" user-
> level program to try and delete an entry taked 15-20 seconds to
> complete, but with normal very small processor time. I'm starting
> to suspect some sort of lock problem in the kernel, but can't pin
> it down yet. The problem eventually clears itself (for a while)...
> I'd be interested in hearing anything you find to either confirm
> or refute the possibility that it's the same problem.
I don't believe that's the problem in our situation. If you look at
the packet traces each and every packet the servers send to each other
makes it, ie: TCP ACk's. The timeout comes quick right at a time
when the servers are actually talking back and forth. If it were a
loss of ARP of any other network problem, I think you'd see one server
send packets and the other not receive them. Though YOUR problem on
FreeBSD sounds interesting; I've not see that one yet.
So the mystery on why they actually disconnect is unknown still. In
our traces the network level looks fine, it looks like the application
just thinks the other side times out and closes the connection, within
a second or two. (so that'd be a pretty fast timeout!) Definitely
looks application level.
The second part of the question also remains... why do they never try
to reconnect? I think there is some internal bug that's breaking the
session and making failover go away all together.
Robert Blayzor, BOFH
rblayzor at inoc.net
More information about the dhcp-users