DHCP failover problems - still

Sun Nov 22 20:02:16 UTC 2009

Hrm....crickets.  No ideas?

Digging more, I noticed that there was a large spike of dhcp requests/ 
acks at the beginning of one of the comms failures.  Looks like we saw  
a period of 60 reqests/acks per second for a brief period.  Maybe 4-6  
offers per second during that time.

 From the capture, I see that the secondary sends contacts and bind  
updates to the primary.  I can see from a TCP perspective that those  
frames are ACK'd.  This is good news as it means the network is  
healthy.  But it seems that during this time, the primary dhcp daemon  
isn't sending the acknowledgements for those contacts and bind updates.

Is it possible/plausible that the dhcp daemon could be 'too busy' at  
65 xactions/second to respond to the failover messages?  I mean, mebbe  
we are blocking on disk or CPU... But I've reviewed the performance  
stats for these systems and they are not busy at all.  They're modern  
whitebox servers, 8G RAM, quad core wirlygig with all the animal  
options...

Thanks,

--
Matt

On Nov 19, 2009, at 19:38, Matt Causey <matt.causey at gmail.com> wrote:

> On Thu, Nov 19, 2009 at 6:41 PM, Matt Causey <matt.causey at gmail.com>  
> wrote:
>>> Unfortunately there is not a great deal of useful logging around
>>> these events.  A socket suddenly closing or resetting I think only
>>> results in the state changes, whereas a CONTACT message timeout
>>> causes a message with the word "timeout" in it to be logged before
>>> the state change is logged.
>>>
>> Ok, so we are seeing this:
>>
>> Nov 20 00:10:51 blah-101 dhcpd: DHCPACK on blah to 00:15:70:89:6d: 
>> 00 via blah
>> Nov 20 00:11:32 blah-101 dhcpd: timeout waiting for failover peer  
>> failover
>> Nov 20 00:11:32 blah-101 dhcpd: peer failover: disconnected
>> Nov 20 00:11:33 blah-101 dhcpd: failover peer failover: I move from
>> normal to communications-interrupted
>>
> In case anyone is interested, I've attached a small capture...like 108
> frames.  The ones before the first RST are normal traffic...seems that
> the socket drops and gets re-started every so often (you'll note the
> 3-way handshake just before the RST....not sure if that is normal or
> not.).
>
> After that first RST-ACK, things seem to recover.  And then we see the
> contacts/binding updates stop.  Then a string of attempts to re-start
> the socket (3-way handshake/RST-ACK combo).  And it just carries on
> that way until I bounce one of the dhcp daemons.
>
> We're running an identical configuration in 30+ pairs in different
> locations, and this is the only site with this problem - so I'm
> certain it's not a dhcpd design flaw as such.   But I'd like to learn
> more about what's happening, so we can either fix it in the software,
> or I can add some automation around it to detect and repair this
> condition.
>
> Thanks!
>
> --
> Matt
> <failover_xaction_failure.pcap>