Failover synchronisation problem

Fri Dec 16 10:42:39 UTC 2011

During the last days I had to struggle with a failover synchronisation
problem. Although the problem is gone now I'm still interested in
understanding the behaviour of the failover partners.

The two systems are running ver. 4.1.1 on solaris 10x86 in a production
environment serving around 180k active leases. The dhcp service runs
very reliable in general.

Due to other circumstances it was necessary to restart one failover
partner with an empty lease file. This is no problem normaly since it
sends an "update request all" message to its partner, receives the whole
lease db and switches to normal mode. But this time the "update request
all" message was received and logged by the working system but it made
no attempt to send any data, so the restarted systems got stuck in
"recover" mode. I restarted it serveral times and watched the failover
dialog with snoop, the running member got the "update request all"
message but simple does not answer in any way.

We finally solved the situation by a what I think rude trick, by copying
the whole lease file from the running to the unsuccessful recovering
system and starting it whis this lease db. After spitting out a bunch of
"conflict resolution" messages it switched to "normal" mode and started
working.

Could anybody think of a reason why the running system made no attempt
to react on the "update request all" message from its partner? I'm shure
that there where no network issues and the clock was synced ok on both
of them. Could this behaviour be trigged be the dhcp request load on the
running partner? It was hit by more than 150 req/sec. Is there any
functionality implemented to postone failover requests in favour of
answering client requests? The usual system resources (cpu, ram, disk
io) showed no bottleneck.

Thanks for help.

nils-henner