4.1.0 does not receive requests while balancing pool

Mon Mar 30 18:50:02 UTC 2009

On Mon, Mar 30, 2009 at 05:16:11PM +0200, Luk Claes wrote:
> We upgraded to the latest dhcp version (4.1.0) as failover was not
> working correctly as far as we could see in 3.x versions. Now failover
> is working correctly, but the dhcp server seems to lock all pools when
> balancing pools with its peer (by default every minute). As this
> balancing takes about 10 seconds in our setup and we get about 20
> requests per second. This is not tolerable in our setup, so I changed
> it to only balance the pools every 10 t0 20 minutes which makes it
> almost bearable. Though the fundamental problem is not solved.

The new failover code estimates a time in the future when it believes
one pool will be misbalanced (the nearest future point is selected
over all pools).  If it's coming up with "one minute", then you have
a very woefully understocked pool, as 60 seconds is the default
minimum time between events.  I'd guess you have some pool with no
free leases, or with only leases that had expired some seconds prior.

10 seconds for a single balance run is rather long.  I'd presume you'd
already checked, but it bears mentioning that the pool scan run logs
to syslog twice - if syslog is fsync()'ing every write, that will slow
it down considerably.

> Is this a known problem? If it's indeed a global lock on the pools,
> could it be changed to local locks per pool or having the balancing
> only happen per pool instead of all at once?

There are no locks, the server is single-threaded.  I've contemplated
segregating individual pool rebalance events into single operations,
scheduling them independently of one another.  To be complete, you
also need the POOLREQ message to indicate a single pool to rebalance,
and there is no existing mechanism for that yet.

For workarounds, you can reduce the number of pools (consolidation
assists the rebalance events because they just calculate sums stored
on the pool structures), or you can subdivide the pools amongst a
number of individual failover states (one rebalance event is
scheduled for every one failover state; rebalance runs still must
traverse the entire list of pools, but they only process pools that
are covered by the currently selected failover state).

-- 
David W. Hankins	"If you don't do it right the first time,
Software Engineer		     you'll just have to do it again."
Internet Systems Consortium, Inc.		-- Jack T. Hankins
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20090330/9acc6af0/attachment.bin>