Watching performance on a DHCP Server

Wed Feb 13 06:13:10 UTC 2008

On Fri, 2008-02-08 at 17:44 -0600, Blake Hudson wrote:
> -------- Original Message  --------
> Subject: Re: Watching performance on a DHCP Server
> From: David W. Hankins <David_Hankins at isc.org>
> To: dhcp-users at isc.org
> Date: Friday, February 08, 2008 4:53:15 PM
> > On Fri, Feb 08, 2008 at 03:27:23PM -0600, Blake Hudson wrote:
> >   
> > > Seems like it would make sense that this I/O could be non-blocking.... 
> > >     
> > 
> > That's just it, we feel it can't be, or else you run the risk of
> > duplicate allocations on a restart.  The simple way of saying it is
> > the database isn't consistent.

There is another solution:

For each address pool, create an in-memory list of e.g., 20 free
addresses.  Create a "pre-commit" log entry with these addresses. The
next 20 requests get served from this smaller pool, without commit to
disk - except if there is a timeout. When all 20 have been given out (or
on timeout) commit the leases to disk, and pre-commit a new list of 20
free addresses.  The timeout may well be the one in the o.s.
buffercache, i.e., no coding in the server (on suitable platforms).

When the server terminates normally, add a "close" record. Upon reboot,
rebuild memory and the state of the pool. If there is no close record,
assume there was a crash. It is not known if the addresses in the latest
pre-commit entry have been given out, so the server creates dummy leases
with these addresses, as a way of marking them not available.

This does not work well with many small pools. If pools are too small,
the pre-commit lists must be smaller. Is that a problem? 

A variation of this method without changes to the log file format: 

Split the pool in blocks of 32 addresses, e.g. using a "netmask" of 27
bits. When the server commits to disk a lease with an address within a
particular such block, then all subsequent lease assignments should go
from that block until it is exhausted. When that happens, the next lease
will be given out from another block, and that lease, and all earlier
uncommitted leases, are committed to disk before the lease is given to
the client. Further leases from the same block are given to the client
without commitment to disk. 

With either method, after a crash there will be a small number of leases
out there that you do not know about, but they are all in the same
block, and you are not giving out leases with conflicting addresses. If
clients show up with renewal requests, using addresses corresponding to
dummy leases, refuse the renewal offering a different address, or give
them the address they ask for deleting the dummy lease. When dummy
leases expire, reclaim the addresses.

Idea sketch:

  #define BLOCK_SHIFT 5
  #define BLOCK_TOP ((1 << BLOCK_SHIFT) - 1)
  #define BLOCK_MASK (~BLOCK_TOP)

  /* Search the current block for a free address */
  for (n=0; n<=BLOCK_TOP; n++)
    if (address_is_free(pool, pool->current_block | n)) {
      lease.address = pool->current_block | n;
      take_address(pool, address);
      write_log_no_fsync(lease);
      break;
    }

  if (n > BLOCK_TOP) { /* No free address in current block */
    lease.address = next_free_address(pool);
    take_address(pool, address);
    pool->current_block = lease->address & BLOCK_MASK;
    write_log_with_fsync(lease);
  }

  offer_to_client(lease);

Upon reboot, if there is a stale lock file (or some other way of
detecting a previous crash), replay the log file to rebuild the state of
the pool, but retain the address of the last lease given out from each
pool, and

  pool->current_block = last_lease_address_from_log(pool) & BLOCK_MASK;
  dummy_lease_time = max_lease_time(pool) - time_since_crash + 
     max_sync_delay;
  if (dummy_lease_time > 0) {
    for (n=0; n<=BLOCK_TOP; n++) {
      if (address_is_free(pool, pool->current_block & n)) {
        lease = create_lease(pool, dummy_client, dummy_lease_time);
        lease->address = pool->current_block | n;
        take_address(pool, lease->address);
        mark_as_dummy(lease); /* eg. by hw_address = 0 */
        write_log_no_fsync(lease);
      }
    }
    fsync_log();
  }

Occasionally, the next address given out is in a block where there is no
other free addresses. With many such addresses, performance will degrade
toward today's level. This should be avoidable unless the pool is nearly
exhausted. In some not-quite-exhausted situations, it may be desirable
to maintain a list of blocks sorted by decreasing number of free
addresses. Or, keep a list of blocks with N free, another with blocks
with N-1 free, etc, down to blocks with 1 free. When an address is
freed, the block is moved to the appropriate list. The current block is
kept unlinked, as are the ones without free addresses.

Without linked lists of blocks:

A bad strategy: addresses freed are placed at the top of a list, and the
latest freed is reused first.

A better strategy: Prefer always the next free address larger (modulo
pool size) than the latest address given out.  In other words, walk
cyclically down the pool, picking free addresses as you go. If the
address you find happens to be in a block with a free/non-free ratio
much worse than the pool totals, don't use it. Keep searching down the
pool.

The pre-commit variation does not have this problem.

Also, if the present-day code uses fsync(), consider using fdatasync()
instead, for architectures that support it. That avoids updating the
inode mtime on each fsync(). Preextend the log file to the nearest n'th
multiple of kilobytes, to avoid updating the inode with file size on
each append.

Regards