DHCP performance testing

Sat Jul 14 10:46:24 UTC 2007

Puter ami wrote:

>Perhaps I did not make myself clear. I am not asking if DHCP is often an
>issue. I have perused the archives. I am asking how evaluators setup a
>common set of definitions and criteria for the setting of performance
>specifications for a DHCP server.

I am not aware of any specific testing tools or standards. There are 
some tools for stress testing DHCP servers - I don't know what they 
are so you'll have to go back into the archives I'm afraid.

>For instance, with DNS, the tool and performance specifications commonly
>used in the industry is "queryperf" and "queries per second (qps)".
>
>My question is what tool and what performance specifications are commonly
>used with DHCP? What is a good number to expect in an Enterprise-class
>network?
>
>Perhaps if I can be so bold to say that a "lease per second (lps)" is the
>DORA process, what is a good "lps" value for a network with ~60K nodes in a
>worst case scenario?

The problem is that there really is no 'standard' setup - there are 
just too many variables to give any meaningful figures.

Variables :
number of clients, geographical spread of those clients, nature of 
clients (do they have built in RTC and persistent storage, can the 
use a non-expired lease after a power outage), what is the spread of 
startup times, is a widespread simultaneous power-up a reasonably 
expectable situation, will the devices continue to attempt 
configuration indefinitely or will they time out and give up.

Only after this lot can we come to some sort of requirement for 
leases per second ! And having done that, we then have further 
variables :
what proportion of leases will be renewals of unexpired leases, how 
many leases will require DNS updates, what is the latency of our DNS 
update process, what load can other services (such as TFTP boot 
servers) manage.

Lets try filling in a few possible answers :
suppose you have 60,000 clients, all identical with identical 
software revisions, you have a statewide power outage (uncommon) and 
a statewide power resumption (very unlikely), all your leases have 
expired, and every new lease needs a DNS update. lets assume no 
spread of bootup times etc, so you suddenly get 60,000 DHCP Discovers 
in one second - 59,000+ of those will go in the bit bucket whatever 
server you run ! A few seconds later the devices will re-transmit 
another 60,000 packets and 59000+ of them will go in the bit bucket.

Now lets assume that your still have the same outage, but the 
restoration is spread over ten minutes - power is restored in areas, 
devices have a spread of startup times, variations in line sync 
speeds, etc, etc. You now have 60,000/600 which is 100 requests/s 
(actually 200/s when you consider the discover-offer and request-ack 
are two transactions) which you might be able to handle on high end 
hardware with suitable settings - the main constraints will be disk 
write performance and DNS update performance.

Lets say you have some control over the end devices and can configure 
them to a) spread their startup times, and b) have a reasonably long 
retry starting interval. Lets also say that you would consider an 
hour to be reasonable for restoring service after such a huge outage. 
You've now reduced the required transaction rate to 33/s 
(60,000/3600*2) which will be a lot cheaper to support !

Now lets assume you use longer lease times AND the clients have 
persistence - they start up, do a DHCP-Discover, fail to get an 
answer, check that they still have the same gateway and continue 
using their existing lease pending renewal (this is what Macs and 
Windows PCs do). You now have potentially weeks to get the leases out.

Now it's your turn. You know what devices you are using, what 
characteristics they have, what control over them you have, how 
geographically spread they are, etc, etc. We know none of those. Only 
you can work out what number you are likely to need to support.

Of course there's still some factors not covered. If such an event 
did happen, could you (for example) disable the majority of your 
access points so that you only have to boot a limited number of 
clients at any time - turning on more access as the load drops ? This 
is, in effect, what I believe the electric utilities would do if they 
had a nationwide outage because the generating & transmission system 
just couldn't cope with the switch on surge of turning on a whole 
country at once.

If you can do that then you can negate the requirement to be able to 
handle the worst case - by re-defining the worst case to be merely an 
abnormally high load.

>  >The limiting
>>factors tend to be disk write time (updating the lease file) and reading in
>>configs (for very large operations).  This can be somewhat mitigated by
>>using a flash drive rather than a traditional magnetic hard disk drive.
>
>I do understand general concepts and how to tune/uipgrade a DHCP server. I
>am needing numbers (and definitions of those numbers) to give to prospective
>vendors so they may recommend the hardware and software, then apply those
>numbers in a laboratory environment and test the claim.

As I wrote above, you will have to sit down and work out what YOUR 
numbers are, and what compromises you can make to mitigate the 
requirement. I suspect that your initial calculations will come up 
with some very high numbers - in which case I would advise that you 
then consider mitigation strategies (such as phased resumption of 
connectivity to clients). Longer lease tiems will also help as that 
would minimise the number of issues/renewals requiring DNS updates.

You may want to set up some lab experiments to see what some of the 
changes might do to your requirements.

Finally, when you do come to lab testing proposed solutions, make 
sure you test for some of the hidden things. For example, in an 'out 
of the box' test between the MS DHCP server and the ISC DHCP server 
the MS one wins hands down - largely because it isn't RFC compliant 
and does NOT write the leases to disk as it should. If you stress 
test it and pull the power plug part way through, it could have many 
thoudsands of issued leases that aren't recorded on disk when it 
comes back up. Someone did that test a while ago and reported his 
findings to this list.

Sorry that's not the answer you were looking for, but I hope it shows 
you how to get to where you want to be.