DHCP performance testing
dhcp1 at thehobsons.co.uk
Sat Jul 14 10:46:24 UTC 2007
Puter ami wrote:
>Perhaps I did not make myself clear. I am not asking if DHCP is often an
>issue. I have perused the archives. I am asking how evaluators setup a
>common set of definitions and criteria for the setting of performance
>specifications for a DHCP server.
I am not aware of any specific testing tools or standards. There are
some tools for stress testing DHCP servers - I don't know what they
are so you'll have to go back into the archives I'm afraid.
>For instance, with DNS, the tool and performance specifications commonly
>used in the industry is "queryperf" and "queries per second (qps)".
>My question is what tool and what performance specifications are commonly
>used with DHCP? What is a good number to expect in an Enterprise-class
>Perhaps if I can be so bold to say that a "lease per second (lps)" is the
>DORA process, what is a good "lps" value for a network with ~60K nodes in a
>worst case scenario?
The problem is that there really is no 'standard' setup - there are
just too many variables to give any meaningful figures.
number of clients, geographical spread of those clients, nature of
clients (do they have built in RTC and persistent storage, can the
use a non-expired lease after a power outage), what is the spread of
startup times, is a widespread simultaneous power-up a reasonably
expectable situation, will the devices continue to attempt
configuration indefinitely or will they time out and give up.
Only after this lot can we come to some sort of requirement for
leases per second ! And having done that, we then have further
what proportion of leases will be renewals of unexpired leases, how
many leases will require DNS updates, what is the latency of our DNS
update process, what load can other services (such as TFTP boot
Lets try filling in a few possible answers :
suppose you have 60,000 clients, all identical with identical
software revisions, you have a statewide power outage (uncommon) and
a statewide power resumption (very unlikely), all your leases have
expired, and every new lease needs a DNS update. lets assume no
spread of bootup times etc, so you suddenly get 60,000 DHCP Discovers
in one second - 59,000+ of those will go in the bit bucket whatever
server you run ! A few seconds later the devices will re-transmit
another 60,000 packets and 59000+ of them will go in the bit bucket.
Now lets assume that your still have the same outage, but the
restoration is spread over ten minutes - power is restored in areas,
devices have a spread of startup times, variations in line sync
speeds, etc, etc. You now have 60,000/600 which is 100 requests/s
(actually 200/s when you consider the discover-offer and request-ack
are two transactions) which you might be able to handle on high end
hardware with suitable settings - the main constraints will be disk
write performance and DNS update performance.
Lets say you have some control over the end devices and can configure
them to a) spread their startup times, and b) have a reasonably long
retry starting interval. Lets also say that you would consider an
hour to be reasonable for restoring service after such a huge outage.
You've now reduced the required transaction rate to 33/s
(60,000/3600*2) which will be a lot cheaper to support !
Now lets assume you use longer lease times AND the clients have
persistence - they start up, do a DHCP-Discover, fail to get an
answer, check that they still have the same gateway and continue
using their existing lease pending renewal (this is what Macs and
Windows PCs do). You now have potentially weeks to get the leases out.
Now it's your turn. You know what devices you are using, what
characteristics they have, what control over them you have, how
geographically spread they are, etc, etc. We know none of those. Only
you can work out what number you are likely to need to support.
Of course there's still some factors not covered. If such an event
did happen, could you (for example) disable the majority of your
access points so that you only have to boot a limited number of
clients at any time - turning on more access as the load drops ? This
is, in effect, what I believe the electric utilities would do if they
had a nationwide outage because the generating & transmission system
just couldn't cope with the switch on surge of turning on a whole
country at once.
If you can do that then you can negate the requirement to be able to
handle the worst case - by re-defining the worst case to be merely an
abnormally high load.
> >The limiting
>>factors tend to be disk write time (updating the lease file) and reading in
>>configs (for very large operations). This can be somewhat mitigated by
>>using a flash drive rather than a traditional magnetic hard disk drive.
>I do understand general concepts and how to tune/uipgrade a DHCP server. I
>am needing numbers (and definitions of those numbers) to give to prospective
>vendors so they may recommend the hardware and software, then apply those
>numbers in a laboratory environment and test the claim.
As I wrote above, you will have to sit down and work out what YOUR
numbers are, and what compromises you can make to mitigate the
requirement. I suspect that your initial calculations will come up
with some very high numbers - in which case I would advise that you
then consider mitigation strategies (such as phased resumption of
connectivity to clients). Longer lease tiems will also help as that
would minimise the number of issues/renewals requiring DNS updates.
You may want to set up some lab experiments to see what some of the
changes might do to your requirements.
Finally, when you do come to lab testing proposed solutions, make
sure you test for some of the hidden things. For example, in an 'out
of the box' test between the MS DHCP server and the ISC DHCP server
the MS one wins hands down - largely because it isn't RFC compliant
and does NOT write the leases to disk as it should. If you stress
test it and pull the power plug part way through, it could have many
thoudsands of issued leases that aren't recorded on disk when it
comes back up. Someone did that test a while ago and reported his
findings to this list.
Sorry that's not the answer you were looking for, but I hope it shows
you how to get to where you want to be.
More information about the dhcp-users