Performance Problem Postmortem
rch17 at duke.edu
Sat Aug 15 22:19:14 UTC 2009
Good day all,
Where I work we have 2 new sets of failover dhcp servers (4 computers)
both running on CentOS and dhcp version 3.1.1. They are newer dual core
machines each sporting 2GB of RAM and have gigbit Ethernet connections.
I had noticed that though both sets seemed to be answering request and
handing out leases, they did not seem to be doing so gracefully. That is
they would answer if one waited long enough (made enough request).
Looking at the logs, Each DISCOVER was OFFERed within a second, and each
REQUEST was followed by an ACK within a second. However it seemed that
sometimes the clients would have to make several request before the the
servers would log a DISCOVER or REQUEST and sometimes it appeared that
the client would not receive the OFFER or ACK.
We went round and round with what could be happening to the missing
packets. Finally we back tracked though all changes that we made since
we noticed the problem.
It seem to have come down to 1 or more of the following:
1) over use of ip-helpers
2) Adding more client machines
3) ddns being mis-configured
When we installed the new failover servers we slowly transitioned to
them from our old servers we were a bit careless will our ip-helpers and
we ended up with a helper address for all 6 servers on all routers. So
essentially all servers were getting all request.
We added a few more subnets resulting in about 3000 more clients.
And we start running ddns which we seemed not to have done correctly
because our logs were full of (about 360K a day):
Aug 9 06:04:13 ns-dhcp-07 dhcpd: Unable to add forward map from
[hostname] to [IP]: timed out
What we discovered when we started running a tcpdump on the server's
interface is that the packets where getting to the server, but the dhcpd
daemon did not log some of them or respond, likewise there were cases
that the dhcpd daemon would issue and OFFER or ACK but we'd no see it
get the tcpdump. There also was a noticeable lag of 10-15 secs between
the packet hitting the interface and the dhcpd daemon logging it.
Our CPU and load levels on the servers were very low / normal levels
and memory usage was also pretty low. Cleaning up the ip-helpers and
disabling ddns seems to have halted the problem and all servers are now
acting as expected.
My question now, is what exactly was causing the packets to go missing?
The going theory is that there is a buffer somewhere between the NIC and
the dhcpd daemon that stores the packets and it is getting full and
dropping the packets. Is this reasonable? If not what else could have
been occurring? If it is right, what can I monitor to warn me the next
time this happens?
More information about the dhcp-users