Performance Problem Postmortem

Sat Aug 15 22:19:14 UTC 2009

Good day all,

Where I work we have 2 new sets of failover dhcp servers (4 computers) 
both running on CentOS and dhcp version 3.1.1. They are newer dual core 
machines each sporting 2GB of RAM and have gigbit Ethernet connections.

I had noticed that though both sets seemed to be answering request and 
handing out leases, they did not seem to be doing so gracefully. That is 
they would answer if one waited long enough (made enough request). 
Looking at the logs, Each DISCOVER was OFFERed within a second, and each 
REQUEST was followed by an ACK within a second. However it seemed that 
sometimes the clients would have to make several request before the the 
servers would log a DISCOVER or REQUEST and sometimes it appeared that 
the client would not receive the OFFER or ACK.

We went round and round with what could be happening to the missing 
packets. Finally we back tracked though all changes that we made since 
we noticed the problem.

It seem to have come down to 1 or more of the following:
1) over use of ip-helpers
2) Adding more client machines
3) ddns being mis-configured

When we installed the new failover servers we slowly transitioned to 
them from our old servers we were a bit careless will our ip-helpers and 
we ended up with a helper address for all 6 servers on all routers. So 
essentially all servers were getting all request.

We added a few more subnets resulting in about 3000 more clients.

And we start running ddns which we seemed not to have done correctly 
because our logs were full of (about 360K a day):

Aug  9 06:04:13 ns-dhcp-07 dhcpd: Unable to add forward map from 
[hostname] to [IP]: timed out

What we discovered when we started running a tcpdump on the server's 
interface is that the packets where getting to the server, but the dhcpd 
daemon did not log some of them or respond, likewise there were cases 
that the dhcpd daemon would issue and OFFER or ACK but we'd no see it 
get the tcpdump. There also was a noticeable lag of 10-15 secs between 
the packet hitting the interface and the dhcpd daemon logging it.

Our CPU and load levels on the servers were very low /  normal levels 
and memory usage was also pretty low.  Cleaning up the ip-helpers and 
disabling ddns seems to have halted the problem and all servers are now 
acting as expected.

My question now, is what exactly was causing the packets to go missing?

The going theory is that there is a buffer somewhere between the NIC and 
the dhcpd daemon that stores the packets and it is getting full and 
dropping the packets. Is this reasonable? If not what else could have 
been occurring? If it is right, what can I monitor to warn me the next 
time this happens?

Thanks!