FW: Performance issue ( maybe ) corrected

Fri Sep 3 15:23:13 UTC 2010

Correction to the sequence.
Sorry for the mess up

mvh,
Bjarne

From: Bjarne Blichfeldt
Sent: 3. september 2010 17:18
To: 'Users of ISC DHCP'
Subject: Performance issue ( maybe )

We have (become) what seems to be a periodic performance problem in our setup.
I am trying to figure out what we have changed lately, it might be load related.
During clients startup, the servers takes a very long time  ~20-30 seconds to answer a DISCOVER.
The server then answers with OFFER. But since the server took so long to answer, the client has timed out and sends a new DISCOVER.
When the first OFFER reach the client, the IP transaction ID does not match and the client drops the first OFFER

Example :
Client                 Server
  DISCOVER (transaction ID 1) -->
                             :time goes
  timeout
 DISCOVER (transaction ID 2) -->
                             <-- OFFER (transaction ID 1)
 ignored from client

 DISCOVER -->
 and so on.

This is found by tracing on the switch where the server is connected, that is directly at the server port.

Looking at the servers with top when everything works  :
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.3%sy,  0.0%ni, 95.3%id,  4.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2075208k total,   714656k used,  1360552k free,   189160k buffers
Swap:  4128760k total,      100k used,  4128660k free,   318604k cached

During the problem I see something like (numbers added from memory):
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

è Cpu2  :  0.0%us,  0.3%sy,  0.0%ni, 1.3%id,  98%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2075208k total,   714656k used,  1360552k free,   189160k buffers
Swap:  4128760k total,      100k used,  4128660k free,   318604k cached

The only process really working is kjournald.

Now I am a little in the dark as what to do now. First of course is to offload everything but dhcpd from the servers to
reduce any load I haven't noticed. There were a few java programs and a mysql server on dhcp2, Those are now closed ,nothing extra on dhcp1.

The next I have done is to increase default lease time from 8 hours to 5 days.

so my questions are:

1)      anybody seen something similar ?

2)      Good ideas to further investigate ?  What about the network topology ? Any gotcha's when sending DISCOVERY through two cisco routers ?

3)      During the problem, everything starts to go wrong. What would be some good values in the failover paragraph to ease the system ?

I think I will have to increase mctl, but not too much. That will create problem in partner down.

I see a lot of load balancing messages in the logfile. Any way of changing the load balancing to reduce load ?

Also, what would be the consensus of disabling pingcheck ?

ping-check false;

The ping adds at least one  second to every discovery/offer, and that could maybe contribute to our problem since we have a large net with
many net boxes between the server and the clients

And last, when this happens, I could use some good ideas on how to handle it in the shortest possible time.
The last time we had this,  I shut one server down, we stopped all incoming  dhcp requests on one server, put the other server in
partner down and opened for dhcp request step by step from the different subnets. After a  few nets where online, we started the second
dhcp server, waited for recover to finish.
That took a long time - about 2 hours. The users where not happy.

The installation details are :

Topology :  Windows clients --> 2 cisco routers --> 2 dhcp linux servers

RHEL 5 two interfaces eth0+eth1 bundled into bond0
Linux version 2.6.18-194.11.1.el5

dhcp1 : 4 x 3.2 GHz Xeon cpu, 2GB Ram, 72 GB disk
dhcp2 : 4 x 3.2 GHz Xeon cpu, 4 GB Ram, 72 GB Disk

isc-dhcpd  4.1.1-P1
failover protocol

1352 subnets
792 pools
around 7500 active leases

extract from dhcpd.conf :

ddns-update-style none;
 authoritative ;
default-lease-time 432000 ; (was 8 hours)
max-lease-time 604800 ;
omapi-port 7911 ;

# Failover configuration.

failover peer "ipc-dhcp1-ipc-dhcp2" {
        primary;
        address 10.11.90.73;
        port 647;
        peer address 10.11.90.74;
        peer port 647;
        max-response-delay 90;
        max-unacked-updates 20;
        mclt 1800;
        split 128;
        load balance max seconds 5;
}

# typical subnet :
subnet 10.2.2.0 netmask 255.255.255.0 {
        option subnet-mask 255.255.255.0 ;
        option routers 10.2.2.254 ;
        option domain-name "name.local" ;
        option option-150 10.11.75.10 ;
        filename "\\mboot.0<file:///\\mboot.0>" ;
        next-server 10.2.2.240 ;
        pool {
                range 10.2.2.1 10.2.2.200;
                failover peer "ipc-dhcp1-ipc-dhcp2";
                deny dynamic bootp clients;
        }
}

Regards,
Bjarne Blichfeldt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/dhcp-users/attachments/20100903/afb0e991/attachment.html>