BIND 10 #2701: DHCPv6 Performance Testing and Enhancement

Tue Feb 19 15:09:10 UTC 2013

#2701: DHCPv6 Performance Testing and Enhancement
-------------------------------------+-------------------------------------
            Reporter:  marcin        |                        Owner:
                Type:  task          |  UnAssigned
            Priority:  medium        |                       Status:
           Component:  dhcp6         |  reviewing
            Keywords:                |                    Milestone:
           Sensitive:  0             |  Sprint-DHCP-20130228
         Sub-Project:  DHCP          |                   Resolution:
Estimated Difficulty:  0             |                 CVSS Scoring:
         Total Hours:  0             |              Defect Severity:  N/A
                                     |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------
Changes (by marcin):

 * owner:  marcin => UnAssigned
 * status:  accepted => reviewing

Comment:

 As a result of profiling tests I generated the chart that '''roughly'''
 shows where the server spends time (see callgraph.v6.png). Due to obvious
 time constraints I did not spend too much time on the code analysis
 assuming that the work on performance improvements is planned for later
 time in 2013. However, there is at least one obvious place in the code
 where the server seemed to spend to much time and there was no need for
 this. That place is OptionDefinition::validate() which was called for each
 incoming packet to check that the definition of the option being used to
 create its instance is valid. In fact, the definitions are created when
 the new definition is created through the BIND10 configuration manager and
 they persist until someone changes them manually trough the configuration
 manager again. I created the patch for the code (trac2701) and rerun the
 test confirm that doesn't show up anymore (see callgraph.v6.2.png). The
 other parts of the code are TBD.

 I was curious whether the change described above had any effect on the
 server's throughput. I thought that Memfile backend (the one that does not
 store leases into the disk) will be the best one to try and see whether I
 get higher speed or not. That revealed another issue in the Memfile
 backend whereby it performs the full scan of existing leases when the new
 lease is to be acquired. This leads to the constant degradation of
 performance of the server running at 1500 leases/sec longer than 10s.

 Note that the 1500lease/sec was the rate that I reported in my previous
 email as a  theoretical highest rate that the server may achieve. Due to
 the full lease scans this obviously was not the highest achievable rate.

 I created next patch for the code that applies composite indexes on the
 multi-index-container that Memfile backend use to store leases. The result
 was pretty good. The server running at the rate of 8000 leases/sec
 consumes only slightly over 60% of the CPU.

 The latter patch gave the chance to check whether we had any gain as a
 result of the former patch. The gain seemed to be around ~10% (for the CPU
 utilization of 44% before the patch it dropped to 40% after the patch).
 However, it does not change anything with respect to actual
 server's performance running with the MySQL backend since the MySQL is a
 bottleneck here.

 The attached spreadsheet contains some data that I collected for the code
 with patches applied and when using MySQL to store leases.

 On the first tab, there are results from two runs of the same test. For
 the first test I used innodb_flush_log_at_trx_commit MySQL variable set to
 1, for another test I used innodb_flush_log_at_trx_commit=2. With the
 latter setting we got much better rate (~1000leases/sec). Note however
 that this rate is lower by more than 100 leases/sec than the rate of the
 V4 server.

 On the second tab I collected the data showing CPU utilization by the
 MySQL and b10-dhcp6 processes when running the test at 1000 leases/sec.
 The chart shows the utilization for both processes as a function of time.
 I used "top" captures for both processes collected every quarter of a
 second. To my surprise the total CPU utilization by both processes went
 above 100% and sometimes the second CPU core was used.
 The graph shows that MySQL was all the time consuming much more of the CPU
 than DHCP.

 I also tried to do some other mysqld tweaks taking this blog as a
 reference:
 http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-
 server-after-installation/
 Unfortunately, none of the settings I have tried for MySQL resulted in the
 performance improvement.

 ----

 The following items should be reviewed for this ticket:
 - the new code on trac2701 branch
 - results from the testing attached to this email
 - changes to the dhcp-val/doc/performance-test-setup.txt
 - new tests and configuration files in dhcp-
 val/tests/common/v6.performance/

-- 
Ticket URL: <http://bind10.isc.org/ticket/2701#comment:2>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development