BIND 10 #2934: xfrout session can be broken due to EAGAIN

Mon Apr 29 16:41:50 UTC 2013

#2934: xfrout session can be broken due to EAGAIN
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:
  jinmei                             |                Status:  new
                       Type:         |             Milestone:  Next-Sprint-
  defect                             |  Proposed
                   Priority:         |              Keywords:
  medium                             |             Sensitive:  0
                  Component:         |           Sub-Project:  DNS
  xfrout                             |  Estimated Difficulty:  0
               CVSS Scoring:         |           Total Hours:  0
            Defect Severity:  N/A    |
Feature Depending on Ticket:         |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------
 I noticed xfrout-ing a large zone from b10-xfrout can be abruptly
 terminated if I dump the transferred record to a terminal using 'dig
 axfr'.  On a closer look it seems `XfroutSession._send_data()` raises
 (an exception due to) EAGAIN:
 {{{#!python
         while total_count < size:
             count = os.write(sock_fd, data[total_count:])
             total_count += count
 }}}
 (It should be reproducible even more easily by, e.g., starting axfr
 with dig and suspend it before it completes).

 It might be system dependent, but on my system sock_fd is non
 blocking (probably derived from the original TCP socket with which
 b10-auth received the AXFR query), which is the reason for the error.

 While this might be relatively minor, it should easily happen in real
 world, due to a slow link or packet loss, etc, too.  So I think we
 should fix it sooner.

 A cleanest solution would be to do the asynchronous write correctly,
 communicating with the parent thread so it can gracefully terminate on
 shutdown.  But, assuming we'll redesign xfr* fundamentally, an easier
 workaround is sufficient: making the FD (socket) non blocking.  I'm
 attaching a patch to do this.  I confirmed it solved the problem.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2934>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development