BIND 10 #1537: Handle error receiving file descriptors

Mon Jan 2 16:32:51 UTC 2012

#1537: Handle error receiving file descriptors
-------------------------------------+-------------------------------------
            Reporter:  shane         |                        Owner:
                Type:  defect        |                       Status:  new
            Priority:  major         |                    Milestone:  New
           Component:  Unclassified  |  Tasks
           Sensitive:  0             |                     Keywords:
         Sub-Project:  Core          |              Defect Severity:
Estimated Difficulty:  0             |  Medium
         Total Hours:  0             |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------
 I was seeing this in my log file:

 {{{
 2012-01-02 16:02:53.168 ERROR [b10-xfrout.xfrout]
 XFROUT_RECEIVE_FILE_DESCRIPTOR_ERROR error receiving the file descriptor
 for an XFR connection
 2012-01-02 16:02:53.168 ERROR [b10-xfrout.xfrout]
 XFROUT_RECEIVE_FILE_DESCRIPTOR_ERROR error receiving the file descriptor
 for an XFR connection
 2012-01-02 16:02:53.168 ERROR [b10-xfrout.xfrout]
 XFROUT_RECEIVE_FILE_DESCRIPTOR_ERROR error receiving the file descriptor
 for an XFR connection
 }}}

 Also, b10-xfrout then uses 100% of CPU once this condition occurs.

 I discovered that this ultimately comes from fd_share.cc:

 {{{
     const int cc = recvmsg(sock, &msghdr, 0);
     if (cc <= 0) {
         free(msghdr.msg_control);
         if (cc == 0) {
             errno = ECONNRESET;
         }
         return (FD_SYSTEM_ERROR);
     }
 }}}

 Looking via strace I find:

 {{{
 select(13, [9 12], [], [], NULL)        = 1 (in [12])
 recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"\0", 1}], msg_controllen=0,
 msg_flags=0}, 0) = 0
 }}}

 My guess is that is what is happening is that the process on the other
 side of the Unix domain socket has closed the connection (perhaps due to
 dying), and that the xfrout gets stuck in a loop.

 What I think we should do is:

   1. Check for this condition everywhere in the code and re-connect (or
 error in some meaningful way) when we discover it.
   2. Update the documentation to specify that this is necessary.

-- 
Ticket URL: <http://bind10.isc.org/ticket/1537>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development