BIND 10 #2636: stats crash

BIND 10 Development do-not-reply at isc.org
Tue Jan 15 19:45:14 UTC 2013


#2636: stats crash
-------------------------------------+-------------------------------------
                   Reporter:  jreed  |                 Owner:
                       Type:         |                Status:  new
  defect                             |             Milestone:  New Tasks
                   Priority:         |              Keywords:
  medium                             |             Sensitive:  0
                  Component:         |           Sub-Project:  Core
  statistics                         |  Estimated Difficulty:  0
               CVSS Scoring:         |           Total Hours:  0
            Defect Severity:  N/A    |
Feature Depending on Ticket:         |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------
 OPS told me that the as112 server nagios check showed memory usage went
 up. (But looking at this, it happened after the crash.)

 I logged in and saw in my logs that stats server crashed:

 {{{
 2013-01-15 19:00:00.791 ERROR [b10-auth.auth/95428] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: ASIO read: data length is not ready
 2013-01-15 19:00:01.152 ERROR [b10-auth.cc/95426] CC_TIMEOUT timeout
 reading data from command channel
 2013-01-15 19:00:01.154 ERROR [b10-auth.auth/95426] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: Timeout while reading data from cc
 session
 2013-01-15 19:00:01.154 ERROR [b10-auth.cc/95426] CC_LENGTH_NOT_READY
 length not ready
 2013-01-15 19:00:01.154 ERROR [b10-auth.auth/95426] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: ASIO read: data length is not ready
 2013-01-15 19:01:06.404 ERROR [b10-auth.cc/95422] CC_TIMEOUT timeout
 reading data from command channel
 2013-01-15 19:01:07.169 ERROR [b10-auth.cc/95424] CC_TIMEOUT timeout
 reading data from command channel
 2013-01-15 19:01:07.269 ERROR [b10-auth.auth/95422] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: Timeout while reading data from cc
 session
 2013-01-15 19:01:07.269 ERROR [b10-auth.cc/95422] CC_LENGTH_NOT_READY
 length not ready
 2013-01-15 19:01:07.269 ERROR [b10-auth.auth/95422] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: ASIO read: data length is not ready
 2013-01-15 19:01:08.103 ERROR [b10-auth.auth/95424] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: Timeout while reading data from cc
 session
 2013-01-15 19:01:08.104 ERROR [b10-auth.cc/95424] CC_LENGTH_NOT_READY
 length not ready
 2013-01-15 19:01:08.104 ERROR [b10-auth.auth/95424] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: ASIO read: data length is not ready
 2013-01-15 19:01:12.717 INFO  [b10-boss.boss/95413] BIND10_PROCESS_ENDED
 process 95423 of b10-stats ended with status 256
 2013-01-15 19:01:12.780 ERROR [b10-boss.boss/95413]
 BIND10_COMPONENT_FAILED component b10-stats (pid 95423) failed: process
 exited normally with exit status 256
 2013-01-15 19:01:12.785 INFO  [b10-boss.boss/95413] BIND10_COMPONENT_START
 component b10-stats is starting
 2013-01-15 19:01:12.792 INFO  [b10-boss.boss/95413]
 BIND10_STARTING_PROCESS starting process b10-stats
 2013-01-15 19:01:13.876 ERROR [b10-auth.cc/95429] CC_TIMEOUT timeout
 reading data from command channel
 2013-01-15 19:01:14.339 ERROR [b10-auth.auth/95429] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: Timeout while reading data from cc
 session
 2013-01-15 19:01:14.340 ERROR [b10-auth.cc/95429] CC_LENGTH_NOT_READY
 length not ready
 2013-01-15 19:01:14.340 ERROR [b10-auth.auth/95429] AUTH_ZONEMGR_COMMS
 error communicating with zone manager: ASIO read: data length is not ready
 2013-01-15 19:01:15.084 INFO  [b10-stats.stats/78714] STATS_STARTING
 starting
 }}}

 The redirected bind10 output showed:
 {{{
 Traceback (most recent call last):
   File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
 212, in _receive_full_buffer
     self._receive_len_data()
   File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
 172, in _receive_len_data
     new_data = self._receive_bytes(self._recv_len_size)
   File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
 158, in _receive_bytes
     data = self._socket.recv(size)
 socket.timeout: timed out

 During handling of the above exception, another exception occurred:

 Traceback (most recent call last):
   File "/usr/local/libexec/bind10/b10-stats", line 687, in <module>
     stats.start()
   File "/usr/local/libexec/bind10/b10-stats", line 373, in start
     self.do_polling()
   File "/usr/local/libexec/bind10/b10-stats", line 256, in do_polling
     (answer, env) = self.cc_session.group_recvmsg(False, seq)
   File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
 275, in group_recvmsg
     env, msg  = self.recvmsg(nonblock, seq)
   File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
 130, in recvmsg
     data = self._receive_full_buffer(nonblock)
   File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
 227, in _receive_full_buffer
     raise SessionTimeout("recv() on cc session timed out")
 isc.cc.session.SessionTimeout: recv() on cc session timed out
 }}}

 Now Stats show has:

 {{{
 {
     "Auth": {
         "opcode.iquery": 4,
         "opcode.notify": 1542,
         "opcode.other": 0,
         "opcode.query": 11446454195,
         "opcode.status": 0,
         "opcode.update": 906151246,
         "queries.tcp": 4529597,
         "queries.udp": 12348077390,
         "rcode.badalg": 0,
         "rcode.badkey": 0,
         "rcode.badmode": 0,
         "rcode.badname": 0,
         "rcode.badsigvers": 0,
         "rcode.badtime": 0,
         "rcode.badtrunc": 0,
         "rcode.formerr": 469435,
         "rcode.noerror": 35305553,
         "rcode.notauth": 198,
         "rcode.notimp": 906151250,
         "rcode.notzone": 0,
         "rcode.nxdomain": 11319210406,
         "rcode.nxrrset": 0,
         "rcode.other": 0,
         "rcode.refused": 91402768,
         "rcode.servfail": 57451,
         "rcode.yxdomain": 0,
         "rcode.yxrrset": 0
     },
     "Boss": {
         "boot_time": "2012-12-19T21:41:21Z"
     },
     "Stats": {
         "boot_time": "2013-01-15T19:01:14Z",
         "last_update_time": "2013-01-15T19:32:27Z",
         "lname": "50f5a77a_1c at ethel.isc.org",
         "report_time": "2013-01-15T19:32:27Z",
         "timestamp": 1358278347.888214
     }
 }
 }}}

 The end of the dmesg output has:

 {{{
 Limiting icmp unreach response from 25027 to 200 packets/sec
 Limiting icmp unreach response from 25088 to 200 packets/sec
 Limiting icmp unreach response from 26449 to 200 packets/sec
 Limiting icmp unreach response from 28994 to 200 packets/sec
 Limiting icmp unreach response from 28885 to 200 packets/sec
 Limiting icmp unreach response from 30465 to 200 packets/sec
 Limiting icmp unreach response from 3801 to 200 packets/sec
 swap_pager_getswapspace(16): failed
 swap_pager_getswapspace(8): failed
 swap_pager_getswapspace(16): failed
 swap_pager_getswapspace(12): failed
 swap_pager_getswapspace(16): failed
 swap_pager_getswapspace(12): failed
 swap_pager_getswapspace(9): failed
 swap_pager_getswapspace(5): failed
 swap_pager_getswapspace(16): failed
 swap_pager_getswapspace(12): failed
 swap_pager_getswapspace(9): failed
 swap_pager_getswapspace(5): failed
 swap_pager_getswapspace(16): failed
 swap_pager_getswapspace(12): failed
 swap_pager_getswapspace(9): failed
 swap_pager_getswapspace(5): failed
 }}}

 /var/log/messages around time of crash:
 {{{
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(8): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
 Jan 15 19:01:06 ethel kernel:
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
 Jan 15 19:01:06 ethel kernel:
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
 Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
 }}}

 Maybe something grew to huge?

-- 
Ticket URL: <http://bind10.isc.org/ticket/2636>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development


More information about the bind10-tickets mailing list