BIND 10 #2636: stats crash
BIND 10 Development
do-not-reply at isc.org
Tue Jan 15 19:45:14 UTC 2013
#2636: stats crash
-------------------------------------+-------------------------------------
Reporter: jreed | Owner:
Type: | Status: new
defect | Milestone: New Tasks
Priority: | Keywords:
medium | Sensitive: 0
Component: | Sub-Project: Core
statistics | Estimated Difficulty: 0
CVSS Scoring: | Total Hours: 0
Defect Severity: N/A |
Feature Depending on Ticket: |
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
OPS told me that the as112 server nagios check showed memory usage went
up. (But looking at this, it happened after the crash.)
I logged in and saw in my logs that stats server crashed:
{{{
2013-01-15 19:00:00.791 ERROR [b10-auth.auth/95428] AUTH_ZONEMGR_COMMS
error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:00:01.152 ERROR [b10-auth.cc/95426] CC_TIMEOUT timeout
reading data from command channel
2013-01-15 19:00:01.154 ERROR [b10-auth.auth/95426] AUTH_ZONEMGR_COMMS
error communicating with zone manager: Timeout while reading data from cc
session
2013-01-15 19:00:01.154 ERROR [b10-auth.cc/95426] CC_LENGTH_NOT_READY
length not ready
2013-01-15 19:00:01.154 ERROR [b10-auth.auth/95426] AUTH_ZONEMGR_COMMS
error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:06.404 ERROR [b10-auth.cc/95422] CC_TIMEOUT timeout
reading data from command channel
2013-01-15 19:01:07.169 ERROR [b10-auth.cc/95424] CC_TIMEOUT timeout
reading data from command channel
2013-01-15 19:01:07.269 ERROR [b10-auth.auth/95422] AUTH_ZONEMGR_COMMS
error communicating with zone manager: Timeout while reading data from cc
session
2013-01-15 19:01:07.269 ERROR [b10-auth.cc/95422] CC_LENGTH_NOT_READY
length not ready
2013-01-15 19:01:07.269 ERROR [b10-auth.auth/95422] AUTH_ZONEMGR_COMMS
error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:08.103 ERROR [b10-auth.auth/95424] AUTH_ZONEMGR_COMMS
error communicating with zone manager: Timeout while reading data from cc
session
2013-01-15 19:01:08.104 ERROR [b10-auth.cc/95424] CC_LENGTH_NOT_READY
length not ready
2013-01-15 19:01:08.104 ERROR [b10-auth.auth/95424] AUTH_ZONEMGR_COMMS
error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:12.717 INFO [b10-boss.boss/95413] BIND10_PROCESS_ENDED
process 95423 of b10-stats ended with status 256
2013-01-15 19:01:12.780 ERROR [b10-boss.boss/95413]
BIND10_COMPONENT_FAILED component b10-stats (pid 95423) failed: process
exited normally with exit status 256
2013-01-15 19:01:12.785 INFO [b10-boss.boss/95413] BIND10_COMPONENT_START
component b10-stats is starting
2013-01-15 19:01:12.792 INFO [b10-boss.boss/95413]
BIND10_STARTING_PROCESS starting process b10-stats
2013-01-15 19:01:13.876 ERROR [b10-auth.cc/95429] CC_TIMEOUT timeout
reading data from command channel
2013-01-15 19:01:14.339 ERROR [b10-auth.auth/95429] AUTH_ZONEMGR_COMMS
error communicating with zone manager: Timeout while reading data from cc
session
2013-01-15 19:01:14.340 ERROR [b10-auth.cc/95429] CC_LENGTH_NOT_READY
length not ready
2013-01-15 19:01:14.340 ERROR [b10-auth.auth/95429] AUTH_ZONEMGR_COMMS
error communicating with zone manager: ASIO read: data length is not ready
2013-01-15 19:01:15.084 INFO [b10-stats.stats/78714] STATS_STARTING
starting
}}}
The redirected bind10 output showed:
{{{
Traceback (most recent call last):
File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
212, in _receive_full_buffer
self._receive_len_data()
File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
172, in _receive_len_data
new_data = self._receive_bytes(self._recv_len_size)
File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
158, in _receive_bytes
data = self._socket.recv(size)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/libexec/bind10/b10-stats", line 687, in <module>
stats.start()
File "/usr/local/libexec/bind10/b10-stats", line 373, in start
self.do_polling()
File "/usr/local/libexec/bind10/b10-stats", line 256, in do_polling
(answer, env) = self.cc_session.group_recvmsg(False, seq)
File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
275, in group_recvmsg
env, msg = self.recvmsg(nonblock, seq)
File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
130, in recvmsg
data = self._receive_full_buffer(nonblock)
File "/usr/local/lib/python3.1/site-packages/isc/cc/session.py", line
227, in _receive_full_buffer
raise SessionTimeout("recv() on cc session timed out")
isc.cc.session.SessionTimeout: recv() on cc session timed out
}}}
Now Stats show has:
{{{
{
"Auth": {
"opcode.iquery": 4,
"opcode.notify": 1542,
"opcode.other": 0,
"opcode.query": 11446454195,
"opcode.status": 0,
"opcode.update": 906151246,
"queries.tcp": 4529597,
"queries.udp": 12348077390,
"rcode.badalg": 0,
"rcode.badkey": 0,
"rcode.badmode": 0,
"rcode.badname": 0,
"rcode.badsigvers": 0,
"rcode.badtime": 0,
"rcode.badtrunc": 0,
"rcode.formerr": 469435,
"rcode.noerror": 35305553,
"rcode.notauth": 198,
"rcode.notimp": 906151250,
"rcode.notzone": 0,
"rcode.nxdomain": 11319210406,
"rcode.nxrrset": 0,
"rcode.other": 0,
"rcode.refused": 91402768,
"rcode.servfail": 57451,
"rcode.yxdomain": 0,
"rcode.yxrrset": 0
},
"Boss": {
"boot_time": "2012-12-19T21:41:21Z"
},
"Stats": {
"boot_time": "2013-01-15T19:01:14Z",
"last_update_time": "2013-01-15T19:32:27Z",
"lname": "50f5a77a_1c at ethel.isc.org",
"report_time": "2013-01-15T19:32:27Z",
"timestamp": 1358278347.888214
}
}
}}}
The end of the dmesg output has:
{{{
Limiting icmp unreach response from 25027 to 200 packets/sec
Limiting icmp unreach response from 25088 to 200 packets/sec
Limiting icmp unreach response from 26449 to 200 packets/sec
Limiting icmp unreach response from 28994 to 200 packets/sec
Limiting icmp unreach response from 28885 to 200 packets/sec
Limiting icmp unreach response from 30465 to 200 packets/sec
Limiting icmp unreach response from 3801 to 200 packets/sec
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(8): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failed
swap_pager_getswapspace(5): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failed
swap_pager_getswapspace(5): failed
swap_pager_getswapspace(16): failed
swap_pager_getswapspace(12): failed
swap_pager_getswapspace(9): failed
swap_pager_getswapspace(5): failed
}}}
/var/log/messages around time of crash:
{{{
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(8): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
Jan 15 19:01:06 ethel kernel:
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
Jan 15 19:01:06 ethel kernel:
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(16): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(12): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(9): failed
Jan 15 19:01:06 ethel kernel: swap_pager_getswapspace(5): failed
}}}
Maybe something grew to huge?
--
Ticket URL: <http://bind10.isc.org/ticket/2636>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list