bind crashes with assertion, maybe due to many ephemeral network devices?
Ondřej Surý
ondrej at isc.org
Mon Mar 10 20:45:26 UTC 2025
> bind crashes with assertion, maybe due to many ephemeral network devices?
Looking at the symptoms and your description, I actually think this is a problem
of interfaces appearing during the network interface scan and then disappearing
before named can process them.
I would suggest to disable the automatic-interface-scan and setup named to
listen of fixed addresses so it doesn't have to deal with the mayhem the docker
is creating.
I've unblocked and "trusted" your account, so it should not get blocked again.
If you setup 2fa on the account it also acts as a permanent marked this not
a spam account.
Feel free to fill the issue, but I can't promise this will be looked at quite soon
as this is in the "doctor it hurts when I do this" territory.
Ondrej
--
Ondřej Surý (He/Him)
ondrej at isc.org
My working hours and your working hours may be different. Please do not feel obligated to reply outside your normal working hours.
> On 10. 3. 2025, at 21:19, Erich Eckner <bind at eckner.net> wrote:
>
> Hi,
>
> I'm running bind version 9.20.6 on artix linux (an arch linux derivate without systemd) with a pretty standard config:
>
> # named -V
> BIND 9.20.6 (Stable Release) <id:72cbad0>
> running on Linux x86_64 6.13.5-artix1-1 #1 SMP PREEMPT_DYNAMIC Fri, 28 Feb 2025 10:18:15 +0000
> built by make with '--prefix=/usr' '--sysconfdir=/etc' '--sbindir=/usr/bin' '--localstatedir=/var' '--disable-static' '--enable-fixed-rrset' '--enable-full-report' '--with-maxminddb' '--with-openssl' '--with-libidn2' '--with-json-c' '--with-libxml2' '--with-lmdb' 'CFLAGS=-march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -flto=auto -DDIG_SIGCHASE' 'LDFLAGS=-Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,-z,pack-relative-relocs -flto=auto'
> compiled by GCC 14.2.1 20250207
> compiled with OpenSSL version: OpenSSL 3.4.1 11 Feb 2025
> linked to OpenSSL version: OpenSSL 3.4.1 11 Feb 2025
> compiled with libuv version: 1.50.0
> linked to libuv version: 1.50.0
> compiled with liburcu version: 0.15.0
> compiled with jemalloc version: 5.3.0
> compiled with libnghttp2 version: 1.64.0
> linked to libnghttp2 version: 1.65.0
> compiled with libxml2 version: 2.13.5
> linked to libxml2 version: 21306-GITv2.13.6
> compiled with json-c version: 0.18
> linked to json-c version: 0.18
> compiled with zlib version: 1.3.1
> linked to zlib version: 1.3.1
> linked to maxminddb version: 1.12.2
> threads support is enabled
> DNSSEC algorithms: RSASHA1 NSEC3RSASHA1 RSASHA256 RSASHA512 ECDSAP256SHA256 ECDSAP384SHA384 ED25519 ED448
> DS algorithms: SHA-1 SHA-256 SHA-384
> HMAC algorithms: HMAC-MD5 HMAC-SHA1 HMAC-SHA224 HMAC-SHA256 HMAC-SHA384 HMAC-SHA512
> TKEY mode 2 support (Diffie-Hellman): no
> TKEY mode 3 support (GSS-API): yes
>
> default paths:
> named configuration: /etc/named.conf
> rndc configuration: /etc/rndc.conf
> nsupdate session key: /var/run/named/session.key
> named PID file: /var/run/named/named.pid
> geoip-directory: /usr/share/GeoIP
>
>
> # grep '^\s*[^[:space:]#/]' /etc/named.conf
> options {
> directory "/var/named";
> pid-file "/run/named/named.pid";
> allow-recursion { 127.0.0.1; 192.168.188.0/24; };
> allow-transfer { none; };
> allow-update { none; };
> version none;
> hostname none;
> server-id none;
> };
> zone "localhost" IN {
> type master;
> file "localhost.zone";
> };
> zone "0.0.127.in-addr.arpa" IN {
> type master;
> file "127.0.0.zone";
> };
> zone "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" {
> type master;
> file "localhost.ip6.zone";
> };
>
> # pgrep -af named
> 22958 /usr/bin/named -u named -L /var/log/named.log
>
> Since a few days (or weeks?) now, it started to act up. Every few ten minutes, it crashes with:
>
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.995 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.995 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 general: error: uv.c:95:isc__uverr2result(): unexpected error:
> 10-Mar-2025 20:33:36.996 general: error: unable to convert libuv error code in start_udp_child_job (netmgr/udp.c:172) to isc_result: -19: no such device
> 10-Mar-2025 20:33:36.996 network: error: creating IPv6 interface veth731351f failed; interface ignored
> 10-Mar-2025 20:33:36.996 network: info: listening on IPv6 interface vetha808625, fe80::d0cf:5fff:fe3a:1e50%954915#53
> 10-Mar-2025 20:33:36.998 network: info: listening on IPv6 interface veth92035bc, fe80::58f0:c5ff:fecf:4a8d%954971#53
> 10-Mar-2025 20:33:37.000 network: info: listening on IPv6 interface vethb1ef26b, fe80::58e2:d2ff:fe3f:c77f%955141#53
> 10-Mar-2025 20:33:37.003 network: info: listening on IPv6 interface veth0ee3ea4, fe80::44be:c7ff:fefd:83fb%955153#53
> 10-Mar-2025 20:33:37.005 network: info: listening on IPv6 interface veth39e879e, fe80::34fb:98ff:fe9e:d49f%955162#53
> 10-Mar-2025 20:33:37.007 network: info: listening on IPv6 interface veth2f2d6df, fe80::2c2b:e8ff:fe8e:2339%955167#53
> 10-Mar-2025 20:33:37.010 network: info: listening on IPv6 interface vetha0e2b2b, fe80::84fd:7aff:fe72:9c82%955207#53
> 10-Mar-2025 20:33:37.012 network: info: listening on IPv6 interface vethb633142, fe80::58a5:32ff:feaf:bdb2%955208#53
> 10-Mar-2025 20:33:37.014 network: info: listening on IPv6 interface veth232d291, fe80::f442:a2ff:fe0d:18f8%955383#53
> 10-Mar-2025 20:33:37.017 network: info: listening on IPv6 interface vetha87c0e9, fe80::2431:26ff:fe1e:adac%955384#53
> 10-Mar-2025 20:33:37.021 network: info: listening on IPv6 interface vethadab24f, fe80::7d:44ff:fe11:7284%955606#53
> 10-Mar-2025 20:33:37.024 network: info: listening on IPv6 interface vethe9c8381, fe80::1847:42ff:fe98:cd5c%955655#53
> 10-Mar-2025 20:33:37.026 network: info: listening on IPv6 interface veth5f5869a, fe80::ec06:66ff:fe5d:ef74%955668#53
> 10-Mar-2025 20:33:37.029 network: info: listening on IPv6 interface vethe46d2e1, fe80::f48e:14ff:fe94:2efd%955683#53
> 10-Mar-2025 20:33:37.032 network: info: listening on IPv6 interface vethf87bbe4, fe80::6c0b:47ff:fed2:404d%955686#53
> 10-Mar-2025 20:33:37.035 network: info: listening on IPv6 interface veth207c7ca, fe80::f019:b8ff:feda:517d%955692#53
> 10-Mar-2025 20:33:37.038 network: info: listening on IPv6 interface veth1654fa8, fe80::fc83:fcff:fe79:8f01%955718#53
> 10-Mar-2025 20:33:37.041 network: info: listening on IPv6 interface vethe4e528f, fe80::901d:7fff:fe58:ed2%955719#53
> 10-Mar-2025 20:33:37.041 general: critical: netmgr/udp.c:77:isc__nm_udp_lb_socket(): fatal error:
> 10-Mar-2025 20:33:37.041 general: critical: RUNTIME_CHECK(result == ISC_R_SUCCESS) failed
> 10-Mar-2025 20:33:37.041 general: critical: exiting (due to fatal error in library)
>
> As a first-aid, I added a script to simply restart the nameserver, if it crashes. This showed me two things:
>
> 1. If the server crashed, a restart will fail for the next one or two minutes, too.
>
> 2. The crashes seem to correlate with the other main load, that I have on this machine: A couple hundred docker containers (each of which apparently setting up a network device on the host system), that are started every ten minutes and run for a few minutes (in rare cases longer). Looking at the minutes of the assertion-logs, there is a clear emphasis on minutes when many containers start(?)/run/stop:
>
> $ grep -F 'RUNTIME_CHECK(result == ISC_R_SUCCESS)' /var/log/named.log | cut -d' ' -f2 | cut -d: -f2 | cut -c2 | sort | uniq -c
> 5976 0
> 14767 1
> 42850 2
> 31292 3
> 693 4
> 204 5
> 199 6
> 211 7
> 226 8
> 198 9
>
> The containers are started via a cronjob:
> */10 * * * * /home/erich/git/archlinuxewe/build-all-with-docker
>
> In between the crashes, the nameserver seems to run as-expected. Also, the docker containers (which require working name resolution on the host system) do not always fail, so at least sometime / somewhen, named seems to successfully process the requests of the containers.
>
> I hope, someone has an idea, where I should look at. It feels strange, that such a "reference" product as bind should be crashable simply by having a big number of fluctuating network devices.
>
> Some side notes, maybe less related to the issue at hand, but I still want to write them here for the case, that they are relevant:
>
> The system seems to be somewhat under load during the run of the containers, but I would be astonished, if this would cause bind to crash: RAM usage goes up to 16GB of 128GB possible, CPU goes up to 100%, though.
>
> I have a second, similar machine (same distribution, similar setup regarding bind), but without the "pulsed" load of docker containers, where named is running since *looks*up*the*numbers* more than 8 days without crashes (which matches the uptime of that machine).
>
> I wanted to open a bug at gitlab.isc.org, but my account ("deep42thought" under which I reported something a few years ago) got blocked after getting reactivated again, because I did not notice the big warning on the login page stating exactly this behaviour and took >1 day to gather the information for the bug. :-( Maybe someone can unblock me, then I could add 2FA to persist the account?
>
> Some time ago I tried to get the stats channel working through
>
> options {
> zone-statistics full;
> }
> statistics-channels {
> inet 127.0.0.1 port 8053;
> };
>
> but this seemed to crash the server back then. And since it was just a toy project, I didn't pursue it any further and have removed it from the config since quite some time.
>
> regards,
> Erich
> --
> Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> ISC funds the development of this software with paid support subscriptions. Contact us at https://www.isc.org/contact/ for more information.
>
>
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: Message signed with OpenPGP
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20250310/b2093c0f/attachment-0001.sig>
More information about the bind-users
mailing list