BIND and UDP tuning

Thu Sep 27 15:04:05 UTC 2018


On 27/09/2018 16.53, Alex wrote:
> Hi,
>
>>> I reported a few weeks ago that I was experiencing a really high
>>> number of "SERVFAIL" messages in my bind-9.11.4-P1 system running on
>>> fedora28, and I haven't yet found a solution. This is all now running
>>> on a 165/35 cable system.
>>>
>>> I found a program named dropwatch which is showing a significant
>>> number of dropped UDP packets, particularly when there are bursts of
>>> email traffic:
>>>
>>> 12 drops at skb_queue_purge+13 (0xffffffff9f79a0c3)
>>> 1 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
>>> 4 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
>>> 5 drops at nf_hook_slow+a7 (0xffffffff9f7faff7)
>>> 3 drops at sk_stream_kill_queues+48 (0xffffffff9f7a1158)
>>> 3 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)
>>> ...
>>>
>>> # netstat -us
>>> ...
>>> Udp:
>>>     23449482 packets received
>>>     1724269 packets to unknown port received
>>>     8248 packet receive errors
>>>     31394909 packets sent
>>>     8243 receive buffer errors
>>>     0 send buffer errors
>>>     InCsumErrors: 5
>>>     IgnoredMulti: 43247
>>>
>>> The SERVFAIL messages don't necessarily correspond to the UDP packet
>>> errors shown by netstat, but the dropwatch output is continuous. The
>>> netstat packet receive errors also don't seem to correspond to
>>> "SERVFAIL" or "Name service" errors:
>>>
>>> 26-Sep-2018 12:42:49.743 query-errors: info: client @0x7fb3c41634d0
>>> 127.0.0.1#44104 (46.36.47.104.wl.mailspike.net): query failed
>>> (SERVFAIL) for 46.36.47.104.wl.mailspike.net/IN/A at
>>> ../../../bin/named/query.c:8580
>>>
>>> Sep 26 12:47:11 mail03 postfix/dnsblog[22821]: warning: dnsblog_query:
>>> lookup error for DNS query 196.91.107.80.bl.spameatingmonkey.net: Host
>>> or domain name not found. Name service error for
>>> name=196.91.107.80.bl.spameatingmonkey.net type=A: Host not found, try
>>> again
>>>
>>> I've been following this thread from some time ago, but nothing I've
>>> done has made a difference. I really don't know what the buffer sizes
>>> should be.
>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__bind-2Dusers-
>>> 2Dforum.2342410.n4.nabble.com_Tuning-2Dsuggestions-2Dfor-2Dhigh-2Dcore-
>>> 2Dcount-2DLinux-2Dservers-
>>> 2Dtd3899.html&d=DwICAg&c=MOptNlVtIETeDALC_lULrw&r=udvvbouEjrWNUMab5xo_vLb
>>> UE6LRGu5fmxLhrDvVJS8&m=5XQNuuRQ4kxK03zqoWaJHIdaJvNdsyTKHuFlDKedbpc&s=5Dqh
>>> ne-5w5V_1coBTBvTITwK2EFeankOegTaofy8S5w&e=
>>>
>>> Are there specific bind tunables you might recommend? edns-udp-size,
>>> perhaps?
>>>
>>> Any ideas on other tunables such as net.core.*mem_default etc?
>> *chuckles to self*
>>
>> I was just referring back to that thread myself to try remember what I did.
>>
>> I ended up tuning the following items:
>>
>>   - name: SYSCTL system tuning, basics
>>     sysctl:
>>       name: "{{ item.name }}"
>>       value: "{{ item.value }}"
>>       sysctl_set: yes
>>       state: present
>>     with_items:
>>       - { name: 'vm.swappiness', value: 0 }
>>       - { name: 'net.core.netdev_max_backlog', value: 32768 }
>>       - { name: 'net.core.netdev_budget', value: 2700 }
>>       - { name: 'net.ipv4.tcp_sack', value: 0 }
>>       - { name: 'net.core.somaxconn', value: 2048 }
>>       - { name: 'net.core.rmem_default', value: 16777216 }
>>       - { name: 'net.core.rmem_max', value: 16777216 }
>>       - { name: 'net.core.wmem_default', value: 16777216 }
>>       - { name: 'net.core.wmem_max', value: 16777216 }
> Were you troubleshooting the same problems as I'm experiencing?
>
> Many of these values I've already tweaked and have had no effect on my
> SERVFAIL issues :-(
>
> I've also been following the performance tuning variables in this RH document:
> https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf
>
> These errors appear to occur in spurts - there is typically ten or
> more in a row at a time, then any number of minutes/seconds before the
> next one.
>
> It looks like there are periods of as many as 500 queries per second,
> although the usual amount is closer to 200 per second.
>
> I don't believe this is a bind configuration problem, as the "Name
> service error" errors from postfix also occur when testing with
> unbound.
>
> This is also only happening on the two identical systems connected to
> the 165/35mbit cable modem. I've verified with Oponline, and they've
> emphatically asserted there are no problems with the circuit. The
> systems are 8-core Xeon E31240 with 16GB RAM. I've also tried other
> systems, including a 12-core i7 with 32GB.
>
> We have several other systems connected to a 10mbit DIA ethernet
> circuit where these errors don't generally occur. They are also
> similarly configured fedora systems with the same version of bind.
>
> I'm really at a loss as to what the problem(s) are, but feel like it's
> really impacting our ability to query RBLs for processing mail.
>
>> Whilst mentioned in passing on that thread, there was also poking around with TOE, pause, coalesce adaptive and ring size settings (look at ethtool -K, ethtool -A, ethtool -C and ethtool -G), but sadly have lost the specific commands.
> I've also tried configuring the NIC with ethtool according to the
> variables defined in the RH document listed above and have had no
> success.
>
> This really is just a stock system. I can't believe these problems
> would be so elusive or uncommon. Could it have to do with some
> characteristic of the cable circuit itself?
Just a wild thought:
It works with a lower speed line (at least I read it that way) but has
problems with higher speeds.
Could it be that the line is so fast that it "overtakes" the host in
question?

A faster incoming line will give less time between the packets for
processing.
>
> I've also experimented with QoS, using tc to prioritize interactive
> traffic, including tcp and udp port 53, with plenty of bandwidth.
>
> I really hope there is someone with some additional ideas.
> Thanks,
> Alex
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list
>
> bind-users mailing list
> bind-users at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind-users/attachments/20180927/04c886f2/attachment.html>