<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFCC">

    <br>

    <br>

    <div class="moz-cite-prefix">On 27/09/2018 16.53, Alex wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAB1R3sjC9nyt4=ex-43cqrFC2ED-_UW=eYf13-tS5DtAwbTK_A@mail.gmail.com">

      <pre wrap="">Hi,

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">I reported a few weeks ago that I was experiencing a really high

number of "SERVFAIL" messages in my bind-9.11.4-P1 system running on

fedora28, and I haven't yet found a solution. This is all now running

on a 165/35 cable system.

I found a program named dropwatch which is showing a significant

number of dropped UDP packets, particularly when there are bursts of

email traffic:

12 drops at skb_queue_purge+13 (0xffffffff9f79a0c3)

1 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)

4 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)

5 drops at nf_hook_slow+a7 (0xffffffff9f7faff7)

3 drops at sk_stream_kill_queues+48 (0xffffffff9f7a1158)

3 drops at __udp4_lib_rcv+1e6 (0xffffffff9f83bdf6)

...

# netstat -us

...

Udp:

    23449482 packets received

    1724269 packets to unknown port received

    8248 packet receive errors

    31394909 packets sent

    8243 receive buffer errors

    0 send buffer errors

    InCsumErrors: 5

    IgnoredMulti: 43247

The SERVFAIL messages don't necessarily correspond to the UDP packet

errors shown by netstat, but the dropwatch output is continuous. The

netstat packet receive errors also don't seem to correspond to

"SERVFAIL" or "Name service" errors:

26-Sep-2018 12:42:49.743 query-errors: info: client @0x7fb3c41634d0

127.0.0.1#44104 (46.36.47.104.wl.mailspike.net): query failed

(SERVFAIL) for 46.36.47.104.wl.mailspike.net/IN/A at

../../../bin/named/query.c:8580

Sep 26 12:47:11 mail03 postfix/dnsblog[22821]: warning: dnsblog_query:

lookup error for DNS query 196.91.107.80.bl.spameatingmonkey.net: Host

or domain name not found. Name service error for

name=196.91.107.80.bl.spameatingmonkey.net type=A: Host not found, try

again

I've been following this thread from some time ago, but nothing I've

done has made a difference. I really don't know what the buffer sizes

should be.

<a class="moz-txt-link-freetext" href="https://urldefense.proofpoint.com/v2/url?u=http-3A__bind-2Dusers">https://urldefense.proofpoint.com/v2/url?u=http-3A__bind-2Dusers</a>-

2Dforum.2342410.n4.nabble.com_Tuning-2Dsuggestions-2Dfor-2Dhigh-2Dcore-

2Dcount-2DLinux-2Dservers-

2Dtd3899.html&d=DwICAg&c=MOptNlVtIETeDALC_lULrw&r=udvvbouEjrWNUMab5xo_vLb

UE6LRGu5fmxLhrDvVJS8&m=5XQNuuRQ4kxK03zqoWaJHIdaJvNdsyTKHuFlDKedbpc&s=5Dqh

ne-5w5V_1coBTBvTITwK2EFeankOegTaofy8S5w&e=

Are there specific bind tunables you might recommend? edns-udp-size,

perhaps?

Any ideas on other tunables such as net.core.*mem_default etc?

</pre>

        </blockquote>

        <pre wrap="">

*chuckles to self*

I was just referring back to that thread myself to try remember what I did.

I ended up tuning the following items:

  - name: SYSCTL system tuning, basics

    sysctl:

      name: "{{ item.name }}"

      value: "{{ item.value }}"

      sysctl_set: yes

      state: present

    with_items:

      - { name: 'vm.swappiness', value: 0 }

      - { name: 'net.core.netdev_max_backlog', value: 32768 }

      - { name: 'net.core.netdev_budget', value: 2700 }

      - { name: 'net.ipv4.tcp_sack', value: 0 }

      - { name: 'net.core.somaxconn', value: 2048 }

      - { name: 'net.core.rmem_default', value: 16777216 }

      - { name: 'net.core.rmem_max', value: 16777216 }

      - { name: 'net.core.wmem_default', value: 16777216 }

      - { name: 'net.core.wmem_max', value: 16777216 }

</pre>

      </blockquote>

      <pre wrap="">

Were you troubleshooting the same problems as I'm experiencing?

Many of these values I've already tweaked and have had no effect on my

SERVFAIL issues :-(

I've also been following the performance tuning variables in this RH document:

<a class="moz-txt-link-freetext" href="https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf">https://access.redhat.com/sites/default/files/attachments/20150325_network_performance_tuning.pdf</a>

These errors appear to occur in spurts - there is typically ten or

more in a row at a time, then any number of minutes/seconds before the

next one.

It looks like there are periods of as many as 500 queries per second,

although the usual amount is closer to 200 per second.

I don't believe this is a bind configuration problem, as the "Name

service error" errors from postfix also occur when testing with

unbound.

This is also only happening on the two identical systems connected to

the 165/35mbit cable modem. I've verified with Oponline, and they've

emphatically asserted there are no problems with the circuit. The

systems are 8-core Xeon E31240 with 16GB RAM. I've also tried other

systems, including a 12-core i7 with 32GB.

We have several other systems connected to a 10mbit DIA ethernet

circuit where these errors don't generally occur. They are also

similarly configured fedora systems with the same version of bind.

I'm really at a loss as to what the problem(s) are, but feel like it's

really impacting our ability to query RBLs for processing mail.

</pre>

      <blockquote type="cite">

        <pre wrap="">Whilst mentioned in passing on that thread, there was also poking around with TOE, pause, coalesce adaptive and ring size settings (look at ethtool -K, ethtool -A, ethtool -C and ethtool -G), but sadly have lost the specific commands.

</pre>

      </blockquote>

      <pre wrap="">

I've also tried configuring the NIC with ethtool according to the

variables defined in the RH document listed above and have had no

success.

This really is just a stock system. I can't believe these problems

would be so elusive or uncommon. Could it have to do with some

characteristic of the cable circuit itself?</pre>

    </blockquote>

    Just a wild thought:<br>

    It works with a lower speed line (at least I read it that way) but

    has problems with higher speeds.<br>

    Could it be that the line is so fast that it "overtakes" the host in

    question?<br>

    <br>

    A faster incoming line will give less time between the packets for

    processing.<br>

    <blockquote type="cite"

cite="mid:CAB1R3sjC9nyt4=ex-43cqrFC2ED-_UW=eYf13-tS5DtAwbTK_A@mail.gmail.com">

      <pre wrap="">

I've also experimented with QoS, using tc to prioritize interactive

traffic, including tcp and udp port 53, with plenty of bandwidth.

I really hope there is someone with some additional ideas.

Thanks,

Alex

_______________________________________________

Please visit <a class="moz-txt-link-freetext" href="https://lists.isc.org/mailman/listinfo/bind-users">https://lists.isc.org/mailman/listinfo/bind-users</a> to unsubscribe from this list

bind-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:bind-users@lists.isc.org">bind-users@lists.isc.org</a>

<a class="moz-txt-link-freetext" href="https://lists.isc.org/mailman/listinfo/bind-users">https://lists.isc.org/mailman/listinfo/bind-users</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>