100% CPU / wedge with 9.8.3-P4 & RPZ?
Phil Mayers
p.mayers at imperial.ac.uk
Sat Mar 16 13:04:27 UTC 2013
On 03/16/2013 12:43 PM, Matus UHLAR - fantomas wrote:
> On 16.03.13 11:39, Phil Mayers wrote:
>> In the last 12 hours, we've had repeated instances of named getting
>> wedged. The symptoms are:
>>
>> * named consuming nearly 100% CPU, all in user-time
>> * lots of queries apparently not processed, and based on query
>> logging, a sharp drop in the rate of queries that are
>
> do you provide recursion only for your clients?
Yes, only our clients. The nameserver is ACLed to our networks only i.e.
is *not* open.
>
>> * a very sharp drop (almost a complete halt, in fact) in the rate of
>> RPZ "hits" in the logs at the exact time this happens
>> * no other interesting logs, as far as I can see
>>
>> Re-starting the named process clears it.
>>
>> I can't see anything in the release notes for 9.8.4/9.8.5 - any ideas?
>>
>> This is with the Spamhaus DBL, in case it matters.
>
> do you have local copy of spamhaus DBL?
Yes, the RPZ version.
Further investigation suggests the trigger was a *massive* update to the
zone; xfer.log has entries like:
16-Mar-2013 11:02:48.332 transfer of 'rpz.spamhaus.org/IN/main' from
x.x.x.x#53: Transfer completed: 489 messages, 701633 records, 15709867
bytes, 53.944 secs (291225 bytes/sec)
...and these immediately precede (by a handful of seconds) the loss of
service.
I'm assuming that such a huge update to an RPZ triggered some sort of
bug and one thread ended up spinning; certainly the xfr is way larger
than usual since it's about 95% of the total zone size (typicaly updates
are several an hour in the 10-200 record range, via IXFR).
Examination of the journal suggests they deleted and re-added more or
less every record in the zone (presumably an error at their side).
Does anyone else slave the Spamhaus RPZ and saw this? It seems like
there might be a bind bug here with large updates to RPZ.
More information about the bind-users
mailing list