100% CPU / wedge with 9.8.3-P4 & RPZ?

Sat Mar 16 13:04:27 UTC 2013

On 03/16/2013 12:43 PM, Matus UHLAR - fantomas wrote:
> On 16.03.13 11:39, Phil Mayers wrote:
>> In the last 12 hours, we've had repeated instances of named getting
>> wedged. The symptoms are:
>>
>> * named consuming nearly 100% CPU, all in user-time
>> * lots of queries apparently not processed, and based on query
>> logging, a sharp drop in the rate of queries that are
>
> do you provide recursion only for your clients?

Yes, only our clients. The nameserver is ACLed to our networks only i.e. 
is *not* open.

>
>> * a very sharp drop (almost a complete halt, in fact) in the rate of
>> RPZ "hits" in the logs at the exact time this happens
>> * no other interesting logs, as far as I can see
>>
>> Re-starting the named process clears it.
>>
>> I can't see anything in the release notes for 9.8.4/9.8.5 - any ideas?
>>
>> This is with the Spamhaus DBL, in case it matters.
>
> do you have local copy of spamhaus DBL?

Yes, the RPZ version.

Further investigation suggests the trigger was a *massive* update to the 
zone; xfer.log has entries like:

16-Mar-2013 11:02:48.332 transfer of 'rpz.spamhaus.org/IN/main' from 
x.x.x.x#53: Transfer completed: 489 messages, 701633 records, 15709867 
bytes, 53.944 secs (291225 bytes/sec)

...and these immediately precede (by a handful of seconds) the loss of 
service.

I'm assuming that such a huge update to an RPZ triggered some sort of 
bug and one thread ended up spinning; certainly the xfr is way larger 
than usual since it's about 95% of the total zone size (typicaly updates 
are several an hour in the 10-200 record range, via IXFR).

Examination of the journal suggests they deleted and re-added more or 
less every record in the zone (presumably an error at their side).

Does anyone else slave the Spamhaus RPZ and saw this? It seems like 
there might be a bind bug here with large updates to RPZ.