100% CPU / wedge with 9.8.3-P4 & RPZ?

Sat Mar 16 14:51:11 UTC 2013

On 03/16/2013 02:21 PM, Vernon Schryver wrote:
>> From: Phil Mayers <p.mayers at imperial.ac.uk>
>
>>>> In the last 12 hours, we've had repeated instances of named getting
>>>> wedged. The symptoms are:
>>>>
>>>> * named consuming nearly 100% CPU, all in user-time
>>>> * lots of queries apparently not processed, and based on query
>>>> logging, a sharp drop in the rate of queries that are
>
>>>> * a very sharp drop (almost a complete halt, in fact) in the rate of
>>>> RPZ "hits" in the logs at the exact time this happens
>>>> * no other interesting logs, as far as I can see
>
> How can the rate of RPZ hits not drop along with a sharp drop in the
> rate of queries?

Let me rephrase:

Successful queries dropped by 3-6x. RPZ matches dropped by a far, far 
larger fraction - essentially to zero.

In addition, I have no reason to believe the actual rate of queries 
dropped - instead, that named started to answer a subset, due to CPU load.

>> Examination of the journal suggests they deleted and re-added more or
>> less every record in the zone (presumably an error at their side).
>
> Wouldn't deleting more or less every record in the response policy zone
> tend to reduce the rate of RPZ hits?

Yes. But I would not expect named CPU to rise to 100% and stay there.

>> Does anyone else slave the Spamhaus RPZ and saw this? It seems like
>> there might be a bind bug here with large updates to RPZ.
>
> Not to defend RPZ, but what is the evidence that links RPZ to the
> problem problem, even if it is related to large updates of a zone
> instead of, for example, NSEC chains?

I see the large XFR of the RPZ zone, then seconds later the service 
falls off-net. This happened on 4 separate servers.

Conversely, incremental re-signing of ic.ac.uk/imperial.ac.uk takes 
places regularly throughout the day (we're using "auto-dnssec 
maintain"). Nothing special happened there.

I have pretty much all logging going to one file or another, and I 
couldn't find any events for about 10 minutes prior to that which seem a 
likely candidate.

> How many times did named get wedged? According to the theory that
> the problem is related to large updates of policy zones, there
> should have been at most 3 instances of wedged named processes per
> computer and they should have happened during or soon after the end
> of large rpz.spamhaus.org transfers.

We saw 4 large RPZ transfers:

22:44 - service unaffected
00:02 - named goes to 100% CPU
07:19 - service unaffected
11:02 - named goes to 100% CPU

It is difficult to tell, with the info I have to hand (logs, journal) 
but I am assuming that the pattern was either:

bulk delete
bulk add
bulk delete
bulk add

...or

bulk add
bulk delete
bulk add
bulk delete

The number of records in the XFR would tend to support that, but I 
cannot be certain since bind doesn't give me the info to know that.

>
> My logs have these instances of transfers of rpz.spamhaus.org involving
> at least 100 messages during March (NTP disciplined UTC timestamps):
>
>    02-Mar-2013 21:45:42.511 07-Mar-2013 22:47:56.423 08-Mar-2013 03:19:46.419
>    08-Mar-2013 03:26:50.262 08-Mar-2013 07:27:13.176 08-Mar-2013 07:33:29.203
>    08-Mar-2013 10:07:05.829 08-Mar-2013 11:18:09.837 15-Mar-2013 22:52:02.969
>    16-Mar-2013 00:04:14.447 16-Mar-2013 07:21:07.576 16-Mar-2013 11:06:46.515

The last 4 you have correspond to what I see.

Maybe it's not RPZ, but I don't see how it could be DNSSEC; our signed 
zones are slaved from a hidden master, and the XFRs were the usual 
trickle of resignings.

If it's not RPZ, those xfr timings are very coincidental. But of course, 
I'm just guessing. It could be phase of the moon for all I know at this 
stage.