[bind10-dev] recursor cache requirements - input required

Wed Dec 8 23:04:12 UTC 2010

Comments on the correspondence on this subject so far:

On 7 Dec 2010, at 10:32, Michal 'vorner' Vaner wrote:

> How about an interface where we would put a dns message into the cache, it would
> split it and store only the RR sets and we could ask for individual RR sets
> then?

It may not be as easy as that.  If the data in the DNS message were compressed, it would be likely that some decompression would need to occur first.  The compression pointers might point to parts of the message that are not stored.

>> ==== Updating one rrset ====
>> The interface for updating one rrset should be provided. Note: some rrsets
>> in the cache may be updated before they expire.
> 
> We could use it to update both the data and the expiration time.
> 
> We should not update if we have authoritative data and what we got is not
> authoritative.

The rules for how to trust the different types of data are in RFC 2181 (section 5.4.1 "Ranking data")

> And there should be a mechanism the cache would be able to inform that data in
> some already existing RRset has changed (NSAS needs that, it does little bit of
> its own caching. If it times out, it is OK, it just asks the cache again, but it
> does not know when the cache gets different data than it had and needs to
> update).

The only updates the NSAS needs to know about is when

(a) The zone's NS RRset is updated and the update involves (i) the addition or deletion of nameservers or (ii) modification of a nameserver address.  (i) could occur if the NS RRset in the parent did not match that in the zone.  (ii) would occur if the glue in the parent did not match that in the zone.  Either way, the simplest would seem to be for the NSAS to mark the nameservers and associated address records as expired and to re-fetch them.
(b) Something is explicitly deleted from the cache.

On 8 December 2010 at 07:53:13 GMT Likun Zhang wrote:

> I agree with the feature, but I would like to search the small cookie cache first, the idea is same with unbound.
> Unbound has two types of rrset cache, L1 and L2. L1 cache only includes the rrsets in "local zone", and the entry in it is fixed and never be removed. L2 cache include the rrsets that may be added or removed frequently. L1 cache is always searched first.

We'll want some local zones to cope with Mark Andrews's draft-ietf-dnsop-default-local-zones draft.

And if we are writing a way to load a cache from some form of dump file, we could always load it with the contents of any authoritative zones we want to serve on the same system.  So we could have L1 and L2 caches, L1 containing local zones and any authoritative zones we want to serve and L2 containing the transient data.

On 8 Dec 2010, at 13:27, Shane Kerr wrote:
> On Tue, 2010-12-07 at 16:55 +0800, Likun Zhang wrote:
>> The key for one rrset in cache should be "Domain_name + Type + Class". In
>> the value part, besides of the rdata of each rr in the rrset, there should
>> be rrset's signature(rrsig record), if it has, and the security
>> status(dnssec validation result) of rrset.
> 
> It is very unlikely that the cache will ever be used for any class other
> than IN. I suggest that we restrict the cache to a single class, in the
> interest of saving 16-bits per entry plus associated processing time.

If we make the cache a single C++ class, we could always create other instances of it for different DNS classes - although we want to be aware of a malicious attack whereby an attacker could force the creation of 64k instances of the cache (see below).  Should we go this way, it will have an impact on the NSAS.  At present the NSAS stores all classes in one store; if we explicitly take DNS class out of the cache, we should do the same for the NSAS.

>> ==== dumping/loading to/from one document/database ====
>> The content of rrset cache should can be dumped to one document/database, so
>> that the rrset cache can be reused when the recursor is restarted.
>> Extra rrset expire time should be added when dumping, so that expired rrsets
>> can be ignored when loading rrset cache from the document/database.

I'm not quite clear what "Extra rrset expire time should be added" means.  Certainly when we dump the cache we want to exclude all expired entries and when we read the result back back in we want to exclude entries that have expired since the cache was written.

> Because the cache can be quite large, we need to define the behavior
> when the cache is being dumped. I suggest that the cache should not
> block add/remove operations when this is going on.
> 
> Actually, it might be possible to act in two ways: allow add/remove
> operations for when dumping or loading during runtime, and disallow for
> faster, lock-free operation when starting up or shutting down.

For dumping, could we not fork the process and have the child write the copy of the cache at our leisure?  That would work on Unix-based systems at least.

As to loading, it depends how long loading will typically take.  If a short time, we can construct the cache before we begin operations; if a long time, a separate component that works through the dump in the background and does an "add if not here and if the stored data has not yet expired" should work.

> We may also want a way to completely empty the cache. If nothing else
> this can be useful in debugging. :)

If the cache is a single object, how about deleting it and creating a new instance?

> Regarding the expiration time... we may need to be careful about this.
> For example, if we have the following messages:
> 
> -[ message 1 ]------------------
> FOO.EXAMPLE CNAME BAR.EXAMPLE
> BAR.EXAMPLE A 1.2.3.4
> --------------------------------
> 
> -[ message 2 ]------------------
> BAR.EXAMPLE A 1.2.3.4
> --------------------------------
> 
> Then an update to the RRSET message 2 might effectively change the TTL
> of message 1, if the TTL is made longer.

We will have to know where all the TTL fields in the message are because we will need to change them for every message that gets sent out. So rather than delete the message when the minimum TTL is reached, assume the message is OK and work through it updating the TTLs.  If any have dropped below zero, discard the message and re-query to get a new one.

>> the value for one message should include message header, index information
>> for each rrset in different message sections. For the structure, see the
>> following sketch. The security status(dnssec validation result) of the
>> message should also be noted.
> 
> It might be nice to include versions of this data with name compression
> too, right? To avoid having to perform this processing again.

The compression technique involve pointers to absolute offsets in the message.  If an RRset is found in more than one message, (potentially) we will have to have one compressed version of the RRset for each message it is in.  It may well be simpler to cache the wire format of each message as-is and create associated index structures mapping it to the constituent RRsets (and vice-versa).  Then if any of the RRsets is updated (other than a TTL update that we can explicitly write back into the message), drop the message and re-query.

On 8 Dec 2010, at 13:39, Michal 'vorner' Vaner wrote:
> The expiration in NSAS might be only sooner, in which case it will just update
> its data from cache when needed.
> 
> But there are two things. First, it seems a waste not to use newer data if we
> have them. Second, what happens if there's misconfigured glue somewhere. The
> NSAS stores the (wrong) glue, but if it is not wrong too much, we reach some of
> the nameservers eventually. It returns answer with authority, so we fix the glue
> in cache, but NSAS still has the wrong glue.

See above.  If there is an update to a NS RRset or a record marked as glue, the NSAS should be informed.  In this way the NSAS is kept synchronized with the main cache.

>> I agree with the feature, but I would like to search the small cookie cache first, the idea is same with unbound.
>> Unbound has two types of rrset cache, L1 and L2. L1 cache only includes the rrsets in "local zone", and the entry in it is fixed and never be removed. L2 cache include the rrsets that may be added or removed frequently. L1 cache is always searched first.
> 
> I don't see why. If the cache is a hash table, then we need to compute the hash
> and then look into a given position in an array. It takes the same amount of
> time if the array is small or large.
> 
> And, local zone seems to be some kind of zone served by this server. I need to
> put there remote data as well.
> 
> Why I'd like to search the large one first is, the cache might have never
> (better/authoritative) data than the small one. I want to use the better data
> from large cache.

I think the idea is that the sets of data in each cache are disjoint.  The L1 cache contains local zones (and authoritative data).  L2 contains everything else.

> What is the advantage of searching the small one first?

The L1 cache is relatively static; as updates are infrequent, you could have a single lock for the entire cache instead of locks at a finer granularity.  This would make access faster.

On 8 Dec 2010, at 19:31, Jerry Scharf wrote:
> I like Shane's idea of one class per cache. Since the cache in not a once per server entity (each view requires its own cache), if someone wants hesiod queries (the only other class I know of) build a separate cache for that. It will decrease the size and increase the speed of each class in this case.

The actual IANA DNS Class registry is:

Registry:     
Decimal      Hexadecimal    Name                            Reference
-----------  -----------    ------------------------------  ---------
0            0x0000         Reserved                        [RFC5395]
1            0x0001         Internet (IN)                   [RFC1035]
2            0x0002         Unassigned                      
3            0x0003         Chaos (CH)                      [Moon1981]
4            0x0004         Hesiod (HS)                     [Dyer1987]
5-253        0x0005-0x00FD  Unassigned                    
254          0x00FD         QCLASS NONE                     [RFC2136]
255          0x00FF         QCLASS * (ANY)                  [RFC1035]
256-65279    0x0100-0xFEFF  Unassigned                      
65280-65534  0xFF00-0xFFFE  Reserved for Private Use        [RFC5395]
65535        0xFFFF         Reserved                        [RFC5395] 

Unless we filter the class request, we have to cope with the possibility creating up to 65,536 caches - hence my comment above.

> I do not think you should keep any message in the cache. I have been involved in more than one recursive server and have never seen the response message from the upstream cache fill or the answer created for the client kept. You always have to regenerate the answer from the validated data in the cache to prevent cache poisoning attacks, so keeping the upstream message is a bad idea. You also have situations where the answer from before is no longer correct. (a classic example of this is that the first answer used the glue NS rrset for the authority data and that rrset was later replaced by the authoritative zone NS rrset or a new query within the zone included a new NS rrset in its authority section.) You need to build everything you need from the current cache information.

I appreciate that, yet I would think that in some situations - queries for www.google.com for example - a pre-formatted reply that can be sent immediately could be a big performance improvement.

Stephen