[bind10-dev] NSAS Using Authority/Additional Information?

Tue Nov 30 11:24:03 UTC 2010

Hello

On Mon, Nov 29, 2010 at 05:27:02PM +0000, Stephen Morris wrote:
> > The usual situation is that the one asking us to provide an IP address is the resolver. So it tells us to give it an IP address of some nameserver of example.net. Because we do not know example.net yet, we ask the resolver to give us NS records of example.net. And the resolver will ask us to provide the IP address of example.net, to ask what its nameservers are. This does not loop infinitely, as the example.net entry exists and is marked as IN_PROGRESS (waiting for data), so the callback is just stored. But we do not get any data and, worse, we do not timeout, because timeouts are on network operations, not on running code, and nothing here communicates by a network, so it does not create timeouts.
> 
> As I understand the asynchronous I/O we're using, there is no reason why you can't use separate timers.  So all requests could have a timeout associated with them.

It is possible. However, I do not think it makes much sense to put a timer on
something that is expected to be resolved locally only. The network really needs
a timeout, but having a timeout to run for quite anything in the system seems
both like an overhead and more complicated thing (when the timer fires, I do not
know in which state the request is currently, if it was already answered (this
can actually happen if the callback queue is long)).

So it seems simpler not to need them. Or is there any real reason to have
timeouts on everything?

> > Still, it is not bullet-proof. There might be a zone with single nameserver which does not have any IP address. In that case it is unreachable, but the cache can not assume anything from seeing empty additional section. So it does not know there are no IP addresses, so it will not provide them and the resolver will try to fetch them, asking us for IP address.
> 
> This should not be an issue - the system is designed to cope with it.  Let's assume this situation: a misconfigured parent zone with a single NS record:

I expected I do not have a timeout on a response from resolver, that if the
resolver needs to communicate by network it would get one itself. In that case I
needed to tell it „never ask anything outside“ which means never ask NSAS.

It seems to me it is easier to code and the negative response would be returned
faster (without waiting for the timeout).

> 9) The resolver receives the request for the A record of ns.example.net.  For this it requires the address of a nameserver in example.net. so sends another call to the NSAS for this information passing another "resolver callback" object.
> 
> [Note - this should be detectable.  A requirement on the resolver is that it does not issue multiple outstanding queries for the same information.  If the same logic is applied to NSAS callbacks, the duplicate call should be detected.  Such an optimisation affects the detail of what follows but not the outcome.]

That is not exactly a duplicate, one was for A, www.example.net, another for A,
ns.example.net. And I think we do want to allow the resolver to put as many
callbacks it wants (otherwise it would need to code its own multiplexing of
callbacks). Real duplicates will need to be detected by the resolver, not us
(I guess the NSAS should not inspect the callbacks, they might be different than
resolver callbacks anyway, in case of test at least).

> > This can be solved by providing a CACHE_ONLY flag to the resolver (assuming it will have one), forcing it not asking anything remote and provide fail right away if the cache does not have the data.
> > 
> > Such flag would allow us to do a first-round over the nameservers and fill the IP addresses we already have right in the initialization, then start fetching at most 2 IP NSs at once externally.
> 
> This won't work if the nameservers for a zone are in different zones.  Suppose the NS records are:

Hmm, you are right. And it still wouldn't solve the problem with two
cross-referencing zones (each with a nameserver in the other one and not
providing a glue).

Then we must think of something else then. But it still seems like an unclean
idea to put a callback there. Maybe we will be able to detect some of the
problems described here sooner. Or maybe they do not happen in real life much so
we should care only not to leak memory in such cases.

But the CACHE_ONLY could be used to fetch all the glue information at the
creation time, failing for the not specified ones. Then the second round could
get the rest remotely. When we do not have that one, we just pick 2 nameservers,
ask for their IP addresses and wait for that. If we are unlucky, we might have
chosen the two that are out of zone, so they trigger external query. And we wait
for them to answer before we ask for the glue in cache. But that's probably too
soon to think about optimising this (anything real just has the glue anyway).

> This is I think the bit with the biggest uncertainty.  I presume that the cache will be able to detect when an NS RRset has changed so presumably it will only update the NSAS when this occurs.

The detection will be no problem I think. I'm afraid little bit about the
notifications, since the nameserver might be already out of the LRU list and
hash table and still used by a zone. So we either need to ignore this one, do
something on TTL expiration or have some linking there.

> The simplest thing to do would be to delete the zone entry and recreate it anew.  When not pointed to by any zone entries, nameserver entries in the NSAS will remain in existence until they fall off the end of the LRU table.  So if we delete a zone entry and recreate it, there is a good chance that we will reacquire the same nameserver entries and by implication the same address entries with the up to date RTT information.  (This chance is improved - but not certain - if we create the new zone entry first and replace it.  The reason it is not certain is because the nameserver entry could have been removed from both the hash table and LRU list and is only being kept in existence by the pointer from the zone entry object.)  If not, then the RTT information will have to be rebuilt from scratch.

I do not think we need to recreate it. Just renewing seems better, to preserve
information. Sure, that's more code and we could just do the recreation for now.

Would it be a problem having LRU only for zones, and nameservers would be
removed from the hash when no zone points to them? That way we have less
nameserver entries than now (there would be no duplicities), we could always
locate everything we have and we wouldn't have problems with losing information
too soon.

The problem there would be to correctly remove them from the hash at the
destructor, but some clever locking should solve it.

> At this point though, it is perhaps worth considering one (related) change to the NSAS data store.  At present zones and nameservers are accessed via a hashtable, but addresses are pointed to solely by the nameserver entries.  So if an address is referenced by two or more nameservers there will be multiple (independent) entries for it in the NSAS.  (On reflection I see that there was an implicit assumption that multiple names pointing to an address are unlikely. This is because the most usual case with multiple names is that the names are CNAMEs for a single name that points to a single address.)  If we were to add addresses to their own hash table and LRU list (this should be a minimal change to the code - the LRU list and hash table classes are templates so should adapt easily to the Address Entry class) and check the hash table when adding an address, there will only ever be one entry in the NSAS for any given address.

Does it bring anything else than just not having duplicate addresses? I do not
see much benefit there.

Just a guess, but I think that the assumption is not far from truth. Some
statistics would be better. But if there are only few that have multiple, is it
worth to bring other overhead to the usual case? Currently, they are in one
array, which is consecutive in memory (vector is just an array inside). If we
omit another lookup in the table for each address, we get another at last 2
memory fetches on some random address per address (one for the address object,
one for the reference count, shared_ptr templates must store them somewhere
else), each one possibly with TLB miss, therefore each taking aproximately 100
ticks of processor. That would be a speed penalty at last 1+n times current
implementation, where n is the number of addresses.

Not to mention that it would be more complicated code (passing the hash tables
and lists one level deeper trough the structure, for one).

In the case where it would be common for a nameserver to have multiple names, I
would see the benefit. But then, would it be possible to link multiple names to
single nameserver entry? I know, it is hard to guess that and current
implementation of the hash table does not allow it.

> > Another problem is, we assume that resolver is willing to provide data that it knows is unauthoritative. But this can be solved simply by adding UNATHORITATIVE_OK flag.
> 
> I think that most resolvers will do that anyway; if it does not have authoritative data, it will return any data is has.

Even without trying to reach authoritative source? I think they shouldn't.

> > This might be solved for example by providing some kind of cache cookies. When data are put into the cache, it would return a cookie and having such a cookie would guarantee that the cache is able to provide at last the data passed to it. (Technically, the easiest way to do this functionality is to put a shared pointer to the data into the cookie, and the cache would look first into itself, then into the cookie if not found.)
> 
> An interesting case.
> 
> Assuming glue were given but that the cache time is set to 0 no information is cached.  Therefore when the NSAS queries for the NS records for the zone in question, the resolver make an explicit query for the NS RRset. But after that we would be in the same situation as described above and queries would ultimately timeout.

Yes, that's why I want to propose the cookie. It would hold small piece of the
cache and when passed with the request, cache would fallback there. The cookie
would not be subject to the eviction, therefore 0-TTL data would survive, would
be used just for the request that got the cookie and when dropped, the hold data
would get reference count to 0.

> > 	• Cache needs to be able to provide a way to store TTL and make it available to one exact NSAS query only. For example by cookies.
> 
> I don't think it needs to do this.  As you pointed out elsewhere, if we do store the TTL in the NSAS, the overhead is a single comparison for expiry time.  So the TTL can be passed with any RRset data.

I maybe wrote this wrong, sorry. It should have been data with 0 TTL.

> > 	• Resolver interface should provide a way to ask for an RRset. It is not really required, but passing the RRset is probably better than constructing a response and then parsing it again.
> 
> Absolutely!  The resolver interface should also allow for asynchronous I/O.

That one will be easy to change. I'll start with that :-).

Thanks for the discussion. I hope it will improve the design and make it simpler
:-).

Have a nice day

-- 
"It can be done in C++" says nothing. "It can be done in C++ easily" says nobody.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20101130/52c29202/attachment.bin>