[bind10-dev] More ideas about high-performance data-source backend (warning: long text)

Wed Oct 13 07:17:56 UTC 2010

Michal,

A couple of comments about this that I hope are useful. Some are 
considered and others are just random thoughts.

First, cache coherence becomes even more critical with multiprocessor 
systems, when execution units contend for the same memory bandwidth and 
cache data.

Second, if I do some counting with your scheme, I get that you need 1 
byte for flags, 8 bytes for a pointer (assuming most high performance 
systems will be 64 bit) which means that there are 7 bytes of string 
left in your small vertex for text. Past experience has shown that 
people like long shallow names rather than deep names with short labels. 
How many names begin with batman and the like? Also remember that we 
have the mispellings that either hit alternate sites trying to steal 
views or create negative cache entries. Does the vertex text start at 
the beginning of label or at the disambiguation point of the parent vertex?

Finally, I don't have numbers that are current, but in a recursive 
server a significant part of the cache is filled with names that will be 
looked up and used once and expire before ever being queried again. 
Perhaps someone has some real numbers for the percentage of a large 
recursive server. This means that the ratio of insert to fetch is lower 
than one might first expect.

This begs a slightly different question? What would be the overall cost 
of having a smaller cache of busy labels and a larger cache of 
infrequently used labels? I realize this may add massive complexity, but 
it could also allow one to have different data models for the query once 
and never used and the 5-10% of names that will produce the bulk of the 
cache hits.  Another thing for the active labels is that you can deal 
with the fact that they often have short TTLs and leave them in the 
cache even after the TTL expires, they only get dropped by LRU. This 
would keep these busy names from being popped out of the cache 
accidentally during a garbage collect.

The dual cache would also add something useful to the persistent cache 
store over reboots and the like. You would only want to dump the active 
store for reload.

jerry

On 10/07/2010 06:43 AM, Michal 'vorner' Vaner wrote:
> Hello
>
> I did some thinking (and thinking aloud in company of my teacher) about the
> high-performance data backend and how to speed up the tree on real hardware.
> There are some more ideas since the last email.
>
> But before I start, I would like to ask few questions, because the rest is quite
> long and I do not expect everyone to keep reading until the very end.
> . In which state is the data-source API now? Is it possible to start new
>    backend?
> . Jinmey, do you think I could use some of your code of the experiment you
>    showed on face2face meeting? Or, could we cooperate somehow on it?
> . With the team split, is it the right time to start it, or I should wait?
> . Would it be OK if I write this? I have two reasons to want to do that. One is,
>    explaining the ideas in email still probably loses some information and I have
>    the best idea what I mean. The second is, I'd like to use it to measure some
>    speed improvements for my thesis, which requires me to write it (but as this
>    wouldn't be actually part of it, just the numbers I get there, I hope ISC
>    having the copyright shouldn't be a problem, while someone else writing it
>    would be).
> . Do you think it makes sense to write this at all? It is little bit more
>    complicated than usual tree, but I think it should be faster and allow realtime
>    updates without much speed penalty on the lookup side. Is the lookup the
>    bottleneck of authoritative server speed, or should we be optimising somewhere
>    else?
>
> Assumptions (if you do not believe them, I can find some sources to support it):
> . Memory access is a lot slower (order of magnitude) than cache access.
> . If we access byte of memory, whole cache-line is pulled into cache, therefore
>    the rest of the cache-line is "for free".
> . TLB miss is expensive too (looking up a real address may invoke several memory
>    accesses, depending on what is in cache).
>
> Because it is easier to think with concrete numbers, I'll assume that cache-line
> size is 16 bytes and page size is 4096 bytes, which is what most today hardware
> has, but it should, in principle, use with different numbers too.
>
> So, what I want is, when I access a byte, I'd like to have as much useful
> information in the same cache-line as possible, because I get them for free.
> Furthermore, when I access a page, I want to have as many accesses from that
> page before I go to another (the address is in TLB).
>
> I'll leave the real data out for now, and just keep the keys (names) and
> pointers to the data, for simplicity.
>
> The idea is, it is dual-level something-like-b-tree. There are small vertices,
> each of them aligned to a cache-line start and taking up to 16 bytes. These form
> trees, each of them is a vertex of the outer tree (the bigger vertex would have
> large fan-out). Each bigger vertex would similarly live inside a single page.
>
> You may notice that there's up to 256 small vertices inside the big one. Which
> leads to idea that addressing inside the big vertex can be done with single-byte
> index to save space in the small vertices.
>
> Another observation is, if I put the strings I compare (bytestrings actually,
> but let's call them strings) outside of the vertices, I have additional memory
> accesses and possible TLB lookups. So we put the strings (or, parts of them)
> into the vertices. Each key in the vertex would be the smallest prefix dividing
> the sons (therefore, if there are many different strings, they are short, if
> they have common prefixes, they get longer).
>
> So, how the small vertex looks like? It needs some flags (like if there are any
> data or it is just a vertex on the path). Then, if I have a length of string, 4
> bits is enough (max length of 16). If I need a pointer, I need to distinguis if
> it is internal (therefore single byte) or external to another big vertex or data
> data. That's a single bit. Or, because pointers in the target machines will be
> 64-bit long probably, we could add another bit and have one and two byte
> relative pointers by the page count (not bytes) to save more space.
>
> There's a little problem with alignment of pointers, some architectures do not
> like unaligned data at all, some are just incredibly slow working with them. So
> let's put all native pointers at the end of the vertex, the bit fields with
> lengths and flags and such at the beginning and the strings in the middle.
>
> If the data are dense (like, many distinct strings) and they differ at the first
> character, we get nice large fan-out (up to something like 6 or 7) for single
> cache-line and memory access. If they have long common prefixes, then the
> fan-out is lowered to fit in the strings.
>
> The big vertex would be just bunch of small vertices, the one with index 0 would
> be the root.
>
> Ballancing would be done in a similar manner to b-tree, there are possible some
> nice optimisations of the tree, like choosing the right place to split
> left/right to minimise the string lengths or better packing/usage of the small
> vertex size.
>
>
> If we decide to have it multi-threaded with live updates (which I think should
> still be possible), I guess it would be better to use read-write locker (as
> there would probably be more concurrent reads and less writes anyway) in each
> big vertex (somewhere at the beginning of the page).
>
> Thanks for reading so long email and any suggestions and opinions.
>
> With regards
>
>    
>
>
> _______________________________________________
> bind10-dev mailing list
> bind10-dev at lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind10-dev
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20101013/96afc7c7/attachment.html>