[bind10-dev] More ideas about high-performance data-source backend (warning: long text)

Thu Oct 7 13:43:01 UTC 2010

Hello

I did some thinking (and thinking aloud in company of my teacher) about the
high-performance data backend and how to speed up the tree on real hardware.
There are some more ideas since the last email.

But before I start, I would like to ask few questions, because the rest is quite
long and I do not expect everyone to keep reading until the very end.
• In which state is the data-source API now? Is it possible to start new
  backend?
• Jinmey, do you think I could use some of your code of the experiment you
  showed on face2face meeting? Or, could we cooperate somehow on it?
• With the team split, is it the right time to start it, or I should wait?
• Would it be OK if I write this? I have two reasons to want to do that. One is,
  explaining the ideas in email still probably loses some information and I have
  the best idea what I mean. The second is, I'd like to use it to measure some
  speed improvements for my thesis, which requires me to write it (but as this
  wouldn't be actually part of it, just the numbers I get there, I hope ISC
  having the copyright shouldn't be a problem, while someone else writing it
  would be).
• Do you think it makes sense to write this at all? It is little bit more
  complicated than usual tree, but I think it should be faster and allow realtime
  updates without much speed penalty on the lookup side. Is the lookup the
  bottleneck of authoritative server speed, or should we be optimising somewhere
  else?

Assumptions (if you do not believe them, I can find some sources to support it):
• Memory access is a lot slower (order of magnitude) than cache access.
• If we access byte of memory, whole cache-line is pulled into cache, therefore
  the rest of the cache-line is „for free“.
• TLB miss is expensive too (looking up a real address may invoke several memory
  accesses, depending on what is in cache).

Because it is easier to think with concrete numbers, I'll assume that cache-line
size is 16 bytes and page size is 4096 bytes, which is what most today hardware
has, but it should, in principle, use with different numbers too.

So, what I want is, when I access a byte, I'd like to have as much useful
information in the same cache-line as possible, because I get them for free.
Furthermore, when I access a page, I want to have as many accesses from that
page before I go to another (the address is in TLB).

I'll leave the real data out for now, and just keep the keys (names) and
pointers to the data, for simplicity.

The idea is, it is dual-level something-like-b-tree. There are small vertices,
each of them aligned to a cache-line start and taking up to 16 bytes. These form
trees, each of them is a vertex of the outer tree (the bigger vertex would have
large fan-out). Each bigger vertex would similarly live inside a single page.

You may notice that there's up to 256 small vertices inside the big one. Which
leads to idea that addressing inside the big vertex can be done with single-byte
index to save space in the small vertices.

Another observation is, if I put the strings I compare (bytestrings actually,
but let's call them strings) outside of the vertices, I have additional memory
accesses and possible TLB lookups. So we put the strings (or, parts of them)
into the vertices. Each key in the vertex would be the smallest prefix dividing
the sons (therefore, if there are many different strings, they are short, if
they have common prefixes, they get longer).

So, how the small vertex looks like? It needs some flags (like if there are any
data or it is just a vertex on the path). Then, if I have a length of string, 4
bits is enough (max length of 16). If I need a pointer, I need to distinguis if
it is internal (therefore single byte) or external to another big vertex or data
data. That's a single bit. Or, because pointers in the target machines will be
64-bit long probably, we could add another bit and have one and two byte
relative pointers by the page count (not bytes) to save more space.

There's a little problem with alignment of pointers, some architectures do not
like unaligned data at all, some are just incredibly slow working with them. So
let's put all native pointers at the end of the vertex, the bit fields with
lengths and flags and such at the beginning and the strings in the middle.

If the data are dense (like, many distinct strings) and they differ at the first
character, we get nice large fan-out (up to something like 6 or 7) for single
cache-line and memory access. If they have long common prefixes, then the
fan-out is lowered to fit in the strings.

The big vertex would be just bunch of small vertices, the one with index 0 would
be the root.

Ballancing would be done in a similar manner to b-tree, there are possible some
nice optimisations of the tree, like choosing the right place to split
left/right to minimise the string lengths or better packing/usage of the small
vertex size.

If we decide to have it multi-threaded with live updates (which I think should
still be possible), I guess it would be better to use read-write locker (as
there would probably be more concurrent reads and less writes anyway) in each
big vertex (somewhere at the beginning of the page).

Thanks for reading so long email and any suggestions and opinions.

With regards

-- 
Look! Behind you!

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20101007/4688fb94/attachment.bin>