[bind10-dev] Profiling & performance sugestions

Mon Jan 24 12:30:07 UTC 2011

Michal,

On Sun, 2011-01-23 at 20:25 +0100, Michal 'vorner' Vaner wrote:
> First, I mentioned the system spent a lot of time in kernel. It turned out most
> of the time accounted for kernel were caused by dnsperf, not bind10. Approx. 12%
> of time bind10 is running is in kernel (eg. kernel handling bind10's requests),
> must of them are some mine enabled debug functions and networking stuff.
> Therefore we probably don't have any chance of getting that better (we need to
> receive and send the packets and I guess asio does some kind of epoll or other
> higher-performance think by itself).

Good to know. :)

> I have no clue what the ZNK3isc… function is. I didn't find it in our code
> (OutputBuffer does not seem to have a clone method). Does anyone idea what it
> might be? Because we spend quarter of the time in it. From the name I would
> guess it is copying of some buffer, but I guess we don't need it, it would be
> nice to get rid of it completely.

$ echo _ZNK3isc3dns12OutputBufferixEm | c++filt -n
isc::dns::OutputBuffer::operator[](unsigned long) const

There you go. :)

> The Name::compare is called mostly from the RBTree (I found this out before the
> F2F) and is slow because of cache misses (it contains nearly 80% of cache misses
> in the program). This is taken from the function, first two columns are cache
> misses, second two are cache hits.
> 
>    429  1.6208  1185  0.2135   :    while (l > 0) {
>                                :        --l;
>                                :        --l1;
>                                :        --l2;
>      5  0.0189  1486  0.2678   :        size_t pos1 = offsets_[l1];
>   3937 14.8746  1039  0.1872   :        size_t pos2 = other.offsets_[l2];
>     11  0.0416  8942  1.6114   :        unsigned int count1 = ndata_[pos1++];
>   5475 20.6854   863  0.1555   :        unsigned int count2 = other.ndata_[pos2++];
>                                :
>                                :        // We don't support any extended label types including now-obsolete
>                                :        // bitstring labels.
>                                :        assert(count1 <= MAX_LABELLEN && count2 <= MAX_LABELLEN);
>                                :
>                                :        int cdiff = (int)count1 - (int)count2;
>      8  0.0302  1743  0.3141   :        unsigned int count = (cdiff < 0) ? count1 : count2;
>                                :
>                                :        while (count > 0) {
>                                :            unsigned char label1 = ndata_[pos1];
>                                :            unsigned char label2 = other.ndata_[pos2];
>                                :
>  10739 40.5735 21129  3.8076   :            int chdiff = (int)maptolower[label1] - (int)maptolower[label2];
>      0       0  2186  0.3939   :            if (chdiff != 0) {
>                                :                return (NameComparisonResult(chdiff, nlabels,
>                                :                                         NameComparisonResult::COMMONANCESTOR));
>                                :            }
>                                :            --count;
>                                :            ++pos1;
>                                :            ++pos2;
>                                :        }
> 
> It also contains 13% of mispredicted branches in the program.

I don't think either of these is surprising. We're accessing a lot of
memory, and if we could predict the branches we wouldn't need to
search... there may be ways to structure this code without branching, or
with a fixed-sized string in the data structure to avoid de-referencing,
or things like that. Of course, the best solution is always not to call
the function at all, so perhaps memoization might be worthwhile too.

> It would be nice to find a way to store the Name in more compact way, in a
> consecutive block of memory. Right now we have vector and string there, which
> means it is scattered over 3 places in the memory. If it wasn't C++, it would be
> OK to just call a malloc with sum of the lenghts for all three of them. Do you
> know of a way how to do it with C++?

Google points me to this syntax:

#include <new>
Timer * ptimer = new (raw_mem) Timer; // #1

That (raw_mem) is the pre-allocated memory.

http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=160

> Another one is, we traverse the tree more times than we need to. For example,
> each time we answer, we traverse it once to get the actual answer, then once
> more to get the NS records of zone into the authority section and several more
> times to get IP addresses of those to the additional section. If the record
> asked is MX, for example, then we traverse it again multiple times. As Jinmei
> presented on the previous F2F, it would be nice to have some kind of
> interlinking/hints where to find something. The hints could be updated lazily
> (when it is found it is wrong or missing).

This seems like an excellent idea.

> The writeName part is the name compression. Jinmei, you had some trick for that,
> right? It seems that one is worth implementing.

And this. :)

> I might point out as well that the zone is not that big (because we don't really
> save much memory and I have only 4GB, so I couldn't fit in a bigger one), if it
> grows even more, then the impact of lookup performance will be bigger (since the
> tree or trees will be deeper).

Luckily trees grow in depth with the log of their size, so larger trees
should result in small performance hits.

> Then there is the destructor of shared pointer. There are 12% of mispredicted
> branches. It would be nice if we found the fast path and got rid of the shared
> pointers there. Most of the variables are needed only until processing of the
> current query ends. If we wrote it in C, I would suggest using a memorypool,
> so we would just take bits of memory, wouldn't care about deallocation and
> started taking from the beginning with the next query. I'm not sure if anything
> like that is possible with C++ (because it would never call destructors, which
> is bad with stl I guess). But maybe we could recycle some kind of objects as
> well.

Anything is possible with C++. :) We could use a custom allocator that
works in the manner you describe - allocating at the beginning of a
query and then dumping the entire block at the end. It is also possible
to explicitly invoke destructors if you use this technique. (That link
above includes an example.)

> And there are copy constructor and destructor of UDPServer. That sounds like a
> place with the coroutines, which copy something a lot. It would be nice to get
> it without the copying (because, both together take over 8% as well).

This seems strange to me. Possibly a place for easy optimization, if we
can figure out *why* this is being done.

> So, here is the list of proposed tasks [the ones in square brackets are just notes,
> I don't think we have enough time or courage to do them now]:
> • Identify, what is the clone function and if we can get rid of it.

See above.

> • Find a more compact representation of Name. Or, don't have Names in the tree,
>   just some kind of „naked“ representation that would be more compact.

Seems reasonable.

> • Find if we can find a way to compare names faster (maybe with different
>   representation, precomputed hashes or something).

Seems reasonable.

> • Represent the trick with name compression (might reuse precomputed hashes as
>   well, possibly.)
> • Do some interlinking inside the tree.

You mean to avoid extra traversals?

> • [Find a better data structure than red-black tree.]

Not unreasonable although perhaps risky and time consuming?

> • Look trough the fast path and find if we can get rid of any of the shared
>   pointers there. An stl::auto_ptr might be enough for most cases and it does
>   not contain conditions, access to another place in memory and some kind of
>   locking (shared_ptr is supposed to be thread safe).

Agreed. shared_ptr is nice but if it causes us performance problems in a
specific use case we can reconsider it.

> • [Explore a way to bring memory pools into C++ or how to recycle objects.]

See above?

> • Look why is UDPServer copied so much. As we have only one thread and handle
>   only one query at a time, it would be great if we used just single one.

Definitely.

> Do we want to explore multiprocessor utilization right now, or we leave it out
> until the release?

I'd prefer to explore the techniques in improving the single-processor
performance for now, unless you think there are multiprocessor
implications.

I hope we have some time on Wednesday's call to discuss this issue. If
not, perhaps it would make sense to set up a separate meeting for this
topic (either in voice or chat, I'm fine with anything)?

--
Shane