[bind10-dev] Profiling & performance sugestions
Shane Kerr
shane at isc.org
Mon Jan 24 12:30:07 UTC 2011
Michal,
On Sun, 2011-01-23 at 20:25 +0100, Michal 'vorner' Vaner wrote:
> First, I mentioned the system spent a lot of time in kernel. It turned out most
> of the time accounted for kernel were caused by dnsperf, not bind10. Approx. 12%
> of time bind10 is running is in kernel (eg. kernel handling bind10's requests),
> must of them are some mine enabled debug functions and networking stuff.
> Therefore we probably don't have any chance of getting that better (we need to
> receive and send the packets and I guess asio does some kind of epoll or other
> higher-performance think by itself).
Good to know. :)
> I have no clue what the ZNK3isc… function is. I didn't find it in our code
> (OutputBuffer does not seem to have a clone method). Does anyone idea what it
> might be? Because we spend quarter of the time in it. From the name I would
> guess it is copying of some buffer, but I guess we don't need it, it would be
> nice to get rid of it completely.
$ echo _ZNK3isc3dns12OutputBufferixEm | c++filt -n
isc::dns::OutputBuffer::operator[](unsigned long) const
There you go. :)
> The Name::compare is called mostly from the RBTree (I found this out before the
> F2F) and is slow because of cache misses (it contains nearly 80% of cache misses
> in the program). This is taken from the function, first two columns are cache
> misses, second two are cache hits.
>
> 429 1.6208 1185 0.2135 : while (l > 0) {
> : --l;
> : --l1;
> : --l2;
> 5 0.0189 1486 0.2678 : size_t pos1 = offsets_[l1];
> 3937 14.8746 1039 0.1872 : size_t pos2 = other.offsets_[l2];
> 11 0.0416 8942 1.6114 : unsigned int count1 = ndata_[pos1++];
> 5475 20.6854 863 0.1555 : unsigned int count2 = other.ndata_[pos2++];
> :
> : // We don't support any extended label types including now-obsolete
> : // bitstring labels.
> : assert(count1 <= MAX_LABELLEN && count2 <= MAX_LABELLEN);
> :
> : int cdiff = (int)count1 - (int)count2;
> 8 0.0302 1743 0.3141 : unsigned int count = (cdiff < 0) ? count1 : count2;
> :
> : while (count > 0) {
> : unsigned char label1 = ndata_[pos1];
> : unsigned char label2 = other.ndata_[pos2];
> :
> 10739 40.5735 21129 3.8076 : int chdiff = (int)maptolower[label1] - (int)maptolower[label2];
> 0 0 2186 0.3939 : if (chdiff != 0) {
> : return (NameComparisonResult(chdiff, nlabels,
> : NameComparisonResult::COMMONANCESTOR));
> : }
> : --count;
> : ++pos1;
> : ++pos2;
> : }
>
> It also contains 13% of mispredicted branches in the program.
I don't think either of these is surprising. We're accessing a lot of
memory, and if we could predict the branches we wouldn't need to
search... there may be ways to structure this code without branching, or
with a fixed-sized string in the data structure to avoid de-referencing,
or things like that. Of course, the best solution is always not to call
the function at all, so perhaps memoization might be worthwhile too.
> It would be nice to find a way to store the Name in more compact way, in a
> consecutive block of memory. Right now we have vector and string there, which
> means it is scattered over 3 places in the memory. If it wasn't C++, it would be
> OK to just call a malloc with sum of the lenghts for all three of them. Do you
> know of a way how to do it with C++?
Google points me to this syntax:
#include <new>
Timer * ptimer = new (raw_mem) Timer; // #1
That (raw_mem) is the pre-allocated memory.
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=160
> Another one is, we traverse the tree more times than we need to. For example,
> each time we answer, we traverse it once to get the actual answer, then once
> more to get the NS records of zone into the authority section and several more
> times to get IP addresses of those to the additional section. If the record
> asked is MX, for example, then we traverse it again multiple times. As Jinmei
> presented on the previous F2F, it would be nice to have some kind of
> interlinking/hints where to find something. The hints could be updated lazily
> (when it is found it is wrong or missing).
This seems like an excellent idea.
> The writeName part is the name compression. Jinmei, you had some trick for that,
> right? It seems that one is worth implementing.
And this. :)
> I might point out as well that the zone is not that big (because we don't really
> save much memory and I have only 4GB, so I couldn't fit in a bigger one), if it
> grows even more, then the impact of lookup performance will be bigger (since the
> tree or trees will be deeper).
Luckily trees grow in depth with the log of their size, so larger trees
should result in small performance hits.
> Then there is the destructor of shared pointer. There are 12% of mispredicted
> branches. It would be nice if we found the fast path and got rid of the shared
> pointers there. Most of the variables are needed only until processing of the
> current query ends. If we wrote it in C, I would suggest using a memorypool,
> so we would just take bits of memory, wouldn't care about deallocation and
> started taking from the beginning with the next query. I'm not sure if anything
> like that is possible with C++ (because it would never call destructors, which
> is bad with stl I guess). But maybe we could recycle some kind of objects as
> well.
Anything is possible with C++. :) We could use a custom allocator that
works in the manner you describe - allocating at the beginning of a
query and then dumping the entire block at the end. It is also possible
to explicitly invoke destructors if you use this technique. (That link
above includes an example.)
> And there are copy constructor and destructor of UDPServer. That sounds like a
> place with the coroutines, which copy something a lot. It would be nice to get
> it without the copying (because, both together take over 8% as well).
This seems strange to me. Possibly a place for easy optimization, if we
can figure out *why* this is being done.
> So, here is the list of proposed tasks [the ones in square brackets are just notes,
> I don't think we have enough time or courage to do them now]:
> • Identify, what is the clone function and if we can get rid of it.
See above.
> • Find a more compact representation of Name. Or, don't have Names in the tree,
> just some kind of „naked“ representation that would be more compact.
Seems reasonable.
> • Find if we can find a way to compare names faster (maybe with different
> representation, precomputed hashes or something).
Seems reasonable.
> • Represent the trick with name compression (might reuse precomputed hashes as
> well, possibly.)
> • Do some interlinking inside the tree.
You mean to avoid extra traversals?
> • [Find a better data structure than red-black tree.]
Not unreasonable although perhaps risky and time consuming?
> • Look trough the fast path and find if we can get rid of any of the shared
> pointers there. An stl::auto_ptr might be enough for most cases and it does
> not contain conditions, access to another place in memory and some kind of
> locking (shared_ptr is supposed to be thread safe).
Agreed. shared_ptr is nice but if it causes us performance problems in a
specific use case we can reconsider it.
> • [Explore a way to bring memory pools into C++ or how to recycle objects.]
See above?
> • Look why is UDPServer copied so much. As we have only one thread and handle
> only one query at a time, it would be great if we used just single one.
Definitely.
> Do we want to explore multiprocessor utilization right now, or we leave it out
> until the release?
I'd prefer to explore the techniques in improving the single-processor
performance for now, unless you think there are multiprocessor
implications.
I hope we have some time on Wednesday's call to discuss this issue. If
not, perhaps it would make sense to set up a separate meeting for this
topic (either in voice or chat, I'm fine with anything)?
--
Shane
More information about the bind10-dev
mailing list