[bind10-dev] Profiling & performance sugestions

Michal 'vorner' Vaner michal.vaner at nic.cz
Sun Jan 23 19:25:48 UTC 2011


Hello

I tried to get little bit more info out of the profiling and propose what we do
about it. There is proposed set of tasks (subject to discussion, splitting,
modification, etc).

I did it with one largish zone, there surely are some other scenarios. I
generated it with the attached script.

First, I mentioned the system spent a lot of time in kernel. It turned out most
of the time accounted for kernel were caused by dnsperf, not bind10. Approx. 12%
of time bind10 is running is in kernel (eg. kernel handling bind10's requests),
must of them are some mine enabled debug functions and networking stuff.
Therefore we probably don't have any chance of getting that better (we need to
receive and send the packets and I guess asio does some kind of epoll or other
higher-performance think by itself).

Now, to the slow parts of our code. Here is the list of functions executing over
1% of time (excluding children):

CPU: AMD64 family10, speed 0 MHz (estimated)
Counted IBS_OP_ALL events (All IBS op samples) with a unit mask of 0x00 (Using IBS OP cycle count mode) count 50000
samples  %        symbol name
70215    25.1633  _ZNK3isc3dns12OutputBufferixEm.clone.87
47336    16.9641  isc::dns::Name::compare(isc::dns::Name const&) const
44651    16.0018  isc::dns::MessageRenderer::writeName(isc::dns::Name const&, bool)
22399     8.0273  boost::detail::shared_count::~shared_count()
13080     4.6876  asiolink::UDPServer::UDPServer(asiolink::UDPServer const&)
8695      3.1161  asiolink::UDPServer::~UDPServer()
4962      1.7783  isc::dns::Name::Name(isc::dns::InputBuffer&, bool)
4923      1.7643  isc::datasrc::RBTree<std::map<isc::dns::RRType, boost::shared_ptr<isc::dns::RRset const>, std::less<isc::dns::RRType>, std::allocator<std::pair<isc::dns::RRType const, boost::shared_ptr<isc::dns::RRset const> > > >, false>::Result isc::datasrc::RBTree<std::map<isc::dns::RRType, boost::shared_ptr<isc::dns::RRset const>, std::less<isc::dns::RRType>, std::allocator<std::pair<isc::dns::RRType const, boost::shared_ptr<isc::dns::RRset const> > > >, false>::findHelper<isc::datasrc::MemoryZone::MemoryZoneImpl::FindState*>(isc::dns::Name const&, isc::datasrc::RBNode<std::map<isc::dns::RRType, boost::shared_ptr<isc::dns::RRset const>, std::less<isc::dns::RRType>, std::allocator<std::pair<isc::dns::RRType const, boost::shared_ptr<isc::dns::RRset const> > > > > const**, isc::datasrc::RBNode<std::map<isc::dns::RRType, boost::shared_ptr<isc::dns::RRset const>, std::less<isc::dns::RRType>, std::allocator<std::pair<isc::dns::RRType const, boost::shared_ptr<isc::dns::RRset const> > > > >**, bool (*)(isc::datasrc::RBNode<std::map<isc::dns::RRType, boost::shared_ptr<isc::dns::RRset const>, std::less<isc::dns::RRType>, std::allocator<std::pair<isc::dns::RRType const, boost::shared_ptr<isc::dns::RRset const> > > > > const&, isc::datasrc::MemoryZone::MemoryZoneImpl::FindState*), isc::datasrc::MemoryZone::MemoryZoneImpl::FindState*) const
2946      1.0558  void std::vector<unsigned char, std::allocator<unsigned char> >::_M_range_insert<unsigned char const*>(__gnu_cxx::__normal_iterator<unsigned char*, std::vector<unsigned char, std::allocator<unsigned char> > >, unsigned char const*, unsigned char const*, std::forward_iterator_tag)
2818      1.0099  asiolink::UDPServer::operator()(asio::error_code, unsigned long)

I have no clue what the ZNK3isc… function is. I didn't find it in our code
(OutputBuffer does not seem to have a clone method). Does anyone idea what it
might be? Because we spend quarter of the time in it. From the name I would
guess it is copying of some buffer, but I guess we don't need it, it would be
nice to get rid of it completely.

The Name::compare is called mostly from the RBTree (I found this out before the
F2F) and is slow because of cache misses (it contains nearly 80% of cache misses
in the program). This is taken from the function, first two columns are cache
misses, second two are cache hits.

   429  1.6208  1185  0.2135   :    while (l > 0) {
                               :        --l;
                               :        --l1;
                               :        --l2;
     5  0.0189  1486  0.2678   :        size_t pos1 = offsets_[l1];
  3937 14.8746  1039  0.1872   :        size_t pos2 = other.offsets_[l2];
    11  0.0416  8942  1.6114   :        unsigned int count1 = ndata_[pos1++];
  5475 20.6854   863  0.1555   :        unsigned int count2 = other.ndata_[pos2++];
                               :
                               :        // We don't support any extended label types including now-obsolete
                               :        // bitstring labels.
                               :        assert(count1 <= MAX_LABELLEN && count2 <= MAX_LABELLEN);
                               :
                               :        int cdiff = (int)count1 - (int)count2;
     8  0.0302  1743  0.3141   :        unsigned int count = (cdiff < 0) ? count1 : count2;
                               :
                               :        while (count > 0) {
                               :            unsigned char label1 = ndata_[pos1];
                               :            unsigned char label2 = other.ndata_[pos2];
                               :
 10739 40.5735 21129  3.8076   :            int chdiff = (int)maptolower[label1] - (int)maptolower[label2];
     0       0  2186  0.3939   :            if (chdiff != 0) {
                               :                return (NameComparisonResult(chdiff, nlabels,
                               :                                         NameComparisonResult::COMMONANCESTOR));
                               :            }
                               :            --count;
                               :            ++pos1;
                               :            ++pos2;
                               :        }

It also contains 13% of mispredicted branches in the program.

The MessageRenderer::writeName is OK regarding data placement (it has under a
percent of misses and over 20% of cache hits), but it about 20% of
misspredicted branches. Most of it is not directly in the function, but in
NameComper::operator() (on the for cycle) and NameCompare::nextPosition (it's
first line), which are inlined into it.

It would be nice to find a way to store the Name in more compact way, in a
consecutive block of memory. Right now we have vector and string there, which
means it is scattered over 3 places in the memory. If it wasn't C++, it would be
OK to just call a malloc with sum of the lenghts for all three of them. Do you
know of a way how to do it with C++?

Another one is, we traverse the tree more times than we need to. For example,
each time we answer, we traverse it once to get the actual answer, then once
more to get the NS records of zone into the authority section and several more
times to get IP addresses of those to the additional section. If the record
asked is MX, for example, then we traverse it again multiple times. As Jinmei
presented on the previous F2F, it would be nice to have some kind of
interlinking/hints where to find something. The hints could be updated lazily
(when it is found it is wrong or missing).

The writeName part is the name compression. Jinmei, you had some trick for that,
right? It seems that one is worth implementing.

I might point out as well that the zone is not that big (because we don't really
save much memory and I have only 4GB, so I couldn't fit in a bigger one), if it
grows even more, then the impact of lookup performance will be bigger (since the
tree or trees will be deeper).

Then there is the destructor of shared pointer. There are 12% of mispredicted
branches. It would be nice if we found the fast path and got rid of the shared
pointers there. Most of the variables are needed only until processing of the
current query ends. If we wrote it in C, I would suggest using a memorypool,
so we would just take bits of memory, wouldn't care about deallocation and
started taking from the beginning with the next query. I'm not sure if anything
like that is possible with C++ (because it would never call destructors, which
is bad with stl I guess). But maybe we could recycle some kind of objects as
well.

And there are copy constructor and destructor of UDPServer. That sounds like a
place with the coroutines, which copy something a lot. It would be nice to get
it without the copying (because, both together take over 8% as well).


So, here is the list of proposed tasks [the ones in square brackets are just notes,
I don't think we have enough time or courage to do them now]:
• Identify, what is the clone function and if we can get rid of it.
• Find a more compact representation of Name. Or, don't have Names in the tree,
  just some kind of „naked“ representation that would be more compact.
• Find if we can find a way to compare names faster (maybe with different
  representation, precomputed hashes or something).
• Represent the trick with name compression (might reuse precomputed hashes as
  well, possibly.)
• Do some interlinking inside the tree.
• [Find a better data structure than red-black tree.]
• Look trough the fast path and find if we can get rid of any of the shared
  pointers there. An stl::auto_ptr might be enough for most cases and it does
  not contain conditions, access to another place in memory and some kind of
  locking (shared_ptr is supposed to be thread safe).
• [Explore a way to bring memory pools into C++ or how to recycle objects.]
• Look why is UDPServer copied so much. As we have only one thread and handle
  only one query at a time, it would be great if we used just single one.

Any other ideas how to address the above is welcome.

Do we want to explore multiprocessor utilization right now, or we leave it out
until the release?

Have a nice day

-- 
I still miss Windows, but my aim is getting better.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: generator.pl
Type: text/x-perl
Size: 1252 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20110123/67c6e3a8/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20110123/67c6e3a8/attachment-0001.bin>


More information about the bind10-dev mailing list