BIND 10 #397: Port generic red-black tree (RBT) data structure from BIND-9

Wed Nov 24 14:43:30 UTC 2010

#397: Port generic red-black tree (RBT) data structure from BIND-9
------------------------------+---------------------------------------------
      Reporter:  zzchen_pku   |        Owner:  hanfeng  
          Type:  enhancement  |       Status:  reviewing
      Priority:  major        |    Milestone:           
     Component:  data source  |   Resolution:           
      Keywords:               |    Sensitive:  0        
Estimatedhours:  0.0          |        Hours:  0        
      Billable:  1            |   Totalhours:  0        
      Internal:  0            |  
------------------------------+---------------------------------------------
Changes (by jinmei):

  * owner:  jinmei => hanfeng

Comment:

 Okay, finally completed review.  Here are comments.  Giving the ticket
 back to feng.

 First, I made some (more) minor cleanups I noted directly to the branch
 (r3633).  Please check them.

 '''Meta Comments'''

 Was it developed test driven?  From the ordering of commit messages
 I'm afraid it was actually "code driven".  I'm even afraid the code
 and tests were written by different developers.  If it went that
 way, please seriously consider revisiting the development style.
 Without practice we will never be able to master TDD.  If this is done
 by a pair, it shouldn't be a pair of "main coder" and "test coder".
 Each person should do both, in the order of test then code.

 '''General Comments'''
  - I still don't understand why we need to rewrite the whole logic
    from the scratch.  This is complicated stuff by nature, so doing it
    from the scratch only seems to increase the risk of introducing
    bugs.  In fact, I've already found some strange behaviors and
    points that don't meet our requirements (see below for the
    specifics).  I couldn't also be sure about the correctness of one
    part of the code (see the comment for eraseNode() below).  Redblack
    tree algorithm itself is very complicated and difficult to write
    correctly or difficult to be sure if it's correct, so why don't we
    simply port the existing, known to work code?  That way, we can at
    least be sure that it should be as correct as the original code.
    Code written from the scratch requires more load for review and is
    normally more buggy.  "The original code is complicated" is true,
    but that is not a convincing reason to me because this code is also
    complicated.  So, I'd like to ask revisiting the approach again.
    If, you still think the scratch code is better, that's a decision,
    but please describe why we can't use the original code in the class
    description in detail.
  - A related point: the current design seems to be much less efficient
    in many points.  For example, it stores name objects, which
    contain std::string and std::vector objects, and therefore heavy in
    terms of memory footprint.  Likewise, having a separate tree object
    for each subtree can be heavy wrt memory footprint (consider, e.g.,
    a large TLD that contains many <subdomain>.<tld> names with "in
    bailiwick" NS names such as nsX.<subdomain>.<tld>.  Then we'll have
    a large number of sub trees).  find() involves possibly many
    recursive function calls, which will be slower than a single loop.
    These things may be acceptable for initial implementation if it has
    other advantage such as code simplicity (I'm personally okay with
    having name objects for now, for example), but these are important
    design decision and should be explicitly documented.
  - As we discussed before, I'm personally against using
    boost::noncopyable in a public header file.  At the very least we
    should be consistent about the use of it (right now this is the
    only place we use it).  If you still believe increasing the
    reliance on boost is the way to go, please raise it as a general
    matter and get consensus, then introduce it throughout our code
    tree.
  - Admitting this may be a matter of taste, this type of syntax looks
    awkward, and reduces code readability (by confusing the reader) to
    me:
 {{{
                             if (-1 == ret) {
 }}}
    I guess this tries to proactively avoid a common error of mistyping
    '==' with '='.  On one hand, this might be considered a good
    practice; on the other hand, where I stand, this is an outdated
    hack that just makes the code awkward.  Modern compilers warn about
    this type mistypes, and by treating warnings as an error we can
    avoid this type of bugs.  For example, if I modify this code to:
 {{{
                             if (ret = -1) {
 }}}
    my gcc warns as follows:
 {{{
 rbt_datasrc.h:525: warning: suggest parentheses around assignment used as
 truth value
 }}}
    In addition, particularly in C++, we can declare many variables
    const, which also help trigger a compiler error due to an
    accidental assignment.

    Besides, the style isn't consistent even in this file.  In some
    other parts a "more natural" style is used:
 {{{
             if (common_label_count == 1) {
 }}}
   The inconsistency is a source of confusion and reduces readability.

   If, we agree with adopting this as a coding guideline and using it
   throughout our source tree consistently, I could live with that.
   Until/unless that happens, I suggest we keep the "natural" style
   consistently.
  - About doxygen comment: comments on private methods will be ignored
    by doxygen.  it's still sometimes useful to provide detailed
    comments on them, but if it's related to a general design matter,
    it might be better to describe the points in, e.g., the class
    description.
  - I'm afraid it's not a good idea to store raw "type-T" data in
    RBNode.  We need to represent a "node without data", and the notion
    of "shadow" isn't sufficient for this purpose (see also discussion
    about "NXRRSET" below).  Also, depending on the implementation of
    the type, raw data may be heavyweight in terms of memory footprint,
    even if it's a default-constructed one.  So we should probably
    store the data as a pointer(-like) type object that can be
    comparable with "NULL".
  - class naming: it may be better to introduce one level of
    abstraction than "red black tree", because from user's point of
    view the internal structure doesn't (shouldn't) matter.  Also, we
    may actually want to change the internal structure to something
    different (maybe a hash table, or other tree-like structure as
    Michal proposed).
  - Likewise, I suspect the module and filename are not appropriate.
    This stuff should probably go to lib/datasrc.  And, "rbt_datasrc.h"
    is not a good name in two reasons: this is a generic backend and is
    not a data source per se; it would be better to raise the
    abstraction level as discussed above.
  - documentation in general: we need two types of documentation: one
    for API users, and the other for developers who try to read/modify
    the code.  For the former (or both), we need to describe "what this
    class is", "how to use it", "when this method throws an exception",
    "what this parameter should be", "what this method returns in which
    case", etc.  Especially for the latter, we need to describe "why we
    design it that way", "how this is different from other
    implementation (e.g. BIND 9) and why", "what is missing in the
    current implementation", etc.  These things are largely missing in
    this implementation.

 '''operator- for Name'''
  - technically, an unnamed namespace shouldn't be used in a public
    header file because it would easily cause breaking the C++ one
    definition rule.  I suspect there's even a compiler that rejects
    this code.  Assuming we'll eventually (actually soon) use more
    efficient data structure than Name objects, it may be okay for
    now.  But we should at least note these in comments.

 '''RBNode class'''
  - RBTreeColor: a minor point, but why does it start with 1?
  - RBTreeColor: it seems this can (should) be non public
  - getName(): we'll actually need the full name of the node (not
    relative)
  - successor()
   - why is it public (other than for testing purpose)?  it's not
     always useful for applications because it only provides the
     "successor" of a single subtree.  What the application would really
     need is the "next" node of the entire tree-of-tree (= zone).  and,
     it can even be harmful in that it returns the internal null node,
     whose property is completely hidden in the implementation and
     will confuse the application.
   - documentation needs to be improved:
    - define "big"
    - what if this is the "biggest" node?
    - note also that "whose name is bigger" is ambiguous.  It normally
      doesn't specify a single node because there are normally more
      than one "bigger" nodes.
   - cloneDNSData(): "rbt" is now non DNS specific thing, so this
     method should be renamed.
   - setDownTree(): doc: it's not a complete sentence.  also, what it tries
     to say isn't clear.
   - not clear why we need is_shadow_ (unless you read other part of
     the code carefully).  This should be fully documented (probably in
     the class description of RBTree).
   - name_: unclear description.
  - setDownTree(): I didn't understand the description.

 '''RBTree class'''
  - documentation: this is insufficient.  see above for the general
    matter.  this class should specifically describe overall data
    structure, complexity, overall user interfaces, footprint
    consideration, ownership property (whether the application should
    keep the ownership of the data stored in the "tree"), etc.
  - FindResult: we may want to make the naming consistent with
    ZoneTable, where we use SUCCESS instead of EXACTMATCH.
  - find()
    - the documentation should clarify when to use which version of
      this method.
    - semantically, the "mutable" version shouldn't be a const member
      function, because the caller can effectively modify the tree
      through the returned node pointer.  IMO, it's safer to remove
      the const and warn that (in doc) this is a dangerous version of
      find() and should be used only when necessary and with caution.
    - in BIND 9 we add argument validation for this type of API:
     requiring inserted_node != NULL && *inserted_node = NULL.
     we may want to do the same check.  same comment applies to insert().
  - for what purpose do we need getNodeCount()?  for debug?  the intent
    should be documented, and if we don't need it in practice, we
    should probably remove it.  I'd also note that "node created common
    suffix node" is completely implementation specific, and isn't
    appropriate for a public interface (it's susceptible to
    implementation changes)
  - same comment applies to getNameCount()?
  - printTree(): our usual practice is to define toText() and (if
    necessary) define operator<<() separately (but for this class
    toText() may not always make sense because the resulting text may
    become very large).
    Note also that the naming of "printTree()" is too specific to the
    current implementation, and "Tree" is actually redundant because
    it's a member function of "RBTree".  maybe we should just call it
    "print()".
  - insert()
   - I don't like it to return magic numbers such as 0/1/-1.  Why don't
     we generalize the result code (EXACTMATCH, etc) and use it here?
     Note: ZoneTable adopts that approach.
   - In any case, I don't think it a good idea to handle the 'out of
     memory' error via the return value.  I'll discuss it more with
     exception safety below.
   - we may eventually want to introduce an iterator, and want to have
     insert() return an iterator.  for now I'm okay without this
     extension, but it would be better to note this possibility in the
     class documentation.
  - erase(): like insert(), I don't like it to return magic numbers.
  - The diagram after the class definition: this should be included in
    the doxygen document.

 '''~RBTree()'''
  - This seems to be quite inefficient (consider how many nodes we need
    to visit before deleting the entire tree).  This seems to me to be
    another bad reason for redeveloping the wheel.

 '''findHelper()'''
  - I don't think it correct to return NOTFOUND when the search finds
    an exact match with "is_shadow_" because there's a case like
    "NXRRSET", where the node is empty but should match the search key.
    Note that handling this case wouldn't be that trivial, because
    there will be indirect ways to change the node property, e.g.
    an application first finds a "shadow" node in an "NXRRSET mode", and
    then perform setData() on it, which effectively makes that node
    "non shadow".
    This design should be substantially revisited.
  - if we specify 'const' for 'tree', why not do that for 'ret', too?
    note that 'tree' is also subject to a subsequent change, where we
    need const_cast (see erase()).  ideally we should avoid such
    casting, but I can live with it in this case, understanding that
    it's used in mutable and immutable contexts and that it's a private
    helper function.  however, such a tricky property should be
    consistent so that the intent is clearer to code readers.  the
    intent should also be documented in detail, so that when someone
    needs to modify the code the interface won't confuse the developer.

 '''getNodeCount/Helper()'''
  - this requires traversing the entire tree just to count the total
    number of nodes.  depending on the purpose, it seems too expensive.
    Note: in BIND 9 version of rbt this is a constant operation.

 '''getNameCounter()'''
  - same comment applies.

 '''insert()'''
  - this is still not exception safe, and this usage of try-catch is
    not consistent with our convention, and makes the code unreadable.

    First, it's still not safe.  For example, operator- for names or
    cloneDNSData() will require resource allocation for new name
    objects and can throw.  If this happens the memory allocated for
    new_sub_tree will leak.  In general, it's very error prone to try
    to address exception safety this way unless the code is very short
    and only uses very primitive interfaces, because we can easily miss
    a function that can throw an exception.  It's also fragile to
    future changes for the same reason.  In fact, this is one of the
    reasons "why we shouldn't use exceptions" often claimed by
    exception opponents.

    Second, in a (C++) package relying on exceptions, handling
    exceptions locally doesn't make sense in general.  This is first
    because how to handle the situation that an exception occurs is
    often unclear at a lower level like this (e.g. at a top application
    level, we might be able to drop some unnecessary data such as
    caches to make more space for getting mandatory resources).  It's
    also because adding try-catch in a lower level decreases code
    readability with other disadvantages of exceptions (such as higher
    costs and higher possibility of resource leaks).  In fact, the
    resulting code is quite unreadable due to the handling of rare
    exceptional cases.  If we were to handle these error cases at this
    point, it's much better to use no-throw new and write the C-style
    if-fail-then-(goto)-cleanup code.

    I'd personally prefer allocating resources in an RAII manner and
    simply let exceptions be propagated.  We could use a shared
    pointer, but in this case it's a bit overkilling, and the standard
    auto_ptr should suffice.  For example, the new_sub_tree handling
    would look like this:
 {{{
                     std::auto_ptr<RBTree<T> > new_sub_tree =
                         std::auto_ptr<RBTree<T> >(new RBTree());
                     RBNode<T>* sub_root;
                     new_sub_tree->insert(sub_name, &sub_root);
                     int ret = 0;
                     if (name.getLabelCount() != common_label_count) {
                         ret = new_sub_tree->insert(name - common_ancestor,
                                                    new_node);
                     }
                     RBTree<T>* down_old = current->down_;
                     current->setDownTree(new_sub_tree.get());
                     ...
                     current->is_shadow_ = true;
                     new_sub_tree.release(); // there's no worry for
 exception
 }}}

  - in the common_ancestor case, shouldn't we leave the original data
    in the new "current"?
  - I suspect we shouldn't do this when new_node is NULL:
 {{{
                         if (name.getLabelCount() == common_label_count) {
                             *new_node = current;
 }}}
   (or we might require new_node must always be non NULL)

 '''insertRebalance()'''
  - shouldn't we care about the case of 'node == root_' in the while
 condition?

 '''leftRotate()'''
  - largely a matter of preference, but I'd use more descriptive
    variable names like 'child', 'parent', instead of 'c', 'p', except
    commonly used idioms such as 'i' for a loop counter.
    Same comment applies to rightRotate().
  - I suggest adding const to p and c as follows:
 {{{
 RBTree<T>::leftRotate(RBNode<T>* const p) {
     RBNode<T>* const c = p->right_;
 }}}
   that way we can be sure p or c will be never replaced in the
   function, which helps read the code. Same comment applies to
   rightRotate().
  - why does this function return a value?  it doesn't seem to be used.
    Same comment applies to rightRotate().

 '''erase()'''
  - in the case of 'node->down_ != NULL', I don't think it a good idea to
 leave
    the data in the node.  what if it's a 100MB of object?
  - in the "merge down to up" case, it assumes the "tree->root" node is
    not "shadow".  It's not obvious to me if this is met.  Isn't it
    possible to have the following scenario?
 {{{
 level L: two nodes=N1, N2
   N1: shadow, with down
   N2: (any type of node)
 level L+1 (from N1): two nodes=N3, N4
   N3: shadow, with down
   N4: (any type of node)
 }}}
   Then suppose removing N4.  Tree L+1 has now only one node (N3), which
   has an upper tree (L) and the upper node (N1) is shadow.  N3, a shadow
   node, will be merged to N1, which now becomes a non shadow node.
   Even if such case shouldn't happen due to some non obvious
   constraints, it's far from clear, and we at least need comments why
   it cannot happen (we should probably add an assert() here, too).
   It would even be better if the whole code logic is more easy to
   understand.
  - In any case I see a more fundamental issue in this trick: this
    part can throw an exception (from Name::concatenate() or
    cloneDNSData(), which can throw in copying the name, or depending
    on the implementation of type T, in its reassignment, or from
    the merge_name copy).  In general, methods like "erase" or
    destructors should not throw, so we need careful reconsideration
    here.

 '''eraseNode()'''
  - like left/rightRotate, I'd use more descriptive variable names than
    'x', 'y'.  It's also awkward that 'y' is declared before 'x'.
    Meaningless initialization of these to NULLNODE is also a bad
    practice.  These variables are essentially const (i.e. initialized
    once and never replaced), so I'd revise the first part of the code
    as follows:
 {{{
 template <typename T>
 void
 RBTree<T>::eraseNode(RBNode<T>* const node) {
     RBNode<T>* const to_delete =
         (node->left_ == NULLNODE || node->right_ == NULLNODE) ?
         node : const_cast<RBNode<T>*>(node->successor());

     RBNode<T>* const child = (y->left_ != NULLNODE) ?
         to_delete->left_ : to_delete->right_;
 }}}
  - and, after reading the code of eraseNode() and deleteRebalance()
    for over an hour, I finally gave up confirming this is really
    correct in terms of the red black tree algorithm.  It's mostly
    impossible to be sure about such non trivial code without any
    comment.  Please do either:
    - clarify each chunk of the code with comment about what it intends
      to do with any non trivial remarks, or
    - fall back to a more straightforward port of BIND 9's rbt (which I
      still recommend).  This way we'll be able to be sure at least
      that it's as correct as the original code that has been proved to
      be working.

 '''deleteRebalance()'''
  - I cannot be sure this code is correct, like eraseNode().  same
    suggestion applies.
  - variable name 'w' is bad.  please use a more descriptive one.

 '''INDNET()'''
  - it should be INDENT (I fixed it in the branch).
  - We should avoid using a macro this way, especially in a (possibly)
    public header file.  It should be a function (ideally in an unnamed
    namespace, but also note the issue with an unnamed namespace in a
    header file).
  - in either case, I'd replace the for loop as follows:
 {{{
     os << std::string(depth * 5, ' ');
 }}}
  - I'm a magic number hater:-) Why is 5 chosen (not 4 for example)?

 '''printTree()'''
  - Just checking: is it intentional that \n is used, not std::endl?
    same question applies to printTreeHelper().

 '''printTreeHelper()'''
  - at the end of the function, it seems we can avoid the if-else by
    handling null node case at the beginning of the function:
 {{{
 {
     INDENT(os, depth);
     if (node == NULLNODE) {
         os << "NULL\n";
         return;
     }
 }}}
   then we can simply the end of the function:
 {{{
     printTreeHelper(os, node->left_, depth + 1);
     printTreeHelper(os, node->right_, depth + 1);
 }}}
  - maybe a matter of preference, but I'd avoid inserting a blank line
  for "visible" nodes, and avoid consuming a line just for
  'invisible'.  so, I prefer
 {{{
      d.e.f. (black) [invisible]
 }}}
   instead of
 {{{
      d.e.f. (black)
 [invisible]
 }}}
  - I'd add ' ' to 'end down from' (hey, this type of error could be
    avoided if you strictly do it test driven:-)
  - "tree has node XX" doesn't parse well.  I'd rephrase it "tree has
    XX node(s)"

 '''rbt_datasrc_unittest'''
  - insertNames: this comment is not correct any more:
 {{{
     // a node is considered to "formally" exist only if it has data
     // associated with it
 }}}
    (I guess it was based on the first implementation?)
    In the current implementation, a node will be considered "existent"
    once it's explicitly inserted whether or not the data is
    associated. so for example the following test would pass:
 {{{
     EXPECT_EQ(0, rbtree.insert(Name("d.e.f"), &rbtnode));
     EXPECT_EQ(1, rbtree.insert(Name("d.e.f"), &rbtnode));
 }}}
  - successor
   - as commented in the main code, it may not be a good idea to have
     it as a public method.
   - the last successor() operation after "g.h" seems to be no-op
 {{{
     EXPECT_EQ(Name("g.h"), successor_node->getName());
     successor_node = successor_node->successor();
 }}}
   (although I know there's probably no meaningful test for this)
  - eraseName
   - the diagrams explaining the tests are helpful:-)
   - this comment doesn't seem to match what's tested:
 {{{
     // o would not be rejoined with w.y if w.y had data
     // associated with the key
 }}}
   - Strange thing happens here:
 {{{
     // z will rejoin with d.e.f since d.e.f has no data
     EXPECT_EQ(0, rbtree.erase(Name("s.d.e.f")));
 }}}
   until this point, z.d.e.f is "shadow", but once we erase s.d.e.f it
   becomes "visible".  so, the following test would pass
 {{{
     EXPECT_EQ(RBTree<int>::NOTFOUND, rbtree.find(Name("z.d.e.f"),
 &crbtnode));
     EXPECT_EQ(0, rbtree.erase(Name("s.d.e.f")));
     EXPECT_EQ(RBTree<int>::EXACTMATCH, rbtree.find(Name("z.d.e.f"),
 &crbtnode));
 }}}
   I guess the issue I pointed out for erase() is realized here.  (This
   is more about the implementation rather than for the test itself)

-- 
Ticket URL: <http://bind10.isc.org/ticket/397#comment:20>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development