BIND 10 #397: Port generic red-black tree (RBT) data structure from BIND-9
BIND 10 Development
do-not-reply at isc.org
Wed Nov 24 14:43:30 UTC 2010
#397: Port generic red-black tree (RBT) data structure from BIND-9
------------------------------+---------------------------------------------
Reporter: zzchen_pku | Owner: hanfeng
Type: enhancement | Status: reviewing
Priority: major | Milestone:
Component: data source | Resolution:
Keywords: | Sensitive: 0
Estimatedhours: 0.0 | Hours: 0
Billable: 1 | Totalhours: 0
Internal: 0 |
------------------------------+---------------------------------------------
Changes (by jinmei):
* owner: jinmei => hanfeng
Comment:
Okay, finally completed review. Here are comments. Giving the ticket
back to feng.
First, I made some (more) minor cleanups I noted directly to the branch
(r3633). Please check them.
'''Meta Comments'''
Was it developed test driven? From the ordering of commit messages
I'm afraid it was actually "code driven". I'm even afraid the code
and tests were written by different developers. If it went that
way, please seriously consider revisiting the development style.
Without practice we will never be able to master TDD. If this is done
by a pair, it shouldn't be a pair of "main coder" and "test coder".
Each person should do both, in the order of test then code.
'''General Comments'''
- I still don't understand why we need to rewrite the whole logic
from the scratch. This is complicated stuff by nature, so doing it
from the scratch only seems to increase the risk of introducing
bugs. In fact, I've already found some strange behaviors and
points that don't meet our requirements (see below for the
specifics). I couldn't also be sure about the correctness of one
part of the code (see the comment for eraseNode() below). Redblack
tree algorithm itself is very complicated and difficult to write
correctly or difficult to be sure if it's correct, so why don't we
simply port the existing, known to work code? That way, we can at
least be sure that it should be as correct as the original code.
Code written from the scratch requires more load for review and is
normally more buggy. "The original code is complicated" is true,
but that is not a convincing reason to me because this code is also
complicated. So, I'd like to ask revisiting the approach again.
If, you still think the scratch code is better, that's a decision,
but please describe why we can't use the original code in the class
description in detail.
- A related point: the current design seems to be much less efficient
in many points. For example, it stores name objects, which
contain std::string and std::vector objects, and therefore heavy in
terms of memory footprint. Likewise, having a separate tree object
for each subtree can be heavy wrt memory footprint (consider, e.g.,
a large TLD that contains many <subdomain>.<tld> names with "in
bailiwick" NS names such as nsX.<subdomain>.<tld>. Then we'll have
a large number of sub trees). find() involves possibly many
recursive function calls, which will be slower than a single loop.
These things may be acceptable for initial implementation if it has
other advantage such as code simplicity (I'm personally okay with
having name objects for now, for example), but these are important
design decision and should be explicitly documented.
- As we discussed before, I'm personally against using
boost::noncopyable in a public header file. At the very least we
should be consistent about the use of it (right now this is the
only place we use it). If you still believe increasing the
reliance on boost is the way to go, please raise it as a general
matter and get consensus, then introduce it throughout our code
tree.
- Admitting this may be a matter of taste, this type of syntax looks
awkward, and reduces code readability (by confusing the reader) to
me:
{{{
if (-1 == ret) {
}}}
I guess this tries to proactively avoid a common error of mistyping
'==' with '='. On one hand, this might be considered a good
practice; on the other hand, where I stand, this is an outdated
hack that just makes the code awkward. Modern compilers warn about
this type mistypes, and by treating warnings as an error we can
avoid this type of bugs. For example, if I modify this code to:
{{{
if (ret = -1) {
}}}
my gcc warns as follows:
{{{
rbt_datasrc.h:525: warning: suggest parentheses around assignment used as
truth value
}}}
In addition, particularly in C++, we can declare many variables
const, which also help trigger a compiler error due to an
accidental assignment.
Besides, the style isn't consistent even in this file. In some
other parts a "more natural" style is used:
{{{
if (common_label_count == 1) {
}}}
The inconsistency is a source of confusion and reduces readability.
If, we agree with adopting this as a coding guideline and using it
throughout our source tree consistently, I could live with that.
Until/unless that happens, I suggest we keep the "natural" style
consistently.
- About doxygen comment: comments on private methods will be ignored
by doxygen. it's still sometimes useful to provide detailed
comments on them, but if it's related to a general design matter,
it might be better to describe the points in, e.g., the class
description.
- I'm afraid it's not a good idea to store raw "type-T" data in
RBNode. We need to represent a "node without data", and the notion
of "shadow" isn't sufficient for this purpose (see also discussion
about "NXRRSET" below). Also, depending on the implementation of
the type, raw data may be heavyweight in terms of memory footprint,
even if it's a default-constructed one. So we should probably
store the data as a pointer(-like) type object that can be
comparable with "NULL".
- class naming: it may be better to introduce one level of
abstraction than "red black tree", because from user's point of
view the internal structure doesn't (shouldn't) matter. Also, we
may actually want to change the internal structure to something
different (maybe a hash table, or other tree-like structure as
Michal proposed).
- Likewise, I suspect the module and filename are not appropriate.
This stuff should probably go to lib/datasrc. And, "rbt_datasrc.h"
is not a good name in two reasons: this is a generic backend and is
not a data source per se; it would be better to raise the
abstraction level as discussed above.
- documentation in general: we need two types of documentation: one
for API users, and the other for developers who try to read/modify
the code. For the former (or both), we need to describe "what this
class is", "how to use it", "when this method throws an exception",
"what this parameter should be", "what this method returns in which
case", etc. Especially for the latter, we need to describe "why we
design it that way", "how this is different from other
implementation (e.g. BIND 9) and why", "what is missing in the
current implementation", etc. These things are largely missing in
this implementation.
'''operator- for Name'''
- technically, an unnamed namespace shouldn't be used in a public
header file because it would easily cause breaking the C++ one
definition rule. I suspect there's even a compiler that rejects
this code. Assuming we'll eventually (actually soon) use more
efficient data structure than Name objects, it may be okay for
now. But we should at least note these in comments.
'''RBNode class'''
- RBTreeColor: a minor point, but why does it start with 1?
- RBTreeColor: it seems this can (should) be non public
- getName(): we'll actually need the full name of the node (not
relative)
- successor()
- why is it public (other than for testing purpose)? it's not
always useful for applications because it only provides the
"successor" of a single subtree. What the application would really
need is the "next" node of the entire tree-of-tree (= zone). and,
it can even be harmful in that it returns the internal null node,
whose property is completely hidden in the implementation and
will confuse the application.
- documentation needs to be improved:
- define "big"
- what if this is the "biggest" node?
- note also that "whose name is bigger" is ambiguous. It normally
doesn't specify a single node because there are normally more
than one "bigger" nodes.
- cloneDNSData(): "rbt" is now non DNS specific thing, so this
method should be renamed.
- setDownTree(): doc: it's not a complete sentence. also, what it tries
to say isn't clear.
- not clear why we need is_shadow_ (unless you read other part of
the code carefully). This should be fully documented (probably in
the class description of RBTree).
- name_: unclear description.
- setDownTree(): I didn't understand the description.
'''RBTree class'''
- documentation: this is insufficient. see above for the general
matter. this class should specifically describe overall data
structure, complexity, overall user interfaces, footprint
consideration, ownership property (whether the application should
keep the ownership of the data stored in the "tree"), etc.
- FindResult: we may want to make the naming consistent with
ZoneTable, where we use SUCCESS instead of EXACTMATCH.
- find()
- the documentation should clarify when to use which version of
this method.
- semantically, the "mutable" version shouldn't be a const member
function, because the caller can effectively modify the tree
through the returned node pointer. IMO, it's safer to remove
the const and warn that (in doc) this is a dangerous version of
find() and should be used only when necessary and with caution.
- in BIND 9 we add argument validation for this type of API:
requiring inserted_node != NULL && *inserted_node = NULL.
we may want to do the same check. same comment applies to insert().
- for what purpose do we need getNodeCount()? for debug? the intent
should be documented, and if we don't need it in practice, we
should probably remove it. I'd also note that "node created common
suffix node" is completely implementation specific, and isn't
appropriate for a public interface (it's susceptible to
implementation changes)
- same comment applies to getNameCount()?
- printTree(): our usual practice is to define toText() and (if
necessary) define operator<<() separately (but for this class
toText() may not always make sense because the resulting text may
become very large).
Note also that the naming of "printTree()" is too specific to the
current implementation, and "Tree" is actually redundant because
it's a member function of "RBTree". maybe we should just call it
"print()".
- insert()
- I don't like it to return magic numbers such as 0/1/-1. Why don't
we generalize the result code (EXACTMATCH, etc) and use it here?
Note: ZoneTable adopts that approach.
- In any case, I don't think it a good idea to handle the 'out of
memory' error via the return value. I'll discuss it more with
exception safety below.
- we may eventually want to introduce an iterator, and want to have
insert() return an iterator. for now I'm okay without this
extension, but it would be better to note this possibility in the
class documentation.
- erase(): like insert(), I don't like it to return magic numbers.
- The diagram after the class definition: this should be included in
the doxygen document.
'''~RBTree()'''
- This seems to be quite inefficient (consider how many nodes we need
to visit before deleting the entire tree). This seems to me to be
another bad reason for redeveloping the wheel.
'''findHelper()'''
- I don't think it correct to return NOTFOUND when the search finds
an exact match with "is_shadow_" because there's a case like
"NXRRSET", where the node is empty but should match the search key.
Note that handling this case wouldn't be that trivial, because
there will be indirect ways to change the node property, e.g.
an application first finds a "shadow" node in an "NXRRSET mode", and
then perform setData() on it, which effectively makes that node
"non shadow".
This design should be substantially revisited.
- if we specify 'const' for 'tree', why not do that for 'ret', too?
note that 'tree' is also subject to a subsequent change, where we
need const_cast (see erase()). ideally we should avoid such
casting, but I can live with it in this case, understanding that
it's used in mutable and immutable contexts and that it's a private
helper function. however, such a tricky property should be
consistent so that the intent is clearer to code readers. the
intent should also be documented in detail, so that when someone
needs to modify the code the interface won't confuse the developer.
'''getNodeCount/Helper()'''
- this requires traversing the entire tree just to count the total
number of nodes. depending on the purpose, it seems too expensive.
Note: in BIND 9 version of rbt this is a constant operation.
'''getNameCounter()'''
- same comment applies.
'''insert()'''
- this is still not exception safe, and this usage of try-catch is
not consistent with our convention, and makes the code unreadable.
First, it's still not safe. For example, operator- for names or
cloneDNSData() will require resource allocation for new name
objects and can throw. If this happens the memory allocated for
new_sub_tree will leak. In general, it's very error prone to try
to address exception safety this way unless the code is very short
and only uses very primitive interfaces, because we can easily miss
a function that can throw an exception. It's also fragile to
future changes for the same reason. In fact, this is one of the
reasons "why we shouldn't use exceptions" often claimed by
exception opponents.
Second, in a (C++) package relying on exceptions, handling
exceptions locally doesn't make sense in general. This is first
because how to handle the situation that an exception occurs is
often unclear at a lower level like this (e.g. at a top application
level, we might be able to drop some unnecessary data such as
caches to make more space for getting mandatory resources). It's
also because adding try-catch in a lower level decreases code
readability with other disadvantages of exceptions (such as higher
costs and higher possibility of resource leaks). In fact, the
resulting code is quite unreadable due to the handling of rare
exceptional cases. If we were to handle these error cases at this
point, it's much better to use no-throw new and write the C-style
if-fail-then-(goto)-cleanup code.
I'd personally prefer allocating resources in an RAII manner and
simply let exceptions be propagated. We could use a shared
pointer, but in this case it's a bit overkilling, and the standard
auto_ptr should suffice. For example, the new_sub_tree handling
would look like this:
{{{
std::auto_ptr<RBTree<T> > new_sub_tree =
std::auto_ptr<RBTree<T> >(new RBTree());
RBNode<T>* sub_root;
new_sub_tree->insert(sub_name, &sub_root);
int ret = 0;
if (name.getLabelCount() != common_label_count) {
ret = new_sub_tree->insert(name - common_ancestor,
new_node);
}
RBTree<T>* down_old = current->down_;
current->setDownTree(new_sub_tree.get());
...
current->is_shadow_ = true;
new_sub_tree.release(); // there's no worry for
exception
}}}
- in the common_ancestor case, shouldn't we leave the original data
in the new "current"?
- I suspect we shouldn't do this when new_node is NULL:
{{{
if (name.getLabelCount() == common_label_count) {
*new_node = current;
}}}
(or we might require new_node must always be non NULL)
'''insertRebalance()'''
- shouldn't we care about the case of 'node == root_' in the while
condition?
'''leftRotate()'''
- largely a matter of preference, but I'd use more descriptive
variable names like 'child', 'parent', instead of 'c', 'p', except
commonly used idioms such as 'i' for a loop counter.
Same comment applies to rightRotate().
- I suggest adding const to p and c as follows:
{{{
RBTree<T>::leftRotate(RBNode<T>* const p) {
RBNode<T>* const c = p->right_;
}}}
that way we can be sure p or c will be never replaced in the
function, which helps read the code. Same comment applies to
rightRotate().
- why does this function return a value? it doesn't seem to be used.
Same comment applies to rightRotate().
'''erase()'''
- in the case of 'node->down_ != NULL', I don't think it a good idea to
leave
the data in the node. what if it's a 100MB of object?
- in the "merge down to up" case, it assumes the "tree->root" node is
not "shadow". It's not obvious to me if this is met. Isn't it
possible to have the following scenario?
{{{
level L: two nodes=N1, N2
N1: shadow, with down
N2: (any type of node)
level L+1 (from N1): two nodes=N3, N4
N3: shadow, with down
N4: (any type of node)
}}}
Then suppose removing N4. Tree L+1 has now only one node (N3), which
has an upper tree (L) and the upper node (N1) is shadow. N3, a shadow
node, will be merged to N1, which now becomes a non shadow node.
Even if such case shouldn't happen due to some non obvious
constraints, it's far from clear, and we at least need comments why
it cannot happen (we should probably add an assert() here, too).
It would even be better if the whole code logic is more easy to
understand.
- In any case I see a more fundamental issue in this trick: this
part can throw an exception (from Name::concatenate() or
cloneDNSData(), which can throw in copying the name, or depending
on the implementation of type T, in its reassignment, or from
the merge_name copy). In general, methods like "erase" or
destructors should not throw, so we need careful reconsideration
here.
'''eraseNode()'''
- like left/rightRotate, I'd use more descriptive variable names than
'x', 'y'. It's also awkward that 'y' is declared before 'x'.
Meaningless initialization of these to NULLNODE is also a bad
practice. These variables are essentially const (i.e. initialized
once and never replaced), so I'd revise the first part of the code
as follows:
{{{
template <typename T>
void
RBTree<T>::eraseNode(RBNode<T>* const node) {
RBNode<T>* const to_delete =
(node->left_ == NULLNODE || node->right_ == NULLNODE) ?
node : const_cast<RBNode<T>*>(node->successor());
RBNode<T>* const child = (y->left_ != NULLNODE) ?
to_delete->left_ : to_delete->right_;
}}}
- and, after reading the code of eraseNode() and deleteRebalance()
for over an hour, I finally gave up confirming this is really
correct in terms of the red black tree algorithm. It's mostly
impossible to be sure about such non trivial code without any
comment. Please do either:
- clarify each chunk of the code with comment about what it intends
to do with any non trivial remarks, or
- fall back to a more straightforward port of BIND 9's rbt (which I
still recommend). This way we'll be able to be sure at least
that it's as correct as the original code that has been proved to
be working.
'''deleteRebalance()'''
- I cannot be sure this code is correct, like eraseNode(). same
suggestion applies.
- variable name 'w' is bad. please use a more descriptive one.
'''INDNET()'''
- it should be INDENT (I fixed it in the branch).
- We should avoid using a macro this way, especially in a (possibly)
public header file. It should be a function (ideally in an unnamed
namespace, but also note the issue with an unnamed namespace in a
header file).
- in either case, I'd replace the for loop as follows:
{{{
os << std::string(depth * 5, ' ');
}}}
- I'm a magic number hater:-) Why is 5 chosen (not 4 for example)?
'''printTree()'''
- Just checking: is it intentional that \n is used, not std::endl?
same question applies to printTreeHelper().
'''printTreeHelper()'''
- at the end of the function, it seems we can avoid the if-else by
handling null node case at the beginning of the function:
{{{
{
INDENT(os, depth);
if (node == NULLNODE) {
os << "NULL\n";
return;
}
}}}
then we can simply the end of the function:
{{{
printTreeHelper(os, node->left_, depth + 1);
printTreeHelper(os, node->right_, depth + 1);
}}}
- maybe a matter of preference, but I'd avoid inserting a blank line
for "visible" nodes, and avoid consuming a line just for
'invisible'. so, I prefer
{{{
d.e.f. (black) [invisible]
}}}
instead of
{{{
d.e.f. (black)
[invisible]
}}}
- I'd add ' ' to 'end down from' (hey, this type of error could be
avoided if you strictly do it test driven:-)
- "tree has node XX" doesn't parse well. I'd rephrase it "tree has
XX node(s)"
'''rbt_datasrc_unittest'''
- insertNames: this comment is not correct any more:
{{{
// a node is considered to "formally" exist only if it has data
// associated with it
}}}
(I guess it was based on the first implementation?)
In the current implementation, a node will be considered "existent"
once it's explicitly inserted whether or not the data is
associated. so for example the following test would pass:
{{{
EXPECT_EQ(0, rbtree.insert(Name("d.e.f"), &rbtnode));
EXPECT_EQ(1, rbtree.insert(Name("d.e.f"), &rbtnode));
}}}
- successor
- as commented in the main code, it may not be a good idea to have
it as a public method.
- the last successor() operation after "g.h" seems to be no-op
{{{
EXPECT_EQ(Name("g.h"), successor_node->getName());
successor_node = successor_node->successor();
}}}
(although I know there's probably no meaningful test for this)
- eraseName
- the diagrams explaining the tests are helpful:-)
- this comment doesn't seem to match what's tested:
{{{
// o would not be rejoined with w.y if w.y had data
// associated with the key
}}}
- Strange thing happens here:
{{{
// z will rejoin with d.e.f since d.e.f has no data
EXPECT_EQ(0, rbtree.erase(Name("s.d.e.f")));
}}}
until this point, z.d.e.f is "shadow", but once we erase s.d.e.f it
becomes "visible". so, the following test would pass
{{{
EXPECT_EQ(RBTree<int>::NOTFOUND, rbtree.find(Name("z.d.e.f"),
&crbtnode));
EXPECT_EQ(0, rbtree.erase(Name("s.d.e.f")));
EXPECT_EQ(RBTree<int>::EXACTMATCH, rbtree.find(Name("z.d.e.f"),
&crbtnode));
}}}
I guess the issue I pointed out for erase() is realized here. (This
is more about the implementation rather than for the test itself)
--
Ticket URL: <http://bind10.isc.org/ticket/397#comment:20>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list