BIND 10 #2096: Define and implement "RdataReader" class
BIND 10 Development
do-not-reply at isc.org
Thu Aug 16 07:19:19 UTC 2012
#2096: Define and implement "RdataReader" class
-------------------------------------+-------------------------------------
Reporter: | Owner: jinmei
jinmei | Status: reviewing
Type: task | Milestone:
Priority: | Sprint-20120821
medium | Resolution:
Component: data | Sensitive: 0
source | Sub-Project: DNS
Keywords: | Estimated Difficulty: 5
Defect Severity: N/A | Total Hours: 0
Feature Depending on Ticket: |
scalable inmemory |
Add Hours to Ticket: 0 |
Internal?: 0 |
-------------------------------------+-------------------------------------
Comment (by jinmei):
'''initial notes'''
First, I made a few editorial/style changes (with one real bug fix)
and committed them.
Second, please check the revised version of getSize(). The behavior
seems to be reasonably covered by existing tests so I didn't add new
one, but maybe it's better to have some explicit tests.
'''substantial design issue'''
If I understand the implementation correctly, it lacks one important
functionality: ability to specify RDATA boundaries. Without this, we
cannot use this class to implement RRset::toWire.
We need something like this:
{{{#!cpp
/// \brief Iterate over a single RDATA.
///
/// This method acts like iterate(), but returns once all data fields
of
/// on RDATA are examined. The RdataReader object remembers the
position,
/// and the next call to this method works on the next (if any) RDATA.
/// It returns true if there's more RDATA and false otherwise.
bool iterateRdata();
}}}
Same for RRSIG (actually, in practice, we'll only need the "RDATA
version" of it rather than the "whole set version").
'''other design level discussions'''
I appreciate the effort of providing two ways of usage, but I'd like
to provide a single unified way. At least for now, `RdataReader` is
mostly private to the in-memory data source implementation, and its
expected usage is pretty limited. It will be just awkward and
confusing if we end up having different ways for different cases
depending on mere preference of the developer who happens to work on
that particular part. On the other hand, if we use only one of the
two ways, the other version will be just a waste, leaving the
implementation and test cases unnecessarily complicated (in fact, the
current unit tests looked quite convoluted - I suspect it's pretty
difficult to understand unless you already knew the original tests).
Besides, due to the hybrid nature the current implementation has
unnecessary overhead for both cases: for the "next" mode, we have call
overhead for the callbacks even if it's less likely to use non default
callbacks in this mode; for the "iterator" mode, constructing the
`Result` data could be omitted otherwise. I'll talk about more
details on performance impact below.
If, in future, we decide to make this class generic (with the encoder
class) and public (which might actually happen, as the framework
itself is quite general), it may make sense to provide different ways
of handling data, but unless and until it happens, I'd like to keep it
simpler and faster.
If we can agree on that, the next question is which one we should choose.
I'm afraid we have different opinions on this point, but I'll try to
explain why I prefer the iterator approach: it's for safety and
performance.
IMO, case analysis like the one in NextDecoder::decode() is pretty
error prone. Even if noted in the documentation, it's always possible
to incorrectly use invalid combination such as data() for the NAME
type as long as the interface isn't prohibited. We also need to
explicitly think about basically impossible case like the "default" in
the switch.
The "next" version is also less efficient. It's inherently so,
because we need to examine the type within the class implementation
and its application. Also, at least in the current implementation the
construction overhead of the `Result` object is not negligible (for
example, it involves the construction of `LabelSequence`, which,
although designed to be lightweight, still requires some amount of
work). This happens not only for usual data/name, but also for
indicating the end of data. The suggested usage using the type
conversion operator makes it worse:
{{{#!cpp
/// RdataReader::Result data;
/// while (data = reader.next()) {
/// switch (data.type()) {
}}}
this involves a call to the default constructor, and then to the
assignment operator.
To confirm this concern, I've written a micro benchmark test and
committed it to the branch. In my environment (with the hardcoded
test scenario), the "next" version is more than 10% slower than the
"iterator" version. If we try to avoid the overhead of duplicate
construction like the quoted part above, the "next" version was
a bit faster, but it's still slower than 5-8% than the "iterator"
version. (I guess the benchmark program will be useful even after
completing this branch, so we should probably keep it with necessary
cleanup).
Granted, the difference might end up marginal in higher level query
performance. But this class is supposed to be used in major
performance bottlenecks, I'd like to make it as fast as possible,
especially if other factors are mainly about style preference.
I admit the indirect usage via the callback is a drawback of the
"iterator" version, but after seeing code examples in the unit tests
and the benchmark program, the "next" version with a loop and case
analysis doesn't really look so readable either (although you may
simply disagree on this).
So, my suggestion is:
- solely use the iterator version, or
- if we cannot agree, at least make the "next" version reasonably
faster so the performance difference will be marginal
This point may make some of my comments to the current specific code
moot, but I'm dumping them anyway, below.
'''rdata_field.h'''
- I'm not really sure about the meaning of 'private':
{{{#!cpp
/// This header should be considered private to the implementation and
should
/// not be used from outside.
}}}
if it means it's private to `RdataEncoder` and `RdataReader` only, we
should
at least extract `RdataNameAttributes` to somewhere else because
it's expected to be used by other classes via the reader's name
action. Or, if it's about naming (rdata_encoder.h looks like
encoder specific), maybe we should introduce more generic name like
"rdata_serialization.h/cc" and define related things there?
(and still hide implementation details in .cc). On a related point,
we might want to rename `RdataEncoder` to `RdataSerializer`?
- if we keep separating the header file, the top note in
rdata_encoder.h (for file rdata_encoder.h) should probably be moved
to the common header (the one that is supposed to be referenced from
other classes).
'''rdata_reader.h/cc'''
- I think we should clarify the meaning of 'data field' or 'piece of
data' (and should probably unify the terminology) in the class
description, because this concept is specific to this implementation.
For example:
{{{#!cpp
/// This class allows you to read the data encoded by RdataEncoder.
/// It is rather low-level -- it provides a sequence of data fields,
/// where each field is either:
/// - opaque data, in the form of a non NULL (void*/uint8_t*) pointer
/// and its length, or
/// - a domain name, in the form of dns::LabelSequence (which must be
/// absolute) and its attributes in the form of a logical OR of
/// `RdataNameAttributes` values
///
/// Conceptually, these fields correspond to consecutive regions in
/// wire-format representation of the RDATA, varying the type of above
/// two cases depending on whether the region corresponds to a domain
/// name or other data. For example, for an MX RDATA the field
/// sequence will be
/// - 2 bytes of opaque data (which corresponds to the MX preference)
/// - a domain name (which corresponds to the MX name)
/// If the encoded data contain multiple MX RDATAs, the same type of
/// sequence continues for the number of RDATAs. Note that the opaque
/// data field does not always correspond to a specific RDATA field
/// as is the 2-byte preference field of MX. For example, the field
/// sequence for an SOA RDATA in terms of `RdataEncoder` will be:
/// - a domain name (which corresponds to the SOA MNAME)
/// - a domain name (which corresponds to the SOA RNAME)
/// - 20 bytes of opaque data (for the rest of fields)
///
/// So, if you want to construct a general purpose dns::Rdata object
/// from the field sequence, you'll need to build the complete
/// wire-format data, and then construct a dns::Rdata object from it
}}}
With this type of clarification, I guess we should consistently use
"(data) field" in other part of the doc, rather than mixing it with
some other terms like "piece of data".
- what does "it's not unfinished" mean?
{{{#!cpp
// Do nothing here. On purpose, it's not unfinished.
}}}
- maybe it's better to change the type of the second parameter of
`NameAction` to `RdataNameAttributes` to make it clear what's
actually expected to be passed there.
- I'd suggest changing the type of `data` parameter of `RdataReader`
constructor to make it clear that this region of memory is opaque
data for the user of the encoder/reader classes.
- Likewise, it's probably better to change the type of `data`
parameter of `DataAction` to `const void*`. In fact, we'd generally
expect the passed data (with the size) to be treated as opaque data
(e.g., transparently passed to writeData() or write(2) system call
variants) rather than be examined byte-by-byte.
- is 'unstructuder' a typo?
{{{#!cpp
DATA, ///< This iteration returns unstructuder data
}}}
- The `Result` class: by convention, I think it's better to name
getter methods "getXXX()" (e.g., getType() instead of type()).
- Aside from the previous point, I think 'label()' should be renamed
to something like 'labels' or '(get)LabelSequence' because 'label'
sounds like a single label.
- `Result::operator bool()`: admittedly this might be a matter of
taste (though in my understanding it's generally supported by
literatures), I personally would like to avoid using type
conversion operator, if its only purpose is to allow a shortcut
notation like:
{{{#!cpp
RdataReader::Result field;
while ((field = reader.next())) {
}}}
This could be written without relying on the type conversion
operator, e.g., as follows:
{{{#!cpp
while (true) {
const RdataReader::Result field = reader.next();
if (field.type() == RdataReader::END) {
break;
}
}}}
(and we could even integrate it into the switch if a switch of
field type follows). While the latter may seem to be less concise,
it has other advantages like using 'field' as a const and avoiding
calling the default constructor or explicit assignment (which is
faster as I showed above, and will prevent making some member
variables const which could be so otherwise). And, even if we
don't consider it an advantage, I'd personally be happy if we can
avoid pitfalls with implicit type conversion.
- I'd replace this:
{{{#!cpp
/// \brief The domain label.
///
/// This holds the domain label. It is only valid if type() ==
NAME.
}}}
as follows:
{{{#!cpp
/// \brief The domain name.
///
/// This holds the domain name in the form of \c
dns::LabelSequence.
/// It is only valid if type() == NAME.
}}}
mainly because "label" sounds like a single label.
- `RdataReader` constructor: as suggested above, I'd change the type
of 'data' to `const void*`. Then, I'd use reinterpret_cast for
lengths_:
{{{#!cpp
lengths_(reinterpret_cast<const uint16_t*>(data)),
}}}
In this case, we are in fact doing dangerous conversion from
an arbitrary pointer to uint16_t* (when it's used in `RdataSet`, we
use the region immediately after the object for the encoded data
by reinterpreting `RdataSet*` to, e.g., uint16_t*). The caller may
even be careless about data alignment and the `data` pointer can
indeed be bogus as a uint16_t pointer depending on the architecture.
We know all this, and still deliberately decided to use the risky
conversion trading the safety with efficiency. For such cases, IMO
we should be honest about what we are doing, and by doing so, we can
indicate readers of the code that it should be handled with an extra
care.
- Same for `data_`.
- nextSig(): maybe marginal or the compiler internally does it, but we
might define a variable for the length_ index to avoid the
computation twice:
{{{#!cpp
const size_t length_idx = var_count_total_ + sig_pos_;
const Result result(sigs_ + sig_data_pos_, lengths_[length_idx]);
// Move the position of iterator.
sig_data_pos_ += lengths_[length_idx];
}}}
'''rdata_serialization_unittest'''
- It's not really clear to me why we need to keep
`ManualDecoderStyle`. Is there anything essentially missing that
can only be tested here?
- It's safer to avoid statically initialize dummy_name2:
{{{#!cpp
// Used across more classes and scopes. But it's just uninteresting
// constant.
const Name dummy_name2("example.com");
}}}
(it's not just a theoretical concern; we in fact previously had a
static initialization fiasco with a static name object in a unit
test)
- this comment (for additionalRequired) is not correct any more:
{{{#!cpp
// We'll pass it to renderNameField to check the stored attribute
matches
// our expectation.
}}}
- `RdataEncodeDecodeTest`: I'd clarify the requirement to
`DecoderStyle`. At least it seems to be expected to have a
'decode()' method, so I'd note that fact and what properties
this method should meet. If there are other requirements, this
should be documented too.
- do you actually mean 'rewound' by 'rewund'?
{{{#!cpp
// Decode using reader with the return value of next, but after the reader
// was used and rewund.
class RewundDecoder {
}}}
- NextDecoder::decode() and RewundDecoder::decode() are mostly
identical. I'd combine them.
- addSIGRdataOnly test should use the reader (whether or not we keep
the brute force version). maybe also for encodeLargeRdata.
- readerResult: we cannot pass a temporary `Name` object to
`LabelSequence` constructor (for 'compressible'). I've fixed it
with other trivial fixes.
--
Ticket URL: <http://bind10.isc.org/ticket/2096#comment:15>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list