BIND 10 #2096: Define and implement "RdataReader" class

Thu Aug 16 07:19:19 UTC 2012

#2096: Define and implement "RdataReader" class
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  jinmei                             |                Status:  reviewing
                       Type:  task   |             Milestone:
                   Priority:         |  Sprint-20120821
  medium                             |            Resolution:
                  Component:  data   |             Sensitive:  0
  source                             |           Sub-Project:  DNS
                   Keywords:         |  Estimated Difficulty:  5
            Defect Severity:  N/A    |           Total Hours:  0
Feature Depending on Ticket:         |
  scalable inmemory                  |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 '''initial notes'''

 First, I made a few editorial/style changes (with one real bug fix)
 and committed them.

 Second, please check the revised version of getSize().  The behavior
 seems to be reasonably covered by existing tests so I didn't add new
 one, but maybe it's better to have some explicit tests.

 '''substantial design issue'''

 If I understand the implementation correctly, it lacks one important
 functionality: ability to specify RDATA boundaries.  Without this, we
 cannot use this class to implement RRset::toWire.

 We need something like this:

 {{{#!cpp
     /// \brief Iterate over a single RDATA.
     ///
     /// This method acts like iterate(), but returns once all data fields
 of
     /// on RDATA are examined.  The RdataReader object remembers the
 position,
     /// and the next call to this method works on the next (if any) RDATA.
     /// It returns true if there's more RDATA and false otherwise.
     bool iterateRdata();
 }}}

 Same for RRSIG (actually, in practice, we'll only need the "RDATA
 version" of it rather than the "whole set version").

 '''other design level discussions'''

 I appreciate the effort of providing two ways of usage, but I'd like
 to provide a single unified way.  At least for now, `RdataReader` is
 mostly private to the in-memory data source implementation, and its
 expected usage is pretty limited.  It will be just awkward and
 confusing if we end up having different ways for different cases
 depending on mere preference of the developer who happens to work on
 that particular part.  On the other hand, if we use only one of the
 two ways, the other version will be just a waste, leaving the
 implementation and test cases unnecessarily complicated (in fact, the
 current unit tests looked quite convoluted - I suspect it's pretty
 difficult to understand unless you already knew the original tests).

 Besides, due to the hybrid nature the current implementation has
 unnecessary overhead for both cases: for the "next" mode, we have call
 overhead for the callbacks even if it's less likely to use non default
 callbacks in this mode; for the "iterator" mode, constructing the
 `Result` data could be omitted otherwise.  I'll talk about more
 details on performance impact below.

 If, in future, we decide to make this class generic (with the encoder
 class) and public (which might actually happen, as the framework
 itself is quite general), it may make sense to provide different ways
 of handling data, but unless and until it happens, I'd like to keep it
 simpler and faster.

 If we can agree on that, the next question is which one we should choose.
 I'm afraid we have different opinions on this point, but I'll try to
 explain why I prefer the iterator approach: it's for safety and
 performance.

 IMO, case analysis like the one in NextDecoder::decode() is pretty
 error prone.  Even if noted in the documentation, it's always possible
 to incorrectly use invalid combination such as data() for the NAME
 type as long as the interface isn't prohibited.  We also need to
 explicitly think about basically impossible case like the "default" in
 the switch.

 The "next" version is also less efficient.  It's inherently so,
 because we need to examine the type within the class implementation
 and its application.  Also, at least in the current implementation the
 construction overhead of the `Result` object is not negligible (for
 example, it involves the construction of `LabelSequence`, which,
 although designed to be lightweight, still requires some amount of
 work).  This happens not only for usual data/name, but also for
 indicating the end of data.  The suggested usage using the type
 conversion operator makes it worse:
 {{{#!cpp
 /// RdataReader::Result data;
 /// while (data = reader.next()) {
 ///     switch (data.type()) {
 }}}
 this involves a call to the default constructor, and then to the
 assignment operator.

 To confirm this concern, I've written a micro benchmark test and
 committed it to the branch.  In my environment (with the hardcoded
 test scenario), the "next" version is more than 10% slower than the
 "iterator" version.  If we try to avoid the overhead of duplicate
 construction like the quoted part above, the "next" version was
 a bit faster, but it's still slower than 5-8% than the "iterator"
 version.  (I guess the benchmark program will be useful even after
 completing this branch, so we should probably keep it with necessary
 cleanup).

 Granted, the difference might end up marginal in higher level query
 performance.  But this class is supposed to be used in major
 performance bottlenecks, I'd like to make it as fast as possible,
 especially if other factors are mainly about style preference.

 I admit the indirect usage via the callback is a drawback of the
 "iterator" version, but after seeing code examples in the unit tests
 and the benchmark program, the "next" version with a loop and case
 analysis doesn't really look so readable either (although you may
 simply disagree on this).

 So, my suggestion is:
 - solely use the iterator version, or
 - if we cannot agree, at least make the "next" version reasonably
    faster so the performance difference will be marginal

 This point may make some of my comments to the current specific code
 moot, but I'm dumping them anyway, below.

 '''rdata_field.h'''

 - I'm not really sure about the  meaning of 'private':
 {{{#!cpp
 /// This header should be considered private to the implementation and
 should
 /// not be used from outside.
 }}}
   if it means it's private to `RdataEncoder` and `RdataReader` only, we
 should
   at least extract `RdataNameAttributes` to somewhere else because
   it's expected to be used by other classes via the reader's name
   action.  Or, if it's about naming (rdata_encoder.h looks like
   encoder specific), maybe we should introduce more generic name like
   "rdata_serialization.h/cc" and define related things there?
   (and still hide implementation details in .cc).  On a related point,
   we might want to rename `RdataEncoder` to `RdataSerializer`?

 - if we keep separating the header file, the top note in
   rdata_encoder.h (for file rdata_encoder.h) should probably be moved
   to the common header (the one that is supposed to be referenced from
   other classes).

 '''rdata_reader.h/cc'''

 - I think we should clarify the meaning of 'data field' or 'piece of
   data' (and should probably unify the terminology) in the class
   description, because this concept is specific to this implementation.
   For example:
 {{{#!cpp
 /// This class allows you to read the data encoded by RdataEncoder.
 /// It is rather low-level -- it provides a sequence of data fields,
 /// where each field is either:
 /// - opaque data, in the form of a non NULL (void*/uint8_t*) pointer
 ///   and its length, or
 /// - a domain name, in the form of dns::LabelSequence (which must be
 ///   absolute) and its attributes in the form of a logical OR of
 ///   `RdataNameAttributes` values
 ///
 /// Conceptually, these fields correspond to consecutive regions in
 /// wire-format representation of the RDATA, varying the type of above
 /// two cases depending on whether the region corresponds to a domain
 /// name or other data.  For example, for an MX RDATA the field
 /// sequence will be
 /// - 2 bytes of opaque data (which corresponds to the MX preference)
 /// - a domain name (which corresponds to the MX name)
 /// If the encoded data contain multiple MX RDATAs, the same type of
 /// sequence continues for the number of RDATAs.  Note that the opaque
 /// data field does not always correspond to a specific RDATA field
 /// as is the 2-byte preference field of MX.  For example, the field
 /// sequence for an SOA RDATA in terms of `RdataEncoder` will be:
 /// - a domain name (which corresponds to the SOA MNAME)
 /// - a domain name (which corresponds to the SOA RNAME)
 /// - 20 bytes of opaque data (for the rest of fields)
 ///
 /// So, if you want to construct a general purpose dns::Rdata object
 /// from the field sequence, you'll need to build the complete
 /// wire-format data, and then construct a dns::Rdata object from it
 }}}
   With this type of clarification, I guess we should consistently use
   "(data) field" in other part of the doc, rather than mixing it with
   some other terms like "piece of data".

 - what does "it's not unfinished" mean?
 {{{#!cpp
     // Do nothing here. On purpose, it's not unfinished.
 }}}

 - maybe it's better to change the type of the second parameter of
   `NameAction` to `RdataNameAttributes` to make it clear what's
   actually expected to be passed there.

 - I'd suggest changing the type of `data` parameter of `RdataReader`
   constructor to make it clear that this region of memory is opaque
   data for the user of the encoder/reader classes.

 - Likewise, it's probably better to change the type of `data`
   parameter of `DataAction` to `const void*`.  In fact, we'd generally
   expect the passed data (with the size) to be treated as opaque data
   (e.g., transparently passed to writeData() or write(2) system call
   variants) rather than be examined byte-by-byte.

 - is 'unstructuder' a typo?
 {{{#!cpp
         DATA, ///< This iteration returns unstructuder data
 }}}

 - The `Result` class: by convention, I think it's better to name
   getter methods "getXXX()" (e.g., getType() instead of type()).

 - Aside from the previous point, I think 'label()' should be renamed
   to something like 'labels' or '(get)LabelSequence' because 'label'
   sounds like a single label.

 - `Result::operator bool()`: admittedly this might be a matter of
    taste (though in my understanding it's generally supported by
    literatures), I personally would like to avoid using type
    conversion operator, if its only purpose is to allow a shortcut
    notation like:
 {{{#!cpp
         RdataReader::Result field;
         while ((field = reader.next())) {
 }}}
    This could be written without relying on the type conversion
    operator, e.g., as follows:
 {{{#!cpp
         while (true) {
             const RdataReader::Result field = reader.next();
             if (field.type() == RdataReader::END) {
                 break;
             }
 }}}
    (and we could even integrate it into the switch if a switch of
    field type follows).  While the latter may seem to be less concise,
    it has other advantages like using 'field' as a const and avoiding
    calling the default constructor or explicit assignment (which is
    faster as I showed above, and will prevent making some member
    variables const which could be so otherwise).  And, even if we
    don't consider it an advantage, I'd personally be happy if we can
    avoid pitfalls with implicit type conversion.

 - I'd replace this:
 {{{#!cpp
         /// \brief The domain label.
         ///
         /// This holds the domain label. It is only valid if type() ==
 NAME.
 }}}
   as follows:
 {{{#!cpp
         /// \brief The domain name.
         ///
         /// This holds the domain name in the form of \c
 dns::LabelSequence.
         /// It is only valid if type() == NAME.
 }}}
   mainly because "label" sounds like a single label.

 - `RdataReader` constructor: as suggested above, I'd change the type
   of 'data' to `const void*`.  Then, I'd use reinterpret_cast for
   lengths_:
 {{{#!cpp
     lengths_(reinterpret_cast<const uint16_t*>(data)),
 }}}
   In this case, we are in fact doing dangerous conversion from
   an arbitrary pointer to uint16_t* (when it's used in `RdataSet`, we
   use the region immediately after the object for the encoded data
   by reinterpreting `RdataSet*` to, e.g., uint16_t*).  The caller may
   even be careless about data alignment and the `data` pointer can
   indeed be bogus as a uint16_t pointer depending on the architecture.
   We know all this, and still deliberately decided to use the risky
   conversion trading the safety with efficiency.  For such cases, IMO
   we should be honest about what we are doing, and by doing so, we can
   indicate readers of the code that it should be handled with an extra
   care.
 - Same for `data_`.
 - nextSig(): maybe marginal or the compiler internally does it, but we
   might define a variable for the length_ index to avoid the
   computation twice:
 {{{#!cpp
         const size_t length_idx = var_count_total_ + sig_pos_;
         const Result result(sigs_ + sig_data_pos_, lengths_[length_idx]);
         // Move the position of iterator.
         sig_data_pos_ += lengths_[length_idx];
 }}}

 '''rdata_serialization_unittest'''

 - It's not really clear to me why we need to keep
   `ManualDecoderStyle`.  Is there anything essentially missing that
   can only be tested here?

 - It's safer to avoid statically initialize dummy_name2:
 {{{#!cpp
 // Used across more classes and scopes. But it's just uninteresting
 // constant.
 const Name dummy_name2("example.com");
 }}}
   (it's not just a theoretical concern; we in fact previously had a
   static initialization fiasco with a static name object in a unit
   test)

 - this comment (for additionalRequired) is not correct any more:
 {{{#!cpp
     // We'll pass it to renderNameField to check the stored attribute
 matches
     // our expectation.
 }}}

 - `RdataEncodeDecodeTest`: I'd clarify the requirement to
   `DecoderStyle`.  At least it seems to be expected to have a
   'decode()' method, so I'd note that fact and what properties
    this method should meet.  If there are other requirements, this
   should be documented too.

 - do you actually mean 'rewound' by 'rewund'?
 {{{#!cpp
 // Decode using reader with the return value of next, but after the reader
 // was used and rewund.
 class RewundDecoder {
 }}}

 - NextDecoder::decode() and RewundDecoder::decode() are mostly
   identical.  I'd combine them.

 - addSIGRdataOnly test should use the reader (whether or not we keep
   the brute force version).  maybe also for encodeLargeRdata.

 - readerResult: we cannot pass a temporary `Name` object to
   `LabelSequence` constructor (for 'compressible').  I've fixed it
   with other trivial fixes.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2096#comment:15>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development