BIND 10 #2375: implement MasterLexer getNextToken() and ungetToken()

Fri Nov 16 14:27:41 UTC 2012

#2375: implement MasterLexer getNextToken() and ungetToken()
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jelte
  jinmei                             |                Status:  reviewing
                       Type:  task   |             Milestone:
                   Priority:         |  Sprint-20121120
  medium                             |            Resolution:
                  Component:         |             Sensitive:  0
  libdns++                           |           Sub-Project:  DNS
                   Keywords:         |  Estimated Difficulty:  5
            Defect Severity:  N/A    |           Total Hours:  0
Feature Depending on Ticket:         |
  loadzone-ng                        |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Hmm, I just noticed Jelte picked up the review task, but I've
 completed my own review (I'm not sure why I didn't own it for the
 review...), so I'm dumping it:

 I have a few major issues and some relatively minor comments on the
 actual implementation.

 '''missing other signature of getNextToken'''

 It's probably not clear from the ticket's description, but it was
 actually intended to implement another version of getNextToken
 (I thought it was mentioned in the main ticket of `MasterLexer`).
 It's equivalent of BIND 9's isc_lex_getmastertoken().  This can wait,
 however, until we implement rdata extensions to support the generic
 lexer.  In that case we need to create a ticket for it.

 '''about error/exception handling'''

 First, I suspect we cannot feasibly revert to the previous token.  In
 the case of string-variant token, the token refers to an internal
 placeholder vector of `MasterLexer` (see #2373).  It's always cleared
 once a new session for getNextToken() (of string variants) starts,
 so the content of `previous_token_` can be already invalid when it's
 used in ungetToken().

 We might be able to even make a deep copy in `previous_token_`, but I
 personally don't see the need for it.  At least BIND 9's
 ungettoken() doesn't provide this feature, so least won't need
 it at least in our work for `MasterLoader`.  Personally, I would
 simply say "the returned token will be only valid until the next call
 to `getNextToken()`; if the caller needs to refer to older token
 values, it must make its own copy of the token value".

 Secondly, I don't think we need to catch exceptions (and recover from
 it) in `getNextToken`.  The only possible exception of our own in this
 method is `ReadError` (from the input source), and other possible
 exceptions are from the standard library (and possibly from boost)
 such as std::bad_alloc if it's not the only one.

 I suspect these are sufficiently critical and it wouldn't make sense
 to try to continue the parsing after seeing these.  In fact, the BIND
 9's equivalent of `ReadError` is treated as a fatal error, and it
 terminates the entire parsing procedure even in the "lenient" mode.

 As far as I can see, it still provides the basic ("weak") exception
 guarantee in that it won't leak any resource or cause a catastrophic
 disruption such as segfault in subsequent use (including call to
 destructor) of the class.

 Thirdly (probably a minor point), the BIND 9 version of ungettoken
 doesn't reset last_was_eol.  I was not sure whether it was
 intentional, or a some kind of bug of BIND 9, or whether it simply
 doesn't matter, but please check if the difference causes any trouble.
 You may also want to ask this at bind10-dev.

 '''on the use of fake state class'''

 First off, I don't see the need for such a tricky setup to test
 `getNextToken` and `ungetToken`.  In particular, `getNextToken` is
 basically a simple method; we only check if it calls the state class
 objects from start to end and returns the token that is set
 in `impl_->token_` at the time of the end of the method.  Its
 interaction with the input source is hidden within the state classes
 and I don't see the need for testing it (in fact, such separation is
 one major reason for extracting the state transition into a separate
 class).

 In the case of `ungetToken`, it may be a bit trickier due to the call
 to `impl_->source_->ungetAll()`, but we should be able to be
 reasonably confident by checking if we get the same token again
 by doing `ungetToken` and then `getNextToken`.  And, actually, the
 current tests using the fake states don't seem to benefit from the
 specialty of the fake on this point anyway.

 We provide a way for tests to peek into some of the internal states of
 `MasterLexer`, so even without a fake state class we should be able to
 test the higher level behavior of `MasterLexer` sufficiently.

 I'm also concerned about the way we introduce the fake state classes:
 adding it in the production implementation file (master_lexer.cc) and
 public header file (master_lexer.h).  Admittedly there are already
 some interfaces that are only supposed to be used by tests, but they
 are basically simple read-only getters and the existence is hidden
 in master_lexer_state.h (which is not supposed to be installed or
 visible to library users).  Defining a protected method for a public
 class could send the wrong (at least not originally intended) message
 that this class can be conceptually inheritable, and by defining it
 in a production .cc we need to build non trivial amount of code that
 is not used in production.

 So, I'd first like to suggest considering test the lexer class using
 the existing state classes and already available inspection hooks.  If
 you still want to fully control the state transition in tests, I'd
 like to make it happen without introducing a derived class of lexer or
 putting fake definitions in the production code.  For example, extend
 the constructor of `MasterLexer` so it can take a pointer to the
 "start" function
 {{{#!cpp
     typedef const master_lexer_internal::State* (start_fn_t)(Options);
     // if non NULL, use it, otherwise we use the builtin state classes.
     // clarify that the non NULL usage is only for tests.
     MasterLexer(start_fn_t start_fn = NULL);
 }}}
 and define necessary fake classes in the test code (that only uses
 read-only accessors for the lexer).

 Below are specific comments on the actual implementation.  Some may
 become irrelevant depending on how we deal with the design issues, but
 I'm dumping them anyway.

 '''master_lexer.h/cc'''

 - we should probably unify NOT_STARTED and NO_TOKEN_PRODUCED.  The former
   is only used as an initial placeholder value.
 - getNextToken(): the behavior on atEOF() is different from that of
   the BIND 9 version:
 {{{#!c
         if (isc_buffer_remaininglength(source->pushback) == 0 &&
             source->at_eof)
         {
                 if ((options & ISC_LEXOPT_DNSMULTILINE) != 0 &&
                     lex->paren_count != 0) {
                         lex->paren_count = 0;
                         return (ISC_R_UNBALANCED);
                 }
                 if ((options & ISC_LEXOPT_EOF) != 0) {
                         tokenp->type = isc_tokentype_eof;
                         return (ISC_R_SUCCESS);
                 }
                 return (ISC_R_EOF);
         }
 }}}
   (Note: in our implementation we can assume LEXOPT_DNSMULTILINE and
   LEXOPT_EOF).

 - what if (or what should happen if) we call ungetToken() after
   reaching EOF?  Is this case tested?

 '''master_lexer_state.h'''

 - this doesn't parse:
 {{{#!cpp
     /// parentheses count is changed accordingly a the last EOL condition
     /// set if provided.
 }}}
   maybe "and the last EOL condition is set if provided"?

 '''master_lexer_unittest.cc'''

 - this might be safe depending on what happens in handle(), but I'd
   protect state in scoped_ptr:
 {{{#!cpp
             const State* state(State::getFakeState(NULL, 0, &token_));
             state->handle(*this);
             delete state;
 }}}

 - ungetSimple: it's not clear to me how this verifies "even the source
   got back to original":
 {{{#!cpp
     // By calling getToken again, we verify even the source got back to
     // original. We must push it as a fake start again so it is picked.
 }}}

-- 
Ticket URL: <http://bind10.isc.org/ticket/2375#comment:7>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development