BIND 10 #2375: implement MasterLexer getNextToken() and ungetToken()

Tue Nov 20 10:19:38 UTC 2012

#2375: implement MasterLexer getNextToken() and ungetToken()
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  jinmei                             |                Status:  reviewing
                       Type:  task   |             Milestone:
                   Priority:         |  Sprint-20121120
  medium                             |            Resolution:
                  Component:         |             Sensitive:  0
  libdns++                           |           Sub-Project:  DNS
                   Keywords:         |  Estimated Difficulty:  5
            Defect Severity:  N/A    |           Total Hours:  0
Feature Depending on Ticket:         |
  loadzone-ng                        |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:12 vorner]:

 > Sometimes, I wonder if you write more code or more trac comments :-P

 In this particular case I had a lot of time to kill in my flight (and
 writing code wasn't a best choice as I wanted to keep the battery
 alive longer:-)

 > Replying to [comment:7 jinmei]:
 > > '''missing other signature of getNextToken'''
 > >
 > > It's probably not clear from the ticket's description, but it was
 > > actually intended to implement another version of getNextToken
 > > (I thought it was mentioned in the main ticket of `MasterLexer`).
 > > It's equivalent of BIND 9's isc_lex_getmastertoken().  This can wait,
 > > however, until we implement rdata extensions to support the generic
 > > lexer.  In that case we need to create a ticket for it.
 >
 > To clarify, so it should be a thin wrapper around the current
 getNextToken?

 Yes (depending on the definition of "thin", now that our body of
 getNextToken() is also pretty small).  The other version of
 `getNextToken` would call the currently defined one with specifying
 some particular set of options, and handle the case of unexpected
 EOL or unexpected string when a number is requested.

 > > Secondly, I don't think we need to catch exceptions (and recover from
 > > it) in `getNextToken`.  [...]
 >
 > I'm not sure. I could remove it, but I think it is better this way for
 two reasons:
 >  * The general approach of trying to be safe.
 >  * In case a big master file includes another one and the other one is
 not
 >    readable for some reason, it might be good to continue parsing (and
 >    reporting errors) at least for the big one. It is probably not that
 >    important, because it would be rather rare occasion, but I don't
 think it
 >    would be that hard to implement either.
 >
 > So, I'd ask. Is there a disadvantage to handling the exceptions?

 At least an explicit try-catch can be expensive (even in the normal
 case of no exceptions), depending on how the exception handling is
 implemented in that environment (compiler/std C++ lib).

 Secondly, try-catch complicates the code logic (just like any explicit
 error handling does).  If we need to do explicit try-catch immediately
 above its helper class, that seems to indicate some design problem to
 me, because we've lost one big advantage of exceptions: defer the
 explicit error handling at a sufficiently high layer that has
 sufficient context to tell how to deal with it and keep the rest of
 code simpler.

 So, to me, the question to be asked is opposite: whether there's any
 advantage of handling the exception here.

 And, I don't see much in this case; it's rethrown anyway, so the upper
 layer still needs to handle it properly, and, I suspect it's not only
 rare (as you noted above) but also quite hopeless to recover.  If we
 encounter an IO error after successfully opening the file, it's very
 unlikely that the next read from the stream/file now succeeds.  So,
 the only meaningful recovery action that the upper layer
 (`MasterLoader`) can take is to pop this source and try to continue.
 But just calling ungetToken() doesn't ensure we are successfully
 back to the state when we push this source.

 As such, I'd still remove the try-catch.  But as I write these
 comments I've just realized the main purpose of catching the exception
 is to ensure `ungetToken` is called when an exception happens.  As
 already stated I doubt its effectiveness, but if we still want to
 ensure it, we can do it with a guard object:

 {{{#!cpp
 namespace {
 class NextTokenGuard {
     NextTokenGuard(MasterLexer* lexer) : lexer_(lexer) {}
     ~NextTokenGuard() {
         if (lexer_ != NULL) {
             lexer_->ungetToken();
         }
     }
     void release() { lexer_ = NULL; }
 private:
     MasterLexer* lexer_;
 };
 }

 MasterLexer::Token
 MasterLexer::getNextToken(Options options) {
     ...
     NextTokenGuard guard(this);
     impl_->token_ = Token(Token::NO_TOKEN_PRODUCED);
     ...
     guard.release();
     return (impl_->token_);
 }
 }}}

 Maybe you don't like it because the total amount of code of this
 version is even bigger than just using try-catch.  But at least it
 simplifies the main code of `getNextToken()` and doesn't have the
 performance issue with any type of try-catch implementation unless an
 exception is actually thrown.

 > > Thirdly (probably a minor point), the BIND 9 version of ungettoken
 > > doesn't reset last_was_eol. [...]
 >
 > I don't know if it is a bug on bind9 side, or this case was just never
 needed.
 > But from the logical point of view, it should be reset, at least if the
 job
 > description for ungetToken is to return to the state before the last
 token. I
 > think that „peeking“ at a token should be safe and sideeffectless
 operation. If
 > there's something that depends on the fact that ungetToken does not
 return the
 > last-eof condition back in bind9, I think it is time to rethink the
 approach,
 > because that's hard to understand and read. Such code is expected in
 perl, not
 > in C++ ;-).

 I agree your version is more logical.  But I don't think it wise to
 just ignore the fact that the BIND 9 version has been working for
 quite a long time against many very unusual or even broken inputs.
 So, while I'm okay with resetting last_was_eol in the end, I suggest
 you really think about any possible problematic cases where the
 difference can cause, rather than just dismissing it "because it's
 more logical".  I'd also ask this simple question (why it doesn't
 presever-restore last_was_eol) at bind10-dev (some core BIND 9
 developers subscribe to it).  Actually I suspect it's less likely
 you'll get an answer, but we can then at least know there's not a
 simple reason for it, if any.

 > I looked at the current master, which already contains the string
 states. There
 > doesn't seem to be any state that returns something other than NULL from
 its
 > handle. The first question therefore is, is this OK for tests or is it
 > incomplete then? The second, if there's no such state, will there be? If
 not,
 > could we simplify the design by a lot?

 Hmm, I'm not sure how it relates to this discussion, but that's a good
 question anyway.  After clarifying and simplifying the state
 transitions for our limited need, we basically have only two cases:
 - everything is completed within State::start().  state() sets the
   token and returns NULL
 - one transition from start() to other derived state.  Each state
   terminates the transition and its handle() always sets the token and
   returns NULL
 This won't change when we complete the transition with the NUMBER
 state.

 I'm not sure it simplifies the code, though.  Instead of the for
 loop, we could do this:
 {{{#!cpp
         const State* state = start(options); // or State::start(lexer,
 options)
         if (state != NULL) {
             state.handle(*this);
         }
 }}}
 or, we might introduce a null state class and write as follows:
 {{{#!cpp
         start(options)->handle(*this);
         // or State::start(*this, options)->handle(*this);
 }}}

 > I'm not sure it can be reasonably tested by read-only accessors. The
 goal of
 > the fake state is to change the internal state and mess around with it.

 I know, but since `getNextToken()` itself doesn't rely on the internal
 states much, I don't see the need for arbitrarily modifying the
 states.

 > > - we should probably unify NOT_STARTED and NO_TOKEN_PRODUCED.  The
 former
 > >   is only used as an initial placeholder value.
 >
 > I'm not sure. I was thinking about that, but then I decided not to. The
 situations
 > are really different. The NOT_STARTED is part of normal lifetime of the
 lexer.
 > The NO_TOKEN_PRODUCED should actually never ever happen except for the
 short time
 > in which the new token is being computed. Considering the enum values
 are cheap
 > (the enum already takes a whole byte, so adding few more changes
 nothing), I don't
 > see compelling reason to have them unified. I think that could be more
 confusing
 > than having them separated.

 But NOT_STARTED is also never expected to be shown outside
 `MasterLexer`, and in that sense it seems to be quite similar to
 NO_TOKEN_PRODUCED.  But I don't insist it strongly.

 > > - getNextToken(): the behavior on atEOF() is different from that of
 > >   the BIND 9 version:
 > > {{{#!c
 > >  if (isc_buffer_remaininglength(source->pushback) == 0 &&
 > >      source->at_eof)
 > >  {
 > >   if ((options & ISC_LEXOPT_DNSMULTILINE) != 0 &&
 > >       lex->paren_count != 0) {
 > >    lex->paren_count = 0;
 > >    return (ISC_R_UNBALANCED);
 > >   }
 > >   if ((options & ISC_LEXOPT_EOF) != 0) {
 > >    tokenp->type = isc_tokentype_eof;
 > >    return (ISC_R_SUCCESS);
 > >   }
 > >   return (ISC_R_EOF);
 > >  }
 > > }}}
 > >   (Note: in our implementation we can assume LEXOPT_DNSMULTILINE and
 > >   LEXOPT_EOF).
 >
 > I'm probably lost here. I added a test to see the unbalanced parenthesis
 is
 > reported (and it is). So, what other difference is there? I'm not really
 confident
 > in bind9 code, I didn't write any bit of it O:-).

 I didn't write any bit of lex.c or am not particularly familiar with
 it, either; I've read this implementation for the purpose of this
 porting mostly as a first time reader.  So, in this context we
 shouldn't be so different.  And, the difference of the two versions on
 this point was pretty clear to me.  This check takes place before
 doing any state transaction:
 {{{#!c
         if (isc_buffer_remaininglength(source->pushback) == 0 &&
             source->at_eof)
         {
                 if ((options & ISC_LEXOPT_DNSMULTILINE) != 0 &&
                     lex->paren_count != 0) {
                         lex->paren_count = 0;
                         return (ISC_R_UNBALANCED);
                 }
                 if ((options & ISC_LEXOPT_EOF) != 0) {
                         tokenp->type = isc_tokentype_eof;
                         return (ISC_R_SUCCESS);
                 }
                 return (ISC_R_EOF);
         }
 }}}
 and the equivalent part of your version is:
 {{{#!cpp
     if (impl_->source_ == NULL || impl_->source_->atEOF()) {
         isc_throw(isc::InvalidOperation, "No source to read tokens from");
     }
 }}}
 so, in the BIND 9 version, if the caller calls "get (next) token"
 after the input source reaches an EOF, it would return an EOF token
 (or an unbalanced error), while your version throws an exception (that
 would indicate a caller's bug).

 When I made this comment I didn't know if this causes any real
 disruption - I assumed it's beyond the responsibility of a reviewer.
 But now that we've had this discussion, I've taken a closer look at
 it.  Assume we are parsing the following (broken) RR in the "lenient"
 mode.
 {{{
 foo.example.com. IN TXT "incomplete qstring<EOF>
 }}}
 What we'd expect in this case is that the unbalanced qstring is
 detected (resulting in an ERROR token), and the next call to
 getNextToken() returns EOF.  If I read it correctly BIND 9's
 implementation works that way.  But your version doesn't.  It can be
 confirmed by the following test:
 {{{#!cpp
 TEST_F(MasterLexerTest, eofAfterUnbalancedQstring) {
     ss << "\"unbalanced then EOF";
     lexer.pushSource(ss);
     EXPECT_EQ(MasterLexer::Token::ERROR,
               lexer.getNextToken(MasterLexer::QSTRING).getType());
     // We've reached EOF, and the next call to getNextToken() should
 return it
     EXPECT_EQ(MasterLexer::Token::END_OF_FILE,
 lexer.getNextToken().getType());
 }
 }}}

 Another example is an unbalanced '(' followed by EOF.  We first need
 to detect the error and then detect EOF.

 So, we should clearly do something, at least if we want to be
 compatible with BIND 9.

 One obvious way would be to add the same code logic as BIND 9 to
 `getNextToken()`.  But, from a closer look I guess we can probably
 just stop throwing the exception.  The EOF will be handled in the
 start() phase.  This way we can handle the above cases as intended,
 and avoid having duplicate code (although we cannot detect a really
 redundant call to `getNextToken()` after EOF).  I also noticed
 if we fixed the code that way we probably don't need the atEOF()
 method.

 > > - this might be safe depending on what happens in handle(), but I'd
 > >   protect state in scoped_ptr:
 > > {{{#!cpp
 > >             const State* state(State::getFakeState(NULL, 0, &token_));
 > >             state->handle(*this);
 > >             delete state;
 > > }}}
 >
 > I addressed this. But I didn't notice having `const` in my code. That
 could
 > suggest you did your usual constify run through the code, but forgot to
 push.
 > If it is the case, could you push to the original branch and I'd cherry-
 pick it
 > onto the current branch?

 Ah, right.  I've just pushed it.  I also made another minor editorial
 fix on the old branch, which has been pushed too.

 '''comments on the revised version of code'''

 - `MasterLexerTest::getUnbalanced` what we really need to test is
   whether we can get an EOF after detecting `UNBALANCED_PAREN`.
   see the discussion above.
 - master_lexer_unittest.cc: in the following comment "getToken" should
   be "getNextToken"?
 {{{#!cpp
     // By calling getToken again, we verify even the source got back to
     // original (as the second fake state checks it gets "234"). We must
     // push it as a fake start again so it is picked.
 }}}
   also maybe "fake start" should be "fake state"? (It still doesn't
   perfectly parse to me, although I think I can understand what it
   tries to say).

 '''other general things'''

 - maybe I forgot to comment this: I wonder if it's better to return
   `const MasterLexer::Token&` referring to its internal `token_`
   from `getNextToken()`.  While the cost of copying `Token` is
   probably not very expensive in this case, I don't see any need
   to make a copy here anyway (in the case of string the copied `Token`
   object has a reference to a placeholder within `MasterLexer`, which
   will be quite ephemeral anyway).

-- 
Ticket URL: <http://bind10.isc.org/ticket/2375#comment:13>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development