BIND 10 #2252: Implement counters into Xfrin (1/3)

Fri Apr 5 04:46:05 UTC 2013

#2252: Implement counters into Xfrin (1/3)
-------------------------------------+-------------------------------------
            Reporter:  naokikambe    |                        Owner:
                Type:  enhancement   |  jinmei
            Priority:  medium        |                       Status:
           Component:  xfrin         |  reviewing
            Keywords:                |                    Milestone:
           Sensitive:  0             |  Sprint-20130423
         Sub-Project:  DNS           |                   Resolution:
Estimated Difficulty:  5             |                 CVSS Scoring:
         Total Hours:  0             |              Defect Severity:  N/A
                                     |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------
Changes (by naokikambe):

 * owner:  naokikambe => jinmei

Comment:

 Hello, sorry for my late response.

 Replying to [comment:17 jinmei]:
 > Replying to [comment:16 naokikambe]:
 >
 > > > It's more about design matter so the harm may not be so obvious or
 > > > even a subjective matter.  I'm okay with excluding it for this
 ticket,
 > > > but I'd argue this should be at least discussed (and if we agree)
 > > > resolved before using the current design further.  So I'd suggest
 > > > creating a ticket to discuss and fix it, and do it before the
 > > > subsequent counter tickets for xfrin or others.
 > >
 > > I'm okay with redesigning if the current library has some problem. But
 what ticket should I create? What contents should I write in the ticket?
 Sorry, I couldn't understand well about problems. Could you create it
 yourself?
 >
 > I've created one: #2883.  We might also discuss it in the next team
 > call.

 Thank you for creating. Yes, I'd like to understand well what problem is
 and how we should improve it somewhere. I think it might be appropriate
 for a topic of the face-to-face meeting to discuss.

 >
 > > > '''b10-xfrin.xml'''
 > > >
 > > > - about latest_*xfr_duration: I missed the currently ongoing one
 > > >   doesn't show up in the statistics.  In that case
 > > >   "last_*xfr_duration" would be better than latest.
 >
 > Note: I had agreed last_*xfr_duration would be better, but they are
 > not changed.

 Sorry, I missed to read it before. I've revised. Please see this.
 {{{
 678965b [2252] change to better names; s/lastest/last/i
 }}}

 >
 > > > > > - as for time_to_ixfr, what if there's never been an xfr for the
 zone?
 > > > > It shows still the default value, 0.0.
 > > >
 > > > It seems to be suboptimal in that using a magic number (especially
 in
 > > > the valid range - 0.0 should still be possible if everything is done
 > > > super quickly) to present a kind of "None" is always a source of
 > > > trouble.  I'd consider something that can explicitly mean "no
 > > > completed xfrs".  And, especially if we use a magic value, document
 it
 > > > somewhere.
 > >
 > > I updated descriptions about the default value `0.0`. Could you see
 the following commit?
 > > {{{
 > > 26a9ee9 [2252] update descriptions of latest_ixfr_duration and
 latest_axfr_duration about their default values
 > > }}}
 >
 > In that case this description isn't accurate.  It should be something
 > like:
 >
 > {{{
 >                     <listitem><simpara>
 >                       Duration of the latest IXFR in seconds. 0.0 means
 it was
 >                       shorter than 0.05 seconds or there was no
 >                       successful IXFR.
 >                     </simpara></listitem>
 > }}}
 >
 > and, as this description suggests, it's generally a bad idea to
 > overload a specific value that way; such ambiguity will eventually
 > bite us(ers).  Better approach is to actually revise the code to
 > eliminate the ambiguity.  Do you still think the overloading is the
 > way to go?

 I believe the timer has microsecond accuracy on the most platforms. So I
 think it is not 0.05 but 0.000001. I've update description at the
 following commit according to this.
 {{{
 ff8d95f [2252] update descriptions
 }}}
 However I don't think this could be so confusing for users. Because I
 think users could see whether a real XFR is started from other counters
 like axfrreqv4 or ixfrreqv4.
 >
 > > > > > '''xfrin.py.in'''
 > > > > >
 > > > > > - get_ipver_str: I suggest making it "protected" by renaming
 > > > By "clarify" I meant update the docstring too.
 > >
 > > I updated the docstring. Could you see this?
 > > {{{
 > > ad8e7ea [2252] update docstring of _get_ipver_str() about accessing
 > > }}}
 >
 > I further clarified it can also be used in tests.

 Thank you. It looks good.
 >
 > > > > > - `_check_soa_serial`: shouldn't this be distinguished per name
 and
 > > > > >   class?
 > [...]
 > > > I'm okay with differing it, but it seems to be a quite fundamental
 > > > design error.  So I'd argue that should be fixed before we rely on
 the
 > > > current format further.  The longer we postpone it, the more painful
 > > > it is to fix.
 > >
 > > The current data structure is already implemented in Xfrout. I think
 we should repair altogether at a time in future.
 >
 > These should be all fixed, yes, but if you mean by "in future", we
 > still keep using it, I'd be opposed to it.  IMO this level of design
 > defect should be corrected sooner than later.  Just as the fact of
 > xfrout indicates, if we rely on the flawed design more, fixing them
 > will simply be more and more painful.  For this particular ticket, I
 > wouldn't block due to this, but I don't like to pick up any other
 > counter tickets until this issue is fixed.
 >
 > I created a ticket: #2884.

 Thank you for creating the ticket. I meant that "in future" was "in near
 future" at that time. Sorry for confusion. I also think this issue is
 serious. So shall we start #2884 after this ticket? Anyway, according to
 this, I've updated description for avoiding confusion from users. Could
 you see the following commit?
 {{{
 20df0b0 [2252] update descriptions of zonename
 }}}

 >
 > > > > > - do_xfrin: stop_timer wouldn't be called if exception is raised
 > > > > >   during the process.
 > > > > I think it is okay that the timer count would be discarded if
 exception is raised.
 > > >
 > > > Is it really discarded?  Shouldn't it be left until next time we set
 > > > the timer for the same zone?   If it's left, that doesn't seem to be
 > > > clear.  Also, if the intent is to only count the duration of
 > > > successfully completed transfers (even excluding the case of
 > > > `XfrinZoneUptodate`), I believe that should be documented.
 > >
 > > Strictly speaking, that would be discarded when the next start time is
 set. I guess the timer wouldn't probably started yet in case of
 `XfrinZoneUptodate`. I described about discarding at the following commit.
 > > {{{
 > > dc2960e [2252] add notes about the case that the timer is already
 started but not yet stopped
 > > }}}
 >
 > By "document", I (probably ... not quite remember after some period
 > since last comment cycle) meant the man page or something, not (only)
 > the code comment.

 I also wrote a comment on the man page. Could you see the commit?
 {{{
 4d5c20e [2252] update descriptions of last_*xfr_duration
 }}}

 >
 > I'm not sure if I don't understand this: "I guess the timer wouldn't
 > probably started yet in case of `XfrinZoneUptodate`".  You mean it
 > should have come from `_check_soa_serial`?  But that can also happen
 > in IXFR.

 Yes, I misread the code. The condition can be changed in case of IXFR as
 you mentioned.

 >
 > As for dc2960e, shouldn't this be "not yet stopped"?
 >
 > {{{
 > +        support multi-threaded use. If the specified timer is already
 > +        started but not yet started, the last start time is
 > +        overwritten."""
 > }}}

 Yes, thank you for correcting. I've updated.
 {{{
 2e97db1 [2252] correct a wrong word
 }}}

 >
 > > > > > - command_handler: why do we bother to log it here?
 > [...]
 > > I generalized the log id not to be specific to the `getstats` command.
 Could you see the following commit?
 > > {{{
 > > b0f2aca [2252] replace the log message XFRIN_RECEIVED_COMMAND to the
 more general id XFRIN_RECEIVED_COMMAND with the com
 > > }}}
 >
 > This change itself looks good.  But please reorder the log messages
 > (use the tool under bind10/tools).

 Thank you. I've reordered the messages. Please see.
 {{{
 307e3c7 [2252] reorder by tools/reorder_message_file.py
 }}}
 >
 > But, while working on #1938 I (happen to) noticed it had a more
 > fundamental problem in the context of using it in lettuce.  The
 > corresponding lettuce test reads:
 >
 > {{{
 >     When I query statistics zones of bind10 module Xfrin with cmdctl
 >     wait for new bind10 stderr message STATS_SEND_STATISTICS_REQUEST
 >     wait for new bind10 stderr message XFRIN_RECEIVED_COMMAND
 >     last bindctl output should not contain "error"
 >
 >     When I query statistics zones of bind10 module Xfrin with cmdctl
 > }}}
 >
 > This is not reliable, because just seeing XFRIN_RECEIVED_COMMAND
 > doesn't guarantee the stats daemon received the answer (on the other
 > hand, checking STATS_SEND_... should be unnecessary because if it
 > didn't happen we wouldn't see XFRIN_RECEIVED_COMMAND; further, I
 > suspect it could even cause failure, because log message order from
 > different processes is not predictable).
 >
 > What we actually need is something like
 > "STATS_STATISTICS_REQUEST_DONE" and check that.
 >
 > Note also that #1938 has been merged, and the meaning of "wait for
 > new" was clarified in a possibly non compatible way.

 That's right. I've removed waiting for 'STATS_SEND_STATISTICS_REQUEST'.
 But I didn't introduce 'STATS_STATISTICS_REQUEST_DONE' as you mentioned.
 Because even if we introduced it, we cannot identify whether b10-stats
 received statistics from b10-xfrin. b10-stats also asks other modules like
 b10-init and b10-auth. I've already pulled the newer master onto trac2252
 at 836fabb. That must include the changes which you pushed on #1938.

 >
 > > > > > '''xfrin_test.py'''
 >
 > I've reviewed the entire change in the branch from the branch point.
 > The latest version looks much better, but it still seems to have some
 > issues:
 >
 Thank you. Anyway first of all,

 > - the case of test_do_soacheck_uptodate makes me wonder why xfrsuccess
 >   is counted even if there's no xfr transaction.  Is that okay?  (What
 >   BIND 9 does in this case?)  If that's by design, I suspect its
 >   description has to be updated:
 > {{{
 >                     <listitem><simpara>
 >                       Number of zone transfer requests succeeded
 >                     </simpara></listitem>
 > }}}
 >  (since there's actually not even a transfer request in this case).

 I found I misunderstood these counters by your comment. We should count
 axfrreqv4/6 or ixfrreqv4/6 before sending out an XFR request. After that
 we should count up xfrsuccess or xfrfail depending on the result of the
 query. I've applied this correction to the code. Please see this for
 details.

 {{{
 5c0c1cb [2252] change positions for counters to be counted
 4a8316d [2252] update the test cases due to the previous change
 }}}

 > - the tests for the durations do not always succeed (for me).  in
 >   general, it's not really good to rely on the real world clock in
 >   unit tests because it's too fragile.  (see also the next point)

 That's right. I've updated the code to check a duration time by a fixed
 value. Please see this.

 {{{
 e346959 [2252] update datetime to always return fixed seconds
 }}}

 > - AXFR+xfrfail case doesn't seem to be tested.  Combining the first
 >   and this point, I think it's easier to test if we mock many of the
 >   methods used in do_xfrin, and for durations only check
 >   start_timer/stop_timer are called at the correct point with the
 >   correct parameters (not checking the resulting values).

 As for this, I've included the above commits. I think a case for both AXFR
 and xfrfail is included in the updated code.

 > - some class attributes (ipver, name_to_counter, master_addrinfo),
 >   are introduced while they are essentially instance attributes.
 >   using class attributes when they don't have to be so is generally
 >   not a good practice because change in one test instance could affect
 >   another test.  (This is also related to the issue that led to
 >   #2883).  I suggest making them instance attributes.

 Hmm..., sorry, I cannot still understand what you worry about so much. For
 example, if we intentionally change the class attribute 'ipver' to 'v7'
 inside of the test case like the following:
 {{{#!patch
 diff --git a/src/bin/xfrin/tests/xfrin_test.py
 b/src/bin/xfrin/tests/xfrin_test.py
 index ecd2bf5..c17741f 100644
 --- a/src/bin/xfrin/tests/xfrin_test.py
 +++ b/src/bin/xfrin/tests/xfrin_test.py
 @@ -2186,6 +2186,7 @@ class
 TestStatisticsXfrinAXFRv4(TestStatisticsXfrinConn):
      def test_soaout(self):
          '''tests that an soaoutv4 or soaoutv6 counter is incremented
          when an soa query succeeds'''
 +        self.ipver = 'v7'
          self.conn.response_generator = self._create_soa_response_data
          self._check_init_statistics()
          self.assertEqual(self.conn._check_soa_serial(), XFRIN_OK)

 }}}
 Then could this change unexpectedly affect to other test cases? In this
 case I can see it affects only the two related cases:
 `TestStatisticsXfrinAXFRv4.test_soaout()` and
 `TestStatisticsXfrinAXFRv6.test_soaout()`. But other cases definitely
 refer to the changed attribute. Can you crash a test code by changing a
 attribute of the test case class? Maybe I have less ability to code
 python. Please correct me if I misunderstand this.

 > - the intent of each test (such as `test_do_soacheck_uptodate`) is not
 >   so obvious.  Please add docstring or comment.

 That's right, the case names were inappropriate. I've revised them and
 their docstring. Please see this.
 {{{
 a4b1c8d [2252] update case names and their docstrings according to the
 tested counters
 }}}

 >
 > > > '''lettuce tests'''
 >
 > - Just wonder: are the changes in test_spec3.spec necessary?

 Not necessary. Keeping same content makes no sense, but I leave a change I
 added before.

 >
 > > > - First off, test failed.  I thought it might be related to the
 issue
 > > >   of #2790, but at least this one doesn't seem to be related to it:
 > [...]
 >
 > > Even if I didn't apply the Jelte's patch, the related tests passed on
 my environment.
 >
 > That one is not relevant to Jelte's patch.
 >
 > > And I haven't changed these thresholds since the first time the codes
 could be reviewed. Did that always fail on your environment? Can I see the
 full logs of lettuce tests and stderr? Depending on the results, we might
 need to reconsider the thresholds.
 >
 > I've attached the stderr log files and the output from lettuce when it
 > failed.  Note that there are two issues:

 Thank you. By checking that, I've adjusted soaoutv4 and soaoutv6. There
 seems to be a problem related to #2879 as you mentioned below.

 >
 > - statistics values are not the expected ones.  I suspect it's related
 >   to the "STATS_STATISTICS_REQUEST_DONE" issue I explained above.  And
 >   this is not related to Jelte's patch.

 I didn't introduced it as I mentioned above.

 > - this synchronization failed
 > {{{
 >     Then wait for new bind10 stderr message XFRIN_XFR_TRANSFER_STARTED
 > }}}
 >   even with Jelte's patch.  On master, I've never seen this happen if
 >   that patch is applied.

 Hmm..., sorry, I'm not sure about this. I have merged the newer master on
 to the branch. Could you try lettuce test again? If you still have same
 problem, please let me know.

 >
 > > > - check_statistics_items: it seems that the docstring has to be
 > > >   updated.
 > >
 > > I updated the docstring. Could you see the following commit?
 > > {{{
 > > be426ec [2252] update description of check_statistics_items() about
 min_value and max_value
 > > }}}
 >
 > It says 5 columns but only mention 4.
 >
 > {{{
 >     the scenario. The multiline part has at most 5 columns: item_name,
 >     item_value, min_value, and max_value. item_name is a relative name
 >     to category. item_value is an expected value for
 > }}}

 Thank you for correcting. I've revised the wrong number.

 {{{
 4f63e23 [2252] correct the wrong number of columns to be mentioned
 }}}

 >
 > > > - check_init_statistics: if it's xfr specific, it seems it should be
 > > >   named accordingly.  Also, it seems to have been somewhere other
 than
 > > >   bind10_control.py (from a quick look it's for more generic
 modules;
 > > >   xfrs are too specific in that sense).
 > >
 > > I generalized it not to be specific to Xfrin nor Xfrout. Could you see
 the following commits?
 > > {{{
 > > c37fd56 [2252] generalize the module name as an argument of
 check_init_statistics()
 > > 0eec3b5 [2252] add a port number of cmdctl as an option of
 check_init_statistics()
 > > }}}
 >
 > This part still seems to be DNS specific:
 >
 > {{{
 >     step.given(notcontain_str % 'example.org.')
 > }}}
 >
 > It should probably be parameterized.

 I've parametrized it. Please see this.
 {{{
 1f578c9 [2252] parameterize for the domain name not to be contained in
 queried statistics
 }}}

 >
 > > > - it's not obvious why these are specified with "min".  please add
 > > >   comments on why.
 >
 > > > - same for this:
 > > > {{{
 > > >       | item_name              | min_value | max_value |
 > > >       | _SERVER_.soaoutv4      |         1 |         3 |
 > > >       | _SERVER_.axfrreqv4     |         1 |         3 |
 > > > ...
 > > > }}}
 > >
 > > On my environment, they cannot be expected to a specific value. They
 seem to depend on timing and environment. I added notes about this. Please
 see the following commit?
 > > {{{
 > > 14ee53d [2252] add notes about counters of Xfrin: soaoutv4, axfrreqv4,
 soaoutv6, axfrreqv6, and xfrfail
 > > }}}
 >
 > I suspect this is due to #2879.  Please confirm that from logs, and
 > try modifying `isc.notify.notify_out._MAX_NOTIFY_TRY_NUM` to 1 to see
 > if the number is now predictable (I believe it should be exactly 1).
 > If these are confirmed, please make more accurate comment in the
 > lettuce feature and add comments to #2879 so when it's done we should
 > be able to clean up this test.

 I think so too. If I set _MAX_NOTIFY_TRY_NUM to 1, then we can expect
 these counters to be exactly 1. I suspect that's why checking soaoutv6
 failed on your environment. According to your uploaded log(lettuce.log),
 soaoutv6 was expected to count 1 but actually count 2. I've added notes on
 both the code and #2879. For the notes, please see this.

 {{{
 9d00667 [2252] update the note to more accurate one due to bug #2879
 }}}

 BTW on my environment, checking initial statistics of Xfrout sometimes
 failed. I suspect this is because a socket of Xfrout is already opened
 before b10-stats asks to b10-xfrout. So I added exceptional cases to
 checking. Could you see this change?

 {{{
 d7885ec [2252] add an exceptional case when checking initial statistics of
 Xfrout
 }}}

 Regards,

-- 
Ticket URL: <http://bind10.isc.org/ticket/2252#comment:19>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development