BIND 10 #2252: Implement counters into Xfrin (1/3)

BIND 10 Development do-not-reply at isc.org
Wed Mar 27 03:46:44 UTC 2013


#2252: Implement counters into Xfrin (1/3)
-------------------------------------+-------------------------------------
            Reporter:  naokikambe    |                        Owner:
                Type:  enhancement   |  jinmei
            Priority:  medium        |                       Status:
           Component:  xfrin         |  reviewing
            Keywords:                |                    Milestone:
           Sensitive:  0             |  Sprint-20130402
         Sub-Project:  DNS           |                   Resolution:
Estimated Difficulty:  5             |                 CVSS Scoring:
         Total Hours:  0             |              Defect Severity:  N/A
                                     |  Feature Depending on Ticket:
                                     |          Add Hours to Ticket:  0
                                     |                    Internal?:  0
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:16 naokikambe]:

 > > It's more about design matter so the harm may not be so obvious or
 > > even a subjective matter.  I'm okay with excluding it for this ticket,
 > > but I'd argue this should be at least discussed (and if we agree)
 > > resolved before using the current design further.  So I'd suggest
 > > creating a ticket to discuss and fix it, and do it before the
 > > subsequent counter tickets for xfrin or others.
 >
 > I'm okay with redesigning if the current library has some problem. But
 what ticket should I create? What contents should I write in the ticket?
 Sorry, I couldn't understand well about problems. Could you create it
 yourself?

 I've created one: #2883.  We might also discuss it in the next team
 call.

 > > '''b10-xfrin.xml'''
 > >
 > > - about latest_*xfr_duration: I missed the currently ongoing one
 > >   doesn't show up in the statistics.  In that case
 > >   "last_*xfr_duration" would be better than latest.

 Note: I had agreed last_*xfr_duration would be better, but they are
 not changed.

 > > > > - as for time_to_ixfr, what if there's never been an xfr for the
 zone?
 > > > It shows still the default value, 0.0.
 > >
 > > It seems to be suboptimal in that using a magic number (especially in
 > > the valid range - 0.0 should still be possible if everything is done
 > > super quickly) to present a kind of "None" is always a source of
 > > trouble.  I'd consider something that can explicitly mean "no
 > > completed xfrs".  And, especially if we use a magic value, document it
 > > somewhere.
 >
 > I updated descriptions about the default value `0.0`. Could you see the
 following commit?
 > {{{
 > 26a9ee9 [2252] update descriptions of latest_ixfr_duration and
 latest_axfr_duration about their default values
 > }}}

 In that case this description isn't accurate.  It should be something
 like:

 {{{
                     <listitem><simpara>
                       Duration of the latest IXFR in seconds. 0.0 means it
 was
                       shorter than 0.05 seconds or there was no
                       successful IXFR.
                     </simpara></listitem>
 }}}

 and, as this description suggests, it's generally a bad idea to
 overload a specific value that way; such ambiguity will eventually
 bite us(ers).  Better approach is to actually revise the code to
 eliminate the ambiguity.  Do you still think the overloading is the
 way to go?

 > > > > '''xfrin.py.in'''
 > > > >
 > > > > - get_ipver_str: I suggest making it "protected" by renaming
 > > By "clarify" I meant update the docstring too.
 >
 > I updated the docstring. Could you see this?
 > {{{
 > ad8e7ea [2252] update docstring of _get_ipver_str() about accessing
 > }}}

 I further clarified it can also be used in tests.

 > > > > - `_check_soa_serial`: shouldn't this be distinguished per name
 and
 > > > >   class?
 [...]
 > > I'm okay with differing it, but it seems to be a quite fundamental
 > > design error.  So I'd argue that should be fixed before we rely on the
 > > current format further.  The longer we postpone it, the more painful
 > > it is to fix.
 >
 > The current data structure is already implemented in Xfrout. I think we
 should repair altogether at a time in future.

 These should be all fixed, yes, but if you mean by "in future", we
 still keep using it, I'd be opposed to it.  IMO this level of design
 defect should be corrected sooner than later.  Just as the fact of
 xfrout indicates, if we rely on the flawed design more, fixing them
 will simply be more and more painful.  For this particular ticket, I
 wouldn't block due to this, but I don't like to pick up any other
 counter tickets until this issue is fixed.

 I created a ticket: #2884.

 > > > > - do_xfrin: stop_timer wouldn't be called if exception is raised
 > > > >   during the process.
 > > > I think it is okay that the timer count would be discarded if
 exception is raised.
 > >
 > > Is it really discarded?  Shouldn't it be left until next time we set
 > > the timer for the same zone?   If it's left, that doesn't seem to be
 > > clear.  Also, if the intent is to only count the duration of
 > > successfully completed transfers (even excluding the case of
 > > `XfrinZoneUptodate`), I believe that should be documented.
 >
 > Strictly speaking, that would be discarded when the next start time is
 set. I guess the timer wouldn't probably started yet in case of
 `XfrinZoneUptodate`. I described about discarding at the following commit.
 > {{{
 > dc2960e [2252] add notes about the case that the timer is already
 started but not yet stopped
 > }}}

 By "document", I (probably ... not quite remember after some period
 since last comment cycle) meant the man page or something, not (only)
 the code comment.

 I'm not sure if I don't understand this: "I guess the timer wouldn't
 probably started yet in case of `XfrinZoneUptodate`".  You mean it
 should have come from `_check_soa_serial`?  But that can also happen
 in IXFR.

 As for dc2960e, shouldn't this be "not yet stopped"?

 {{{
 +        support multi-threaded use. If the specified timer is already
 +        started but not yet started, the last start time is
 +        overwritten."""
 }}}

 > > > > - command_handler: why do we bother to log it here?
 [...]
 > I generalized the log id not to be specific to the `getstats` command.
 Could you see the following commit?
 > {{{
 > b0f2aca [2252] replace the log message XFRIN_RECEIVED_COMMAND to the
 more general id XFRIN_RECEIVED_COMMAND with the com
 > }}}

 This change itself looks good.  But please reorder the log messages
 (use the tool under bind10/tools).

 But, while working on #1938 I (happen to) noticed it had a more
 fundamental problem in the context of using it in lettuce.  The
 corresponding lettuce test reads:

 {{{
     When I query statistics zones of bind10 module Xfrin with cmdctl
     wait for new bind10 stderr message STATS_SEND_STATISTICS_REQUEST
     wait for new bind10 stderr message XFRIN_RECEIVED_COMMAND
     last bindctl output should not contain "error"

     When I query statistics zones of bind10 module Xfrin with cmdctl
 }}}

 This is not reliable, because just seeing XFRIN_RECEIVED_COMMAND
 doesn't guarantee the stats daemon received the answer (on the other
 hand, checking STATS_SEND_... should be unnecessary because if it
 didn't happen we wouldn't see XFRIN_RECEIVED_COMMAND; further, I
 suspect it could even cause failure, because log message order from
 different processes is not predictable).

 What we actually need is something like
 "STATS_STATISTICS_REQUEST_DONE" and check that.

 Note also that #1938 has been merged, and the meaning of "wait for
 new" was clarified in a possibly non compatible way.

 > > > > '''xfrin_test.py'''

 I've reviewed the entire change in the branch from the branch point.
 The latest version looks much better, but it still seems to have some
 issues:

 - the tests for the durations do not always succeed (for me).  in
   general, it's not really good to rely on the real world clock in
   unit tests because it's too fragile.  (see also the next point)
 - AXFR+xfrfail case doesn't seem to be tested.  Combining the first
   and this point, I think it's easier to test if we mock many of the
   methods used in do_xfrin, and for durations only check
   start_timer/stop_timer are called at the correct point with the
   correct parameters (not checking the resulting values).
 - some class attributes (ipver, name_to_counter, master_addrinfo),
   are introduced while they are essentially instance attributes.
   using class attributes when they don't have to be so is generally
   not a good practice because change in one test instance could affect
   another test.  (This is also related to the issue that led to
   #2883).  I suggest making them instance attributes.
 - the intent of each test (such as `test_do_soacheck_uptodate`) is not
   so obvious.  Please add docstring or comment.
 - the case of test_do_soacheck_uptodate makes me wonder why xfrsuccess
   is counted even if there's no xfr transaction.  Is that okay?  (What
   BIND 9 does in this case?)  If that's by design, I suspect its
   description has to be updated:
 {{{
                     <listitem><simpara>
                       Number of zone transfer requests succeeded
                     </simpara></listitem>
 }}}
  (since there's actually not even a transfer request in this case).

 > > '''lettuce tests'''

 - Just wonder: are the changes in test_spec3.spec necessary?

 > > - First off, test failed.  I thought it might be related to the issue
 > >   of #2790, but at least this one doesn't seem to be related to it:
 [...]

 > Even if I didn't apply the Jelte's patch, the related tests passed on my
 environment.

 That one is not relevant to Jelte's patch.

 > And I haven't changed these thresholds since the first time the codes
 could be reviewed. Did that always fail on your environment? Can I see the
 full logs of lettuce tests and stderr? Depending on the results, we might
 need to reconsider the thresholds.

 I've attached the stderr log files and the output from lettuce when it
 failed.  Note that there are two issues:

 - statistics values are not the expected ones.  I suspect it's related
   to the "STATS_STATISTICS_REQUEST_DONE" issue I explained above.  And
   this is not related to Jelte's patch.
 - this synchronization failed
 {{{
     Then wait for new bind10 stderr message XFRIN_XFR_TRANSFER_STARTED
 }}}
   even with Jelte's patch.  On master, I've never seen this happen if
   that patch is applied.

 > > - check_statistics_items: it seems that the docstring has to be
 > >   updated.
 >
 > I updated the docstring. Could you see the following commit?
 > {{{
 > be426ec [2252] update description of check_statistics_items() about
 min_value and max_value
 > }}}

 It says 5 columns but only mention 4.

 {{{
     the scenario. The multiline part has at most 5 columns: item_name,
     item_value, min_value, and max_value. item_name is a relative name
     to category. item_value is an expected value for
 }}}

 > > - check_init_statistics: if it's xfr specific, it seems it should be
 > >   named accordingly.  Also, it seems to have been somewhere other than
 > >   bind10_control.py (from a quick look it's for more generic modules;
 > >   xfrs are too specific in that sense).
 >
 > I generalized it not to be specific to Xfrin nor Xfrout. Could you see
 the following commits?
 > {{{
 > c37fd56 [2252] generalize the module name as an argument of
 check_init_statistics()
 > 0eec3b5 [2252] add a port number of cmdctl as an option of
 check_init_statistics()
 > }}}

 This part still seems to be DNS specific:

 {{{
     step.given(notcontain_str % 'example.org.')
 }}}

 It should probably be parameterized.

 > > - it's not obvious why these are specified with "min".  please add
 > >   comments on why.

 > > - same for this:
 > > {{{
 > >       | item_name              | min_value | max_value |
 > >       | _SERVER_.soaoutv4      |         1 |         3 |
 > >       | _SERVER_.axfrreqv4     |         1 |         3 |
 > > ...
 > > }}}
 >
 > On my environment, they cannot be expected to a specific value. They
 seem to depend on timing and environment. I added notes about this. Please
 see the following commit?
 > {{{
 > 14ee53d [2252] add notes about counters of Xfrin: soaoutv4, axfrreqv4,
 soaoutv6, axfrreqv6, and xfrfail
 > }}}

 I suspect this is due to #2879.  Please confirm that from logs, and
 try modifying `isc.notify.notify_out._MAX_NOTIFY_TRY_NUM` to 1 to see
 if the number is now predictable (I believe it should be exactly 1).
 If these are confirmed, please make more accurate comment in the
 lettuce feature and add comments to #2879 so when it's done we should
 be able to clean up this test.

-- 
Ticket URL: <http://bind10.isc.org/ticket/2252#comment:17>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development


More information about the bind10-tickets mailing list