BIND 10 #2252: Implement counters into Xfrin (1/3)
BIND 10 Development
do-not-reply at isc.org
Wed Mar 27 03:46:44 UTC 2013
#2252: Implement counters into Xfrin (1/3)
-------------------------------------+-------------------------------------
Reporter: naokikambe | Owner:
Type: enhancement | jinmei
Priority: medium | Status:
Component: xfrin | reviewing
Keywords: | Milestone:
Sensitive: 0 | Sprint-20130402
Sub-Project: DNS | Resolution:
Estimated Difficulty: 5 | CVSS Scoring:
Total Hours: 0 | Defect Severity: N/A
| Feature Depending on Ticket:
| Add Hours to Ticket: 0
| Internal?: 0
-------------------------------------+-------------------------------------
Comment (by jinmei):
Replying to [comment:16 naokikambe]:
> > It's more about design matter so the harm may not be so obvious or
> > even a subjective matter. I'm okay with excluding it for this ticket,
> > but I'd argue this should be at least discussed (and if we agree)
> > resolved before using the current design further. So I'd suggest
> > creating a ticket to discuss and fix it, and do it before the
> > subsequent counter tickets for xfrin or others.
>
> I'm okay with redesigning if the current library has some problem. But
what ticket should I create? What contents should I write in the ticket?
Sorry, I couldn't understand well about problems. Could you create it
yourself?
I've created one: #2883. We might also discuss it in the next team
call.
> > '''b10-xfrin.xml'''
> >
> > - about latest_*xfr_duration: I missed the currently ongoing one
> > doesn't show up in the statistics. In that case
> > "last_*xfr_duration" would be better than latest.
Note: I had agreed last_*xfr_duration would be better, but they are
not changed.
> > > > - as for time_to_ixfr, what if there's never been an xfr for the
zone?
> > > It shows still the default value, 0.0.
> >
> > It seems to be suboptimal in that using a magic number (especially in
> > the valid range - 0.0 should still be possible if everything is done
> > super quickly) to present a kind of "None" is always a source of
> > trouble. I'd consider something that can explicitly mean "no
> > completed xfrs". And, especially if we use a magic value, document it
> > somewhere.
>
> I updated descriptions about the default value `0.0`. Could you see the
following commit?
> {{{
> 26a9ee9 [2252] update descriptions of latest_ixfr_duration and
latest_axfr_duration about their default values
> }}}
In that case this description isn't accurate. It should be something
like:
{{{
<listitem><simpara>
Duration of the latest IXFR in seconds. 0.0 means it
was
shorter than 0.05 seconds or there was no
successful IXFR.
</simpara></listitem>
}}}
and, as this description suggests, it's generally a bad idea to
overload a specific value that way; such ambiguity will eventually
bite us(ers). Better approach is to actually revise the code to
eliminate the ambiguity. Do you still think the overloading is the
way to go?
> > > > '''xfrin.py.in'''
> > > >
> > > > - get_ipver_str: I suggest making it "protected" by renaming
> > By "clarify" I meant update the docstring too.
>
> I updated the docstring. Could you see this?
> {{{
> ad8e7ea [2252] update docstring of _get_ipver_str() about accessing
> }}}
I further clarified it can also be used in tests.
> > > > - `_check_soa_serial`: shouldn't this be distinguished per name
and
> > > > class?
[...]
> > I'm okay with differing it, but it seems to be a quite fundamental
> > design error. So I'd argue that should be fixed before we rely on the
> > current format further. The longer we postpone it, the more painful
> > it is to fix.
>
> The current data structure is already implemented in Xfrout. I think we
should repair altogether at a time in future.
These should be all fixed, yes, but if you mean by "in future", we
still keep using it, I'd be opposed to it. IMO this level of design
defect should be corrected sooner than later. Just as the fact of
xfrout indicates, if we rely on the flawed design more, fixing them
will simply be more and more painful. For this particular ticket, I
wouldn't block due to this, but I don't like to pick up any other
counter tickets until this issue is fixed.
I created a ticket: #2884.
> > > > - do_xfrin: stop_timer wouldn't be called if exception is raised
> > > > during the process.
> > > I think it is okay that the timer count would be discarded if
exception is raised.
> >
> > Is it really discarded? Shouldn't it be left until next time we set
> > the timer for the same zone? If it's left, that doesn't seem to be
> > clear. Also, if the intent is to only count the duration of
> > successfully completed transfers (even excluding the case of
> > `XfrinZoneUptodate`), I believe that should be documented.
>
> Strictly speaking, that would be discarded when the next start time is
set. I guess the timer wouldn't probably started yet in case of
`XfrinZoneUptodate`. I described about discarding at the following commit.
> {{{
> dc2960e [2252] add notes about the case that the timer is already
started but not yet stopped
> }}}
By "document", I (probably ... not quite remember after some period
since last comment cycle) meant the man page or something, not (only)
the code comment.
I'm not sure if I don't understand this: "I guess the timer wouldn't
probably started yet in case of `XfrinZoneUptodate`". You mean it
should have come from `_check_soa_serial`? But that can also happen
in IXFR.
As for dc2960e, shouldn't this be "not yet stopped"?
{{{
+ support multi-threaded use. If the specified timer is already
+ started but not yet started, the last start time is
+ overwritten."""
}}}
> > > > - command_handler: why do we bother to log it here?
[...]
> I generalized the log id not to be specific to the `getstats` command.
Could you see the following commit?
> {{{
> b0f2aca [2252] replace the log message XFRIN_RECEIVED_COMMAND to the
more general id XFRIN_RECEIVED_COMMAND with the com
> }}}
This change itself looks good. But please reorder the log messages
(use the tool under bind10/tools).
But, while working on #1938 I (happen to) noticed it had a more
fundamental problem in the context of using it in lettuce. The
corresponding lettuce test reads:
{{{
When I query statistics zones of bind10 module Xfrin with cmdctl
wait for new bind10 stderr message STATS_SEND_STATISTICS_REQUEST
wait for new bind10 stderr message XFRIN_RECEIVED_COMMAND
last bindctl output should not contain "error"
When I query statistics zones of bind10 module Xfrin with cmdctl
}}}
This is not reliable, because just seeing XFRIN_RECEIVED_COMMAND
doesn't guarantee the stats daemon received the answer (on the other
hand, checking STATS_SEND_... should be unnecessary because if it
didn't happen we wouldn't see XFRIN_RECEIVED_COMMAND; further, I
suspect it could even cause failure, because log message order from
different processes is not predictable).
What we actually need is something like
"STATS_STATISTICS_REQUEST_DONE" and check that.
Note also that #1938 has been merged, and the meaning of "wait for
new" was clarified in a possibly non compatible way.
> > > > '''xfrin_test.py'''
I've reviewed the entire change in the branch from the branch point.
The latest version looks much better, but it still seems to have some
issues:
- the tests for the durations do not always succeed (for me). in
general, it's not really good to rely on the real world clock in
unit tests because it's too fragile. (see also the next point)
- AXFR+xfrfail case doesn't seem to be tested. Combining the first
and this point, I think it's easier to test if we mock many of the
methods used in do_xfrin, and for durations only check
start_timer/stop_timer are called at the correct point with the
correct parameters (not checking the resulting values).
- some class attributes (ipver, name_to_counter, master_addrinfo),
are introduced while they are essentially instance attributes.
using class attributes when they don't have to be so is generally
not a good practice because change in one test instance could affect
another test. (This is also related to the issue that led to
#2883). I suggest making them instance attributes.
- the intent of each test (such as `test_do_soacheck_uptodate`) is not
so obvious. Please add docstring or comment.
- the case of test_do_soacheck_uptodate makes me wonder why xfrsuccess
is counted even if there's no xfr transaction. Is that okay? (What
BIND 9 does in this case?) If that's by design, I suspect its
description has to be updated:
{{{
<listitem><simpara>
Number of zone transfer requests succeeded
</simpara></listitem>
}}}
(since there's actually not even a transfer request in this case).
> > '''lettuce tests'''
- Just wonder: are the changes in test_spec3.spec necessary?
> > - First off, test failed. I thought it might be related to the issue
> > of #2790, but at least this one doesn't seem to be related to it:
[...]
> Even if I didn't apply the Jelte's patch, the related tests passed on my
environment.
That one is not relevant to Jelte's patch.
> And I haven't changed these thresholds since the first time the codes
could be reviewed. Did that always fail on your environment? Can I see the
full logs of lettuce tests and stderr? Depending on the results, we might
need to reconsider the thresholds.
I've attached the stderr log files and the output from lettuce when it
failed. Note that there are two issues:
- statistics values are not the expected ones. I suspect it's related
to the "STATS_STATISTICS_REQUEST_DONE" issue I explained above. And
this is not related to Jelte's patch.
- this synchronization failed
{{{
Then wait for new bind10 stderr message XFRIN_XFR_TRANSFER_STARTED
}}}
even with Jelte's patch. On master, I've never seen this happen if
that patch is applied.
> > - check_statistics_items: it seems that the docstring has to be
> > updated.
>
> I updated the docstring. Could you see the following commit?
> {{{
> be426ec [2252] update description of check_statistics_items() about
min_value and max_value
> }}}
It says 5 columns but only mention 4.
{{{
the scenario. The multiline part has at most 5 columns: item_name,
item_value, min_value, and max_value. item_name is a relative name
to category. item_value is an expected value for
}}}
> > - check_init_statistics: if it's xfr specific, it seems it should be
> > named accordingly. Also, it seems to have been somewhere other than
> > bind10_control.py (from a quick look it's for more generic modules;
> > xfrs are too specific in that sense).
>
> I generalized it not to be specific to Xfrin nor Xfrout. Could you see
the following commits?
> {{{
> c37fd56 [2252] generalize the module name as an argument of
check_init_statistics()
> 0eec3b5 [2252] add a port number of cmdctl as an option of
check_init_statistics()
> }}}
This part still seems to be DNS specific:
{{{
step.given(notcontain_str % 'example.org.')
}}}
It should probably be parameterized.
> > - it's not obvious why these are specified with "min". please add
> > comments on why.
> > - same for this:
> > {{{
> > | item_name | min_value | max_value |
> > | _SERVER_.soaoutv4 | 1 | 3 |
> > | _SERVER_.axfrreqv4 | 1 | 3 |
> > ...
> > }}}
>
> On my environment, they cannot be expected to a specific value. They
seem to depend on timing and environment. I added notes about this. Please
see the following commit?
> {{{
> 14ee53d [2252] add notes about counters of Xfrin: soaoutv4, axfrreqv4,
soaoutv6, axfrreqv6, and xfrfail
> }}}
I suspect this is due to #2879. Please confirm that from logs, and
try modifying `isc.notify.notify_out._MAX_NOTIFY_TRY_NUM` to 1 to see
if the number is now predictable (I believe it should be exactly 1).
If these are confirmed, please make more accurate comment in the
lettuce feature and add comments to #2879 so when it's done we should
be able to clean up this test.
--
Ticket URL: <http://bind10.isc.org/ticket/2252#comment:17>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list