[bind10-dev] comments on the statistics design

Fri Jul 27 00:57:55 UTC 2012

I've quickly read the revised version of statistics design:
http://bind10.isc.org/wiki/StatsModule

Some random comments (absolutely not comprehensive):

- should each module really do validation on the statistics data it's
  sending?  It doesn't seem to make much sense to me; normally the
  sender should know what it's sending is valid (otherwise it's a
  bug), and in any case the receiver needs to validate it.
  It could be even be a non negligible overhead for a busy module like
  b10-auth.
- Regarding how to identify "how many...", I don't think we can always
  rely on 'special'.  Not all modules would have it.  Maybe beyond the
  scope of this discussion, but my gut feeling is that we should solve
  this type of thing as part of more fundamental restructuring of
  inter-module relationships (how to know which module and how many
  instances of it is running, when something stops/starts, etc)
- what should happen if stats dies and restarts?  e.g. Is it okay to
  drop all statistics collected by then?
- what should other modules do if stats doesn't send requests or isn't
  even running at all?  Should they keep maintaining local statistics
  anyway?  What if the amount of the statistics is huge (like the case
  of per-zone statistics with billions of zones)?
- is each module expected to reset their statistics to 0 (if that's
  resettable) every time it responds to a request from the stats?  I
  guess so because otherwise the accumulated data at stats won't make
  sense, but it's not clear from the document.  What if it's not
  "resettable" (such as number of RRs of some zone/cache, etc,
  assuming we consider it "statistics" and want to maintain it)?
- keeping statistics even if some (instance of) module stops would
  generally make sense, but I'm not sure if it's always the case.
  If someone completely stops authoritative service once and then
  restarts it with a completely new set of config weeks later, it
  would rather be desirable if we start from fresh statistics.
- does that make sense? "If the target module caught an error when
  returning, the module should return (1, 'error message'), the
  return code is 1 and the error message is included detail of
  the error. Error message is text type."  If responding to a stat
  request fails, it's also pretty likely that sending an error message
  fails too.
- related to the previous point, what should the sending module do
  about the statistics if sending statistics fails?  Keep it until the
  next request?  Clear it?  Like asked above, what if the amount could
  be huge?
- I guess we need to revisit the representation of statistic
  (counters), especially in terms of the spec.   Things like per
  RR-type counter aren't easily represented in this style.
- Now that we switch to the "request model", I think it also makes
  sense to consider more synchronized update upon user request (via
  bindctl or http).

---
JINMEI, Tatuya