[kea-dev] Statistics design proposal for 0.9.2

Wed Apr 15 14:56:16 UTC 2015

Thanks a lot for those comments. See my comments below.

On 15.04.2015 10:43, Marcin Siodelski wrote:
> I would like to clarify the comment I have made at some point about the
> use of concurrency when gathering the statistical information. I didn't
> really mean that the statistics manager should run in a separate
> process. I was rather thinking that it should run in a separate thread.
> This thread could create a socket and listen on the fd belonging to this
> socket. This would allow for better responsiveness of the stats manager
> in the presence of many DHCP packets being received on possibly many
> interfaces. This would also allow to perform certain independent tasks
> like, reception of a command, unparsing JSON, creating and sending the
> response concurrently with the main thread which handles DHCP stream.
> There is a problem with the concurrent access to the StatsMgr such that
> certain values have to be locked for write when second thread is reading
> them. But, that is not something that can't be solved with the design of
> the StatsMgr.
We had a lengthy chat about this. It can be summarized as follows:
The simple approach that the main thread is data producer and the stats
thread is data consumer is not sustainable. It would work during most of
the time while collecting data, but would not work when receiving
external commands to set statistic or flush it. Therefore even a simple
case (e.g. log number of packets) would require thread synchronization.
This could dramatically lower statistics performance, rather than
improve it.

Having said that, this problem is solvable. It just requires non-trivial
amount of research, preferably with prototype implementations and
benchmarks. This is something we don't have time for in 0.9.2, so we
agreed to go with the simple approach of logging everything in the main
thread.

Now, if we get more concrete data (as opposed to opinions that a said
solution may be slow), we may rework the statistics system. Note that it
would require updating a handful of methods in StatsMgr to move the
whole processing to a separate thread.

> The stats manager's operation is going to be based on time intervals.
> For example: keep statistics collected for the last 5 minutes. The use
> of threads would probably make it much easier to use asio-based timers
> which are asynchronous, i.e. based on callbacks invoked when specific
> timers expire. You can't do it easily when you are hanging on the call
> to select() in the main thread.
Not exactly. Yes, it will be time oriented, but there won't be a single
timer. Let me give you an example. Let's assume the user specifies that
observations of an event (e.g. broken Discover) that typically happens
every 30 seconds should be kept for 5 minutes. After 5 minutes we'll
have 10 observations. After another 30 seconds, the event occurred again
and the code attempts to log another observation. New observation is
added, but at the same time the StatsMgr check that the last observed
sample is too old, so it is discarded.

This approach works well, even if the expected event doesn't happen for
a certain period of time. Let's assume that the next broken Discover
appears after 3 minutes. The StatsMgr will record new observation and
will discard several old observations that are more than 5 minutes old.

This approach also means that the StatsMgr does not do anything when
there are no observations recorded. No background tasks, no timers to
manage, no timeouts to act on.

> I wonder how statistics is going to be configured. I understand that
> you're planning to add invocations to the StatsMgr in multiple places in
> the code where you're going to bump the counters. But, is it going to be
> possible to enable/disable specific counters so as they are not bumped
> if not needed? Or, it is assumed that the counter bumping operation is
> fast enough that such optimization would not bring a lot of benefit?
> From the "Performance Optimization" section however it seems that it has
> been of your concern.
That's a good point. In principle, I think that by default a statistic
is kept a single integer/double. Increasing an integer or double is
sufficiently fast that in my opinion we don't need to worry about it.
But I see your point. There's a setStorage(name, max_samples) method.
It can be trivially extended for case max_samples = 0, which means to
not collect this particular stat at all.

> I am iffy about the naming for statistics per subnet You say,
> "subnet[0].packets-received". But, what if I remove the subnet with
> index 0 from the configuration? The subnets will get renumbered and the
> statistics will now apply to wrong subnet. Wouldn't it be better to
> identify subnets using SubnetID which is supposed to be unique?
That was perhaps a poor example. The intention was that "0" in this
example would be SubnetID. So yes - sysadmins are expected to be
familiar with SubnetID as they are used in many places, e.g. lease
database. They will be able to use it to retrieve specific subnet
information.

> On the related note. Does this also account for the statistics per
> interface?
Not in the basic scope, but extending the code to handle per interface
statistics is trivial. Barring unit-tests, it's a one-liner:

StatsMgr::instance().addValue("interface[eth0].packets-received");

Note that the lazy initialization design pattern applies also here. For
the first statistic that starts with "interface[eth0]." a new context
will be created. You can use interface index instead (e.g.
"interface[6]") if you prefer that.

> In the data extraction section we should keep in mind that the
> communication over the unix socket requires two sockets: one for the
> client and one for the server. So I guess, you'll need to extend the
> "control-socket" parameter to specify two names? I am also not so sure
> that choosing the string as a parameter for control-socket configuration
> is a right choice. If you want to use the same parameter for future
> sockets: TCP, UDP or whatever else, it may quickly occur that you need
> more parameters. If I am correct about the two names for socket files
> you already have three parameters that describe the socket communication.
There are two aspects here. First, as you posted in the follow-up
message, one unix socket can be used for two-way communication. I get
your second point, though. So far there are two parameters defined for
now (socket-type and socket-file-location), but the number is likely to
grow over time (tcp-port, access-control and possibly many others), so
this will be defined as a map, the same way as we define lease-database.

> It would be useful if the design included some sample JSON requests and
> responses, including responses which report errors in statistics
> gathering. The organization of the JSON query and response should be a
> subject for review because it will be troublesome to modify it once
> people start implementing proprietary clients.
Good point. Added new section. I tried to recycle what is still there
from BIND10 days, but the syntax was a bit too complex for my taste. I
proposed a simplified version.

> I also wonder if this "protocol" shouldn't be the base for the remote
> management API, in which case we should take into account use cases for
> the management API here? Not that I want to start implementing
> management API right now, but just make sure that it will be compatible
> when we implement it.
Yes. I was afraid to say it out out and being accused that I try to
sneak in remote management API :) Your observation is correct. I think
about this statistic retrieval API as an early prototype that will one
day evolve into full featured management API.

> On the class diagram, I still think that it may be useful to make it
> generic and allow for some additional types apart from the ones you
> listed. In particular, string value. Suppose someone writes a hook and
> wants to store some textual information in it like last error found.
Personally I doubt statistics are the appropriate place for collecting
and reporting strings, but I don't object to it. I'll add strings to the
list of supported types.

> Doesn't Observation require setValue modifiers?
It does. I updated the diagram. Make sure you refresh your page. For
some reason, chrome was showing cached version until I forced it to
refresh. I think trac is set to encourage aggressive caching.

Thanks for your comments,
Tomek