BIND 10 #762: Conversion of xfrout to use the new logging interface

Mon Jun 27 18:27:48 UTC 2011

#762: Conversion of xfrout to use the new logging interface
-------------------------------------+-------------------------------------
                   Reporter:         |                 Owner:  jinmei
  stephen                            |                Status:  reviewing
                       Type:         |             Milestone:
  enhancement                        |  Sprint-20110628
                   Priority:  major  |            Resolution:
                  Component:         |             Sensitive:  0
  logging                            |           Sub-Project:  DNS
                   Keywords:         |  Estimated Difficulty:  2.0
            Defect Severity:  N/A    |           Total Hours:  0
Feature Depending on Ticket:         |
        Add Hours to Ticket:  0      |
                  Internal?:  0      |
-------------------------------------+-------------------------------------

Comment (by jinmei):

 Replying to [comment:7 jelte]:

 The revised version looks mostly okay except two things:

 - there was one minor grammar error in the description.  I directly
   fixed it and pushed it.
 - The XFROUT_REMOVE_UNIX_SOCKET_FILE issue doesn't seem to be solved
   (see the end of this comment message)

 Also, I suggest performing manual checks (for now) about the
 consistency of .mes and xfrout.py(.in), i.e., all defined ID is used
 and all used IDs are defined.  I've not done that comprehensive check
 myself.

 Other points are minor, so please feel free to address or ignore them
 as you like, and, when done, feel free to merge.

 > > '''General'''
 > >
 > > I've noticed some IDs in .mes are not used in .py and some other IDs
 > > in .py are not defined in .mes.  Some specific cases are noted below,
 > > but that's not a comprehensive check.  I guess we need a generic
 > > checker script to identify these types of errors.  Also, ideally (as
 > > also commented below), we should have test cases for log messages to
 > > check whether intended log messages are really produced.
 >
 > Yeah, you're not the only one to think we need such a checking script,
 see #1055.
 [...]
 > Or something (it should probably include severity and levels etc, but
 this is just an example)
 >
 > If we then want to make it exhaustive, perhaps we can even try to get
 the tests to error if anything was logged that was not checked. This all
 is indeed outside the scope of this ticket, but definitely worth
 considering IMO.

 These seem to be good ideas (aside from how exactly we implement them,
 and on further thought we may thing these are overkilling).  In any
 case that's beyond the scope of this ticket.  I see some (or all?)
 points are covered in #1055.  If #1055 would cover everything, we are
 done with that.  If there are some other things, I'd suggest adding
 them as a comment to #1055 or creating a new ticket.

 > > - XFROUT_PARSE_QUERY

 > >   - "in principle" this should not happen except for really unexpected
 > >     events like memory allocation failure because b10-auth should have
 > >     handled bogus packets.
 >
 > or if the python library somehow manages to disagree with the c++
 library of what constitutes a correct query ;)
 >
 > Added this text:
 > In general, this should only occur for unexpected problems like
 > memory allocation failures, as the query should already have been
 > parsed by the b10-auth daemon, before it was passed here.
 >
 > (note that should we even come up with an 'optimization' where only the
 header is parsed before it is sent on this text would not be true anymore)

 Right.  I'm not sure whether we should prepare for that in this
 context.  We might implicitly clarify that by saying "in the current
 implementation" or something.  I'd leave it to you.

 > > - XFROUT_AXFR_TRANSFER_FAILED
 > >  - maybe we should print RR class, too, as the commented-out original
 > >    log tried to do (but if we do, don't hardcode it please:-)?
 >
 > hmm, now that you mention it, the code only looks at the name, not the
 class... I've extracted the requested class as well, but made checking the
 class in the code an added TODO (there are more of those :p)

 OK.

 > > - XFROUT_AXFR_TRANSFER_STARTED
 > >  - print RR class?
 > >  - should we want to log the "END" event as the original
 > >    (commented-out) log tried?
 >
 > ok, ok

 As for XFROUT_AXFR_TRANSFER_DONE, it seems this could be logged when
 the transfer was incompletely terminated due to a shutdown while the
 log description says:
 {{{
 +The transfer of the given zone has been completed successfully.
 }}}
 I'd say, e.g., "...has been successfully completed or aborted
 halfway due to a shutdown event".

 > > - XFROUT_STOPPING: this is a specific event happening in the middle of
 > >   transfer.  I'd clarify that fact.
 >
 > eek. Good point, updated text. But I think we should actually complete
 ongoing transfers before shutting down (but not accept new ones)

 OK about the change.  As for whether to complete it, I personally
 think aborting it is better because it could take longer (e.g. for
 several minutes in some bad cases), and when I perform "shutdown", I'd
 expect the system is terminated (e.g. so I can restart it again) quite
 soon.  But in any case, that's a different topic (if you want to
 change the behavior, please file a new ticket).

 > > - XFROUT_SOCKET_SELECT

 > >   - does this really indicate "an error in the internal communication
 with b10-auth"?
 >
 > is my understanding that this is what this part of the code does (read
 fd's from b10-auth) wrong?

 Or a "socketpair" within xfrout to signal a shutdown.  But my question
 is different: if python's select.select() inherits the semantics of
 the standard select(2) system call, an error from the call should be
 very local, such as passing an invalid FD or temporary memory
 allocation failure, rather than a communication error, which should be
 signaled via a "successful" select() call followed by a error result
 in subsequent read/write operation.  If my understanding is correct,
 I'd replace "This most likely points to an error in the internal
 communication with b10-auth." to, e.g. "This should be a result of
 rare local error such as memory allocation failure and shouldn't
 happen under normal conditions."

 > >   - better name to indicate it's an error?..hmm...this point seems to
 > >     be common for many others, so I won't repeat.  it at least looks
 > >     awkward that some error logs are named xxx_ERROR and others
 > >     aren't.
 >
 > yes, at some point i guess i decided that we have more than enough
 redundancy in log output, and adding _ERROR would usually mean that the
 word 'error' is printed 3 times in one log message;
 > <time> ERROR [Foo.bar] FOO_BAR_ERROR error in foo while trying bar:
 <reason>
 >
 > But as it would seem i'm the only one who doesn't like this redundancy
 i'll just add ERROR to the message identifiers where it isn't clear.

 I don't like redundancy (in general:-).  My main point in this case
 was the inconsistency (I actually suggested adding _ERROR, but at that
 time I didn't consider the redundancy that much).  As for redundancy,
 I see the point on one hand.  I've been also a bit concerned about
 possible inconsistency between the defined level and description due
 to future changes to the level.  For example, in the above example, we
 might change our mind and make the level to "info", by changing
 "logger.error" to "logger.info".  That would implicitly request
 changes to the log description (and ideally the ID also for
 consistency), but I can easily imagine the situation we forget it.

 So, it would be nicer if we only have a single definition point on the
 log level, while still naming log IDs reasonably.  But, like other
 points in this ticket, it would be beyond the scope of this ticket.
 So I'm okay with the current code for this specific case.

 > >  - I don't understand this: "This xfrout daemon will now shut down."
 > >    Is "this xfrout daemon" the other process?  The daemon that emits
 > >    this message actual exits (if that means 'shut down').
 >
 > oh, i used 'this' to make clear that it is the one printing the message
 that gives up (as opposed to 'that', which would be the already-running
 one, or 'the', which would be ambiguous). What about this?
 >
 > While starting up, the xfrout daemon tried to clear the unix domain
 > socket needed for contacting the b10-auth daemon to pass requests
 > on, but the file is in use. The most likely cause is that another
 > xfrout daemon process is still running. This xfrout daemon (the one
 > printing this message) will not start.

 OK.

 > > - Is the log ID correct here?
 > > {{{
 > >         try:
 > >             os.unlink(self._sock_file)
 > >         except Exception as e:
 > >             logger.error(XFROUT_REMOVE_UNIX_SOCKET_FILE, str(e))
 > > }}}
 > >   Note that XFROUT_REMOVE_UNIX_SOCKET_FILE should take two args.
 > >
 > > - XFROUT_REMOVE_OLD_UNIX_SOCKET_FILE: doesn't seem to be used in .py.
 >
 > nope, the first should indeed have _OLD_ in it. Yet another reason for
 that checker.

 This doesn't seem to be fixed.

-- 
Ticket URL: <http://bind10.isc.org/ticket/762#comment:9>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development