BIND 10 #2676: Provide ccsession::rpc_call and replace usual calls with it
BIND 10 Development
do-not-reply at isc.org
Tue Feb 19 10:04:41 UTC 2013
#2676: Provide ccsession::rpc_call and replace usual calls with it
-------------------------------------+-------------------------------------
Reporter: vorner | Owner:
Type: enhancement | jinmei
Priority: medium | Status:
Component: Inter-module | reviewing
communication | Milestone:
Keywords: | Sprint-20130219
Sensitive: 0 | Resolution:
Sub-Project: Core | CVSS Scoring:
Estimated Difficulty: 2.5 | Defect Severity: N/A
Total Hours: 0 | Feature Depending on Ticket:
| Add Hours to Ticket: 0
| Internal?: 0
-------------------------------------+-------------------------------------
Changes (by vorner):
* owner: vorner => jinmei
Comment:
Hello
As always, what is not commented has been handled.
Replying to [comment:7 jinmei]:
> I didn't yet understand why we needed two sessions in these cases, but
> if they are about xfrin and zonemgr, I guess it's just another
> incident that xfrs and zonemgr only have experimental quality,
> especially about the use of threads. I'm not sure when we can make
> them production ready, but I'd generally defer tricky cases related to
> the buggy use of threads in these modules to fundamental redesign
> work.
It seems the ModuleCCSession is busy locked in one thread in the loop
checking
commands. It doesn't cope with another request to wait for messages there.
We
may want to refactor the libraries in around there a lot, because we had
some
similar problems in C++ library as well (not with threads though, during
the
startup).
Anyway, thinking about this, it might be solved by some kind of
asynchronous
wait on the session. Or, just by creating a ModuleCCSession instead of
CCSession as the second one. That could be easy. Should I try if it helps?
> - a lettuce test (xfrin_notify_handling.feature) often, but not
> always, fails. I'm not sure if it's related to this branch.
I don't know. When I was running lettuce tests, they failed several times
for
me, in various unrelated parts. When I tried to investigate, I did find
the log
messages it timed out on in the logs. So I guess it is some strange kind
of
race condition in the lettuce framework, possibly happening in case many
messages are output at once (maybe a full 4kB read happens and splits the
message in the middle, or something). It might be related to this branch
in the
sense that the timing of sends and logs is slightly different, but I don't
see
a way how it could directly cause a new race there.
I'd propose not trying to hunt down the lettuce timing here, but create a
ticket for it.
> - totally a nit, but did you intentionally use "ourself" instead of
> "ourselves"?
No, I just don't speak English that much so I'd notice it's wrong.
> - some log messages seem to have to be updated. For example, the
> description of DDNS_START_FORWARDER_ERROR is not very accurate any
> more because it includes the case that auth doesn't exist (or at
> least isn't listening to the commands). I suggest you go through
> all relevant log messages (including detailed description) and
> revise them appropriately.
Actually, I changed the behaviour here back to the original. But I did
change
one message, removing the several lines about how bad the timeouts are,
since
they should no longer happen.
> - `__notify_start_forwarder`: also not related to this branch, but
> this case is one example where just getting a response from one
> instance (of possibly multiple ones) of the component isn't
> sufficient. We'll have to clarify this in the protocol doc (and
> refinement) task and revise the implementation so it'll work
> robustly.
That is true, but just logging here seems like not the way to go either. I
believe it should just crash the whole DDNS in such case. Anyway, it is
not
related to this ticket.
> see the detailed description of DDNS_UPDATE_NOTIFY_FAIL. And, in
> fact, IMO if we want to make production quality system, not a toy of
> a student project, we shouldn't just "notify and forget" in many
> more cases. IMO our implementation is rather too loose in this
> sense. In this particular case, the discussion is quite subtle
> though...first, if the error is `RPCRecipientMissing`, it can be
> less serious: while the fact that auth/xfrout has likely terminated
> may indicate something wrong to auth/xfrout themselves, the entire
> system can recover from that situation as they were restarted by the
> init process. So we could handle this specific case separately,
> with a lower level of log. But then we face the fundamental
> question of the inter-module communication model. This is another
> case where if not all recipients receives the message the system can
> be in an inconsistent state.
Actually, the current behaviour with waiting for one (random) answer is
wrong too, because:
* They may be no recipients on purpose. We can have no XfrOut running
(because
the server is not primary). We can have a hidden master updated by
DDNS, not
running Auth. We can have neither, for example user might have added
their
own tool to write blog articles automatically about the changes being
made
and the server just gets copies of the updates sent to the main server.
* What if the first of Auths succeeds and the others fail?
What I think should be done is:
* Make sure the sender (DDNS in this case) does not need to know what the
recipients will be. In that case, we can run no Auth without any
problem and
the customer also can run a custom module that wants to know about the
updates the developers are not aware of.
* Leave the responsibility of guaranteed delivery to all on the msgq or
libraries. Modules should not babysit the communication system and
check if
it was delivered.
* If something fails, we currently only log. Let the logging be done in
the
recipient state.
* Maybe introduce advanced response tracking and collate it into one
answer of
form:
- All OK
- or These modules failed
However, you worry about not all of them receiving. I think we can
consider if
it is not connected to msgq, it doesn't exist. But we can worry about if
the
receiving module fails to do whatever is needed.
> - do_polling: doesn't it change the behavior? Previously if the
> recipient doesn't exist it should have resulted in a timeout and an
> exception. It's now reported as `RPCError` and ignored.
> Furthermore, there now seems to be a possibility that `modules` can
> be used without set (even if detailed conditions prevent that from
> happening, it's not obvious from the code, and it means the quality
> of this code is low - although it's not a fault of your change).
I changed it to mimic the original. But I don't understand what you mean
by the
„without set“ part.
> - update_modules: likewise, the behavior seems to be slightly changed.
It is changed only slightly. If the module doesn't exist, it raises
StatsError
instead of SessionTimeout. But I don't think either is handled on the
above
levels. I can return the original, but I don't know it is important enough
to
complicate the code because of that.
> - get_stats_data: raising `StatsHttpdDataError` in case of
> `RPCRecipientMissing` doesn't seem to be correct (at least it
> doesn't preserve the previous behavior).
Ah, I misread and thought it's both `StatsHttpdError` and one is a
different
one. Fixed.
> - about the FIXME comment: as stated above, IMO it's another
> incident that xfr's are a student's hobby project, not a
> production-quality system, that it simply ignores the response or
> timeout. (But this specific case may not have to be an "RPC" anyway
> - all xfrin should know and handle if the message was surely
> delivered to zonemgr).
Do you propose to fix it in this branch? I don't think it is in scope. The
errors were ignored before.
--
Ticket URL: <http://bind10.isc.org/ticket/2676#comment:9>
BIND 10 Development <http://bind10.isc.org>
BIND 10 Development
More information about the bind10-tickets
mailing list