"the problem with threads"
mayer at gis.net
Mon Sep 11 22:39:27 UTC 2006
Shane Kerr wrote:
> Danny Mayer wrote:
>>> Paul Vixie wrote:
>>>>> - - All concurrent programs are provably bad
>>>> that's not the way i read this article. but speaking from bind9's experience,
>>>> actually getting useful work out of N processors is very much more difficult
>>>> than "use threads".
>>> Agreed, but any method required to chop up an application into
>>> manageable pieces that can run in parallel (for some meaning of
>>> parallel) is not simple. Threads is just one attempt at doing that and
>>> up-to-now has been the one with the most support. There are always
>>> points in the application that one part has to wait for the results of
>>> another before it can proceed and that's where you begin to lose. But
>>> you cannot blame threads for that.
> I tend to agree. The *real* benefit of threaded programming is that it was
> *easy* compared to other ways to do concurrent processing. Using shared memory
> and semaphores was a pain the ass, and using event driven methods beyond a
> certain complexity is tricky. And it was faster too, when CPU time mattered more
> than coder time (wh00t).
And that's still true today with applications like BIND which needs to
be able to return results as fast as possible.
> Maybe "the problem with threads" is the same as "the problem with 80x86 CPUs",
> it's ugly and everybody agrees there are problems, but it sure seems to work
> good enough in most cases. :)
I think that one of the biggest problems with threads is that it's not
easy to do right. On the other hand, whatever method you choose is hard
so trying to do it "the easy way" means that you don't understand the
difficulties involved. As the person to have implemented the
multithreading for BIND 9 on Windows I am all too familiar with the work
that it involved to get it running properly. And when I was done there
were some bugs that showed up later. But probably none of what was
implemented would have been easier, simpler or faster if some other
method had been chosen.
>>>>> - - We don't really know what else to do
>>>> i think the VLIW people believe that they know what else we should do :-). i
>>>> do not myself know what else we should do, unless it's the apache fork() model.
>>> And how would that help?
> VLIW attacks parallelism at a level too low to be interesting for anyone other
> than compiler authors, or those crazy people who hand-craft assembly, I think.
> But, the fork() model at least isolates failure.
> One of the problem that we see in BIND is that a critical failure of one
> component takes down the entire application. This is certainly not unique to
> BIND... being both "safe" and multithreaded tends to make applications brittle.
> When your code finds itself in a state that it shouldn't be in, the usual course
> of action is for the program to end abnormally.
If it were implemented as a multiprocess or a multithreaded application
it would make little difference if it weren't for the "safeness" issue.
That's what really breaks an application not the implementation strategy.
> One way to avoid this is to isolate problems in each thread as much as you
> can... but at that point, you're starting to look a lot like a multi-process
> application. So, with a more modern language than C or C++, you can raise an
> exception and stay with a threaded model (that still makes me kind of nervous).
> Or you can just use a multi-process (that is, fork()) model, in which case a
> failure of one component means that the specific operation it was doing fails,
> but the rest of the software proceeds on.
This can be done in the same way with threads, but what have we bought?
The article indicates that we should use a different kind of model to do
this but I see none of them as much different from either the
multiprocess model or the multithreaded model.
> Even if you *are* using a multi-process model, coding errors that affect your
> shared structures are going to cause you pain. But it's less pain than in a
> threaded model, and if you really want robustness you can do things like
> checkpointing so you can revert to known-good previous states. Or actually use
> full transactions for shared operations.
I'm not sure that it makes much difference but I'm willing to test
alternatives. Checkpointing, transactions are really not much different
from locking and unlocking and waiting for locks. All lose in the same
kind of way.
More information about the bind-workers