Race condition in shlock.c (time window between ValidLock and unlink)

Tue Sep 17 19:56:53 UTC 2002

Katsuhiro Kondou <Katsuhiro_Kondou at isc.org> writes:
> Berend Reitsma <berend at asset-control.com> wrote;

> } As far as I can tell there is a time window between the ValidLock
> } returning FALSE and the unlink after that.
> } There is no guarantee that you are unlinking the same file you just
> } checked. This means that there is a posibility that two (or more) shlocks
> } will succeed when they should not ...
> } 
> } In fact I can reliable reproduce it with the following script while
> } forcing the system into swap.
> } 
> } If this already a known fact, it would be nice to have this at least in
> } the documentation.

> I don't think shlock is used so heavily that above case
> may happen.  So, I think it should be noted in man pages
> rather than fixing at the moment.  Any comments?

I looked at this some last night, and it would be pretty hard to fix.
Basically, the problem is that shlock has code to remove stale locks when
the PID of the process creating the lock no longer exists.  However, if
two runs of shlock start at about the same time in the presence of a stale
lock, one of them may remove the lock of the other thinking that it's
deleting the stale lock and both can end up succeeding.

The problem can only trigger in the presence of stale locks, though, so
far as I can tell.

I can't think of a good way to fix this.  Locks based on the atomicity of
file creation are only safe if you remove stale locks under controlled
circumstances.  So yeah, I vote for just documenting the limitation.

-- 
Russ Allbery (rra at stanford.edu)             <http://www.eyrie.org/~eagle/>

    Please send questions to the list rather than mailing me directly.
     <http://www.eyrie.org/~eagle/faqs/questions.html> explains why.