buffindexed: could not open overview

Tue Jan 21 08:51:23 UTC 2003

Sang-yong Suh <sysuh at kigam.re.kr> writes:

> On Sat, Dec 14, 2002 at 08:11:03PM +0900, Katsuhiro Kondou wrote:
> > Assuming you've never had system crash, there are two
> > possibilities of the occurrence.  One is unified buffer
> > cache of your OS may not work correctly.  Another is
> > there is a bug in buffindexed.  In the case, it's
> > necessary to investigate by defining OV_DEBUG which
> > examines occupancy of the BLOCK by another program.
> 
> Actually, I was playing with a few RedHat Linux machines.
> And I now suspect that it may be related with OS/version.
> The followings describe my Linux servers
> 
> 1. RedHat-7.1 4-CPU: Not rebuilt overview after a system crash.
>    The uptime is 102 days now.  Overview error occurs a few times a week.
> 
>    Now I'm using a tool which refreshes broken newsgroup by calling
>    buffindexed_groupdel() and buffindexed_groupadd(). This works
>    pretty good...
> 
> 2. RedHat-7.3 1-CPU: Uptime 33 days.  No single overview error occurred yet.
> 
> 3. RedHat-7.3 2-CPU: This is my main test box. Overview error occurs

I'd missed this message (and I spent a lot of time thinking about this
problem over Christmas), but you've just given me the data point to
convince myself I understand why buffindexed is broken with the file
based locking code (and FWIW so I suspect is tradindexed).

With apologies to anyone I'm teaching to suck eggs (and to anyone who
knows this stuff better than me and the mistakes I'll doubtless make
explaining the problem - my excuse is the last CPU I knew /really/
well was the 68030).

I'm almost certain that all of our concurrent file based locking where
we're using mmap is broken on SMP (and it'll depend on your
architecture whether its really broken or just theoretically
broken). On SMP the locking primitives for concurrent code use memory
barrier operations to ensure that all CPUs have consistent caches; the
lock file stuff we're using I suspect doesn't generate the right
memory barriers so we manage to exit critical regions with
inconsistent cache views of the mmap()ed regions on different CPUs
which means we get all confused about the state of the bitmaps.

On UP we're OK as we have a single view of the cache (how HT on Intel
fits in I've no idea).

So... I think we should add some warnings somewhere about UP/SMP in
the documentation (FAQ?) and really deal with the problem for 2.5.

-- 
Alex Kiernan, Principal Engineer, Development, THUS plc