Location of SetupDaemon in nnrpd.c

Heiko Schlichting inn-workers at CIS.FU-Berlin.DE
Thu Jun 13 09:33:12 UTC 2002


> Anyone remember the detail of this change to nnrpd.c:

Sure.

> Revision 1.81 / (download) - annotate - [select for diffs], Mon May 8 22:24:54 2000 UTC (2 years, 1 month ago) by kondou 
> Branch: MAIN 
> Changes since 1.80: +2 -3 lines
> Diff to previous 1.80 (unified)
> - From: Heiko Schlichting <inn-bugs at fu-berlin.de>
> - If nnrpd is started in daemon mode (and only then) and two or more nnrpd
>   processes try to access articles in the same CNFS buffer simultaneously
>   there are conflicts which cause article loss for the reader. The problem
>   seems to be the opening of the CNFS buffer, which are done in SetupDaemon()
>   *before* the daemon forks.
> 
> Basically I'd really, really like to move SetupDaemon() back to where
> it was, as its causing a mammoth amount of I/O every time someone
> connects to one of my reader boxes (during some testing, a connect
> every 15s was causing 2Mb/s of traffic to the NFS mounted file store -
> all CNFS bitmap traffic).

Putting it back to the old place will break nnrpd. It could (and will)
happen that you request an article and do not get an answer. You can retry
and it succeeds. That is not acceptable in production.

If the master opens the buffers you have only one set of filehandles. The
forked childs will use a sequence of seek() and read() to get the article.
If you have one child, this will always succeed. But the problem occurs if
there are more than one child: child-1 does a seek() but before the read of
child-1 occurs, child-2 does it's own seek. Then child-2 can successfully
read the article but child-1 cannot. In reality the situation is more
complex because nnrpd seeks and reads the CNFS article heading and article
in two seperate actions. Therefore child-1 can get an article heading if an
article content is expected.

In practice, the seek and read are very close and you might never notice
problems with only a few nnrpd childs (although things can go wrong too).
We noticed significant failures on our main production server when there
are many nnrpd childs (today there up to 2000 nnrpd childs). With so many
nnrpd childs a huge amounts of article requests will fail.

Therefore: Don't open the CNFS buffers in the daemon, never!

Below the signature you find my report of the year 2000.

Heiko

Heiko Schlichting        | Freie Universität Berlin
heiko at FU-Berlin.DE       | Zentraleinrichtung für Datenverarbeitung (ZEDAT)
Telefon +49 30 838-54327 | Fabeckstraße 32
Telefax +49 30 838454327 | 14195 Berlin
--------------------------------------------------------------------------
| Date: Mon, 8 May 2000 18:24:27 +0200
| From: Heiko Schlichting <inn-bugs at fu-berlin.de>
| To: inn-bugs at isc.org
| Subject: shared file pointer problem with nnrpd in daemon mode
| Message-ID: <20000508182427.A157840 at CIS.FU-Berlin.DE>

Hi,

since serveral months we use an older INN 2.3 snapshot on our reader server
News.CIS.DFN.DE with CNFS and nnrpd in daemon mode. Before we started with
our 30.000 registered users, I made a lot of tests and some changes until I
noticed no more problems.

Right after starting the production, I noticed a error messages in syslog:
"...could not match article size token..." produced by cnfs_retrieve(). The
number of messages seems to be direct related to the number of clients on
our server. With few (50 or less) clients no error messages appear at all,
which might be the reason why my tests never matched this condition.  On our
production server with about 1,000 simultaneous clients we noticed more than
50,000 error messages per days.

One big problem to find the reason was that the tokens of the error messages
always differ and the errors, which resulted in 'article not available'
responses to the user, are not reproducable at all. Requesting a specific
article was sometimes successful (>99%) but sometimes not. The latter
case appears more often with many active nnrpd processes. It does never
appear when I start nnrpd in my debugging environment (SGI CaseVision).

After a huge amount of bug tracking, I noticed the following:
The sequence on seek and read in cnfs_retrieve()...

    if (CNFSseek(cycbuff->fd, offset, SEEK_SET) < 0) { [...]
    }
    if (read(cycbuff->fd, &cah, sizeof(cah)) != sizeof(cah)) { [...]
    }

...does not work properly in all cases. The read() just gets data of a
wrong position of the correct CNFS buffer. So I started with a loop around
the seek+read and try to seek more than one time to the same position if
the mentioned error condition appears. Against my expectations this has an
effect: the articles can be read in the second or third try.

As I'm sure that seek() and read() aren't broken on my operating system
(IRIX 6.5), I continued debugging:

If nnrpd is started in daemon mode (and only then) and two or more nnrpd
processes try to access articles in the same CNFS buffer simultaneously
there are conflicts which cause article loss for the reader. The problem
seems to be the opening of the CNFS buffer, which are done in SetupDaemon()
*before* the daemon forks.

Marc J. Rochkind, "Advanced Unix Programming", 1985:
|
| 5.4 fork SYSTEM CALL
| [...]
| - The child gets copies of the parent's open file descriptors. Each is
|   opened to the same file, and the file pointer has the same value. The
|   file pointer is shared. If the child changes it with lseek, than the
|   parent's next read or write will be at the new location. The file
|   descriptor itself, however, is distinct: If the child closes it, the
|   parent's copy is undisturbed.

Having shared file pointers for the CNFS buffers of all nnrpd processes
is of course a major problem and I'm surprised that I never noticed any
bug report by anyone else.

The patch below fixed all problems on our server and if someone can
confirm it, it should be applied before releasing INN 2.3. The patch
is against inn-BETA-20000507 and is very small compared to the debugging
effort which was necessary to create it.

Heiko

Heiko Schlichting        | Freie Universität Berlin
heiko at FU-Berlin.DE       | Zentraleinrichtung für Datenverarbeitung (ZEDAT)
Telefon +49 30 838-54327 | Fabeckstraße 32
Telefax +49 30 838-56721 | 14195 Berlin
---------------------------------------------------------------------------

--- nnrpd/nnrpd.c.org	Sun May  7 12:06:10 2000
+++ nnrpd/nnrpd.c	Mon May  8 16:55:52 2000
@@ -880,7 +880,6 @@
 
 	/* Set signal handle to care for dead children */
 	(void)xsignal(SIGCHLD, WaitChild);
-	SetupDaemon();
  
 	TITLEset("nnrpd: accepting connections");
  	
@@ -895,7 +894,6 @@
 	    for (i = 0; (pid = fork()) < 0; i++) {
 		if (i == MAX_FORKS) {
 		    syslog(L_FATAL, "cant fork %m -- giving up");
-		    OVclose();
 		    exit(1);
 		}
 		syslog(L_NOTICE, "cant fork %m -- waiting");
@@ -912,6 +910,7 @@
 	close(fd);
 	dup2(0, 1);
 	dup2(0, 2);
+	SetupDaemon();
 
 	/* if we are a daemon innd didn't make us nice, so be nice kids */
 	if (innconf->nicekids) {



More information about the inn-workers mailing list