INN commit: trunk (4 files)

INN Commit rra at isc.org
Mon Aug 25 17:13:12 UTC 2014


    Date: Monday, August 25, 2014 @ 10:13:12
  Author: iulius
Revision: 9657

pullnews:  new -a flag (hashfeed ability)

Add a new feature to pullnews:  hashfeed to split feeds.  It uses MD5
and is Diablo-compatible.

Thanks to Geraint Edwards for the patch.

Modified:
  trunk/doc/pod/news.pod
  trunk/doc/pod/newsfeeds.pod
  trunk/doc/pod/pullnews.pod
  trunk/frontends/pullnews.in

-----------------------+
 doc/pod/news.pod      |    3 +-
 doc/pod/newsfeeds.pod |    3 +-
 doc/pod/pullnews.pod  |   40 +++++++++++++++++++++++++++++++++++-
 frontends/pullnews.in |   53 ++++++++++++++++++++++++++++++++++++++++++------
 4 files changed, 90 insertions(+), 9 deletions(-)

Modified: doc/pod/news.pod
===================================================================
--- doc/pod/news.pod	2014-08-24 13:25:28 UTC (rev 9656)
+++ doc/pod/news.pod	2014-08-25 17:13:12 UTC (rev 9657)
@@ -186,7 +186,8 @@
 =item *
 
 Several improvements have been contributed to B<pullnews> by Geraint
-Edwards:  the new B<-B> flag triggers header-only feeding, the B<-m>
+Edwards:  the new B<-a> flag adds the Diablo-compatible hashfeed
+ability, the new B<-B> flag triggers header-only feeding, the B<-m>
 flag now permits to remove headers matching (or not) a given regexp,
 and B<rnews> reporting is improved.
 

Modified: doc/pod/newsfeeds.pod
===================================================================
--- doc/pod/newsfeeds.pod	2014-08-24 13:25:28 UTC (rev 9656)
+++ doc/pod/newsfeeds.pod	2014-08-25 17:13:12 UTC (rev 9657)
@@ -440,7 +440,8 @@
 
 Therefore, it allows to a generate a second level of deterministic
 distribution.  Indeed, if a news server is fed C<Q1/2>, it can go on
-splitting thanks to C<Q1-3/9_4> for instance.
+splitting thanks to C<Q1-3/9_4> for instance.  Up to four levels of
+deterministic distribution can be used.
 
 The algorithm is compatible with the one used by S<Diablo 5.1> and up.
 If you want to use the legacy quickhashing method used by Diablo

Modified: doc/pod/pullnews.pod
===================================================================
--- doc/pod/pullnews.pod	2014-08-24 13:25:28 UTC (rev 9656)
+++ doc/pod/pullnews.pod	2014-08-25 17:13:12 UTC (rev 9657)
@@ -4,7 +4,8 @@
 
 =head1 SYNOPSIS
 
-B<pullnews> [B<-BhnOqRx>] [B<-b> I<fraction>] [B<-c> I<config>] [B<-C> I<width>]
+B<pullnews> [B<-BhnOqRx>] [B<-a> I<hashfeed>] [B<-b> I<fraction>]
+[B<-c> I<config>] [B<-C> I<width>]
 [B<-d> I<level>] [B<-f> I<fraction>] [B<-F> I<fakehop>] [B<-g> I<groups>]
 [B<-G> I<newsgroups>] [B<-H> I<headers>] [B<-k> I<checkpt>] [B<-l> I<logfile>]
 [B<-m> I<header_pats>] [B<-M> I<num>] [B<-N> I<timeout>] [B<-p> I<port>]
@@ -41,6 +42,43 @@
 
 =over 4
 
+=item B<-a> I<hashfeed>
+
+This option is a deterministic way to control the flow of articles and to
+split a feed.  The I<hashfeed> parameter must be in the form C<value/mod>
+or C<start-end/mod>.  The Message-ID of each article is hashed using MD5,
+which results in a 128-bit hash.  The lowest S<32 bits> are then taken
+by default as the hashfeed value (which is an integer).  If the hashfeed
+value modulus C<mod> plus one equals C<value> or is between C<start>
+and C<end>, B<pullnews> will feed the article.  All these numbers must
+be integers.
+
+For instance:
+
+    pullnews -a 1/2      Feeds about 50% of all articles.
+    pullnews -a 2/2      Feeds the other 50% of all articles.
+
+Another example:
+
+    pullnews -a 1-3/10   Feeds about 30% of all articles.
+    pullnews -a 4-5/10   Feeds about 20% of all articles.
+    pullnews -a 6-10/10  Feeds about 50% of all articles.
+
+You can use an extended syntax of the form C<value/mod:offset> or
+C<start-end/mod:offset> (using an underscore C<_> instead of a colon
+C<:> is also recognized).  As MD5 generates a 128-bit return value,
+it is possible to specify from which byte-offset the 32-bit integer
+used by hashfeed starts.  The default value for C<offset> is C<:0> and
+thirteen overlapping values from C<:0> to C<:12> can be used.  Only up to
+four totally independent values exist:  C<:0>, C<:4>, C<:8> and C<:12>.
+
+Therefore, it allows to a generate a second level of deterministic
+distribution.  Indeed, if B<pullnews> feeds C<1/2>, it can go on
+splitting thanks to C<1-3/9:4> for instance.  Up to four levels of
+deterministic distribution can be used.
+
+The algorithm is compatible with the one used by S<Diablo 5.1> and up.
+
 =item B<-b> I<fraction>
 
 Backtrack on server numbering reset.  Specify the proportion (C<0.0> to C<1.0>)

Modified: frontends/pullnews.in
===================================================================
--- frontends/pullnews.in	2014-08-24 13:25:28 UTC (rev 9656)
+++ frontends/pullnews.in	2014-08-25 17:13:12 UTC (rev 9657)
@@ -13,6 +13,7 @@
 #               INN project.  Major changes are:
 #
 #               January 2010:  Geraint A. Edwards added header-only feeding (-B);
+#               added ability to hashfeed (-a) - uses MD5 - Diablo-compatible;
 #               enabled -m to remove headers matching (or not) a given regexp;
 #               minor bug fix to rnews when -O; improved rnews reporting.
 #
@@ -121,13 +122,19 @@
 }
 
 $usage =~ s!.*/!!;
-$usage .= " [ -BhnOqRx -b fraction -c config -C width -d level
+$usage .= " [ -BhnOqRx -a hashfeed -b fraction -c config -C width -d level
         -f fraction -F fakehop -g groups -G newsgroups -H headers
         -k checkpt -l logfile -m header_pats -M num -N num
         -p port -P hop_limit -Q level -r file -s host[:port] -S num
         -t retries -T seconds -w num -z num -Z num ]
         [ upstream_host ... ]
 
+  -a hashfeed   only feed article if the MD5 hash of the Message-ID
+                matches hashfeed (where hashfeed is of the form value/mod,
+                value/mod:offset, start-end/mod, or start-end/mod:offset).
+                The algorithm used is compatible with the one used by Diablo;
+                see the pullnews man page for more details.
+
   -b fraction   backtrack on server numbering reset.  The proportion
                 (0.0 to 1.0) of a group's articles to pull when the
                 server's article number is less than our high for that
@@ -231,11 +238,11 @@
 ";
 
 
-use vars qw($opt_b $opt_B $opt_c $opt_C $opt_d $opt_f $opt_F $opt_g $opt_G
-            $opt_h $opt_H $opt_k $opt_l $opt_m $opt_M $opt_n
+use vars qw($opt_a $opt_b $opt_B $opt_c $opt_C $opt_d $opt_f $opt_F
+            $opt_g $opt_G $opt_h $opt_H $opt_k $opt_l $opt_m $opt_M $opt_n
             $opt_N $opt_O $opt_p $opt_P $opt_q $opt_Q $opt_r $opt_R $opt_s
             $opt_S $opt_t $opt_T $opt_w $opt_x $opt_z $opt_Z);
-getopts("b:Bc:C:d:f:F:g:G:hH:k:l:m:M:nN:Op:P:qQ:r:Rs:S:t:T:w:xz:Z:") || die $usage;
+getopts("a:b:Bc:C:d:f:F:g:G:hH:k:l:m:M:nN:Op:P:qQ:r:Rs:S:t:T:w:xz:Z:") || die $usage;
 
 die $usage if $opt_h;
 
@@ -246,6 +253,7 @@
 my $localServer         = $opt_s || $defaultHost;
 my $localPort           = $opt_p || $defaultPort;
 my $quiet               = $opt_q;
+my $hashfeed            = $opt_a || '';
 my $header_only         = $opt_B;
 my $watermark           = $opt_w;
 my $retries             = $opt_t || $defaultRetries;
@@ -288,6 +296,26 @@
 die "``-z'' value not an integer: $opt_z\n" if defined $opt_z and $opt_z !~ /^\d+$/;
 die "``-Z'' value not an integer: $opt_Z\n" if defined $opt_Z and $opt_Z !~ /^\d+$/;
 
+if ($hashfeed ne '') {
+    my $a_err = "``-a'' value not in format ``start[-end]/mod[:offset]'': $opt_a\n";
+    die $a_err if $opt_a !~ m!^(\d+)(?:-(\d+))?/(\d+)(?:[:_](\d+))?$!;
+    $hashfeed = {
+        'low'       => $1,
+        'high'      => $2 || $1,
+        'modulus'   => $3,
+        'offset'    => $4 || 0,
+    };
+    die $a_err if $hashfeed->{'low'} > $hashfeed->{'high'}
+                  or $hashfeed->{'modulus'} == 0
+                  or $hashfeed->{'offset'} > 12;
+    if ($hashfeed->{'low'} == 1 and $hashfeed->{'high'} == $hashfeed->{'modulus'}) {
+        $hashfeed = '';
+    } else {
+        require Digest::MD5;
+        Digest::MD5->import(qw/md5/);
+    }
+}
+
 $quiet = 1 if $quietness > 1;
 my %NNTP_Args = ();
 $NNTP_Args{'Timeout'} = $opt_N if defined $opt_N;
@@ -409,7 +437,7 @@
     print LOG "        ``+'' is an article the downstream server accepted\n";
     print LOG "        ``x'' is an article the upstream server couldn't ";
     print LOG "give out\n";
-    print LOG "        ``m'' is an article skipped due to headers (-m or -P)\n";
+    print LOG "        ``m'' is an article skipped due to headers (-a, -m or -P)\n";
     print LOG "\n";
     print LOG "Writing to rnews-format output: $rnews\n\n" if $rnews;
 }
@@ -743,7 +771,7 @@
             my $tx_len = 0;              # Transmitted article length (bytes) (for rnews, Bytes:).
             my @header_nums_to_go = ();
             my $match_all_hdrs = 1;      # Assume no headers to match.
-            my $skip_due_to_hdrs = 0;
+            my $skip_due_to_hdrs = 0;    # Set to 1 if triggered by -P, 2 if by -m, 3 if by -a.
             my %m_found_hdrs = ();
             my $curr_hdr = '';
 
@@ -894,9 +922,22 @@
                 }
             }
 
+            if (not $skip_due_to_hdrs and ref $hashfeed) {
+                my $hash_val = unpack('N', substr(md5($msgid), 12-$hashfeed->{'offset'}, 4)) % $hashfeed->{'modulus'} + 1;
+                $skip_due_to_hdrs = 3 if $hash_val < $hashfeed->{'low'} or $hash_val > $hashfeed->{'high'};
+            }
+
             $pulled->{$server}->{$group}++;
 
             if ($skip_due_to_hdrs) {
+                if ($debug >= 2) { 
+                    print LOG "\tDEBUGGING $i\tskip_art: " .
+                              ($skip_due_to_hdrs == 1 ? 'hopsPath'
+                                  : ($skip_due_to_hdrs == 2 ? 'hdr'
+                                      : ($skip_due_to_hdrs == 3 ? 'hashfeed'
+                                          : 'unknown'))) .
+                              "\n";
+                }
                 print LOG "m" unless $quiet;
             } elsif ($rnews) {
                 printf RNEWS "#! rnews %d\n", $tx_len;



More information about the inn-committers mailing list