Uniform distribution of truncated md5? - hash

Can we say that a truncated md5 hash is still uniformly distributed?
To avoid misinterpretations: I'm aware the chance of collisions is much greater the moment you start to hack off parts from the md5 result; my use-case is actually interested in deliberate collisions. I'm also aware there are other hash methods that may be better suited to use-cases of a shorter hash (including, in fact, my own), and I'm definitely looking into those.
But I'd also really like to know whether md5's uniform distribution also applies to chunks of it. (Consider it a burning curiosity.)
Since mediawiki uses it (specifically, the left-most two hex-digits as characters of the result) to generate filepaths for images (e.g. /4/42/The-image-name-here.png) and they're probably also interested in an at least near-uniform distribution, I imagine the answer is 'yes', but I don't actually know.

Yes, not exhibiting any bias is a design requirement for a cryptographic hash. MD5 is broken from a cryptographic point of view however the distribution of the results was never in question.
If you still need to be convinced, it's not a huge undertaking to hash a bunch of files, truncate the output and use ent ( http://www.fourmilab.ch/random/ ) to analyze the result.

I wrote a little php-program to answer this question. It's not very scientific, but it shows the distribution for the first and the last 8 bits of the hashvalues using the natural numbers as hashtext. After about 40.000.000 hashes the difference between the highest and the lowest counts goes down to 1%, so I'd say the distribution is ok. I hope the code is more precise in explaining what was computed :-)
Btw, with a similar program I found that the last 8 bits seem to be distributed slightly better than the first.
<?php
// Setup count-array:
for ($y=0; $y<16; $y++) {
for ($x=0; $x<16; $x++) {
$count[dechex($x).dechex($y)] = 0;
}
}
$text = 1; // The text we will hash.
$hashCount = 0;
$steps = 10000;
while (1) {
// Calculate & count a bunch of hashes:
for ($i=0; $i<$steps; $i++) {
$hash = md5($text);
$count[substr($hash, 0, 2)]++;
$count[substr($hash, -2)]++;
$text++;
}
$hashCount += $steps;
// Output result so far:
system("clear");
$min = PHP_INT_MAX; $max = 0;
for ($y=0; $y<16; $y++) {
for ($x=0; $x<16; $x++) {
$n = $count[dechex($x).dechex($y)];
if ($n < $min) $min = $n;
if ($n > $max) $max = $n;
print $n."\t";
}
print "\n";
}
print "Hashes: $hashCount, Min: $min, Max: $max, Delta: ".((($max-$min)*100)/$max)."%\n";
}
?>

Related

Suggestions for optimizing sieve of Eratosthenes in perl

use warnings;
use strict;
my $in=<STDIN>;
my #array=(1...$in);
foreach my $j(2...sqrt($in)){
for(my $i=$j*2;$i<=$in;$i+=$j){
delete($array[$i-1]);
}
}
delete($array[0]);
open FILE, ">","I:\\Perl_tests\\primes.dat";
foreach my $i (#array){
if($i){
print FILE $i,"\n";
}
}
I'm sure there is a better way to do the array of all numbers however I don't really know of a way to do it in perl. I'm pretty inexperienced in perl. Any recommendations for speeding it up are greatly appreciated. Thanks a ton!
The fastest solution is using a module (e.g. ntheory) if you just want primes and don't care how you get them. That will be massively faster and use much less memory.
I suggest looking at the RosettaCode Sieve task. There are a number of simple sieves shown, a fast string version, some weird examples, and a couple extensible sieves.
Even the basic one uses less memory than your example, and the vector and string versions shown there use significantly less memory.
Sieve of Atkin is rarely a good method unless you are Dan Bernstein. Even then, it's slower than fast SoE implementations.
That's a fairly inefficient way to look for primes, so unless you absolutely have to use the Sieve of Erastosthenes, The Sieve of Atkin is probably a faster algorithm for finding primes.
With regard to the memory usage, here's a perlified version of this python answer. Instead of striking out the numbers from a big up front array of all the integers, it tracks which primes are divisors of non-prime numbers, and then masks out the next non-prime number after each iteration. This means that you can generate as many primes as you have precision for without using all the RAM, or any more memory than you have to. More code and more indirection makes this version slower than what you have already though.
#!/usr/bin/perl
use warnings;
use strict;
sub get() {
my $this = shift;
if ($this->{next} == 2) {
$this->{next} = 3;
return 2;
}
while (1) {
my $next = $this->{next};
$this->{next} += 2;
if (not exists $this->{used}{$next}) {
$this->{used}{$next * $next} = [$next];
return $next;
} else {
foreach my $x (#{$this->{used}{$next}}) {
push (#{$this->{used}{$next + $x}}, $x);
}
delete $this->{used}{$next};
}
}
}
sub buildSieve {
my $this = {
"used" => {},
"next" => 2,
};
bless $this;
return $this;
}
my $sieve = buildSieve();
foreach $_ (1..100) {
print $sieve->get()."\n";
}
You should be able to combine the better algorithm with the more memory efficient generative version above to come up with a good solution though.
If you want to see how the algorithm works in detail, it's quite instructive to use Data::Dumper; and print out $sieve in between each call to get().

Perl: increment 2d array cell?

I have a set of numerical data for which is important to me to know what pairs of numbers occurred together, and how many times. Each set of data contain 7 numbers betwen 1 and 20. There are several hundred sets of data.
Essentially, by parsing each set of my data, I want to create a 20 x 20 array that I can use to keep a count of when pairs of numbers occurred together.
I have done a lot of searching, but maybe I've used the wrong key words. I've seen loads of examples how to create a "2D array" - I know perl doesn't actually do that, and that it's really an array of references - and to print the values contained therein, but nothing really on how to work with one particular cell by number and alter it.
Below is my conceptual code. The commented lines don't work, but illustrate what I want to achieve. I'm reasonably new to coding perl, and this just seems to advanced for me to understand the examples I've seen and translate it into something I can actually use.
my #datapairs;
while (<DATAFILE>)
{
chomp;
my #data = split(",",$_);
for ($prcount=0; $prcount <=5; $prcount++)
{
for ($othcount=($prcount+1); $othcount<=6; $othcount++)
{
#data[$prcount]=#data[$prcount]+1;
#data[$othcount]=#data[$othcount]+1;
#data[$prcount]=#data[$prcount]-1;
#data[$othcount]=#data[$othcount]-1;
print #data[$prcount]." ".#data[$othcount]."; ";
##datapairs[#data[$prcount]][#data[$othcount]]++;
##datapairs[#data[$othcount]][#data[$prcount]]++;
}
}
}
Any input or suggestions would be much appreciated.
To access a "cell" in a "2-d array" in Perl (as you alredy figured out, it's an array of arrayrefs), is simple:
my #datapairs;
# Add 1 for a pair with indexes $i and $j
$datapairs[$i]->[$j]++;
print that value
print "$datapairs[$i]->[$j]\n";
It's not clear what you mean by "occur together" - if you mean "in the same length-7 array", it's easy:
my #datapairs;
while (<DATAFILE>) {
chomp;
my #data = split(",", $_);
for (my $prcount = 0; $prcount <= 5; $prcount++) {
for (my $othcount = $prcount + 1; $othcount <=6 ; $othcount++) {
$datapairs[ $data[$prcount] ]->[ $data[$othcount] ]++;
}
}
}
# Print
for (my $i = 0; $i < 20; $i++) {
for (my $j = 0; $j < 20; $j++) {
print "$datapairs[$i]->[$j], ";
}
print "\n";
}
As a side note, personally, just for stylistic reasons, I strongly prefer to reference EVERYTHING, e.g. use arrayref of arrayrefs instead of array of arrays. E.g.
my $datapairs;
# Add 1 for a pair with indexes $i and $j
$datapairs->[$i]->[$j]++;
print that value
print "$datapairs->[$i]->[$j]\n";
The second (and third...) arrow dereference operator is optional in Perl but I personally find it significantly more readable to enforce its usage - it spaces out the index expressions.

Mysterious "uninitialized value in an array" in algorithm for Perl challenge

Currently learning Perl, and trying to solve a little challenge to find the sum of even terms from the first 4,000,000 Fibonacci terms. I have created a Fibonacci array that seems to work, and then tried different methods to throw out the odd-valued terms, and continually run into an error when trying to sum my resulting array, getting reports of:
Use of uninitialized value in addition (+) at prob2_3.plx line 23
Here is what I have:
#!/usr/bin/perl -w
# prob2_2.plx
use warnings;
use strict;
my #fib; my $i; my $t; my $n;
#fib = (1, 2);
for ($i=2; $i<4000000; $i++) {
my $new= ( $fib[$i-1] + $fib[$i-2] );
push #fib, $new;}
for ($t=3; $t<4000000; $t++) {
if (($fib[$t] % 2) != 0 ) {
delete $fib[$t]; } }
my $total = 0;
for ($n=1; $n<$#fib; $n++) {
$total += $fib[($n+1)];}
print $total;
The warning means you are adding undef to something. delete $fib[$t]; is a bad way of doing $fib[$t] = undef;, which you later add to $total.
You have at least one other error:
The first two Fibonacci numbers are 0 and 1, not 1 and 2.
You have a major problem:
The 4,000,000th Fib number is going to be extremely large, much too large to fit in a double.
For reference purposes,
10,000th has 2090 digits: 20793608237133498072112648988642836825087036094015903119682945866528501423455686648927456034305226515591757343297190158010624794267250973176133810179902738038231789748346235556483191431591924532394420028067810320408724414693462849062668387083308048250920654493340878733226377580847446324873797603734794648258113858631550404081017260381202919943892370942852601647398213554479081823593715429566945149312993664846779090437799284773675379284270660175134664833266377698642012106891355791141872776934080803504956794094648292880566056364718187662668970758537383352677420835574155945658542003634765324541006121012446785689171494803262408602693091211601973938229446636049901531963286159699077880427720289235539329671877182915643419079186525118678856821600897520171070499437657067342400871083908811800976259727431820539554256869460815355918458253398234382360435762759823179896116748424269545924633204614137992850814352018738480923581553988990897151469406131695614497783720743461373756218685106856826090696339815490921253714537241866911604250597353747823733268178182198509240226955826416016690084749816072843582488613184829905383150180047844353751554201573833105521980998123833253261228689824051777846588461079790807828367132384798451794011076569057522158680378961532160858387223882974380483931929541222100800313580688585002598879566463221427820448492565073106595808837401648996423563386109782045634122467872921845606409174360635618216883812562321664442822952537577492715365321134204530686742435454505103269768144370118494906390254934942358904031509877369722437053383165360388595116980245927935225901537634925654872380877183008301074569444002426436414756905094535072804764684492105680024739914490555904391369218696387092918189246157103450387050229300603241611410707453960080170928277951834763216705242485820801423866526633816082921442883095463259080471819329201710147828025221385656340207489796317663278872207607791034431700112753558813478888727503825389066823098683355695718137867882982111710796422706778536913192342733364556727928018953989153106047379741280794091639429908796650294603536651238230626
20,000th has 4180 digits: 1564344347109763849734765364072743458162050946855915883181245417404580803852433819127477934504143316103671237797087184052487157589846395314335101792632666883301188491698850377253383735812017943059782268835280360618754466932406192674904182868594738499500415166599602737300793712012046275485369495600019495004126039595217556097603510836899682827827626851274417838565958464881549888154511565687715162081527027421167926710592169405764372872023265791851279526521097739802047796738013885512616267273220024096214780132567479711643567372517808245262560562426651659391013837988476506124649092538307827326285964637268328029765707984607120961599796336714632362497169952413163370558311283612961033588836334352432860332222878648950508154331165678617373097939647648015552782638392654938551724289386017566932982065441392025369213734676739845068956966278536757235977421127565055467060906533383001625925978595472181091151062798507286798754728450358266089744616465914255799764431508559485853637841082521780322710748029546001980460990695999087046617731317608498316428164179967150350939374702201821818895349621858954893061034598954341939850973673870946183079728029105624782161827626661367017673681922257604178810154438462080217794489109678386881153826838075832058191153133704042628156419344516917867369755345135618986917642004521509538436204298618130363401395547933177643760161135638357088649014469358006518300404036431113143777969391584246934245800739809135619744598808977628245309941537928439431608665523308894967310600529498446943933665468406306292762942409786097847875240014036353917928156220446650579514092031254308059314931618726692376640987446459276331196950780063664171751110087644649773058213117640640085100552927878404516279461437503857017398937097042607258059612257878307007002086913210922626760728342901272768408974906007921227446242552261362505471751722906558235533709070548109789519920405521647836164156675304784097782435865165640401897107828859121831521126567446611716077075769257072773697947064329836969249852382976202348037425889031090020976240691949742160088733357875561841760194799534815496104106903184713919847662253483806138312440578732122855388348848736018217032877013531004653902335692761900988709302797685265501972628217528866551995479526195626503247164073793787381643388365618488630255600890924552511767690989186316859159306438477097458585889829326938198129884953178437411315486719927412151054551726325421747462698125767761987300812744880048122138953746796038485281452086680809803469350470844184375258620810652745992631459076192613797545486775651410699327289089628593588395142531659083933746399666161863597357735290387376161440280731398703030590410957840047591721635117677190494658658256770952605314604687704388833897300447300322491720569722311756874534871145435101596346787454258165870310592717473670917638475152605474446188958081898150393481484970581519902582271877141251593259282483539345792009117894084860435326938689664322383123823631494470354941767039585133484331342468806167901166928052638999423311570618981137348891538818027216596300491989181231598151123614651043205656474490923109982595235880446420678700336717534914381729578113169753046083981752465156933790288020841880688083888166659362896648911608716373579944854235997384986302902608821566689026676371268703303207406827737925274781301986480762462594420398637607893961010824979395439225300832931626540179218558345947558472159906873998923767432504278838419479068093778976997276416592421223235719653905071392295735398272851826350645605643470417155719500185143594804374322010189545136205568856276559806316789533450612097900180399440915139647060459321993254566103255011590902408116018722996267956826555434955409390951728022815209412027248353062982911544674007147249326697275010788100666958314965810320432736615962898175585320993128871046552842068867557341007383399180807449030159797672605530835244157256109268527578172314358255179589605335375414082046575557122636364391407861922824529441261003866098066404526541912783214030236752423547997110159548536582622929575859635210831021463323632502412193578592457118234067116894159316798758933206918936334540039454055299101076302263831614132510576874528929742319396129011617501
Even the 10,000th is too large for a double, and the 20,000th is double the size of the 10,000th, so imagine how large the 4,000,000th will be!
Stylistic issues:
my $i; for ($i=2; $i<4000000; $i++)
is much harder to read than
for my $i (2..$N-1)
with the following at the top to avoid having to repeat the number everywhere:
my $N = 4_000_000;
As if the fact that the 4 millionth Fibonacci number is more than 10^835950 isn't a big enough problem, this is not very good:
for ($t=3; $t<4000000; $t++) {
if (($fib[$t] % 2) != 0 ) {
delete $fib[$t]; } }
my $total = 0;
for ($n=1; $n<$#fib; $n++) {
$total += $fib[($n+1)];}
Why are you walking through the list twice here? Much better would be to combine the two loops into one. You want the sum of the odd terms, so sum the odd terms. Don't delete the odd terms (stylistically very bad) and then walk over the list again, relying on the fact that undef has a numerical value of 0 (but only with a warning).
And mn, the formatting of that code is very, very ugly. Eventually you will write code that someone else needs to read or maintain. My motto: Imagine that the person who will maintain your code is a psychopath who knows where you live.
As ikegami points out, your uninitialized problem is assuming delete removes elements from an array, when in fact it just sets them to undef (unless they are at the end of the array).
Given the storage requirements of the larger Fibonacci numbers, you don't want them in an array at all; fortunately, there's no need to keep them around for this problem.
I would do it like this (takes many minutes to run):
use strict;
use warnings;
use Math::BigInt 'lib' => 'GMP';
my $fib_A = Math::BigInt->new(0);
my $fib_B = Math::BigInt->new(1);
my $sum = Math::BigInt->new(0);
# get the next 3999998
for (1..(4000000-2)) {
my $next = $fib_A + $fib_B;
$sum += $next if $next % 2 == 0;
($fib_A, $fib_B) = ($fib_B, $next);
}
print "The sum is $sum\n";

How fast is Perl's smartmatch operator when searching for a scalar in an array?

I want to repeatedly search for values in an array that does not change.
So far, I have been doing it this way: I put the values in a hash (so I have an array and a hash with essentially the same contents) and I search the hash using exists.
I don't like having two different variables (the array and the hash) that both store the same thing; however, the hash is much faster for searching.
I found out that there is a ~~ (smartmatch) operator in Perl 5.10. How efficient is it when searching for a scalar in an array?
If you want to search for a single scalar in an array, you can use List::Util's first subroutine. It stops as soon as it knows the answer. I don't expect this to be faster than a hash lookup if you already have the hash, but when you consider creating the hash and having it in memory, it might be more convenient for you to just search the array you already have.
As for the smarts of the smart-match operator, if you want to see how smart it is, test it. :)
There are at least three cases you want to examine. The worst case is that every element you want to find is at the end. The best case is that every element you want to find is at the beginning. The likely case is that the elements you want to find average out to being in the middle.
Now, before I start this benchmark, I expect that if the smart match can short circuit (and it can; its documented in perlsyn), that the best case times will stay the same despite the array size, while the other ones get increasingly worse. If it can't short circuit and has to scan the entire array every time, there should be no difference in the times because every case involves the same amount of work.
Here's a benchmark:
#!perl
use 5.12.2;
use strict;
use warnings;
use Benchmark qw(cmpthese);
my #hits = qw(A B C);
my #base = qw(one two three four five six) x ( $ARGV[0] || 1 );
my #at_end = ( #base, #hits );
my #at_beginning = ( #hits, #base );
my #in_middle = #base;
splice #in_middle, int( #in_middle / 2 ), 0, #hits;
my #random = #base;
foreach my $item ( #hits ) {
my $index = int rand #random;
splice #random, $index, 0, $item;
}
sub count {
my( $hits, $candidates ) = #_;
my $count;
foreach ( #$hits ) { when( $candidates ) { $count++ } }
$count;
}
cmpthese(-5, {
hits_beginning => sub { my $count = count( \#hits, \#at_beginning ) },
hits_end => sub { my $count = count( \#hits, \#at_end ) },
hits_middle => sub { my $count = count( \#hits, \#in_middle ) },
hits_random => sub { my $count = count( \#hits, \#random ) },
control => sub { my $count = count( [], [] ) },
}
);
Here's how the various parts did. Note that this is a logarithmic plot on both axes, so the slopes of the plunging lines aren't as close as they look:
So, it looks like the smart match operator is a bit smart, but that doesn't really help you because you still might have to scan the entire array. You probably don't know ahead of time where you'll find your elements. I expect a hash will perform the same as the best case smart match, even if you have to give up some memory for it.
Okay, so the smart match being smart times two is great, but the real question is "Should I use it?". The alternative is a hash lookup, and it's been bugging me that I haven't considered that case.
As with any benchmark, I start off thinking about what the results might be before I actually test them. I expect that if I already have the hash, looking up a value is going to be lightning fast. That case isn't a problem. I'm more interested in the case where I don't have the hash yet. How quickly can I make the hash and lookup a key? I expect that to perform not so well, but is it still better than the worst case smart match?
Before you see the benchmark, though, remember that there's almost never enough information about which technique you should use just by looking at the numbers. The context of the problem selects the best technique, not the fastest, contextless micro-benchmark. Consider a couple of cases that would select different techniques:
You have one array you will search repeatedly
You always get a new array that you only need to search once
You get very large arrays but have limited memory
Now, keeping those in mind, I add to my previous program:
my %old_hash = map {$_,1} #in_middle;
cmpthese(-5, {
...,
new_hash => sub {
my %h = map {$_,1} #in_middle;
my $count = 0;
foreach ( #hits ) { $count++ if exists $h{$_} }
$count;
},
old_hash => sub {
my $count = 0;
foreach ( #hits ) { $count++ if exists $old_hash{$_} }
$count;
},
control_hash => sub {
my $count = 0;
foreach ( #hits ) { $count++ }
$count;
},
}
);
Here's the plot. The colors are a bit difficult to distinguish. The lowest line there is the case where you have to create the hash any time you want to search it. That's pretty poor. The highest two (green) lines are the control for the hash (no hash actually there) and the existing hash lookup. This is a log/log plot; those two cases are faster than even the smart match control (which just calls a subroutine).
There are a few other things to note. The lines for the "random" case are a bit different. That's understandable because each benchmark (so, once per array scale run) randomly places the hit elements in the candidate array. Some runs put them a bit earlier and some a bit later, but since I only make the #random array once per run of the entire program, they move around a bit. That means that the bumps in the line aren't significant. If I tried all positions and averaged, I expect that "random" line to be the same as the "middle" line.
Now, looking at these results, I'd say that a smart-match is much faster in its worst case than the hash lookup is in its worst case. That makes sense. To create a hash, I have to visit every element of the array and also make the hash, which is a lot of copying. There's no copying with the smart match.
Here's a further case I won't examine though. When does the hash become better than the smart match? That is, when does the overhead of creating the hash spread out enough over repeated searches that the hash is the better choice?
Fast for small numbers of potential matches, but not faster than the hash. Hashes are really the right tool for testing set membership. Since hash access is O(log n) and smartmatch on an array is still O(n) linear scan (albeit short-circuiting, unlike grep), with larger numbers of values in the allowed matches, smartmatch gets relatively worse.
Benchmark code (matching against 3 values):
#!perl
use 5.12.0;
use Benchmark qw(cmpthese);
my #hits = qw(one two three);
my #candidates = qw(one two three four five six); # 50% hit rate
my %hash;
#hash{#hits} = ();
sub count_hits_hash {
my $count = 0;
for (#_) {
$count++ if exists $hash{$_};
}
$count;
}
sub count_hits_smartmatch {
my $count = 0;
for (#_) {
$count++ when #hits;
}
$count;
}
say count_hits_hash(#candidates);
say count_hits_smartmatch(#candidates);
cmpthese(-5, {
hash => sub { count_hits_hash((#candidates) x 1000) },
smartmatch => sub { count_hits_smartmatch((#candidates) x 1000) },
}
);
Benchmark results:
Rate smartmatch hash
smartmatch 404/s -- -65%
hash 1144/s 183% --
The "smart" in "smart match" isn't about the searching. It's about doing the right thing at the right time based on context.
The question of whether it's faster to loop through an array or index into a hash is something you'd have to benchmark, but in general, it'd have to be a pretty small array to be quicker to skim through than indexing into a hash.

How is this Perl code selecting two different elements from an array?

I have inherited some code from a guy whose favorite past time was to shorten every line to its absolute minimum (and sometimes only to make it look cool). His code is hard to understand but I managed to understand (and rewrite) most of it.
Now I have stumbled on a piece of code which, no matter how hard I try, I cannot understand.
my #heads = grep {s/\.txt$//} OSA::Fast::IO::Ls->ls($SysKey,'fo','osr/tiparlo',qr{^\d+\.txt$}) || ();
my #selected_heads = ();
for my $i (0..1) {
$selected_heads[$i] = int rand scalar #heads;
for my $j (0..#heads-1) {
last if (!grep $j eq $_, #selected_heads[0..$i-1]);
$selected_heads[$i] = ($selected_heads[$i] + 1) % #heads; #WTF?
}
my $head_nr = sprintf "%04d", $i;
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$heads[$selected_heads[$i]].txt","$recdir/heads/$head_nr.txt");
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$heads[$selected_heads[$i]].cache","$recdir/heads/$head_nr.cache");
}
From what I can understand, this is supposed to be some kind of randomizer, but I never saw a more complex way to achieve randomness. Or are my assumptions wrong? At least, that's what this code is supposed to do. Select 2 random files and copy them.
=== NOTES ===
The OSA Framework is a Framework of our own. They are named after their UNIX counterparts and do some basic testing so that the application does not need to bother with that.
This looks like some C code with Perl syntax. Sometimes knowing the language the person is thinking in helps you figure out what's going on. In this case, the person's brain is infected with the inner workings of memory management, pointer arithmetic, and other low level concerns, so he wants to minutely control everything:
my #selected_heads = ();
# a tricky way to make a two element array
for my $i (0..1) {
# choose a random file
$selected_heads[$i] = int rand #heads;
# for all the files (could use $#heads instead)
for my $j (0..#heads-1) {
# stop if the chosen file is not already in #selected_heads
# it's that damned ! in front of the grep that's mind-warping
last if (!grep $j eq $_, #selected_heads[0..$i-1]);
# if we are this far, the two files we selected are the same
# choose a different file if we're this far
$selected_heads[$i] = ($selected_heads[$i] + 1) % #heads; #WTF?
}
...
}
This is a lot of work because the original programmer either doesn't understand hashes or doesn't like them.
my %selected_heads;
until( keys %selected_heads == 2 )
{
my $try = int rand #heads;
redo if exists $selected_heads{$try};
$selected_heads{$try}++;
}
my #selected_heads = keys %selected_heads;
If you still hate hashes and have Perl 5.10 or later, you can use smart-matching to check if a value is in an array:
my #selected_heads;
until( #selected_heads == 2 )
{
my $try = int rand #heads;
redo if $try ~~ #selected_heads;
push #selected_heads, $try;
}
However, you have a special constraint on this problem. Since you know there are only two elements, you just have to check if the element you want to add is the prior element. In the first case it won't be undef, so the first addition always works. In the second case, it just can't be the last element in the array:
my #selected_heads;
until( #selected_heads == 2 )
{
my $try = int rand #heads;
redo if $try eq $selected_heads[-1];
push #selected_heads, $try;
}
Huh. I can't remember the last time I used until when it actually fit the problem. :)
Note that all of these solutions have the problem that they can cause an infinite loop if the number of original files is less than 2. I'd add a guard condition higher up so the no and single file cases through an error and perhaps the two file case doesn't bother to order them.
Another way you might do this is to shuffle (say, with List::Util) the entire list of original files and just take off the first two files:
use List::Util qw(shuffle);
my #input = 'a' .. 'z';
my #two = ( shuffle( #input ) )[0,1];
print "selected: #two\n";
It selects a random element from #heads.
Then it adds on another random but different element from #heads (if it is the element previously selected, it scrolls through #heads till it find an element not previously selected).
In summary, it selects N (in your case N=2) different random indexes in #heads array and then copies files corresponding to those indexes.
Personally I would write it a bit differently:
# ...
%selected_previously = ();
foreach my $i (0..$N) { # Generalize for N random files instead of 2
my $random_head_index = int rand scalar #heads;
while ($selected_previously[$random_head_index]++) {
$random_head_index = $random_head_index + 1) % #heads; # Cache me!!!
}
# NOTE: "++" in the while() might be considered a bit of a hack
# More readable version: $selected_previously[$random_head_index]=1; here.
The part you labeled "WTF" isn't so troubling, it's just simply making sure that $selected_heads[$i] remains as a valid subscript of #head. The really troubling part is that it is a pretty inefficient way of making sure he's not selecting the same file.
Then again, if the size of #heads is small, stepping from 0..$#heads is probably more efficient than just generating int rand( 2 ) and testing if they are the same.
But basically it copies two files at random (why?) as a '.txt' file and a '.cache' file.
How about just
for my $i (0..1) {
my $selected = splice( #heads, rand #heads, 1 );
my $head_nr = sprintf "%04d", $i;
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$selected.txt","$recdir/heads/$head_nr.txt");
OSA::Fast::IO::Cp->cp($SysKey,'',"osr/tiparlo/$selected.cache","$recdir/heads/$head_nr.cache");
}
unless #heads or #selected_heads are used later.
Here's yet another way to select 2 unique random indices:
my #selected_heads = ();
my #indices = 0..$#heads;
for my $i (0..1) {
my $j = int rand (#heads - $i);
push #selected_heads, $indices[$j];
$indices[$j] = $indices[#heads - $i - 1];
}