How to optimize two-dimensional hash traversing in Perl? - perl

I have a hash of hashes %signal_db. A typical element is: $signal_db{$cycle}{$key}. There are 10,000s of signals, and 10,000s of keys.
Is there any way to optimize (timewise) this piece of code:
foreach my $cycle (sort numerically keys %signal_db) {
foreach my $key (sort keys %{$signal_db{$cycle}}) {
print $signal_db{$cycle}{$key}.$key."\n";
}
}
The elements have to be printed in the same order as in my code.

Two micro optimizations: map inner hash instead of constant dereferencing and buffer instead of constant print. It's possible to get rid of sorting using alternative storage formats, tested two variants. Results:
Rate original try3 alternative alternative2
original 46.1/s -- -12% -21% -32%
try3 52.6/s 14% -- -10% -22%
alternative 58.6/s 27% 11% -- -13%
alternative2 67.5/s 46% 28% 15% --
Conclusion:
It's better to use presorted storage format, but without C win would probably be within 100% (on my test dataset). Provided information about data suggests that keys in outer hash are almost sequential numbers, so this cries for array.
Script:
#!/usr/bin/env perl
use strict; use warnings;
use Benchmark qw/timethese cmpthese/;
my %signal_db = map { $_ => {} } 1..1000;
%$_ = map { $_ => $_ } 'a'..'z' foreach values %signal_db;
my #signal_db = map { { cycle => $_ } } 1..1000;
$_->{'samples'} = { map { $_ => $_ } 'a'..'z' } foreach #signal_db;
my #signal_db1 = map { $_ => [] } 1..1000;
#$_ = map { $_ => $_ } 'a'..'z' foreach grep ref $_, #signal_db1;
use Sort::Key qw(nsort);
sub numerically { $a <=> $b }
my $result = cmpthese( -2, {
'original' => sub {
open my $out, '>', 'tmp.out';
foreach my $cycle (sort numerically keys %signal_db) {
foreach my $key (sort keys %{$signal_db{$cycle}}) {
print $out $signal_db{$cycle}{$key}.$key."\n";
}
}
},
'try3' => sub {
open my $out, '>', 'tmp.out';
foreach my $cycle (map $signal_db{$_}, sort numerically keys %signal_db) {
my $tmp = '';
foreach my $key (sort keys %$cycle) {
$tmp .= $cycle->{$key}.$key."\n";
}
print $out $tmp;
}
},
'alternative' => sub {
open my $out, '>', 'tmp.out';
foreach my $cycle (map $_->{'samples'}, #signal_db) {
my $tmp = '';
foreach my $key (sort keys %$cycle) {
$tmp .= $cycle->{$key}.$key."\n";
}
print $out $tmp;
}
},
'alternative2' => sub {
open my $out, '>', 'tmp.out';
foreach my $cycle (grep ref $_, #signal_db1) {
my $tmp = '';
foreach (my $i = 0; $i < #$cycle; $i+=2) {
$tmp .= $cycle->[$i+1].$cycle->[$i]."\n";
}
print $out $tmp;
}
},
} );

my %signal_db = map {$_ => {1 .. 1000}} 1 .. 1000;
sub numerically {$a <=> $b}
sub orig {
my $x;
foreach my $cycle (sort numerically keys %signal_db) {
foreach my $key (sort keys %{$signal_db{$cycle}}) {
$x += length $signal_db{$cycle}{$key}.$key."\n";
}
}
}
sub faster {
my $x;
our ($cycle, $key, %hash); # move allocation out of the loop
local *hash; # and use package variables which are faster to alias into
foreach $cycle (sort {$a <=> $b} # the {$a <=> $b} literal is optimized
keys %signal_db) {
*hash = $signal_db{$cycle}; # alias into %hash
foreach $key (sort keys %hash) {
$x += length $hash{$key}.$key."\n"; # simplify the lookup
}
}
}
use Benchmark 'cmpthese';
cmpthese -5 => {
orig => \&orig,
faster => \&faster,
};
which gets:
Rate orig faster
orig 2.56/s -- -15%
faster 3.03/s 18% --
Not a huge gain, but it is something. There isn't much more you can optimize without changing your data structure to use presorted arrays. (or writing the whole thing in XS)
Switching the foreach loops to use external package variables saves a little bit of time since perl does not have to create lexicals in the loop. Also package variables seem to be a bit faster to alias into. Reducing the inner lookup to a single level also helps.
I assume you are printing to STDOUT and then redirecting the output to a file? If so, using Perl to open the output file directly and then printing to that handle may allow for improvements in file IO performance. Another micro-optimization could be to experiment with different record sizes. For example, does it save any time to build an array in the inner loop, then join / print it at the bottom of the outer loop? But that is something that is fairly device dependent (and possibly pointless due to other IO caching layers), so I will leave that test up to you.

I'd first experiment with the Sort::Key module because sorting takes longer than simple looping and printing. Also, if the inner hashes keys are (mostly) identical, then you should simply presort them, but I'll assume this isn't the case or else you'd be doing that already.
You should obviously try assigning $signal_db{$cycle} to a reference too. You might find that each is faster than keys plus retrieval as well, especially if used with Sort::Key. I'd check if map runs faster than foreach too, probably the same, but who knows. You might find print runs faster if you pass it a list or call it multiple times.
I haven't tried this code but throwing together all these ideas except each gives :
foreach my $cycle (nsort keys %signal_db) {
my $r = $signal_db{$cycle};
map { print ($r->{$_},$_,"\n"); } (nsort keys %$r);
}
There is an article about sorting in perl here, check out the Schwartzian Transform if you wish to see how one might use each.
If your code need not be security conscious, then you could conceivably disable Perl's protection against algorithmic complexity attacks by setting PERL_HASH_SEED or related variables and/or recompile Perl with altered setting, so that perl's keys and values commands returned the keys or values in sorted order already, thus saving you considerable time sorting them. But please watch this 28C3 talk before doing so. I donno if this'll even work either, you'd need to read this part of Perl's source code, maybe easier just implementing your loop in C.

Related

Conditions in Perl loops and performance

There are many idioms in Perl with the using of operators/functions/subprograms/methods in loop conditions. Books advice use them!
But as I understand these conditions are calculated each iteration. Am I right?
Perl 5:
foreach my $key (keys %hash) { ... }
for my $value (values %hash) { ... }
Perl 6:
for 'words.txt'.IO.lines -> $line { ... }
while $index < $fruit.chars { ... }
Why programmers don't assign condition to some variable before loop and use this variable in loop? It would increase speed. So the first example would look like this:
my #array = keys %hash;
foreach my $keys (#array) { ... }
The condition is only calculated initially (before the loop starts) so I do not think it would increase the speed to precalculate the array before the loop.. Example:
for my $key (get_keys()) {
say $key;
}
sub get_keys {
say "Calculating keys..";
return qw(a b c d);
}
Output:
Calculating keys..
a
b
c
d
foreach my $key (keys %hash) { ... }
for my $value (values %hash) { ... }
The for and the foreach are synonymous in Perl, so aside from the fact that your two example snippets are operating on different parts of the hash, they're the same thing.
Ok, so here's what happens internally: In each case all keys, or all values are calculated as a list, and then the looping construct iterates on that calculated list. There is an internal check, but that check is only to see if the loop has reached the offset of the last element in the list yet. That is a cheap operation in the underlying C code. To be clear, keys and values are not called on each iteration. The list of things iterated over is computed only once at the beginning of the loop.
Also, $key and $value are aliases to the actual key or the actual value, not copies. So there is no per-iteration copy made.
The nuance that is often missed is the fact that the iteration list is precomputed upon entering the loop. That is why it's considered a terrible idea to do this:
foreach my $line (<$file_handle>) {...}
...because the entire file must be read and held in memory at once before the first line can be processed. The fact that a list must be available internally first is typically an acceptable memory trade-off for things that are already held in memory to begin with. But for external sources such as a file there's no guarantee that available memory can hold the whole thing -- particularly if it's some endless stream. Consider this code:
open my $fh, '<', '/dev/urandom';
say ord while <$fh>;
It will never terminate, but will emit a constant stream of ordinal values. However, it does not grow in memory usage.
Now change the second line to this:
say ord for <$fh>;
This will appear to hang while it consumes all of the system's memory attempting to retrieve the entire contents of /dev/urandom (and endless stream). It must do this before it can start iterating, because that's how a range-based foreach loop works in Perl, and some other languages.
So a range based foreach loop is inexpensive in its computational overhead, but in some cases potentially expensive in its memory footprint.
Speaking to your final example:
my #array = keys %hash;
foreach my $keys (#array) { ... }
It doesn't make an appreciable difference, and may actually be slower or consume more memory. When I compare the two approaches with a hash of 100000 elements the difference between the two is only 2%, or within the margin of error:
Rate copy direct
copy 35.9/s -- -2%
direct 36.7/s 2% --
Here's the code:
use Benchmark qw(cmpthese);
my %hash;
#hash{1..100000} = (1..100000);
sub copy {
my #array = keys %hash;
my $b = 0;
$b += $_ foreach #array;
return $b;
}
sub direct {
my $b = 0;
$b += $_ foreach keys %hash;
return $b;
}
cmpthese(-5, {
copy => \&copy,
direct => \&direct,
});

Printing specific number of key-value pairs in a hash

I have a hash which stores the count of key-value pairs, from an array of strings taken from an input document then sorts them and prints them.
%count = ();
foreach $string (#strings) {
$count{$string}++;
}
foreach my $key (sort {$count{$b} <=> $count{$a} } keys %count) {
print $key, ": ", $count{$key} ;
}
so I am wondering is there a way to only print a certain number of key-value pairs in the hash instead of all of them ? i.e print top 5 based the value?
edit: would a for loop solve this?
%count = ();
foreach $string (#strings) {
$count{$string}++;
}
my $n=0; # variable to keep count of processed keys
foreach my $key (sort {$count{$b} <=> $count{$a} } keys %count) {
# count processed keys (++$n)
# and terminate the loop after processing 5 keys
last if ++$n>5;
print $key, ": ", $count{$key} ;
}
Can take a slice of the list returned by sort
use strict;
use warnings;
use feature 'say';
....
my %count;
foreach my $string (#strings) {
++$count{$string}
}
say "$_: $count{$_}"
for ( sort { $count{$b} <=> $count{$a} } keys %count )[0..4];
(This expects that the hash indeed has five keys; if it can happen that that is not the case you'd get hit by warnings so add a test in that case, for instance $_ and say "..." for ...)
The code in the question is clearly not using strict; I recommend to always use it.
The %count = () makes sense if the hash has been populated before and now need be emptied. If you are creating it then just declare (and without = (), which does nothing).
Note, thanks to Grinnz: very recent List::Util 1.50 adds head (and tail) functions

Perl Hash of Hash Output

I'm reading a file. I want a hash that gives me the first number of a line as a key to a hash of all the numbers of the rest of the line to 1.
I believe I'm adding the hash correctly, because Dumper prints correctly.
However, print "$first $secondID\n" is not giving me any output.
while (<FILE>) {
chomp $_;
if (/(\d+)\t(.+)/) {
$firstNum = $1;
#seconds = split(/\,/,$2);
foreach $following (#seconds) {
$Pairs->{$firstNum}{$following} = 1;
}
foreach $first (sort {$a <=> $b} keys %Pairs) {
print "$first\n";
%second = {$Pairs{$first}};
foreach $secondID (sort {$a <=> $b} keys %second) {
print "$first $secondID\n";
}
}
print Dumper($Pairs);
}
else {
print "ERROR\n";
}
}
Later on, given a pair of numbers I would like to look up to see whether $Pairs{$num1}{$num2} is defined. would I write
if(defined $Pairs{$num1}{$num2})
Or should I check the first key first. Then check the second key
if (defined $Pairs{$num1}) {
$temp = $Pairs{$num1};
if (defined $temp{$num2}) {
print "true\n;
}
}
You have a couple of errors. Firstly you seem to be unsure whether you are using %Pairs or $Pairs to store your hash, and secondly you have %second = {$Pairs{$first}}, which tries to assign a hash reference to the hash %second. Presumably you want my %second = %{ $Pairs{$first} }.
You should always use strict and use warnings at the start of all your Perl programs, and declare all variables at the point of first use using my. This will alert you to simple mistakes you could otherwise easily overlook, and would have shown up your use of both %Pairs and $Pairs in this program, as well as your attempt to assign a single value (a hash reference) to a hash.
Rather than copying the entire hash, you should save a reference to it in $seconds. Then you can dereference it in the following for loop.
Experienced Perl programmers would also thank you for using lower-case plus underscore for local (my) variables, and reserving capitals for package and class names.
This program works as you intended, and expects the file name as a command-line parameter:
use strict;
use warnings;
my %pairs;
while (<>) {
unless ( /(\d+)\s+(.+)/ ) {
print "ERROR\n";
next;
}
my $first_num = $1;
my #seconds = split /,/, $2;
foreach my $following (#seconds) {
$pairs{$first_num}{$following} = 1;
}
foreach my $first (sort { $a <=> $b } keys %pairs) {
print "$first\n";
my $second = $pairs{$first};
foreach my $second_id (sort { $a <=> $b } keys %$second) {
print "$first $second_id\n";
}
}
}
my %hash;
while ( <> ) {
my #numbers = split /\D+/;
my $key = shift #numbers;
#{$hash{$key}}{ #numbers } = ( 1 ) x #numbers;
}
# test it this way...
if ( $hash{ $num1 }{ $num2 } ) {
}
Use:
%second = %{$Pairs->{$first}};

Is there a simple way to validate a hash of hash element comparsion?

Is there a simple way to validate a hash of hash element comparsion ?
I need to validate a Perl hash of hash element $Table{$key1}{$key2}{K1}{Value} compare to all other elements in hash
third key will be k1 to kn and i want comprare those elements and other keys are same
if ($Table{$key1}{$key2}{K1}{Value} eq $Table{$key1}{$key2}{K2}{Value}
eq $Table{$key1}{$key2}{K3}{Value} )
{
#do whatever
}
Something like this may work:
use List::MoreUtils 'all';
my #keys = map "K$_", 1..10;
print "All keys equal"
if all { $Table{$key1}{$key2}{$keys[1]}{Value} eq $Table{$key1}{$key2}{$_}{Value} } #keys;
I would use Data::Dumper to help with a task like this, especially for a more general problem (where the third key is more arbitrary than 'K1'...'Kn'). Use Data::Dumper to stringify the data structures and then compare the strings.
use Data::Dumper;
# this line is needed to assure that hashes with the same keys output
# those keys in the same order.
$Data::Dumper::Sortkeys = 1;
my $string1= Data::Dumper->Dump($Table{$key1}{$key2}{k1});
for ($n=2; exists($Table{$key1}{$key2}{"k$n"}; $n++) {
my $string_n = Data::Dumper->Dump($Table{$key1}{$key2}{"k$n"});
if ($string1 ne $string_n) {
warn "key 'k$n' is different from 'k1'";
}
}
This can be used for the more general case where $Table{$key1}{$key2}{k7}{value} itself contains a complex data structure. When a difference is detected, though, it doesn't give you much help figuring out where that difference is.
A fairly complex structure. You should be looking into using object oriented programming techniques. That would greatly simplify your programming and the handling of these complex structures.
First of all, let's simplify a bit. When you say:
$Table{$key1}{$key2}{k1}{value}
Do you really mean:
my $value = $Table{$key1}->{$key2}->{k1};
or
my $actual_value = $Table{$key1}->{$key2}->{k1}->{Value};
I'm going to assume the first one. If I'm wrong, let me know, and I'll update my answer.
Let's simplify:
my %hash = %{$Table{$key1}->{$key2}};
Now, we're just dealing with a hash. There are two techniques you can use:
Sort the keys of this hash by value, then if two keys have the same value, they will be next to each other in the sorted list, making it easy to detect duplicates. The advantage is that all the duplicate keys would be printed together. The disadvantage is that this is a sort which takes time and resources.
Reverse the hash, so it's keyed by value and the value of that key is the key. If a key already exists, we know the other key has a duplicate value. This is faster than the first technique because no sorting is involved. However, duplicates will be detected, but not printed together.
Here's the first technique:
my %hash = %{$Table{$key1}->{$key2}};
my $previous_value;
my $previous_key;
foreach my $key (sort {$hash{$a} cmp $hash{$b}} keys %hash) {
if (defined $previous_key and $previous_value eq $hash{$key}) {
print "\$hash{$key} is a duplicate of \$hash{$previous_key}\n";
}
$previous_value = $hash{$key};
$previous_key = $key;
}
And the second:
my %hash = %{$Table{$key1}->{$key2}};
my %reverse_hash;
foreach $key (keys %hash) {
my $value = $hash{$key};
if (exists $reverse_hash{$value}) {
print "\$hash{$reverse_hash{$value}} has the same value as \$hash{$key}\n";
}
else {
$reverse_hash{$value} = $key;
}
}
Alternative approach to the problem is make utility function which will compare all keys if has same value returned from some function for all keys:
sub AllSame (&\%) {
my ($c, $h) = #_;
my #k = keys %$h;
my $ref;
$ref = $c->() for $h->{shift #k};
$ref ne $c->() and return for #$h{#k};
return 1
}
print "OK\n" if AllSame {$_->{Value}} %{$Table{$key1}{$key2}};
But if you start thinking in this way you can found this approach much more generic (recommended way):
sub AllSame (#) {
my $ref = shift;
$ref ne $_ and return for #_;
return 1
}
print "OK\n" if AllSame map {$_->{Value}} values %{$Table{$key1}{$key2}};
If mapping operation is expensive you can make lazy counterpart of same:
sub AllSameMap (&#) {
my $c = shift;
my $ref;
$ref = $c->() for shift;
$ref ne $c->() and return for #_;
return 1
}
print "OK\n" if AllSameMap {$_->{Value}} values %{$Table{$key1}{$key2}};
If you want only some subset of keys you can use hash slice syntax e.g.:
print "OK\n" if AllSame map {$_->{Value}} #{$Table{$key1}{$key2}}{map "K$_", 1..10};

Is perl's each function worth using?

From perldoc -f each we read:
There is a single iterator for each hash, shared by all each, keys, and values function calls in the program; it can be reset by reading all the elements from the hash, or by evaluating keys HASH or values HASH.
The iterator is not reset when you leave the scope containing the each(), and this can lead to bugs:
my %h = map { $_, 1 } qw(1 2 3);
while (my $k = each %h) { print "1: $k\n"; last }
while (my $k = each %h) { print "2: $k\n" }
Output:
1: 1
2: 3
2: 2
What are the common workarounds for this behavior? And is it worth using each in general?
I think it is worth using as long as you are aware of this. It's ideal when you need both key and value in iteration:
while (my ($k,$v) = each %h) {
say "$k = $v";
}
In your example you can reset the iterator by adding keys %h; like so:
my %h = map { $_ => 1 } qw/1 2 3/;
while (my $k = each %h) { print "1: $k\n"; last }
keys %h; # reset %h
while (my $k = each %h) { print "2: $k\n" }
From Perl 5.12 each will also allow iteration on an array.
I find each to be very handy for idioms like this:
my $hashref = some_really_complicated_method_that_builds_a_large_and_deep_structure();
while (my ($key, $value) = each %$hashref)
{
# code that does stuff with both $key and $value
}
Contrast that code to this:
my $hashref = ...same call as above
foreach my $key (keys %$hashref)
{
my $value = $hashref->{$key};
# more code here...
}
In the first case, both $key and $value are immediately available to the body of the loop. In the second case, $value must be fetched first. Additionally, the list of keys of $hashref may be really huge, which takes up memory. This is occasionally an issue. each does not incur such overhead.
However, the drawbacks of each are not instantly apparent: if aborting from the loop early, the hash's iterator is not reset. Additionally (and I find this one more serious and even less visible): you cannot call keys(), values() or another each() from within this loop. To do so would reset the iterator, and you would lose your place in the while loop. The while loop would continue forever, which is definitely a serious bug.
each is too dangerous to ever use, and many style guides prohibit its use completely. The danger is that if a cycle of each is aborted before the end of the hash, the next cycle will start there. This can cause very hard-to-reproduce bugs; the behavior of one part of the program will depend on a completely unrelated other part of the program. You might use each right, but what about every module ever written that might use your hash (or hashref; it's the same)?
keys and values are always safe, so just use those. keys makes it easier to traverse the hash in deterministic order, anyway, which is almost always more useful. (for my $key (sort keys %hash) { ... })
each is not only worth using, it's pretty much mandatory if you want to loop over all of a tied hash too big for memory.
A void-context keys() (or values, but consistency is nice) before beginning the loop is the only "workaround" necessary; is there some reason you are looking for some other workaround?
use the keys() function to reset the iterator. See the faq for more info
each has a buit-in, hidden global variable that can hurt you. Unless you need this behavior, it's safer to just use keys.
Consider this example where we want to group our k/v pairs (yes, I know printf would do this better):
#!perl
use strict;
use warnings;
use Test::More 'no_plan';
{ my %foo = map { ($_) x 2 } (1..15);
is( one( \%foo ), one( \%foo ), 'Calling one twice works with 15 keys' );
is( two( \%foo ), two( \%foo ), 'Calling two twice works with 15 keys' );
}
{ my %foo = map { ($_) x 2 } (1..105);
is( one( \%foo ), one( \%foo ), 'Calling one twice works with 105 keys' );
is( two( \%foo ), two( \%foo ), 'Calling two twice works with 105 keys' );
}
sub one {
my $foo = shift;
my $r = '';
for( 1..9 ) {
last unless my ($k, $v) = each %$foo;
$r .= " $_: $k -> $v\n";
}
for( 10..99 ) {
last unless my ($k, $v) = each %$foo;
$r .= " $_: $k -> $v\n";
}
return $r;
}
sub two {
my $foo = shift;
my $r = '';
my #k = keys %$foo;
for( 1..9 ) {
last unless #k;
my $k = shift #k;
$r .= " $_: $k -> $foo->{$k}\n";
}
for( 10..99 ) {
last unless #k;
my $k = shift #k;
$r .= " $_: $k -> $foo->{$k}\n";
}
return $r;
}
Debugging the error shown in the tests above in a real application would be horribly painful. (For better output use Test::Differences eq_or_diff instead of is.)
Of course one() can be fixed by using keys to clear the iterator at the start and end of the subroutine. If you remember. If all your coworkers remember. It's perfectly safe as long as no one forgets.
I don't know about you, but I'll just stick with using keys and values.
It's best if used as it's name: each. It's probably the wrong thing to use if you mean "give me the first key-value pair," or "give me the first two pairs" or whatever. Just keep in mind that the idea is flexible enough that each time you call it, you get the next pair (or key in a scalar context).
each() can be more efficient if you are iterating through a tied hash, for example a database that contains millions of keys; that way you don't have to load all the keys in memory.