Can I copy a hash without resetting its "each" iterator? - perl

I am using each to iterate through a Perl hash:
while (my ($key,$val) = each %hash) {
...
}
Then something interesting happens and I want to print out the hash. At first I consider something like:
while (my ($key,$val) = each %hash) {
if (something_interesting_happens()) {
foreach my $k (keys %hash) { print "$k => $hash{$k}\n" }
}
}
But that won't work, because everyone knows that calling keys (or values) on a hash resets the internal iterator used for each, and we may get an infinite loop. For example, these scripts will run forever:
perl -e '%a=(foo=>1); while(each %a){keys %a}'
perl -e '%a=(foo=>1); while(each %a){values %a}'
No problem, I thought. I could make a copy of the hash, and print out the copy.
if (something_interesting_happens()) {
%hash2 = %hash;
foreach my $k (keys %hash2) { print "$k => $hash2{$k}\n" }
}
But that doesn't work, either. This also resets the each iterator. In fact, any use of %hash in a list context seems to reset its each iterator. So these run forever, too:
perl -e '%a=(foo=>1); while(each %a){%b = %a}'
perl -e '%a=(foo=>1); while(each %a){#b = %a}'
perl -e '%a=(foo=>1); while(each %a){print %a}'
Is this documented anywhere? It makes sense that perl might need to use the same internal iterator to push a hash's contents onto a return stack, but I can also imagine hash implementations that didn't need to do that.
More importantly, is there any way to do what I want? To get to all the elements of a hash without resetting the each iterator?
This also suggests you can't debug a hash inside an each iteration, either. Consider running the debugger on:
%a = (foo => 123, bar => 456);
while ( ($k,$v) = each %a ) {
$DB::single = 1;
$o .= "$k,$v;";
}
print $o;
Just by inspecting the hash where the debugger stops (say, typing p %a or x %a), you will change the output of the program.
Update: I uploaded Hash::SafeKeys as a general solution to this problem. Thanks #gpojd for pointing me in the right direction and #cjm for a suggestion that made the solution much simpler.

Have you tried Storable's dclone to copy it? It would probably be something like this:
use Storable qw(dclone);
my %hash_copy = %{ dclone( \%hash ) };

How big is this hash? How long does it take to iterate through it, such that you care about the timing of the access?
Just set a flag and do the action after the end of the iteration:
my $print_it;
while (my ($key,$val) = each %hash) {
$print_it = 1 if something_interesting_happens();
...
}
if ($print_it) {
foreach my $k (keys %hash) { print "$k => $hash{$k}\n" }
}
Although there's no reason not to use each in the printout code, too, unless you were planning on sorting by key or something.

Let's not forget that keys %hash is already defined when you enter the while loop. One could have simply saved the keys into an array for later use:
my #keys = keys %hash;
while (my ($key,$val) = each %hash) {
if (something_interesting_happens()) {
print "$_ => $hash{$_}\n" for #keys;
}
}
Downside:
It's less elegant (subjective)
It won't work if %hash is modified (but then why would one use each in the first place?)
Upside:
It uses less memory by avoiding hash-copying

Not really. each is incredibly fragile. It stores iteration state on the iterated hash itself, state which is reused by other parts of perl when they need it. Far safer is to forget that it exists, and always iterate your own list from the result of keys %hash instead, because the iteration state over a list is stored lexically as part of the for loop itself, so is immune from corruption by other things.

Related

Conditions in Perl loops and performance

There are many idioms in Perl with the using of operators/functions/subprograms/methods in loop conditions. Books advice use them!
But as I understand these conditions are calculated each iteration. Am I right?
Perl 5:
foreach my $key (keys %hash) { ... }
for my $value (values %hash) { ... }
Perl 6:
for 'words.txt'.IO.lines -> $line { ... }
while $index < $fruit.chars { ... }
Why programmers don't assign condition to some variable before loop and use this variable in loop? It would increase speed. So the first example would look like this:
my #array = keys %hash;
foreach my $keys (#array) { ... }
The condition is only calculated initially (before the loop starts) so I do not think it would increase the speed to precalculate the array before the loop.. Example:
for my $key (get_keys()) {
say $key;
}
sub get_keys {
say "Calculating keys..";
return qw(a b c d);
}
Output:
Calculating keys..
a
b
c
d
foreach my $key (keys %hash) { ... }
for my $value (values %hash) { ... }
The for and the foreach are synonymous in Perl, so aside from the fact that your two example snippets are operating on different parts of the hash, they're the same thing.
Ok, so here's what happens internally: In each case all keys, or all values are calculated as a list, and then the looping construct iterates on that calculated list. There is an internal check, but that check is only to see if the loop has reached the offset of the last element in the list yet. That is a cheap operation in the underlying C code. To be clear, keys and values are not called on each iteration. The list of things iterated over is computed only once at the beginning of the loop.
Also, $key and $value are aliases to the actual key or the actual value, not copies. So there is no per-iteration copy made.
The nuance that is often missed is the fact that the iteration list is precomputed upon entering the loop. That is why it's considered a terrible idea to do this:
foreach my $line (<$file_handle>) {...}
...because the entire file must be read and held in memory at once before the first line can be processed. The fact that a list must be available internally first is typically an acceptable memory trade-off for things that are already held in memory to begin with. But for external sources such as a file there's no guarantee that available memory can hold the whole thing -- particularly if it's some endless stream. Consider this code:
open my $fh, '<', '/dev/urandom';
say ord while <$fh>;
It will never terminate, but will emit a constant stream of ordinal values. However, it does not grow in memory usage.
Now change the second line to this:
say ord for <$fh>;
This will appear to hang while it consumes all of the system's memory attempting to retrieve the entire contents of /dev/urandom (and endless stream). It must do this before it can start iterating, because that's how a range-based foreach loop works in Perl, and some other languages.
So a range based foreach loop is inexpensive in its computational overhead, but in some cases potentially expensive in its memory footprint.
Speaking to your final example:
my #array = keys %hash;
foreach my $keys (#array) { ... }
It doesn't make an appreciable difference, and may actually be slower or consume more memory. When I compare the two approaches with a hash of 100000 elements the difference between the two is only 2%, or within the margin of error:
Rate copy direct
copy 35.9/s -- -2%
direct 36.7/s 2% --
Here's the code:
use Benchmark qw(cmpthese);
my %hash;
#hash{1..100000} = (1..100000);
sub copy {
my #array = keys %hash;
my $b = 0;
$b += $_ foreach #array;
return $b;
}
sub direct {
my $b = 0;
$b += $_ foreach keys %hash;
return $b;
}
cmpthese(-5, {
copy => \&copy,
direct => \&direct,
});

Memory/performance tradeoff when determining the size of a Perl hash

I was browsing through some Perl code in a popular repositiory on GitHub and ran across this method to calculate the size of a hash:
while ( my ($a, undef ) = each %h ) { $num++; }
I thought why would one go through the trouble of writing all that code when it could more simply be written as
$num = scalar keys %h;
So, I compared both methods with Benchmark.
my %h = (1 .. 1000);
cmpthese(-10, {
keys => sub {
my $num = 0;
$num = scalar keys %h;
},
whileloop => sub {
my $num = 0;
while ( my ($a, undef ) = each %h ) {
$num++;
}
},
});
RESULTS
Rate whileloop keys
whileloop 5090/s -- -100%
keys 7234884/s 142047% --
The results show that using keys is MUCH faster than the while loop. My question is this: why would the original coder use such a slow method? Is there something that I'm missing? Also, is there a faster way?
I cannot read the mind of whomever might have written that piece of code, but he/she likely thought:
my $n = keys %hash;
used more memory than iterating through everything using each.
Note that the scalar on the left hand side of the assignment creates scalar context: There is no need for scalar unless you want to create a scalar context in what would otherwise have been list context.
Because he didn't know about keys's ability to return the number of elements in the hash.

How to get the top keys from a hash by value

I have a hash that I sorted by values greatest to least. How would I go about getting the top 5? There was a post on here that talked about getting only one value.
What is the easiest way to get a key with the highest value from a hash in Perl?
I understand that so would lets say getting those values add them to an array and delete the element in the hash and then do the process again?
Seems like there should be an easier way to do this then that though.
My hash is called %words.
Edited Took out code as the question answered without really needing it.
Your question is how to get the five highest values from your hash. You have this code:
my #keys = sort {
$words{$b} <=> $words{$a}
or
"\L$a" cmp "\L$b"
} keys %words;
Where you have your sorted hash keys. Take the five top keys from there?
my #highest = splice #keys, 0, 5; # also deletes the keys from the array
my #highest = #keys[0..4]; # non-destructive solution
Also some comments on your code:
open( my $filehandle0, '<', $file0 ) || die "Could not open $file0\n";
It is a good idea to include the error message $! in your die statement to get valuable information for why the open failed.
for (#words) {
s/[\,|\.|\!|\?|\:|\;|\"]//g;
}
Like I said in the comment, you do not need to escape characters or use alternations in a character class bracket. Use either:
s/[,.!?:;"]//g for #words; #or
tr/,.!?:;"//d for #words;
This next part is a bit odd.
my #stopwords;
while ( my $line = <$filehandle1> ) {
chomp $line;
my #linearray = split( " ", $line );
push( #stopwords, #linearray );
}
for my $w ( my #stopwords ) {
s/\b\Q$w\E\B//ig;
}
You read in the stopwords from a file... and then you delete the stopwords from $_? Are you even using $_ at this point? Moreover, you are redeclaring the #stopwords array in the loop header, which will effectively mean your new array will be empty, and your loop will never run. This error is silent, it seems, so you might never notice.
my %words = %words_count;
Here you make a copy of %words_count, which seems to be redundant, since you never use it again. If you have a big hash, this can decrease performance.
my $key_count = 0;
$key_count = keys %words;
This can be done in one line: my $key_count = keys %words. More readable, in my opinion.
$value_count = $words{$key} + $value_count;
Can also be abbreviated with the += operator: $value_cont += $words{$key}
It is very good that you use strict and warnings.
If performance isn't a big deal
(sort {$words{$a} <=> $words{$b}} keys %words)[0..4])
if you absolutely need killer speed, a selection sort which terminates after 5 iterations is probably the best thing for you.
my #results;
for (0..4) {
my $maxkey;
my $max = 0;
for my $key (keys %words){
if ($max < $words{$key}){
$maxkey = $key;
$max = $words{$key};
}
}
push #results, $maxkey;
delete $words{$maxkey};
}
say join(","=>#results);
There's CPAN module for that, Sort::Key::Top.
It has a straight-forward interface and an efficient XS implementation:
use Sort::Key::Top qw(rnkeytop);
my #results = rnkeytop { $words{$_} } 5 => keys %words;

Is perl's each function worth using?

From perldoc -f each we read:
There is a single iterator for each hash, shared by all each, keys, and values function calls in the program; it can be reset by reading all the elements from the hash, or by evaluating keys HASH or values HASH.
The iterator is not reset when you leave the scope containing the each(), and this can lead to bugs:
my %h = map { $_, 1 } qw(1 2 3);
while (my $k = each %h) { print "1: $k\n"; last }
while (my $k = each %h) { print "2: $k\n" }
Output:
1: 1
2: 3
2: 2
What are the common workarounds for this behavior? And is it worth using each in general?
I think it is worth using as long as you are aware of this. It's ideal when you need both key and value in iteration:
while (my ($k,$v) = each %h) {
say "$k = $v";
}
In your example you can reset the iterator by adding keys %h; like so:
my %h = map { $_ => 1 } qw/1 2 3/;
while (my $k = each %h) { print "1: $k\n"; last }
keys %h; # reset %h
while (my $k = each %h) { print "2: $k\n" }
From Perl 5.12 each will also allow iteration on an array.
I find each to be very handy for idioms like this:
my $hashref = some_really_complicated_method_that_builds_a_large_and_deep_structure();
while (my ($key, $value) = each %$hashref)
{
# code that does stuff with both $key and $value
}
Contrast that code to this:
my $hashref = ...same call as above
foreach my $key (keys %$hashref)
{
my $value = $hashref->{$key};
# more code here...
}
In the first case, both $key and $value are immediately available to the body of the loop. In the second case, $value must be fetched first. Additionally, the list of keys of $hashref may be really huge, which takes up memory. This is occasionally an issue. each does not incur such overhead.
However, the drawbacks of each are not instantly apparent: if aborting from the loop early, the hash's iterator is not reset. Additionally (and I find this one more serious and even less visible): you cannot call keys(), values() or another each() from within this loop. To do so would reset the iterator, and you would lose your place in the while loop. The while loop would continue forever, which is definitely a serious bug.
each is too dangerous to ever use, and many style guides prohibit its use completely. The danger is that if a cycle of each is aborted before the end of the hash, the next cycle will start there. This can cause very hard-to-reproduce bugs; the behavior of one part of the program will depend on a completely unrelated other part of the program. You might use each right, but what about every module ever written that might use your hash (or hashref; it's the same)?
keys and values are always safe, so just use those. keys makes it easier to traverse the hash in deterministic order, anyway, which is almost always more useful. (for my $key (sort keys %hash) { ... })
each is not only worth using, it's pretty much mandatory if you want to loop over all of a tied hash too big for memory.
A void-context keys() (or values, but consistency is nice) before beginning the loop is the only "workaround" necessary; is there some reason you are looking for some other workaround?
use the keys() function to reset the iterator. See the faq for more info
each has a buit-in, hidden global variable that can hurt you. Unless you need this behavior, it's safer to just use keys.
Consider this example where we want to group our k/v pairs (yes, I know printf would do this better):
#!perl
use strict;
use warnings;
use Test::More 'no_plan';
{ my %foo = map { ($_) x 2 } (1..15);
is( one( \%foo ), one( \%foo ), 'Calling one twice works with 15 keys' );
is( two( \%foo ), two( \%foo ), 'Calling two twice works with 15 keys' );
}
{ my %foo = map { ($_) x 2 } (1..105);
is( one( \%foo ), one( \%foo ), 'Calling one twice works with 105 keys' );
is( two( \%foo ), two( \%foo ), 'Calling two twice works with 105 keys' );
}
sub one {
my $foo = shift;
my $r = '';
for( 1..9 ) {
last unless my ($k, $v) = each %$foo;
$r .= " $_: $k -> $v\n";
}
for( 10..99 ) {
last unless my ($k, $v) = each %$foo;
$r .= " $_: $k -> $v\n";
}
return $r;
}
sub two {
my $foo = shift;
my $r = '';
my #k = keys %$foo;
for( 1..9 ) {
last unless #k;
my $k = shift #k;
$r .= " $_: $k -> $foo->{$k}\n";
}
for( 10..99 ) {
last unless #k;
my $k = shift #k;
$r .= " $_: $k -> $foo->{$k}\n";
}
return $r;
}
Debugging the error shown in the tests above in a real application would be horribly painful. (For better output use Test::Differences eq_or_diff instead of is.)
Of course one() can be fixed by using keys to clear the iterator at the start and end of the subroutine. If you remember. If all your coworkers remember. It's perfectly safe as long as no one forgets.
I don't know about you, but I'll just stick with using keys and values.
It's best if used as it's name: each. It's probably the wrong thing to use if you mean "give me the first key-value pair," or "give me the first two pairs" or whatever. Just keep in mind that the idea is flexible enough that each time you call it, you get the next pair (or key in a scalar context).
each() can be more efficient if you are iterating through a tied hash, for example a database that contains millions of keys; that way you don't have to load all the keys in memory.

Why is Perl foreach variable assignment modifying the values in the array?

OK, I have the following code:
use strict;
my #ar = (1, 2, 3);
foreach my $a (#ar)
{
$a = $a + 1;
}
print join ", ", #ar;
and the output?
2, 3, 4
What the heck? Why does it do that? Will this always happen? is $a not really a local variable? What where they thinking?
Perl has lots of these almost-odd syntax things which greatly simplify common tasks (like iterating over a list and changing the contents in some way), but can trip you up if you're not aware of them.
$a is aliased to the value in the array - this allows you to modify the array inside the loop. If you don't want to do that, don't modify $a.
See perldoc perlsyn:
If any element of LIST is an lvalue, you can modify it by modifying VAR inside the loop. Conversely, if any element of LIST is NOT an lvalue, any attempt to modify that element will fail. In other words, the foreach loop index variable is an implicit alias for each item in the list that you're looping over.
There is nothing weird or odd about a documented language feature although I do find it odd how many people refuse check the docs upon encountering behavior they do not understand.
$a in this case is an alias to the array element. Just don't have $a = in your code and you won't modify the array. :-)
If I remember correctly, map, grep, etc. all have the same aliasing behaviour.
As others have said, this is documented.
My understanding is that the aliasing behavior of #_, for, map and grep provides a speed and memory optimization as well as providing interesting possibilities for the creative. What happens is essentially, a pass-by-reference invocation of the construct's block. This saves time and memory by avoiding unnecessary data copying.
use strict;
use warnings;
use List::MoreUtils qw(apply);
my #array = qw( cat dog horse kanagaroo );
foo(#array);
print join "\n", '', 'foo()', #array;
my #mapped = map { s/oo/ee/g } #array;
print join "\n", '', 'map-array', #array;
print join "\n", '', 'map-mapped', #mapped;
my #applied = apply { s/fee//g } #array;
print join "\n", '', 'apply-array', #array;
print join "\n", '', 'apply-applied', #applied;
sub foo {
$_ .= 'foo' for #_;
}
Note the use of List::MoreUtils apply function. It works like map but makes a copy of the topic variable, rather than using a reference. If you hate writing code like:
my #foo = map { my $f = $_; $f =~ s/foo/bar/ } #bar;
you'll love apply, which makes it into:
my #foo = apply { s/foo/bar/ } #bar;
Something to watch out for: if you pass read only values into one of these constructs that modifies its input values, you will get a "Modification of a read-only value attempted" error.
perl -e '$_++ for "o"'
the important distinction here is that when you declare a my variable in the initialization section of a for loop, it seems to share some properties of both locals and lexicals (someone with more knowledge of the internals care to clarify?)
my #src = 1 .. 10;
for my $x (#src) {
# $x is an alias to elements of #src
}
for (#src) {
my $x = $_;
# $_ is an alias but $x is not an alias
}
the interesting side effect of this is that in the first case, a sub{} defined within the for loop is a closure around whatever element of the list $x was aliased to. knowing this, it is possible (although a bit odd) to close around an aliased value which could even be a global, which I don't think is possible with any other construct.
our #global = 1 .. 10;
my #subs;
for my $x (#global) {
push #subs, sub {++$x}
}
$subs[5](); # modifies the #global array
Your $a is simply being used as an alias for each element of the list as you loop over it. It's being used in place of $_. You can tell that $a is not a local variable because it is declared outside of the block.
It's more obvious why assigning to $a changes the contents of the list if you think about it as being a stand in for $_ (which is what it is). In fact, $_ doesn't exist if you define your own iterator like that.
foreach my $a (1..10)
print $_; # error
}
If you're wondering what the point is, consider the case:
my #row = (1..10);
my #col = (1..10);
foreach (#row){
print $_;
foreach(#col){
print $_;
}
}
In this case it is more readable to provide a friendlier name for $_
foreach my $x (#row){
print $x;
foreach my $y (#col){
print $y;
}
}
Try
foreach my $a (#_ = #ar)
now modifying $a does not modify #ar.
Works for me on v5.20.2