How to get the top keys from a hash by value - perl

I have a hash that I sorted by values greatest to least. How would I go about getting the top 5? There was a post on here that talked about getting only one value.
What is the easiest way to get a key with the highest value from a hash in Perl?
I understand that so would lets say getting those values add them to an array and delete the element in the hash and then do the process again?
Seems like there should be an easier way to do this then that though.
My hash is called %words.
Edited Took out code as the question answered without really needing it.

Your question is how to get the five highest values from your hash. You have this code:
my #keys = sort {
$words{$b} <=> $words{$a}
or
"\L$a" cmp "\L$b"
} keys %words;
Where you have your sorted hash keys. Take the five top keys from there?
my #highest = splice #keys, 0, 5; # also deletes the keys from the array
my #highest = #keys[0..4]; # non-destructive solution
Also some comments on your code:
open( my $filehandle0, '<', $file0 ) || die "Could not open $file0\n";
It is a good idea to include the error message $! in your die statement to get valuable information for why the open failed.
for (#words) {
s/[\,|\.|\!|\?|\:|\;|\"]//g;
}
Like I said in the comment, you do not need to escape characters or use alternations in a character class bracket. Use either:
s/[,.!?:;"]//g for #words; #or
tr/,.!?:;"//d for #words;
This next part is a bit odd.
my #stopwords;
while ( my $line = <$filehandle1> ) {
chomp $line;
my #linearray = split( " ", $line );
push( #stopwords, #linearray );
}
for my $w ( my #stopwords ) {
s/\b\Q$w\E\B//ig;
}
You read in the stopwords from a file... and then you delete the stopwords from $_? Are you even using $_ at this point? Moreover, you are redeclaring the #stopwords array in the loop header, which will effectively mean your new array will be empty, and your loop will never run. This error is silent, it seems, so you might never notice.
my %words = %words_count;
Here you make a copy of %words_count, which seems to be redundant, since you never use it again. If you have a big hash, this can decrease performance.
my $key_count = 0;
$key_count = keys %words;
This can be done in one line: my $key_count = keys %words. More readable, in my opinion.
$value_count = $words{$key} + $value_count;
Can also be abbreviated with the += operator: $value_cont += $words{$key}
It is very good that you use strict and warnings.

If performance isn't a big deal
(sort {$words{$a} <=> $words{$b}} keys %words)[0..4])
if you absolutely need killer speed, a selection sort which terminates after 5 iterations is probably the best thing for you.
my #results;
for (0..4) {
my $maxkey;
my $max = 0;
for my $key (keys %words){
if ($max < $words{$key}){
$maxkey = $key;
$max = $words{$key};
}
}
push #results, $maxkey;
delete $words{$maxkey};
}
say join(","=>#results);

There's CPAN module for that, Sort::Key::Top.
It has a straight-forward interface and an efficient XS implementation:
use Sort::Key::Top qw(rnkeytop);
my #results = rnkeytop { $words{$_} } 5 => keys %words;

Related

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

Is there a simple way to validate a hash of hash element comparsion?

Is there a simple way to validate a hash of hash element comparsion ?
I need to validate a Perl hash of hash element $Table{$key1}{$key2}{K1}{Value} compare to all other elements in hash
third key will be k1 to kn and i want comprare those elements and other keys are same
if ($Table{$key1}{$key2}{K1}{Value} eq $Table{$key1}{$key2}{K2}{Value}
eq $Table{$key1}{$key2}{K3}{Value} )
{
#do whatever
}
Something like this may work:
use List::MoreUtils 'all';
my #keys = map "K$_", 1..10;
print "All keys equal"
if all { $Table{$key1}{$key2}{$keys[1]}{Value} eq $Table{$key1}{$key2}{$_}{Value} } #keys;
I would use Data::Dumper to help with a task like this, especially for a more general problem (where the third key is more arbitrary than 'K1'...'Kn'). Use Data::Dumper to stringify the data structures and then compare the strings.
use Data::Dumper;
# this line is needed to assure that hashes with the same keys output
# those keys in the same order.
$Data::Dumper::Sortkeys = 1;
my $string1= Data::Dumper->Dump($Table{$key1}{$key2}{k1});
for ($n=2; exists($Table{$key1}{$key2}{"k$n"}; $n++) {
my $string_n = Data::Dumper->Dump($Table{$key1}{$key2}{"k$n"});
if ($string1 ne $string_n) {
warn "key 'k$n' is different from 'k1'";
}
}
This can be used for the more general case where $Table{$key1}{$key2}{k7}{value} itself contains a complex data structure. When a difference is detected, though, it doesn't give you much help figuring out where that difference is.
A fairly complex structure. You should be looking into using object oriented programming techniques. That would greatly simplify your programming and the handling of these complex structures.
First of all, let's simplify a bit. When you say:
$Table{$key1}{$key2}{k1}{value}
Do you really mean:
my $value = $Table{$key1}->{$key2}->{k1};
or
my $actual_value = $Table{$key1}->{$key2}->{k1}->{Value};
I'm going to assume the first one. If I'm wrong, let me know, and I'll update my answer.
Let's simplify:
my %hash = %{$Table{$key1}->{$key2}};
Now, we're just dealing with a hash. There are two techniques you can use:
Sort the keys of this hash by value, then if two keys have the same value, they will be next to each other in the sorted list, making it easy to detect duplicates. The advantage is that all the duplicate keys would be printed together. The disadvantage is that this is a sort which takes time and resources.
Reverse the hash, so it's keyed by value and the value of that key is the key. If a key already exists, we know the other key has a duplicate value. This is faster than the first technique because no sorting is involved. However, duplicates will be detected, but not printed together.
Here's the first technique:
my %hash = %{$Table{$key1}->{$key2}};
my $previous_value;
my $previous_key;
foreach my $key (sort {$hash{$a} cmp $hash{$b}} keys %hash) {
if (defined $previous_key and $previous_value eq $hash{$key}) {
print "\$hash{$key} is a duplicate of \$hash{$previous_key}\n";
}
$previous_value = $hash{$key};
$previous_key = $key;
}
And the second:
my %hash = %{$Table{$key1}->{$key2}};
my %reverse_hash;
foreach $key (keys %hash) {
my $value = $hash{$key};
if (exists $reverse_hash{$value}) {
print "\$hash{$reverse_hash{$value}} has the same value as \$hash{$key}\n";
}
else {
$reverse_hash{$value} = $key;
}
}
Alternative approach to the problem is make utility function which will compare all keys if has same value returned from some function for all keys:
sub AllSame (&\%) {
my ($c, $h) = #_;
my #k = keys %$h;
my $ref;
$ref = $c->() for $h->{shift #k};
$ref ne $c->() and return for #$h{#k};
return 1
}
print "OK\n" if AllSame {$_->{Value}} %{$Table{$key1}{$key2}};
But if you start thinking in this way you can found this approach much more generic (recommended way):
sub AllSame (#) {
my $ref = shift;
$ref ne $_ and return for #_;
return 1
}
print "OK\n" if AllSame map {$_->{Value}} values %{$Table{$key1}{$key2}};
If mapping operation is expensive you can make lazy counterpart of same:
sub AllSameMap (&#) {
my $c = shift;
my $ref;
$ref = $c->() for shift;
$ref ne $c->() and return for #_;
return 1
}
print "OK\n" if AllSameMap {$_->{Value}} values %{$Table{$key1}{$key2}};
If you want only some subset of keys you can use hash slice syntax e.g.:
print "OK\n" if AllSame map {$_->{Value}} #{$Table{$key1}{$key2}}{map "K$_", 1..10};

Why am I getting "ARRAY(0x145030c)" trying to write a Perl hash to a file?

I have a hash in which I store the products a customer buys (%orders). It uses the product code as key and has a reference to an array with the other info as value.
At the end of the program, I have to rewrite the inventory to the updated version (i.e. subtract the quantity of the bought items)
This is how I do rewrite the inventory:
sub rewriteInventory{
open(FILE,'>inv.txt');
foreach $key(%inventory){
print FILE "$key\|$inventory{$key}[0]\|$inventory{$key}[1]\|$inventory{$key}[2]\n"
}
close(FILE);
}
where $inventory{$key}[x] is 0 → Title, 1 → price, 2 → quantity.
The problem here is that when I look at inv.txt afterwards, I see things like this:
CD-911|Lady Gaga - The Fame|15.99|21
ARRAY(0x145030c)|||
BOOK-1453|The Da Vinci Code - Dan Brown|14.75|12
ARRAY(0x145bee4)|||
Where do these ARRAY(0x145030c)||| entries come from? Or more important, how do I get rid of them?
You want to iterate over
keys %inventory
and not
%inventory
which, as you see, causes you to iterate over key, value pairs.
You're using your hash in list context, so you're getting all your values thrown in with your keys. I think what you actually want to do is:
foreach $key (keys %inventory) {
print FILE "...";
}
EDIT: I was wrong about having to use an explicit dereferencing arrow; this is inferred between brackets when necessary, even if the first brackets do NOT require a dereference. That said, I will leave the remainder of the answer as posted since it was accepted, but merely note that if you choose not to use join, you needn't actually use $inventory{$key}->[0] but can in fact use $inventory{$key}[0] as originally posted.
Just be aware that the first (hash) brackets do not imply a dereference but the second (array) brackets do. Your errant array refs in the output were coming from looping over not only keys but also values of the hash.
ORIGINAL ANSWER:
In addition to using keys, you also need to dereference the references-to-array (this is why you're seeing each value output as ARRAY with an address---you're printing the references, not the values of the dereferenced array) when you print, so your loop becomes something like:
foreach my $key (sort keys %inventory) {
print FILE "$key\|$inventory{$key}->[0]\|$inventory{$key}->[1]\|$inventory{$key}->[2]\n";
}
I'd probably rewrite it a little more idiomatically as:
foreach my $key (sort keys %inventory) {
print FILE (join '|', $key, #{$inventory{$key}}) . "\n";
}
Hope that helps!
Here is one way to write that routine:
#!/usr/bin/perl
use strict; use warnings;
my %inventory;
while ( my $line = <DATA> ) {
chomp $line;
my ($key, #values) = split qr{\|}, $line;
last unless #values;
$inventory{$key} = \#values;
}
write_inventory(\%inventory, 'test.txt');
sub write_inventory {
my ($inventory, $output_file) = #_;
open my $output, '>', $output_file
or die "Cannot open '$output_file': $!";
for my $item ( keys %$inventory ) {
unless ( 'ARRAY' eq ref $inventory{$item} ) {
warn "Invalid item '$item' in inventory\n";
next;
}
print $output join('|', $item, #{ $inventory{$item} }), "\n";
}
close $output
or die "Cannot close '$output': $!";
}
__DATA__
CD-911|Lady Gaga - The Fame|15.99|21
BOOK-1453|The Da Vinci Code - Dan Brown|14.75|12
The rules are:
Don't use global variables: Pass a reference to %inventory to write_inventory instead of having it operate on the global %inventory.
Don't use global variables: Instead of using the bareword file handle FILE which has package scope, use a lexical file handle whose scope is limited to write_inventory.
Check for errors on file operations: Make sure open succeeded before plowing ahead and trying to write. Make sure close succeeded before assuming all data you printed actually got saved.
You MUST use strict and warnings because, at this point in your learning process, you do not know enough to know what you do not know.o

Is perl's each function worth using?

From perldoc -f each we read:
There is a single iterator for each hash, shared by all each, keys, and values function calls in the program; it can be reset by reading all the elements from the hash, or by evaluating keys HASH or values HASH.
The iterator is not reset when you leave the scope containing the each(), and this can lead to bugs:
my %h = map { $_, 1 } qw(1 2 3);
while (my $k = each %h) { print "1: $k\n"; last }
while (my $k = each %h) { print "2: $k\n" }
Output:
1: 1
2: 3
2: 2
What are the common workarounds for this behavior? And is it worth using each in general?
I think it is worth using as long as you are aware of this. It's ideal when you need both key and value in iteration:
while (my ($k,$v) = each %h) {
say "$k = $v";
}
In your example you can reset the iterator by adding keys %h; like so:
my %h = map { $_ => 1 } qw/1 2 3/;
while (my $k = each %h) { print "1: $k\n"; last }
keys %h; # reset %h
while (my $k = each %h) { print "2: $k\n" }
From Perl 5.12 each will also allow iteration on an array.
I find each to be very handy for idioms like this:
my $hashref = some_really_complicated_method_that_builds_a_large_and_deep_structure();
while (my ($key, $value) = each %$hashref)
{
# code that does stuff with both $key and $value
}
Contrast that code to this:
my $hashref = ...same call as above
foreach my $key (keys %$hashref)
{
my $value = $hashref->{$key};
# more code here...
}
In the first case, both $key and $value are immediately available to the body of the loop. In the second case, $value must be fetched first. Additionally, the list of keys of $hashref may be really huge, which takes up memory. This is occasionally an issue. each does not incur such overhead.
However, the drawbacks of each are not instantly apparent: if aborting from the loop early, the hash's iterator is not reset. Additionally (and I find this one more serious and even less visible): you cannot call keys(), values() or another each() from within this loop. To do so would reset the iterator, and you would lose your place in the while loop. The while loop would continue forever, which is definitely a serious bug.
each is too dangerous to ever use, and many style guides prohibit its use completely. The danger is that if a cycle of each is aborted before the end of the hash, the next cycle will start there. This can cause very hard-to-reproduce bugs; the behavior of one part of the program will depend on a completely unrelated other part of the program. You might use each right, but what about every module ever written that might use your hash (or hashref; it's the same)?
keys and values are always safe, so just use those. keys makes it easier to traverse the hash in deterministic order, anyway, which is almost always more useful. (for my $key (sort keys %hash) { ... })
each is not only worth using, it's pretty much mandatory if you want to loop over all of a tied hash too big for memory.
A void-context keys() (or values, but consistency is nice) before beginning the loop is the only "workaround" necessary; is there some reason you are looking for some other workaround?
use the keys() function to reset the iterator. See the faq for more info
each has a buit-in, hidden global variable that can hurt you. Unless you need this behavior, it's safer to just use keys.
Consider this example where we want to group our k/v pairs (yes, I know printf would do this better):
#!perl
use strict;
use warnings;
use Test::More 'no_plan';
{ my %foo = map { ($_) x 2 } (1..15);
is( one( \%foo ), one( \%foo ), 'Calling one twice works with 15 keys' );
is( two( \%foo ), two( \%foo ), 'Calling two twice works with 15 keys' );
}
{ my %foo = map { ($_) x 2 } (1..105);
is( one( \%foo ), one( \%foo ), 'Calling one twice works with 105 keys' );
is( two( \%foo ), two( \%foo ), 'Calling two twice works with 105 keys' );
}
sub one {
my $foo = shift;
my $r = '';
for( 1..9 ) {
last unless my ($k, $v) = each %$foo;
$r .= " $_: $k -> $v\n";
}
for( 10..99 ) {
last unless my ($k, $v) = each %$foo;
$r .= " $_: $k -> $v\n";
}
return $r;
}
sub two {
my $foo = shift;
my $r = '';
my #k = keys %$foo;
for( 1..9 ) {
last unless #k;
my $k = shift #k;
$r .= " $_: $k -> $foo->{$k}\n";
}
for( 10..99 ) {
last unless #k;
my $k = shift #k;
$r .= " $_: $k -> $foo->{$k}\n";
}
return $r;
}
Debugging the error shown in the tests above in a real application would be horribly painful. (For better output use Test::Differences eq_or_diff instead of is.)
Of course one() can be fixed by using keys to clear the iterator at the start and end of the subroutine. If you remember. If all your coworkers remember. It's perfectly safe as long as no one forgets.
I don't know about you, but I'll just stick with using keys and values.
It's best if used as it's name: each. It's probably the wrong thing to use if you mean "give me the first key-value pair," or "give me the first two pairs" or whatever. Just keep in mind that the idea is flexible enough that each time you call it, you get the next pair (or key in a scalar context).
each() can be more efficient if you are iterating through a tied hash, for example a database that contains millions of keys; that way you don't have to load all the keys in memory.

Selectively counting delimited field values and creating a hash using map

I have a pipe delimited text file containing, among other things, a date and a number indicating the lines sequence elsewhere in the program. What I'm hoping to do is from that file create a hash using the year as the key and the value being the maximum sequence for that year (I essentially need to implement an auto-incremented key per year) e.g from
2000|1
2003|9
2000|5
2000|21
2003|4
I would end with a hash like:
%hash = {
2000 => 21,
2003 => 9
}
I've managed to split the file into the year and sequence parts (not very well I think) like so:
my #dates = map {
my #temp = split /\|/;
join "|", (split /\//, $temp[1])[-1], $temp[4] || 0; #0 because some records
#mightn't have a sequence
} #info
Is there something succint I could do to create a hash using that data?
Thanks
If I understand you, you were almost there. All you needed to do was return the key and value from map and sort by sequence instead of joining them.
my %hash =
map #$_,
sort { $a->[1] <=> $b->[1] }
map {
my #temp = split /\|/;
my $date = (split /\//, $temp[1])[-1];
my $seq = $temp[4] || 0; #0 because some records mightn't have a sequence
[ $date, $seq ]
} #info;
But just iterating through with for and setting hash only if the current sequence
is higher than the previous maximum for that date is probably a better idea.
Be careful with those {}; where you said
%hash = {
2000 => 21,
2003 => 9
}
you meant () instead (or to be assigning to a reference $hash), since the {} there create an anonymous hash and return a reference to it.
Here's how you could write that .. not too sure why you want/need to use map (please explain)
#!/usr/bin/perl -w
use strict;
use warnings;
my %hash;
while(<DATA>) {
chomp();
my ($year,$sequence)=split('\|');
$sequence = 0 unless (defined ($sequence));
next if (exists $hash{$year} and $sequence < $hash{$year});
$hash{$year}=$sequence;
}
__DATA__
2000|1
2003|9
2000|5
2000|21
2003|4
I added the $sequence = 0 unless defined ($sequence); because of that comment in your snippet. I believe I might understand your intent there.. (either the input format is valid/consistent, or it is not ..)
map operates on each item in a list and builds a list of results to pass on. So, you can't really do the sort of checks you want (keep the maximum sequence value) as you go unless you build a scratch hash that winds up containing exactly the data you are trying to build as the return value of the `map.
my %results = map {
my( $y, $s ) = split '[|]', $_;
seq_is_gt_year_seq( $y, $s )
? ( $y, $s )
: ();
} #year_pipe_seq;
To implement seq_is_gt_year_seq() we wind up having to build a temporary hash that stores each year and its max sequence value for lookup.
You should use an approach that builds the lookup incrementally, like a for or while loop.
map { BLOCK } LIST always usually (unless BLOCK sometimes evaluates to an empty list) returns a list that is least as large as LIST, and may not be the way to go if you do want to simply overwrite duplicate keys with the latest data. Something like:
my %hash;
for (#info) {
my #temp = split /\|/;
my $key = (split /\//, $temp[1]);
my $value = $temp[4] || 0;
$hash{$key} = $value unless defined $hash{$key} && $hash{$key}>=$value;
}
will work. The last line conditionally updates the hash table, which is something you can't do (or at least can't do very conveniently) inside a map statement.
If there's any chance you can perform this processing as the file is read, then I'd do it. Something like this:
my %year_count;
while (my $line = <$fh>){
chomp $line;
my ($year, $num) = split /\|/, $line;
if ($num > $year_count{$year} || !defined $year_count{$year})
$year_count{$year} = $num;
}
}
if you want to use an array, map isn't really the best choice (since you're not transforming the list, you're processing it down to something different). To be honest the most sensible array-processing would probably be the same as the above, but in a foreach instead:
my %year_count;
foreach my $line (#info){
my ($year, $num) = split /\|/, $line;
if ($num > $year_count{$year} || !defined $year_count{$year})
$year_count{$year} = $num;
}
}