Perl Sorting Multiple Keys Hash - perl

I have a hash
%grades{$subject}{$student}=$score
I am trying to extract top 30 scores for each subject with students from the hash which requires sorting, but I'm not sure how to sort the multiple key hash.
So far I have this, but this gives me every single one of the scores instead of the top 30 that I need for each subject. Also, is there a faster way to perform the query since I have almost 200K students.
foreach my $subject(sort keys %grades) {
foreach my $student(keys %{ $grades{$subject} }) {
print "$subject, $student: $grades{$subject}{$student}\n";
}
}

This sorts the top 2 scores for each subject (just for illustrative purposes). You should change 0 .. 1 to 0 .. 29 for the top 30:
use warnings;
use strict;
my %grades = (
math => {bill=>55, joe=>66, mike=>77},
hist => {bill=>72, joe=>33, mike=>99},
read => {bill=>95, joe=>24, mike=>22},
);
for my $subject (sort keys %grades) {
my %gr = %{ $grades{$subject} };
for my $student ((reverse sort { $gr{$a} <=> $gr{$b} } keys %gr)[0 .. 1]) {
print "$subject $student $gr{$student}\n";
}
}
__END__
hist mike 99
hist bill 72
math mike 77
math joe 66
read bill 95
read joe 24
Refer to perldoc perldsc and How do I sort a hash (optionally by value instead of key)?

Count them.
$count++;
last if $count > 30;

Related

Perl 5.26: How to get keys from 2d hash?

System: Perl 5.26.1 on Ubuntu 18.04.
We upgraded our Ubuntu to 18.04 and Perl to 5.26 and now the experimental way I was using 2d hashes in Perl 5.18 is no longer supported. That's my fault for using something experimental.
My goal: I'm trying to use the new way to do hash of hashes, or a hash of 2 keys, or a 2d hash. I.e. a hash would have 2 keys, a department, and each department would have one or more employee ids. There can also be many departments in the hash as the first key. I'm reading data from the results of an SQL statement so I have to add hours to each employee id. Part of the problem is I read the department first, then I get the empid.
Problem: I'm getting an error in second loop to get the empid. Runtime error is: "Can't use string ("6") as a HASH ref while "strict refs" in use".
I've been doing research and I can't find a specific example for my case, the examples are different than what I'm doing, or the examples I try do not work and give me an error.
At least half the pages I've read have no date on them so I can't even guess at what version of perl they are for.
The perldocs on this topic are the same as other pages I've found on the internet.
I can't just use a plain SQL statement for this report because I have to do much more processing after I get the data for each employee.
I've tried several permutations of code with no luck. But I exclude those permutations here.
Most examples I've found on the internet do not work with Perl 5.26 for one reason or another.
Here is my code. I've tried several permutations from different sources to get this to work.
use strict;
use warnings;
use Data::Dumper;
####################
# Variables
my $i=0;
my $s='';
my $t='';
my $k='';
my $k2='';
my $empid='';
my $z='';
my #k=(); # Key list
my #k2=(); # Key list
my %hashs=(); # Scalar hash, works.
my %empees=();
my $pos=0;
my $val="Myval";
my $dept='';
####################
$s="Data::Dumper $Data::Dumper::VERSION";
print "$s\n";
print "\n";
$empees{'JSMITH'}=1.0; # WORKS
$empees{'RGREEN'}=2.0;
$empees{'KJONES'}=3.0;
$hashs{950}=%empees;
$empees{'WSMIT'}=1.5;
$empees{'AMCBE'}=2.5;
$empees{'SCHWAR'}=3.5;
$hashs{800}=%empees;
# Now print out values in 2d hash.
#k=keys(%hashs);
print "Keys and values are:\n";
foreach $dept (sort keys %hashs)
{
for $empid (sort keys %{$hashs{$dept}} ) # ERROR is here
{
$val=$hashs{$dept}{$empid};
$t="$dept $empid $val";
print "$t\n";
} # foreach $empid
} # foreach $dept
Can someone help me out here?
Is there another way to get this done than a hash of hashes?
Thank you for all your help! I really appreciate your time!
This snippet works for me
$hashs{950}=\%empees;
$empees{'WSMIT'}=1.5;
$empees{'AMCBE'}=2.5;
$empees{'SCHWAR'}=3.5;
$hashs{800}=\%empees;
# Now print out values in 2d hash.
#k=keys(%hashs);
print "Keys and values are:\n";
for my $dept (sort keys %hashs) {
for my $empid (sort keys %{$hashs{$dept}}) {
print "$empid\n";
$val=$hashs{$dept}{$empid};
$t="$dept $empid $val";
print "$t\n";
} # foreach $empid
} # foreach $dept
Value of a hash is always a scalar.
So, instead of assigning a hash
$hashs{950}=%empees;
assign a hash reference:
$hashs{950} = { %empees };
This only works if the %empees is one-level hash, otherwise you need to make a deep copy, e.g.
use Storable qw{ dclone };
# ...
$hashs{950} = dclone(\%empees);
and similarly for 800 (you probably need to empty %empees, otherwise you'd assing the values already assigned to 950, too).
Or, assign it directly:
$hashs{950} = { JSMITH => 1.0,
RGREEN => 2.0,
KJONES => 3.0 };
OP's question does not cover the problem in full extent.
Please see the following piece of code which demonstrates how hash departments can be filled up with data and how these data can be retrieved for use latter.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $debug = 1;
my %departments;
my $department;
while( <DATA> ) {
$department = $1 if /^(\d+)/;
$departments{$department}{$1}=$2 if /^\s+(\w+):([\d\.]+)/;
}
say Dumper(\%departments) if $debug;
foreach $department (sort keys %departments) {
say $department;
foreach my $employee (sort keys %{$departments{$department}}) {
say "\t $employee $departments{$department}{$employee}";
}
}
__DATA__
950
JSMITH:1.0
RGREEN:2.0
KJONES:3.0
800
WSMIT:1.5
AMCBE:2.5
SCHWAR:3.5
Output
$VAR1 = {
'950' => {
'JSMITH' => '1.0',
'KJONES' => '3.0',
'RGREEN' => '2.0'
},
'800' => {
'WSMIT' => '1.5',
'SCHWAR' => '3.5',
'AMCBE' => '2.5'
}
};
800
AMCBE 2.5
SCHWAR 3.5
WSMIT 1.5
950
JSMITH 1.0
KJONES 3.0
RGREEN 2.0

Sort the column values and search the value

Input to my script is this file which contains data as below.
A food 75
B car 136
A car 69
A house 179
B food 75
C car 136
C food 85
For each distinct value of the second column, I want to print any line where the number in the third column is different.
Example output
C food 85
A car 69
Here is my Perl code.
#! /usr/local/bin/perl
use strict;
use warning;
my %data = ();
open FILE, '<', 'data.txt' or die $!;
while ( <FILE> ) {
chomp;
$data{$1} = $2 while /\s*(\S+),(\S+)/g;
}
close FILE;
print $_, '-', $data{$_}, $/ for keys %data;
I am able to print the hash keys and values, but not able to get the desired output.
Any pointers on how to do that using Perl?
As far as I can tell from your question, you want a list of all the lines where there is an "odd one out" with the same item type and a different number in the third column from all the rest
I think this is what you need
It reads all the data into hash %data, so that $data{$type}{$n} is a (reference to an) array of all the data lines that use that object type and number
Then the hash is scanned again, looking for and printing all instances that have only a single line with the given type/number and where there are other values for the same object type (otherwise it would be the only entry and not an "odd one out")
use strict;
use warnings 'all';
use autodie;
my %data;
open my $fh, '<', 'data.txt';
while ( <$fh> ) {
my ( $label, $type, $n) = split;
push #{ $data{$type}{$n} }, $_;
}
for my $type ( keys %data ) {
my $items = $data{$type};
next unless keys %$items > 1;
for my $n ( keys %$items ) {
print $items->{$n}[0] if #{ $items->{$n} } == 1;
}
}
output
C food 85
A car 69
Note that this may print multiple lines for a given object type if the input looks like, say
B car 22
A car 33
B car 136
C car 136
This has two "odd ones out" that appear only once for the given object type, so both B car 22 and A car 33 will be printed
Here are the pointers:
First, you need to remember lines somewhere before outputting them.
Second, you need to discard previously remembered line for the object according to the rules you set.
In your case, the rule is to discard when the number for the object differs from the previous remembered.
Both tasks can be accomplished with the hash.
For each line:
my ($letter, $object, $number)=split /\s+/, $line;
if (!defined($hash{$object}) || $hash{$object}[0]!=$number) {
$hash{$object}=[$number, $line];
}
Third, you need to output the hash:
for my $object(keys %hash) {
print $hash{$object}[1];
}
But there is the problem: a hash is an unordered structure, it won't return its keys in the order you put them into the hash.
So, the fourth: you need to add the ordering to your hash data, which can be accomplished like this:
$hash{$object}=[$number,$line,$.]; # $. is the row number over all the input files or STDIN, we use it for sorting
And in the output part you sort with the stored row number
(see sort for details about $a, $b variables):
for my $object(sort { $hash{$a}[2]<=>$hash{$b}[2] } keys %hash) {
print $hash{$object}[1];
}
Regarding the comments
I am certain that my code does not contain any errors.
If we look at the question before it was edited by some high rep users, it states:
[cite]
Now where if Numeric column(Third) has different value (Where in 2nd column matches) ...Then print only the mismatched number line. example..
A food 75
B car 136
A car 69
A house 179
B food 75
B car 136
C food 85
Example output (As number columns are not matching)
C food 85
[/cite]
I can only interpret that print only the mismatched number line as: to print the last line for the object where the number changed. That clearly matches the example the OP provided.
Even so, in my answer I addressed the possibility of misinterpretation, by stating that line omitting is done according to whatever rules the OP wants.
And below that I indicated what was the rule by that time in my opinion.
I think it well addressed the OP problem, because, after all, the OP wanted the pointers.
And now my answer is critiqued because it does not match the edited (long after and not by OP) requirements.
I disagree.
Regarding the whitespace: specifying /\s+/ for split is not an error here, despite of some comments trying to assert that.
While I agree that " " is common for split, I would disagree that there are a lot of cases where you must use " " instead of /\s+/.
/\s+/ is a regular expression which is the conventional argument for split, while " " is the shorthand, that actually masks the meaning.
With that I decided to use explicit split /\s+/, $line in my example instead of just split " ", $line or just split specifically to show the innerworkings of perl.
I think it is important to any one new to perl.
It is perfectly ok to use /\s+/, but be careful if you expect to have leading whitespace in your data, consult perldoc -f split and decide whether /\s+/ suits your needs or not.

Perl variable not assigned in foreach: scope issues

I am trying to normalize some scores from a .txt file by dividing each score for each possible sense (eg. take#v#2; referred to as $tokpossense in my code) by the sum of all scores for a wordtype (e.g. take#v; referred to as $tokpos). The difficulty is in grouping the wordtypes together when processing each line of the so that the normalized scores are printed upon finding a new wordtype/$tokpos. I used two hashes and an if block to achieve this.
Currently, the problem seems to be that $tokpos is undefined as a key in SumHash{$tokpos} at line 20 resulting in a division by zero. However, I believe $tokpos is properly defined within the scope of this block. What is the problem exactly and how would I best solve it? I would also gladly hear alternative approaches to this problem.
Here's an example inputfile:
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
My erroneous Code:
#files = #ARGV;
foreach $file(#files){
open(IN, $file);
#lines=<IN>;
foreach (#lines){
chomp;
#store tokpossense (eg. "take#v#1") and rawscore (eg. 4)
if (($tokpossense,$rawscore)= /^\s{4}(.+): (\d+)/) {
#split tokpossense for recombination
($tok,$pos,$sensenr)=split(/#/,$tokpossense);
#tokpos (eg. take#v) will be a unique identifier when calculating normalized score
$tokpos="$tok\#$pos";
#block for when new tokpos(word) is found in inputfile
if (defined($prevtokpos) and
($tokpos ne $prevtokpos)) {
# normalize hash: THE PROBLEM LIES IN $SumHash{$tokpos} which is returned as zero > WHY?
foreach (keys %ScoreHash) {
$normscore=$ScoreHash{$_}/$SumHash{$tokpos};
#print the results to a file
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
#empty hashes
undef %ScoreHash;
undef %SumHash;
}
#prevtokpos is assigned to tokpos for condition above
$prevtokpos = $tokpos;
#store the sum of scores for a tokpos identifier for normalization
$SumHash{$tokpos}+=$rawscore;
#store the scores for a tokpossense identifier for normalization
$ScoreHash{$tokpossense}=$rawscore;
}
#skip the irrelevant lines of inputfile
else {next;}
}
}
Extra info: I am doing Word Sense Disambiguation using Pedersen's Wordnet WSD tool which uses Wordnet::Similarity::AllWords. The output file is generated by this package and the found scores have to be normalized for implementation in our toolset.
You don't assign anything to $tokpos. The assignment is part of a comment - syntax highlighting in your editor should've told you. strict would've told you, too.
Also, you should probably use $prevtokpos in the division: $tokpos is the new value that you haven't met before. To get the output for the last token, you have to process it outside the loop, as there's no $tokpos to replace it. To avoid code repetition, use a subroutine to do that:
#!/usr/bin/perl
use warnings;
use strict;
my %SumHash;
my %ScoreHash;
sub output {
my $token = shift;
for (keys %ScoreHash) {
my $normscore = $ScoreHash{$_} / $SumHash{$token};
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
undef %ScoreHash;
undef %SumHash;
}
my $prevtokpos;
while (<DATA>){
chomp;
if (my ($tokpossense,$rawscore) = /^\s{4}(.+): (\d+)/) {
my ($tok, $pos, $sensenr) = split /#/, $tokpossense;
my $tokpos = "$tok\#$pos";
if (defined $prevtokpos && $tokpos ne $prevtokpos) {
output($prevtokpos);
}
$prevtokpos = $tokpos;
$SumHash{$tokpos} += $rawscore;
$ScoreHash{$tokpossense} = $rawscore;
}
}
output($prevtokpos);
__DATA__
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
You're confusing yourself by trying to print the results as soon as $tokpos changes. For one thing it's the values for $prevtokpos that are complete, but your trying to output the data for $tokpos; and also you're never going to display the last block of data because you require a change in $tokpos to trigger the output.
It's far easier to accumulate all the data for a given file and then print it when the end of file is reached. This program works by keeping the three values
$tokpos, $sense, and $rawscore for each line of the output in array #results, together with the total score for each value of $tokpos in %totals. Then it's simply a matter of dumping the contents of #results with an extra column that divides each value by the corresponding total.
use strict;
use warnings;
use 5.014; # For non-destructive substitution
for my $file ( #ARGV ) {
open my $fh, '<', $file or die $!;
my (#results, %totals);
while ( <$fh> ) {
chomp;
next unless my ($tokpos, $sense, $rawscore) = / ^ \s{4} ( [^#]+ \# [^#]+ ) \# (\d+) : \s+ (\d+) /x;
push #results, [ $tokpos, $sense, $rawscore ];
$totals{$tokpos} += $rawscore;
}
print "** $file **\n";
for my $item ( #results ) {
my ($tokpos, $sense, $rawscore) = #$item;
printf "%s\t%s\t%6.4f\n", $tokpos.$sense, $rawscore, $rawscore / $totals{$tokpos};
}
print "\n";
}
output
** tokpos.txt **
take#v#1 17 0.0487
take#v#10 158 0.4527
take#v#17 174 0.4986
check#n#1 198 0.5546
check#n#2 117 0.3277
check#n#3 42 0.1176
bank#n#1 81 0.2088
bank#n#2 202 0.5206
bank#n#3 68 0.1753
bank#n#4 37 0.0954

fast way to compare rows in a dataset

I asked this question in R and got a lot of answers, but all of them crash my 4Gb Ram computer after a few hours running or they take a very long time to finish.
faster way to compare rows in a data frame
Some people said that it's not a job to be done in R. As I don't know C and I'm a little bit fluent in Perl, I'll ask here.
I'd like to know if there is a fast way to compare each row of a large dataset with the other rows, identifying the rows with a specific degree of homology. Let's say for the simple example below that I want homology >= 3.
data:
sample_1,10,11,10,13
sample_2,10,11,10,14
sample_3,10,10,8,12
sample_4,10,11,10,13
sample_5,13,13,10,13
The output should be something like:
output
sample duplicate matches
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3
Matches are calculated when both lines have same numbers on same positions,
perl -F',' -lane'
$k = shift #F;
for my $kk (#o) {
$m = grep { $h{$kk}[$_] == $F[$_] } 0 .. $#F;
$m >=3 or next;
print ++$i, " $kk $k $m";
}
push #o, $k;
$h{$k} = [ #F ];
' file
output,
1 sample_1 sample_2 3
2 sample_1 sample_4 4
3 sample_2 sample_4 3
This solution provides an alternative to direct comparison, which will be slow for large data amounts.
Basic idea is to build an inverted index while reading the data.
This makes comparison faster if there are a lot of different values per column.
For each row, you look up the index and count the matches - this way you only consider the samples where this value actually occurs.
You might still have a memory problem because the index gets as large as your data.
To overcome that, you can shorten the sample name and use a persistent index (using DB_File, for example).
use strict;
use warnings;
use 5.010;
my #h;
my $LIMIT_HOMOLOGY = 3;
while(my $line = <>) {
my #arr = split /,/, $line;
my $sample_no = shift #arr;
my %sim;
foreach my $i (0..$#arr) {
my $value = $arr[$i];
our $l;
*l = \$h[$i]->{$value};
foreach my $s (#$l) {
$sim{$s}++;
}
push #$l, $sample_no;
}
foreach my $s (keys %sim) {
if ($sim{$s}>=$LIMIT_HOMOLOGY) {
say "$sample_no: $s. Matches: $sim{$s}";
}
}
}
For 25000 rows with 26 columns with random integer values between 1 and 100, the program took 69 seconds on my mac book air to finish.

Merge lines and do operations if a condition is statisfied

I'm new in perl and I would like to read a table and make a sum of some values from specific lines. This is a simplified example of my input file:
INPUT :
Gene Size Feature
GeneA 1200 Intron 1
GeneB 100 Intron 1
GeneB 200 Intron 1
GeneB 150 Intron 2
GeneC 300 Intron 5
OUTPUT :
GeneA 1200 Intron 1
GeneB 300 Intron 1 <-- the size values are summed
GeneB 150 Intron 2
GeneC 300 Intron 5
Because Gene B is present for intron 1 with two different sizes, I would like to sum these two values and print only one line per intron number.
This is an example of code that I want to do. But I would like to make it more complicated if I can understand How to handle this kind of data.
#!/usr/bin/perl
use strict;
use warnings;
my $sum;
my #GAP_list;
my $prevline = 'na';
open INFILE,"Table.csv";
while (my $ligne = <INFILE>)
{
chomp ($ligne);
my #list = split /\t/, $ligne;
my $gene= $list[0];
my $GAP_size= $list[2];
my $intron= $list[3];
my $intron_number=$list[4];
if($prevline eq 'na'){
push #GAP_list, $GAP_size;
}
elsif($prevline ne 'na') {
my #list_p = split /\t/,$prevline;
my $gene_p= $list_p[0];
my $GAP_size_p= $list_p[2];
my $intron_p= $list_p[3];
my $intron_number_p=$list_p[4];
if (($gene eq $gene_p) && ($intron eq $intron_p) && ($intron_number eq $intron_number_p)){
push #GAP_list, $GAP_size;
}
}
else{
$sum = doSum(#GAP_list);
print "$gene\tGAP\t$GAP_size\t$intron\t$intron_number\t$sum\n";
$prevline=$ligne;
}
}
# Subroutine
sub doSum {
my $sum = 0;
foreach my $x (#_) {
$sum += $x;
}
return $sum;
}
Assuming the fields are seperated by tabs, then the following strategy would work. It buffers the last line, either adding up if the other fields are equal, or printing the old data and then replacing the buffer with the current line.
After the whole input was processed, we must not forget to print out the contents that are still in the buffer.
my $first_line = do { my $l = <>; chomp $l; $l };
my ($last_gene, $last_tow, $last_intron) = split /\t/, $first_line;
while(<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
if ($gene eq $last_gene and $intron eq $last_intron) {
$last_tow += $tow;
} else {
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
($last_gene, $last_tow, $last_intron) = ($gene, $tow, $intron);
}
}
print join("\t", $last_gene, $last_tow, $last_intron), "\n";
This works fine as long as genes that may be folded together are always consecutive. If the joinable records are spread all over the file, we have to keep a data structure of all records. After the whole file is parsed, we can emit nicely sorted sums.
We will use a multilevel hash that uses the gene as first level key, and the intron as 2nd level key. The value is the count/tow/whatever:
my %records;
# parse the file
while (<>) {
chomp;
my ($gene, $tow, $intron) = split /\t/;
$records{$gene}{$intron} += $tow;
}
# emit the data:
for my $gene (sort keys %records) {
for my $intron (sort keys %{ $records{$gene} }) {
print join("\t", $gene, records{$gene}{$intron}, $intron), \n";
}
}
This seems more like something that can be done easily using a simple SQL Query. Especially as you get your files in a database table format. I couldn't comment on your question, to ask you more about it as I don't have enough reputation to do so.
So I'm assuming that you get your data from a table. Not that you can't solve this problem in Perl. But I strongly recommend using the database to do such calculation when fetching the data file, as that seems much easier. And I am not sure why you chose to do it in Perl, especially when you have lots of such fields in a file and you wanted to do such operations on all of them. And you could still use Perl to interact with your database when solving your problem via an SQL Query.
So my proposed solution in SQL, if the data is collected from a database is:
Write an SQL statement involving a GROUP BY on the GENE and feature field and aggregate the size column.
If your table looked exactly like what you described, let us call it GeneInformation table and you loaded your data file to the SQL database (SQLLite maybe) then your select query would be:
SELECT gene, feature, SUM(size) FROM GeneInformation
GROUP
BY gene, feature;
That should give you a list of genes, features and their corresponding total sizes .
If SQL solution is completely impossible for you then I will talk about the Perl solution.
I noticed that the Perl solutions are based on the assumption that a particular gene's values would appear consecutively in the file. If that is the case then I would like to up vote amon's answer (which I can't do at the moment).