Perl: Generating an Array of hashes from a file - perl

I'm trying to create an array of hashes that contains student names as the keys and multiple grades as the values for each student so that I can compute the average for each student, sort the averages in descending order, and print the "lastname, firstname: grade average" of each student in the sorted order.
The issue i'm having is with the generation of the array of hashes which I'm 90% sure the problem lies in the split on line 10 but I can't seem to find the solution myself.
students.txt:
chipper jones 29 80 70
hank aaron 99 85 81 75
beth allen 64 84 71 5x9 38 68 53
andruw jones 100 100 100 100 100
ty cobb 75 75 100
code:
#!/usr/bin/perl
#Program 5
my #Aoh;
open (FILEIN, "$ARGV[0]");
while(<FILEIN>) {
chomp;
push #Aoh, { split / / };
for $i (0 .. $#Aoh) {
print "{ ";
for $role (keys %{ $Aoh[$i] }) {
print " $role $Aoh[$i]{$role} ";
}
print "}\n";
}
}
output I'm getting:
0 is { chipper=jones 70= 29=80 }
1 is { 81=75 hank=aaron 99=85 }
2 is { 38=68 53= beth=allen 64=84 71=5x9 }
3 is { 100= andruw=jones }
4 is { ty=cobb 75=75 100= }

push #Aoh, { split / / };
Creates the following hash from the first line:
{ chipper => 'jones',
29 => 80,
70 => undef,
}
That's not what you wanted, right?
I'd use a hash of numbers instead of the array of hashes. You can use "lastname, firstname" directly as the hash key and you can store the averages directly as the values:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util qw{ sum };
my %average;
while (<>) {
my ($name, $surname, #grades) = split;
$average{"$surname, $name"} = sum(#grades) / #grades;
}
for my $student (sort { $average{$a} <=> $average{$b} } keys %average) {
print $student, ' ', $average{$student}, "\n";
}
Note that I'm getting a warning:
Argument "5x9" isn't numeric in subroutine entry at ./1.pl line 11, <> line 3.
How should one treat the 5x9 grade?

Related

sort array and remove duplicates in specific columns Perl

I would like to remove duplicates rows in col 0 of an array in such a way that only the max values in col 1 remain. Data is tab delimited. There are 16 columns.
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_19 12 exon20 ...
sample2_EGFR_19 20 exon19 ...
sample3_EGFR_20 65 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample1_EGFR_20 12 exon20 ...
sample3_EGFR_20 125 exon20 ...
Desired output:
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...
I've started with tab delimited text files that I split and populated an array. Then i use a hash and sort by keys. The final output I get the data correctly sorted, however, the duplicates are not removed. How do I remove lines that are now blank in the first column? Thanks
sample1_EGFR_19 53 exon19 ...
12 exon20 ...
sample2_EGFR_19 25 exon12 ...
20 exon19 ...
sample3 EGFR_20 125 exon20 ...
65 exon20 ...
sample1 EGFR_20 12 exon20 ...
Please suggest a straight forward method to accomplish his. Thanks
Here is the code:
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw(uniq);
use List::Util 'first';
use Data::Dumper;
my $filename = "/data/Test/output.txt";
my $output_filename = "/data/Test/output_changed.txt";
my #resultarray;
my #sorted;
open( TXT2, "$filename" );
while ( <TXT2> ) {
push( #resultarray, $_ );
}
close( TXT2 );
foreach ( #resultarray ) {
chop( $_ );
}
foreach ( #resultarray ) {
print( $_);
chomp( $_ );
my ( $key, $val ) = split /\t/, $_, 2;
push #{ $result_hash{$key} }, $val;
}
foreach ( sort { $result_hash{$a} <=> $result_hash{$b} } keys %result_hash ) {
push( #final_array, $_ . "\t" . join "\t", #{ $result_hash{$_} } );
}
undef %{result_hash};
foreach ( #final_array ) {
chomp( $_ );
print( $_);
}
for ( 0 .. #final_array - 1 ) {
my $myuniquearray = $final_array[$_];
open( MYFILE, ">>$output_filename" ); ##opens files with header and adds the rest of the lines.
print MYFILE $myuniquearray . "\n";
close( MYFILE );
}
This is a fairly straightforward UNIX one-liner. Why the requirement to write it in Perl?
$ sort -k1,1 -k2,2rn /data/Test/output.txt | awk '!seen[$1]++' | tee /data/Test/output_changed.txt
sample1_EGFR_19 53 exon19 ...
sample1_EGFR_20 12 exon20 ...
sample2_EGFR_19 25 exon12 ...
sample3_EGFR_20 125 exon20 ...
This sorts it by the first column ascending and by the second column descending and numeric, then uses awk to select the first line from each group. If that awk statement is too confusing, it has the same function as awk 'x != $1 { print; x = $1 }'. (tee writes the lines to the file and displays the output to the terminal.)
If you really must use Perl, here's a simple solution to the described problem:
#!/usr/bin/perl
use strict;
use warnings;
sub sort_func {
# sort by the first col asc and then by the second col desc and numeric
$a->[0] cmp $b->[0] || $b->[1] <=> $a->[1]
}
my %seen;
print
map join("\t", #$_), # re-join the fields with tabs into the original line
grep !$seen{$_->[0]}++, # select the first line of each sorted group
sort sort_func # sort lines using the above sort function
map [split /\t/, $_, 3], # split by tabs so we can sort by the first two fields
<>; # read lines from stdin or the filename given by ARGV[0]
Mark the file executable and use it like so:
./sortlines.pl /data/Test/output.txt >/data/Test/output_changed.txt

finding highest value in hash

I have a hash with 5 keys, each of these keys have 5 values
foreach $a(#mass){
if($a=~some regex){
#value=($1,$2,$3,$4,$5);
$hash{"keysname$c"}="#value";
c++;
}
}
Each scalar is a value of different parameters , I have to determinate the highest value of the first array for the all keys in hash
Edit:
Code must compare first value of key1 with first value of key2, key3...key5 and print the highest one
This will print max value for structure like
my %hash = ( k1 => [6,4,1], k2 => [16,14,11] );
use List::Util qw(max);
# longest array
my $n = max map $#$_, values %hash;
for my $i (0 .. $n) {
my $max = max map $_->[$i], values %hash;
print "max value on position $i is $max\n";
}
and for strings,
my %hash = ( k1 => "6 4 1", k2 => "16 14 11" );
use List::Util qw(max);
# longest array
my $n = max map $#{[ split ]}, values %hash;
for my $i (0 .. $n) {
my $max = max map [split]->[$i], values %hash;
print "max value on position $i is $max\n";
}
If I understand your question correctly (and it's a little unclear) then I think you want something like this:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::Util 'max';
my (#data, #max);
while (<DATA>) {
chomp;
push #data, [split];
}
for my $i (0 .. $#{$data[0]}) {
push #max, max map { $_->[$i] } #data;
}
say "#max";
__DATA__
93 3 26 87 7
66 96 46 77 42
26 3 71 64 91
31 27 14 40 86
82 72 71 34 7
try this
map {push #temp, #{$_}} values %hash;
#desc_sorted= sort {$b <=> $a} #temp;
print $desc_sorted[0],"\n";
map will consolidate all lists to a single list and sort will sort that consolidated array in descending order.

Perl subroutine assistance

Using Data::Dumper and List::Util I'm able to sum the total of each row within my array with a subroutine. This part is correct.
With an easier approach I attempted to print the grand total of all numbers with a separate subroutine called get_grandtotal. This returns incorrect numbers.
My question is how do I print the correct grand total?
And what modifications would I use to print the column total (instead of the row total) using a similar structure in get_row(#values).
#!/usr/bin/perl
use 5.10.1;
use warnings;
use strict;
use List::Util qw(sum);
use Data::Dumper;
my #values = (
[ 6, 5, 13 ],
[ 35, 9, 6 ],
[ 65, 255, 54 ]
);
get_row(#values);
sub get_row {
my #total;
foreach my $row (#_) {
say join ' ', #$row;
push #total, sum #$row;
}
say Data::Dumper->Dump( [ \#total ], [ qw(*Row_Total) ] );
}
my $sum = 0;
sub get_grandtotal() {
foreach (#values) {
$sum += $_;
}
print "Grand Total = $sum\n";
}
get_grandtotal();
Output
6 5 13
35 9 6
65 255 54
#Row_Total = (
'24',
'50',
'374'
);
Grand Total = 61899232
You are trying to add together array references in
$sum += $_;
change this to
$sum += sum #$_;
and your code will work.
This subroutine uses map to extract the columns from the array, and prints the totals
sub get_column {
my #total;
foreach my $i (0 .. $#{$values[0]}) {
my #column = map $_->[$i], #values;
say join ' ', #column;
push #total, sum #column;
}
say Data::Dumper->Dump( [ \#total], [ qw(*Column_Total) ] );
}
output
6 35 65
5 9 255
13 6 54
#Column_Total = (
'106',
'269',
'73'
);

Perl count the sum of one column aggregating by another

I have a dataset will a lot of columns. What I need to do is to sum a aggregate a certain column in terms of another. As an example,
ID Volume
A 20
D 60
B 10
A 50
K 30
B 100
D 80
So I want an aggregated sum of all the different IDs (A, B, C...) in terms of volumes and sorted by that sum
The result would be like
D 140
B 110
A 70
K 30
how would I accomplish this in perl?
#!/usr/bin/perl
use strict;
use warnings;
my %ids_and_sums;
while (<>) {
# The regex will only consider one single uppercase letter as
# an ID; in case your IDs may look different, you could prepend
# your 'ID Volume' line with a character which will never be part
# of an ID, and modify below regex to meet your needs
my ($id, $volume) = m/^([A-Z])\s+(\d+)/;
if ($id and $volume) {
$ids_and_sums{$id} += $volume;
}
}
foreach my $key (sort {$ids_and_sums{$b} <=> $ids_and_sums{$a}} keys %ids_and_sums) {
print "$key: $ids_and_sums{$key}\n";
}
This prints:
D: 140
B: 110
A: 70
K: 30
EDIT: I have modified the code so that the sorting will be in descending order of the sums.
You can do it as:
perl -lnae '$H{$F[0]} += $F[1];END { print $_." ".$H{$_} for(keys %H) }'
passing it all but the first line of your input file as standard input.
Ideone Link
You can make Perl discard the heading line as:
perl -lnae 'BEGIN{$i=1;}if($i){$i=0;next;}$H{$F[0]} += $F[1];END { print $_." ".$H{$_ } for(keys %H) }' file
Ideone Link
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
while (<>) {
($Fld1,$Fld2) = split(' ', $_, -1);
$map{$Fld1} += $Fld2;
}
foreach $i (keys %map) {
print $i, $map{$i};
}
something like this

perl text::csv - filtering specific columns in a csv document and discarding others

I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:
date
mem_total
cpu.usagemhz.average_0
cpu.usagemhz.average_1
cpu.usagemhz.average_2
I would like to capture only columns that begin with "cpu.usage.mhz.average"
Is their a particular function of text::csv that will help me do a quick check of the column names?
Thanks!
JD
* Update **
I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!
Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;
use Text::CSV;
my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";
my $lineCount = 0;
my %desiredColumns;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
print "$colCount\t$field\n";
}
}
}
}
close(F);
Here is the results
colCount | $field
12 565
13 73
14 36
15 32
16 127
17 40
18 32
19 42
20 171
12 464
13 62
14 32
15 24
16 109
17 21
18 19
19 39
20 150
12 515
13 76
14 28
15 30
16 119
17 15
18 25
19 46
20 169
12 500
13 71
14 30
15 28
16 111
17 20
18 18
19 40
20 167
I would like to add this data to individual arrays or hashes. what do you think? something like...
foreach column {
check to see if a hash already exists with that column number. If not then create hash.
}
Then go through each field and add the field data to the appropriate hash.
Do you think this is the right way to go about solving this?
No, not a specific function in Text::CSV. I would do something like this:
use Text::CSV;
my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";
my $lineCount = 0;
my %desiredColumns;
my %columnContents;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
push(#{$columnContents{$colCount}}, $field);
}
}
}
}
close(F);
foreach my $key (sort keys %columnContents) {
print "Column $key: " . join(",", #{$columnContents{$key}}) . "\n\n";
}
Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...
Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.
use Text::CSV; # load the module
my $csv = Text::CSV->new (); # instantiate
open $fh, "<somefile"; # open the input
while ( my $fields = $csv->getline($fh) ) { # read a line, and parse it into fields
print "I got #{$fields}[2..4]\n"; # print the fields of interest
}
close ($fh) # close when done
WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?
If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.
If the former, there are many methods, some are more efficient than the others.
Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).
If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:
For the former (roll your own), you can do something like:
my $headers = $csv->getline( $fh ); # First line is headers.
my #headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } #$headers;
while ( my $row = $csv->getline( $fh ) ) {
my $i = 0;
my #row_new = grep { $headers_keep[$i++] } $#row;
push #rows, \#row_new;
}
BUT you can either roll your own OR .
You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \
In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().
You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.