Perl count the sum of one column aggregating by another - perl

I have a dataset will a lot of columns. What I need to do is to sum a aggregate a certain column in terms of another. As an example,
ID Volume
A 20
D 60
B 10
A 50
K 30
B 100
D 80
So I want an aggregated sum of all the different IDs (A, B, C...) in terms of volumes and sorted by that sum
The result would be like
D 140
B 110
A 70
K 30
how would I accomplish this in perl?

#!/usr/bin/perl
use strict;
use warnings;
my %ids_and_sums;
while (<>) {
# The regex will only consider one single uppercase letter as
# an ID; in case your IDs may look different, you could prepend
# your 'ID Volume' line with a character which will never be part
# of an ID, and modify below regex to meet your needs
my ($id, $volume) = m/^([A-Z])\s+(\d+)/;
if ($id and $volume) {
$ids_and_sums{$id} += $volume;
}
}
foreach my $key (sort {$ids_and_sums{$b} <=> $ids_and_sums{$a}} keys %ids_and_sums) {
print "$key: $ids_and_sums{$key}\n";
}
This prints:
D: 140
B: 110
A: 70
K: 30
EDIT: I have modified the code so that the sorting will be in descending order of the sums.

You can do it as:
perl -lnae '$H{$F[0]} += $F[1];END { print $_." ".$H{$_} for(keys %H) }'
passing it all but the first line of your input file as standard input.
Ideone Link
You can make Perl discard the heading line as:
perl -lnae 'BEGIN{$i=1;}if($i){$i=0;next;}$H{$F[0]} += $F[1];END { print $_." ".$H{$_ } for(keys %H) }' file
Ideone Link

$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
while (<>) {
($Fld1,$Fld2) = split(' ', $_, -1);
$map{$Fld1} += $Fld2;
}
foreach $i (keys %map) {
print $i, $map{$i};
}
something like this

Related

Compare 2 sets of numbers contained in strings

I have 2 scalars as below:
$a = '100 105 010';
$b = '010 105 100';
How do I compare if both has same set of values? order doesn't matter.
one way is to arrange it in ascending order and compare, is there a better way?
You can split each string into an array and sort and compare arrays. By hand:
use warnings;
use strict;
my $x = '100 105 1 010';
my $y = '010 105 100 2';
my #xs = sort { $a <=> $b } split ' ', $x;
my #ys = sort { $a <=> $b } split ' ', $y;
if (#xs != #ys) {
print "Differ in number of elements.\n";
}
else {
for (0..$#xs) {
print "Differ: $xs[$_] vs $ys[$_]\n" if $xs[$_] != $ys[$_];
}
}
# For boolean comparison put it in a sub
print arr_num_eq(\#xs, \#ys), "\n";
sub arr_num_eq {
my ($ra, $rb) = #_;
return 0 if #$ra != #$rb;
$ra->[$_] != $rb->[$_] && return 0 for 0..$#$ra;
return 1;
}
The sorting can be moved to the sub as well, which would then take strings. The way it stands it can be used for comparison of existing arrays as well. Please add argument checking.
There is a number of modules that have this capability. The perm from Array::Compare hides the sorting above, but internally joins sorted arrays into strings thus duplicating the work here since we started with strings. The List::AllUtils certainly offers this as well with its long list of utilities.
See this post, for example, for a few methods (just not the smart match ~~), and for benchmarks if efficiency is a concern.
Using the mentioned implementation idea from Array::Compare, per comment by ysth
sub str_num_eq {
return join(' ', sort { $a <=> $b } split / /, $_[0])
eq join(' ', sort { $a <=> $b } split / /, $_[1])
}
What the most suitable method is depends on what this is for and how it is used. Is it only a boolean comparison, or will more be done if they are found to differ? How does it come about in your program flow? What are typical sizes of strings, how often is it run? Are the strings most often the same or different, do they typically differ a lot or a little? Etc.
Without modules, you can use hashes:
#!/usr/bin/perl
use warnings;
use strict;
my $x = '100 105 010 2';
my $y = '010 105 100 100 1';
my (%hx, %hy);
$hx{$_}++ for split ' ', $x;
$hy{$_}++ for split ' ', $y;
for my $k (keys %hx) {
if (! exists $hy{$k}) {
print "$k missing in y\n";
} elsif ($hy{$k} != $hx{$k}) {
print "$k has different number of occurences\n";
}
delete $hy{$k};
}
print "$_ missing in x\n" for keys %hy;
$a and $b are special variables used in sort, so I renamed them to $x and $y.
split turns the strings into lists. Each hash counts how many times a member occurs in the list.
See also Perl FAQ 4.
Something else try with pattern matching,
This is not a straight forward but it will work.
Construct the pattern by anyone of your scalar value. Then check the another string by the constructed pattern.
my $a = '100 100 105';
my $b = '100 105 100';
my #b_ary = split(" ",$b);
my $regex = join'\b|\b', #b_ary;
my $word_length = #b_ary * 2 - 1; #Count the number of words and space.
my $rgx = qr"^(?:\b$regex\b|\s){$word_length}$"; #`{n}` match word exact n times
if($a=~m/$rgx/)
{
print "same values\n";
}
else
{
print "Not a same values\n";
}
The answer is already posted above. This is just in case you want to remove the white spaces and compare each number.
$x = '100 105 010';
$y = '010 105 100';
join("",sort split "",join("",split " ",$x)) eq join("",sort split "",join("",split " ",$y));

Perl sort on numbers

I have some perl script to process a file (contains lots of numbers) line by line.
File content (sample data, first 3 numbers are seperated by space, then the seperate is tab between 3rd and 4th numbers):
1 2 3 15
2 9 8 30
100 106 321 92
9 8 2 59
300 302 69 88
....
Script content:
# snippet of script
open(INF, "$infile") || die "Unable to open file $infile: $!\n";
#content = <INF>;
close(INF);
foreach $line (#content) {
# blah blah, script to handle math here
# Now the numbers are stored in separate variables
# $n1 stores the 1st number, i.e.: 1
# $n2 stores the 2nd number, i.e.: 2
# $n3 stores the 3rd number, i.e.: 3
# $n4 stores the 4th number, i.e.: 15
# Solution code to be inserted here
}
I would like to:
Sort the variables $n1, $n2, $n3 and output them in ascending order.
At the end of foreach, get rid of duplicates
My approach:
# Insert below code to foreach
$numbers{$n1} = 1;
$numbers{$n2} = 1;
$numbers{$n3} = 1;
#keys = sort { $numbers{$b} <=> $numbers{$a} } keys %numbers;
#push #numbers, "$keys[0] $keys[1] $keys[2]";
$numbers2{"$keys[0] $keys[1] $keys[2]"} = 1;
This defines two hashes: 1st hash is for sorting, 2nd hash is for removing duplicates after sorting.
Is there any better approach?
Thanks,
Updated with another solution -- it is the lines that may be duplicate, not numbers on a line.
In order to remove duplicate lines it is easiest if we have all sorted lines of three numbers in an array. Then we post-process that by running them through uniq. There are (at least) two ways to do this.
Store lines in an array, each being a reference to a sorted array with the three numbers. Then for comparison construct a string of each on the fly. This is better if there is yet other processing of numbers somewhere along, as they are in arrays.
Build a string out of each sorted line and store those in an array. Then it's easier to compare.
Below I use the first approach, assuming that there is other processing of numbers.
use warnings;
use strict;
use feature wq(say);
use List::MoreUtils qw(uniq);
my $file = 'sort_nums.txt';
my #content = do {
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #linerefs_all;
foreach my $line (#content) {
# Calculations ... numbers stored in ($n1, $n2, $n3, $n4)
my ($n1, $n2, $n3) = split '\s+' $line; # FOR TESTING
# Add to #rlines a reference to the sorted array with first three
push #linerefs, [ sort { $a <=> $b } ($n1, $n2, $n3) ];
}
# Remove dupes by comparing line-arrays as strings, then remake arrayrefs
my #linerefs = map { [ split ] } uniq map { join ' ', #$_ } #linerefs_all;
say "#$_" for #linerefs;
Using the posted lines in the file sort_nums.txt, the code above prints
1 2 3
2 8 9
100 106 321
69 300 302
Explanation of the post-processing line, read from the right.
The map on the right processes a list of arrayrefs. It dereferences each and joins its elements with a space, forming a string for the line. It returns a list of such strings, one per line.
That list is pruned of duplicates by uniq, which itself returns a list, fed into the map on the left.
In that map's block each string is split by (the default) white space into a list (of numbers on the line), and then a reference of that is taken by [ ]. This map thus returns a list of references to arrays, one for each line, what is assigned to #linerefs.
This is then printed.
If this is too much to stomach for one statement break the process into steps, generating intermediate arrays. Or switch to the second method above.
Initial post, assuming that numbers on each line may be duplicates
I take the objective to be: sort three variables and keep only unique ones, for each line.
use List::MoreUtils qw(uniq);
foreach my $line (#content) {
# Calculations, numbers stored in ($n1, $n2, $n3, $n4)
my #nums = uniq sort { $a <=> $b } ($n1, $n2, $n3);
say "#nums";
}
Remember that after this you don't know which one(s) of $n1, $n2, $n3 may have been dropped.
If, for some reason, a non-core module is not suitable, see this in perlfaq4 For example,
my %seen = ();
my #nums = sort { $a <=> $b } grep { ! $seen{$_}++ } ($n1, $n2, $n3);
or, if you need it without an extra hash around
my #nums = do { my %seen; sort { $a <=> $b } grep { !$seen{$_}++ } ($n1, $n2, $n3) };

Splitting and tallying substrings within mixed integer-string data

Input Data (example):
40A3B35A3C
30A5B28A2C2B
Desired output (per-line) is a single number determined by the composition of the code 40A3B35A3C and the following rules:
if A - add the proceeding number to the running total
if B - add the proceeding number to the running total
if C - subtract the proceeding number from the running total
40A 3B 35A 3C would thus produce 40 + 3 + 35 - 3 = 75.
Output from both lines:
75
63
Is there an efficient way to achieve this for a particular column (such as $F[2]) in a tab-delimited .txt file using a one-liner? I have considered splitting the entire code into individual characters, then performing if statement checks to detect A/B/C, but my Perl knowledge is limited and I am unsure how to go about this.
When you use split with a capture, the captured group is returned from split, too.
perl -lane '
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * ("C" eq $op ? -1 : 1) while ($n, $op) = splice #ar, 0, 2;
print $s
' < input
Or maybe more declarative:
BEGIN { %one = ( A => 1,
B => 1,
C => -1 ) }
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * $one{$op} while ($n, $op) = splice #ar, 0, 2;
print $s
When working through a string like this, it's useful to know that regular expressions can return a list of results.
E.g.
my #matches = $str =~ m/(\d+[A-C])/g; #will catch repeated instances
So you can do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my $total;
#break the string into digit+letter groups.
for (m/(\d+[A-C])/g) {
#separate out this group into num and code.
my ( $num, $code ) = m/(\d+)([A-C])/;
print "\t",$num, " => ", $code, "\n";
if ( $code eq "C" ) {
$total -= $num;
}
else {
$total += $num;
}
}
print $total, " => ", $_;
}
__DATA__
40A3B35A3C
30A5B28A2C2B
perl -lne 'push #a,/([\d]+)[AB]/g;
push #b,/([\d]+)[C]/g;
$sum+=$_ for(#a);$sum-=$_ for(#b);
print $sum;#a=#b=();undef $sum' Your_file
how it works
use the command line arg as the input
set the hash "%op" to the
operations per letter
substitute the letters for operators in the
input evaluate the substituted input as an expression
use strict;
use warnings;
my %op=qw(A + B + C -);
$ARGV[0] =~ s/(\d+)(A|B|C)/$op{$2} $1/g;
print eval($ARGV[0]);

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

How do I get the lowest matched captureno?

How do I get the lowest matched group no in a regular expression?
Suppose there is a regular expression
/(a(b))|(b(1))|(c(4))/...
e.g input string is "b1" , the lowest matched group no 2. ($2)
e.g input string is "c4" , the lowest matched group no 5. ($5)
e.g input string is "ab" , the lowest matched group no 1. ($1)
I have one solution but it is not very efficient.
Thakns all for trying.
The real problem is efficiency. Many have provided similar solutions I found.
The problem is linear time searching for the lowest group. O(N) where n is number of capture groups.
I wondered if there is a faster way. O(1) That was the aim of this question.
I expected the Perl has a hidden feature to get that value. I guess there is not.
Meanwhile I found the solution myself, here it is..
/(a(b)(??{ $first=1;"" }))|(b(1)(??{ $first=2;"" }))|(c(4)(??{ $first=5;"" }))/
The time to find out $first is O(1).
if (#matches = $conv::content =~/$conv::trtree{convertsourceregqr}[$conversionno]/)
{
my $firstno;
my $c = 0;
for my $m (#matches)
{
if (defined $m)
{
$firstno=$c;
last;
}
$c++;
}**strong text****strong text**
This doesn't specifically match your question, but it might address your actual problem (or else a future reader's).
Edit (12/10/12):
One more option, the special construct (?|) will reorganize the numbering in alternations, so that the numbers will be consistent. This won't help identify which group matched, but will assure you that the matches are in $1 and $2. If you need to know which matched, named captures (below) are the way to go.
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $v ('ab', 'b1', 'c4') {
print "Input: $v\n";
next unless $v =~ /(?|(a(b))|(b(1))|(c(4)))/;
print "$1 => $2\n";
}
Original
Perhaps you want to use named captures to ease the burden of understanding what matched. The named capture results are placed in the %+ hash and are thus much easier to introspect.
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $v ('ab', 'b1', 'c4') {
print "Input: $v\n";
next unless $v =~ /(?<a>a(?<ab>b))|(?<b>b(?<b1>1))|(?<c>c(?<c4>4))/;
foreach my $key (sort keys %+) {
next unless defined $+{$key};
print "\t$key => $+{$key}\n";
}
}
prints
Input: ab
a => ab
ab => b
Input: b1
b => b1
b1 => 1
Input: c4
c => c4
c4 => 4
EDIT
In fact, for alternations like this, perhaps you want to simply use recurring names!
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $v ('ab', 'b1', 'c4') {
print "Input: $v\n";
next unless $v =~ /(?<outer>a(?<inner>b))|(?<outer>b(?<inner>1))|(?<outer>c(?<inner>4))/;
print "\touter => $+{outer}\n";
print "\tinner => $+{inner}\n";
}
prints
Input: ab
outer => ab
inner => b
Input: b1
outer => b1
inner => 1
Input: c4
outer => c4
inner => 4
Store the matches in an array, and find the index of the first defined value:
my $str = 'c4';
my #matches = ( $str =~ m/(a(b))|(b(1))|(c(4))/ );
for my $i ( 0..$#matches ) {
if ( defined $matches[$i] ) {
printf "First matching group: %d\n", $i+1;
last;
}
}
# output: 5
Note that this will never output 2, 4 or 6 since groups 1, 3 or 5 must match for one of them to match.
If you only want the content of the first matching group:
use List::Util 'first';
my $str = 'c4';
print first { defined } $str =~ m/(a(b))|(b(1))|(c(4))/;
The special variables #- and #+ hold the starting and ending positions of successful matches. The practical application to your question is that if $<n> holds some value (for $<n> in $1, $2, etc.), then $+[<n>] will be larger than $-[<n>].
for ('b1', 'c4', 'ab') {
/(a(b))|(b(1))|(c(4))/;
my #i = grep { $+[$_] > $-[$_] } 1..$#+;
# #i contains list of successful matches,
# i.e., if #i == (3,4), then $3 and $4 contain values
if (#i > 0) {
print "Earliest match for '$_' is: \$$i[0]\n";
} else {
print "No match for '$_'\n";
}
}
First off, using parentheses that way is confusing. The simplest solution to this particular problem is to just use one:
/(ab|b1|c4)/
Since the other parentheses do not serve a purpose in this particular case, this will work.
However, there may be times when grouping is needed, in which case you can use non-capturing parentheses and just use one to capture, (?: ... ). In your case it would look like this:
/((?:a(?:b))|(?:b(?:1))|(?:c(?:4)))/
The group numbering in a regular expression is the number of parens.
1 2 3 4 5 6
/(a(b))|(b(1))|(c(4))/
A quick script to demonstrate this:
#!/usr/bin/perl
foreach my $v ('ab', 'b1', 'c4') {
$v =~ /(a(b))|(b(1))|(c(4))/;
if(defined $1) { print "One!\n"; }
if(defined $3) { print "Three!\n"; }
if(defined $5) { print "Five!\n"; }
print << "--EOB--";
$v
1 $1
2 $2
3 $3
4 $4
5 $5
6 $6
--EOB--
}
Which produces the output:
One!
ab
1 ab
2 b
3
4
5
6
Three!
b1
1
2
3 b1
4 1
5
6
Five!
c4
1
2
3
4
5 c4
6 4
At this point, one should be able to easily modify the code to do whatever for whichever group is matched.