Combining duplicated lines in txt file with perl - perl

I am trying to combine duplicate lines using Perl with little luck. My tab-delimited text file is structured as follows (spaces added for readability):
Pentamer Probability Observed Length
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
I would like to be combine duplicated lines by adding the three numeric columns, therefor the line containing "ATGCA" would now look like this:
ATGCA 0.024 3 39
Any ideas/help/suggestions would be greatly appreciated! Thanks!

#!/usr/bin/perl
use warnings;
use strict;
my %hash;
while(<>) {
my #v = split(/\s+/);
if (defined $hash{$v[0]}) {
my $arr = $hash{$v[0]};
$hash{$v[0]} = [$v[0], $arr->[1] + $v[1],
$arr->[2] + $v[2], $arr->[3] + $v[3]];
} else {
$hash{$v[0]} = [#v];
}
}
foreach my $key (keys %hash) {
print join(" ", #{$hash{$key}}), "\n";
}

Here's another option:
use Modern::Perl;
my %hash;
while ( my $line = <DATA> ) {
my #vals = split /\s+/, $line;
$hash{ $vals[0] }->[$_] += $vals[ $_ + 1 ] for 0 .. 2;
}
say join "\t", $_, #{ $hash{$_} } for sort keys %hash;
__DATA__
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
Output:
ATGCA 0.024 3 39
CAGTG 0.004 1 16
GGCAT 0.008 1 16
TGTAC 0.028 2 39

Related

Looping through a perl array

I am trying to:
Populate 10 elements of the array with the numbers 1 through 10.
Add all of the numbers contained in the array by looping through the values contained in the array.
For example,
it would start off as 1, then the second number would be 3 (1 plus 2), and then the next would be 6 (the existing 3 plus the new 3)
This is my current code
#!/usr/bin/perl
use warnings;
use strict;
my #b = (1..10);
for(#b){
$_ = $_ *$_ ;
}
print ("The total is: #b\n")
and this is the result
The total is: 1 4 9 16 25 36 49 64 81 100
What im looking for is:
The total is: 1 3 6 10 etc..
The shown sequence has for each element: its index + 1 + value at the previous index
perl -wE'#b = 1..10; #r = 1; $r[$_] = $_+1 + $r[$_-1] for 1..$#b; say "#r"'
The syntax $#name is for the last index in the array #name.
If the array is changed in place, as shown, then there is no need to initialize
perl -wE'#b = 1..10; $b[$_] = $_+1 + $b[$_-1] for 1..$#b; say "#b"'
Both print
1 3 6 10 15 21 28 36 45 55
As a script
use warnings;
use strict;
use feature 'say';
my #seq = 1..10;
for my $i (1..$#seq) {
$seq[$i] = $i+1 + $seq[$i-1];
}
say "#seq";
$ perl -E'say "The total is: ",join" ",map$sum+=$_,1..10'
The total is: 1 3 6 10 15 21 28 36 45 55

Generate all combinations of up to N digits, including repeating digits, in Perl

What's the best way to generate all combinations of 1 to N digits, where digits could be repeated in the combination? E.g, given array 0..2, the result should be:
0
1
2
00
01
02
10
11
12
20
21
22
000
001
002
010
011
etc.
I've played with Algorithm::Permute, but it looks likt it could generate just unique combinations of N numbers:
for( my $a = 0; $a < 3; $a++ ) {
for( my $b = 0; $b < 3; $b++ ) {
my #array = $a..$b;
Algorithm::Permute::permute {
my $Num = join("", #array);
print $Num;
sleep 1;
} #array;
}
}
Thank you.
As its name suggests,
Algorithm::Permute
offers permutations. There are many mathematical variations on selecting k items from a population of N: with and without replacement, with and without repetition, ignoring order or not
It's hard to be certain, but you probably want
Algorithm::Combinatorics
Here's some example code that reproduces at least the part of your expected data that you have shown. It's pretty much the same as zdim's solution but there may be something extra useful to you here
use strict;
use warnings 'all';
use feature 'say';
use Algorithm::Combinatorics 'variations_with_repetition';
my #data = 0 .. 2;
for my $k ( 1 .. #data ) {
say #$_ for variations_with_repetition(\#data, $k);
}
output
0
1
2
00
01
02
10
11
12
20
21
22
000
001
002
010
011
012
020
021
022
100
101
102
110
111
112
120
121
122
200
201
202
210
211
212
220
221
222
my #digits = 0..2;
my $len = 3;
my #combinations = map glob("{#{[join ',', #digits]}}" x $_), 1..$len;

Perl: read an array and calculate corresponding percentile

I am trying to code for a perl code that reads a text file with a series of number, calculates, and prints out the numbers that corresponds to the percentiles. I do not have access to the other statistical modules, so I'd like to stick with just pure perl coding. Thanks in advance!
The input text file looks like:
197
98
251
82
51
272
154
167
38
280
157
212
188
88
40
229
228
125
292
235
67
70
127
26
279
.... (and so on)
The code I have is:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (my $fh, "<", "testing2.txt")
or die "Cannot open: $!\n";
while (<$fh>){
push #data, $_;
}
close $fh;
my %count;
foreach my $datum (#data) {
++$count{$datum};
}
my %percentile;
my $total = 0;
foreach my $datum (sort { $a <=> $b } keys %count) {
$total += $count{$datum};
$percentile{$datum} = $total / #data;
# percentile subject to change
if ($percentile{$datum} <= 0.10) {
print "$datum : $percentile{$datum}\n\n";
}
}
My desired output:
2 : 0.01
3 : 0.01333
4 : 0.01666
6 : 0.02
8 : 0.03
10 : 0.037
12 : 0.04
14 : 0.05
15 : 0.05333
16 : 0.06
18 : 0.06333
21 : 0.07333
22 : 0.08
25 : 0.09
26 : 0.09666
Where the format is #number from the list : #corresponding percentile
To store the numer wihtout a newline in #data, just add chomp; before pushing it, or chomp #data; after you've read them all.
If your input file has MSWin style newlines, convert it to *nix style using dos2unix or fromdos.
Also, try to learn how to indent your code, it boosts readability. And consider renaming $total to $running_total, as you use the value as it changes.

Perl: find sum and average of specific columns

I want to calculate the average over all itemsX (where X is a digit) for each row in Perl on windows.
I have file in format:
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Want to find sum of item blocks and their average.
Any help on this?
Here's what I've tried so far:
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($item1, $item2, ) = split;
$total += $numbers;
$count += 1;
}
For the first line of input (the column names), we store the indices of the columns that start with item. For each subsequent line, we sum the columns referenced by the array slice derived from #indices.
use strict;
use warnings;
use List::Util qw(sum);
my #indices;
while (<DATA>) {
my #fields = split;
if ($. == 1) {
#indices = grep { $fields[$_] =~ /^item/ } 0 .. $#fields;
next;
}
my $sum = sum(#fields[#indices]);
my $avg = $sum / scalar(#indices);
printf("Row %d stats: sum=%d, avg=%.2f\n", $., $sum, $avg);
}
__DATA__
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Output:
Row 2 stats: sum=77, avg=25.67
Row 3 stats: sum=165, avg=55.00

How to print common values from two different overlapping ranges without repetition

I have the following code. I am trying to print all common values from #arr2 and #arr4 without repetition. The expected output should be 5,6,7,8,9,13,14,15,16,17,18. I am not getting how to put a condition in a loop to avoid repetition and why $i is not printing in this code.
#!/usr/bin/perl
my #arr2 = ( 1 .. 10, 5 .. 15, 10 .. 20 );
my #arr4 = ( 5 .. 9, 13 .. 18 );
foreach my $line1 (#arr2) {
my ( $from1, $to1 ) = split( /\.\./, $line1 );
#print "$to1\n";
foreach my $line2 (#arr4) {
my ( $from2, $to2 ) = split( /\.\./, $line2 );
for ( my $i = $from1; $i <= $to1; $i++ ) {
for ( my $j = $from2; $j <= $to2; $j++ ) {
if ( $i == $j ) {
print "$i \n";
}
}
}
}
}
As Jonathan has pointed out, you appear to misunderstanding the nature of your data because you don't recognize the Range Operator .. used to construct lists.
my #array = (1 .. 10);
print "#array\n";
Outputs
1 2 3 4 5 6 7 8 9 10
Once you recognize that, then you just need to be pointed to the following:
perlfaq4 - How can I remove duplicate elements from a list or array?
perlfaq4 - How do I compute the difference of two arrays? How do I compute the intersection of two arrays?
Combined to form:
#!/usr/bin/perl
use strict;
use warnings;
my #arr2 = ( 1 .. 10, 5 .. 15, 10 .. 20 );
my #arr4 = ( 5 .. 9, 13 .. 18 );
my %seen;
$seen{$_}++ for uniq(#arr2), uniq(#arr4);
my #intersection = sort { $a <=> $b } grep { $seen{$_} == 2 } keys %seen;
print "#intersection\n";
sub uniq {
my %seen;
$seen{$_}++ for #_;
return keys %seen;
}
Outputs:
5 6 7 8 9 13 14 15 16 17 18
The first step to understanding your problem is to understand your data — the arrays do not hold what you think they hold.
#!/usr/bin/perl
my #arr2=(1..10,5..15,10..20);
my #arr4=(5..9,13..18);
print "arr2: #arr2\n";
print "arr4: #arr4\n";
The output from this is:
arr2: 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15 10 11 12 13 14 15 16 17 18 19 20
arr4: 5 6 7 8 9 13 14 15 16 17 18
This shows that your code trying to split a string on .. is going to fail horribly.
One of the most basic debugging techniques is printing out the data you've actually got to ensure it matches what you think you should have. Here, that basic printing would have shown that the input data is not in the format you expected.