Perl: find sum and average of specific columns

Perl: find sum and average of specific columns - perl

I want to calculate the average over all itemsX (where X is a digit) for each row in Perl on windows.
I have file in format:
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Want to find sum of item blocks and their average.
Any help on this?
Here's what I've tried so far:
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($item1, $item2, ) = split;
$total += $numbers;
$count += 1;
}

For the first line of input (the column names), we store the indices of the columns that start with item. For each subsequent line, we sum the columns referenced by the array slice derived from #indices.
use strict;
use warnings;
use List::Util qw(sum);
my #indices;
while (<DATA>) {
my #fields = split;
if ($. == 1) {
#indices = grep { $fields[$_] =~ /^item/ } 0 .. $#fields;
next;
}
my $sum = sum(#fields[#indices]);
my $avg = $sum / scalar(#indices);
printf("Row %d stats: sum=%d, avg=%.2f\n", $., $sum, $avg);
}
__DATA__
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Output:
Row 2 stats: sum=77, avg=25.67
Row 3 stats: sum=165, avg=55.00

Related

Looping through a perl array

I am trying to:
Populate 10 elements of the array with the numbers 1 through 10.
Add all of the numbers contained in the array by looping through the values contained in the array.
For example,
it would start off as 1, then the second number would be 3 (1 plus 2), and then the next would be 6 (the existing 3 plus the new 3)
This is my current code
#!/usr/bin/perl
use warnings;
use strict;
my #b = (1..10);
for(#b){
$_ = $_ *$_ ;
}
print ("The total is: #b\n")
and this is the result
The total is: 1 4 9 16 25 36 49 64 81 100
What im looking for is:
The total is: 1 3 6 10 etc..

The shown sequence has for each element: its index + 1 + value at the previous index
perl -wE'#b = 1..10; #r = 1; $r[$_] = $_+1 + $r[$_-1] for 1..$#b; say "#r"'
The syntax $#name is for the last index in the array #name.
If the array is changed in place, as shown, then there is no need to initialize
perl -wE'#b = 1..10; $b[$_] = $_+1 + $b[$_-1] for 1..$#b; say "#b"'
Both print
1 3 6 10 15 21 28 36 45 55
As a script
use warnings;
use strict;
use feature 'say';
my #seq = 1..10;
for my $i (1..$#seq) {
$seq[$i] = $i+1 + $seq[$i-1];
}
say "#seq";

$ perl -E'say "The total is: ",join" ",map$sum+=$_,1..10'
The total is: 1 3 6 10 15 21 28 36 45 55

How to resolve this warning in Perl

I asked this type of ques previously but didn't provide the full code.
I am reading below file and checking the max word width present in each column and then write it to another file with proper alignment.
id0 id1 id2 batch
0 34 56 70
2 3647 58 72 566
4 39 616 75 98 78 78987 9876 7899 776
89 40 62 76
8 42 64 78
34 455 544 565
My code:
unlink "temp1.log";
use warnings;
use strict;
use feature 'say';
my $log1_file = "log1.log";
my $temp1 = "temp1.log";
open(IN1, "<$log1_file" ) or die "Could not open file $log1_file: $!";
my #col_lens;
while (my $line = <IN1>) {
my #fs = split " ", $line;
my #rows = #fs ;
#col_lens = map (length, #rows) if $.==1;
for my $col_idx (0..$#rows) {
my $col_len = length $rows[$col_idx];
if ($col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
};
};
close IN1;
open(IN1, "<$log1_file" ) or die "Could not open file $log1_file: $!";
open(tempp1,"+>>$temp1") or die "Could not open file $temp1: $!";
while (my $line = <IN1>) {
my #fs = split " ", $line;
my #az;
for my $h (0..$#fs) {
my $len = length $fs[$h];
my $blk_len = $col_lens[$h]+1;
my $right = $blk_len - $len;
$az[$h] = (" ") . $fs[$h] . ( " " x $right );
}
say tempp1 (join "|",#az);
};
My warning
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 3.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
I am getting the output correctly but don't know how to remove this warnings.

$col_idx can be up to the number of fields on a line, minus one. For the third line, this is more than the highest index of #col_lens, which contains at most 3 elements. So doing the following makes no sense:
if ($col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
Replace it with
if (!defined($col_lens[$col_idx]) || $col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
With this, there's really no point checking for $. == 1 anymore.

You're getting uninitialized warning because, while checking the $col_lens[$col_idx] < $col_len condition, one or both of them are undef.
Solution 1:
You can skip checking this condition by the use of next statement.
for my $col_idx (0..$#rows) {
my $col_len = length $rows[$col_idx];
next unless $col_lens[$col_idx];
if ($col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
}
Solution 2: (Not advised):
You can simply ignore Use of uninitialized value.. warnings by putting this line at top of your script. This will disable uninitialized warnings in a block.
no warnings 'uninitialized';
For more info, please refer this link

Following code demonstrates one of many possible ways for solution to this task
read line by line
get length of each field
compare with stored earlier
adjust to max length
form $format string for print
print formatted data
use strict;
use warnings;
use feature 'say';
my(#data,#length,$format);
while ( <DATA> ) {
my #e = split ' ';
my #l = map{ length } #e;
$length[$_] = ($length[$_] // 0) < $l[$_] ? $l[$_] : $length[$_] for 0..$#e;
push #data,\#e;
}
$format = join ' ', map{ '%'.$_.'s' } #length;
$format .= "\n";
for my $row ( #data ) {
printf $format, map { $row->[$_] // '' } 0..$#length;;
}
__DATA__
id0 id1 id2 batch
0 34 56 70
2 3647 58 72 566
4 39 616 75 98 78 78987 9876 7899 776
89 40 62 76
8 42 64 78
34 455 544 565
Output
id0 id1 id2 batch
0 34 56 70
2 3647 58 72 566
4 39 616 75 98 78 78987 9876 7899 776
89 40 62 76
8 42 64 78
34 455 544 565

Generate all combinations of up to N digits, including repeating digits, in Perl

What's the best way to generate all combinations of 1 to N digits, where digits could be repeated in the combination? E.g, given array 0..2, the result should be:
0
1
2
00
01
02
10
11
12
20
21
22
000
001
002
010
011
etc.
I've played with Algorithm::Permute, but it looks likt it could generate just unique combinations of N numbers:
for( my $a = 0; $a < 3; $a++ ) {
for( my $b = 0; $b < 3; $b++ ) {
my #array = $a..$b;
Algorithm::Permute::permute {
my $Num = join("", #array);
print $Num;
sleep 1;
} #array;
}
}
Thank you.

As its name suggests,
Algorithm::Permute
offers permutations. There are many mathematical variations on selecting k items from a population of N: with and without replacement, with and without repetition, ignoring order or not
It's hard to be certain, but you probably want
Algorithm::Combinatorics
Here's some example code that reproduces at least the part of your expected data that you have shown. It's pretty much the same as zdim's solution but there may be something extra useful to you here
use strict;
use warnings 'all';
use feature 'say';
use Algorithm::Combinatorics 'variations_with_repetition';
my #data = 0 .. 2;
for my $k ( 1 .. #data ) {
say #$_ for variations_with_repetition(\#data, $k);
}
output
0
1
2
00
01
02
10
11
12
20
21
22
000
001
002
010
011
012
020
021
022
100
101
102
110
111
112
120
121
122
200
201
202
210
211
212
220
221
222

my #digits = 0..2;
my $len = 3;
my #combinations = map glob("{#{[join ',', #digits]}}" x $_), 1..$len;

Consolidation of intervals

I'm working with biological data (copy number variations) which is shown as intervals (tab separated file):
File 1
Columns: Chromosome, Start, End, Annotation
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
I intersected them in order to consolidate the overlapping events (50% of overlap is my condition), the result is this:
I used intersectBed from Bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html):
$ intersectBed -a File1 -b File1 -loj -f 0.50 -r > File 2
File 2
Columns: Chromosome, Start, End, Annotation , Chromosome, Start, End, Annotation
1 1 10 A 1 1 10 A
1 1 10 A 1 3 12 B
1 3 12 B 1 1 10 A
1 3 12 B 1 3 12 B
1 3 12 B 1 7 15 C
1 7 15 C 1 3 12 B
1 7 15 C 1 7 15 C
1 20 30 D 1 20 30 D
1 35 45 E 1 35 45 E
1 35 45 E 1 37 45 F
1 37 45 F 1 35 45 E
1 37 45 F 1 37 45 F
1 50 60 G 1 50 60 G
1 50 60 G 1 50 65 H
1 50 65 H 1 50 60 G
1 50 65 H 1 50 65 H
Event A and the event C overlaps with the event B, event E and F overlaps with each other like G and H, finally the event D has no overlapping partners. Knowing this, the list of consolidated CNV should be:
File 3
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
I was trying to use the merge option of the HDCNV java software (http://daleylab.org/lab/?page_id=125) but the output is not what I needed. I was trying to write a perl code but I'm a beginner so this problem is, at the moment, out of my limits.
I would appreciate if you can help me with a nice perl or awk code which take File 2 as input and outputs File 3.
Thanks in advance

I'm assuming that the columns have the following meanings:
col 1: chromosome number
col 2: start position of genomic region
col 3: end position of genomic region
col 4: text identifier
This script looks for the areas of overlap between the named regions. It assumes that the input text is sorted by col 1 then col 2. I have put the input text in a string, but you will probably be reading it in from a file (and outputting your data to a file, too). I will leave you to work out how to do that--it is pretty easy, and there is lots of documentation on the perl website.
#!/usr/bin/perl
use strict;
use warnings;
use feature ":5.10";
use Data::Dumper;
my $text = '1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
';
# we have tab-delimited data.
# split on line breaks, remove line ending, split on tabs
my #lines = map { chomp; [ split(/\t/, $_) ]; } split("\n", $text);
my $col_0 = 1;
my $min = 0;
my $max = 0;
my #range;
foreach (#lines) {
# the chromosome number has changed or
# minimum is greater than current maximum:
# start a new interval
if ($col_0 != $_->[0] || $_->[1] > $max) {
if (#range) {
# print out the range, and restart the stack
say join("\t",
$col_0,
( $min || $_->[1] ),
( $max || $_->[2] ),
join(", ", #range)
);
}
#range = ( $_->[3] );
# set the min and max
$col_0 = $_->[0];
$min = $_->[1];
$max = $_->[2];
}
else {
# the minimum is lower than our current maximum.
# check whether the max is greater than our current
# max and increase it if so. Add the letter to the
# current range.
if ($_->[2] > $max) {
$max = $_->[2];
}
push #range, $_->[3];
}
}
# print out the last line
say join("\t", $col_0, $min, $max, join(", ", #range) );
Output:
1 1 15 A, B, C
1 20 30 D
1 35 45 E, F
1 50 65 G, H
2 1 15 I, J, K
2 20 30 L
2 35 45 M, N
2 50 65 O, P
I have just calculated simple overlap - this doesn't do 50% overlap. Using this script as a start, you can figure out how to do that. We're not doing your PhD for you! ;)

awk '
$2 > end && NR>1 {
print "1", start, end, pair;
start=end=pair=0
}
{
if (!start) { start = $2 };
end = $3;
pair = (pair ? pair "," $4 : $4)
}
END {
print "1", start, end, pair
}' file
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

Assuming ordered data, the following stub should handle merging the records.
Would just have to modify it to load and output to a file.
use strict;
use warnings;
use List::Util qw(min max);
my $last;
while (<DATA>) {
my #fields = split;
if ( !$last ) {
$last = \#fields;
} elsif ( $last->[0] == $fields[0] && $last->[2] > $fields[1] ) {
$last->[1] = min( $last->[1], $fields[1] );
$last->[2] = max( $last->[2], $fields[2] );
$last->[3] .= ",$fields[3]";
} else {
print join( "\t", #$last ), "\n";
$last = \#fields;
}
}
print join( "\t", #$last ), "\n";
__DATA__
1 1 10 A
1 3 12 B
1 7 15 C
1 20 30 D
1 35 45 E
1 37 45 F
1 50 60 G
1 50 65 H
2 1 10 I
2 3 12 J
2 7 15 K
2 20 30 L
2 35 45 M
2 37 45 N
2 50 60 O
2 50 65 P
Outputs:
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H
2 1 15 I,J,K
2 20 30 L
2 35 45 M,N
2 50 65 O,P

My take:
awk -F "\t" -v OFS="\t" '
function emit() {print chrom, start, end, annot}
$1 == chrom && ((start<=$2 && $2<=end) || (start<=$3 && $3<=end)) {
annot = annot "," $4
if ($2 < start) start = $2
if ($3 > end) end = $3
next
}
chrom {emit()}
{chrom=$1; start=$2; end=$3; annot=$4}
END {emit()}
' file1
1 1 15 A,B,C
1 20 30 D
1 35 45 E,F
1 50 65 G,H

Combining duplicated lines in txt file with perl

I am trying to combine duplicate lines using Perl with little luck. My tab-delimited text file is structured as follows (spaces added for readability):
Pentamer Probability Observed Length
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
I would like to be combine duplicated lines by adding the three numeric columns, therefor the line containing "ATGCA" would now look like this:
ATGCA 0.024 3 39
Any ideas/help/suggestions would be greatly appreciated! Thanks!

#!/usr/bin/perl
use warnings;
use strict;
my %hash;
while(<>) {
my #v = split(/\s+/);
if (defined $hash{$v[0]}) {
my $arr = $hash{$v[0]};
$hash{$v[0]} = [$v[0], $arr->[1] + $v[1],
$arr->[2] + $v[2], $arr->[3] + $v[3]];
} else {
$hash{$v[0]} = [#v];
}
}
foreach my $key (keys %hash) {
print join(" ", #{$hash{$key}}), "\n";
}

Here's another option:
use Modern::Perl;
my %hash;
while ( my $line = <DATA> ) {
my #vals = split /\s+/, $line;
$hash{ $vals[0] }->[$_] += $vals[ $_ + 1 ] for 0 .. 2;
}
say join "\t", $_, #{ $hash{$_} } for sort keys %hash;
__DATA__
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
Output:
ATGCA 0.024 3 39
CAGTG 0.004 1 16
GGCAT 0.008 1 16
TGTAC 0.028 2 39

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl: find sum and average of specific columns - perl

Related

Looping through a perl array

How to resolve this warning in Perl

Generate all combinations of up to N digits, including repeating digits, in Perl

Consolidation of intervals

Combining duplicated lines in txt file with perl

Categories

Resources