Perl: read an array and calculate corresponding percentile - perl

I am trying to code for a perl code that reads a text file with a series of number, calculates, and prints out the numbers that corresponds to the percentiles. I do not have access to the other statistical modules, so I'd like to stick with just pure perl coding. Thanks in advance!
The input text file looks like:
197
98
251
82
51
272
154
167
38
280
157
212
188
88
40
229
228
125
292
235
67
70
127
26
279
.... (and so on)
The code I have is:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (my $fh, "<", "testing2.txt")
or die "Cannot open: $!\n";
while (<$fh>){
push #data, $_;
}
close $fh;
my %count;
foreach my $datum (#data) {
++$count{$datum};
}
my %percentile;
my $total = 0;
foreach my $datum (sort { $a <=> $b } keys %count) {
$total += $count{$datum};
$percentile{$datum} = $total / #data;
# percentile subject to change
if ($percentile{$datum} <= 0.10) {
print "$datum : $percentile{$datum}\n\n";
}
}
My desired output:
2 : 0.01
3 : 0.01333
4 : 0.01666
6 : 0.02
8 : 0.03
10 : 0.037
12 : 0.04
14 : 0.05
15 : 0.05333
16 : 0.06
18 : 0.06333
21 : 0.07333
22 : 0.08
25 : 0.09
26 : 0.09666
Where the format is #number from the list : #corresponding percentile

To store the numer wihtout a newline in #data, just add chomp; before pushing it, or chomp #data; after you've read them all.
If your input file has MSWin style newlines, convert it to *nix style using dos2unix or fromdos.
Also, try to learn how to indent your code, it boosts readability. And consider renaming $total to $running_total, as you use the value as it changes.

Related

How to resolve this warning in Perl

I asked this type of ques previously but didn't provide the full code.
I am reading below file and checking the max word width present in each column and then write it to another file with proper alignment.
id0 id1 id2 batch
0 34 56 70
2 3647 58 72 566
4 39 616 75 98 78 78987 9876 7899 776
89 40 62 76
8 42 64 78
34 455 544 565
My code:
unlink "temp1.log";
use warnings;
use strict;
use feature 'say';
my $log1_file = "log1.log";
my $temp1 = "temp1.log";
open(IN1, "<$log1_file" ) or die "Could not open file $log1_file: $!";
my #col_lens;
while (my $line = <IN1>) {
my #fs = split " ", $line;
my #rows = #fs ;
#col_lens = map (length, #rows) if $.==1;
for my $col_idx (0..$#rows) {
my $col_len = length $rows[$col_idx];
if ($col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
};
};
close IN1;
open(IN1, "<$log1_file" ) or die "Could not open file $log1_file: $!";
open(tempp1,"+>>$temp1") or die "Could not open file $temp1: $!";
while (my $line = <IN1>) {
my #fs = split " ", $line;
my #az;
for my $h (0..$#fs) {
my $len = length $fs[$h];
my $blk_len = $col_lens[$h]+1;
my $right = $blk_len - $len;
$az[$h] = (" ") . $fs[$h] . ( " " x $right );
}
say tempp1 (join "|",#az);
};
My warning
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 3.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
Use of uninitialized value in numeric lt (<) at new.pl line 25, <IN1> line 4.
I am getting the output correctly but don't know how to remove this warnings.
$col_idx can be up to the number of fields on a line, minus one. For the third line, this is more than the highest index of #col_lens, which contains at most 3 elements. So doing the following makes no sense:
if ($col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
Replace it with
if (!defined($col_lens[$col_idx]) || $col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
With this, there's really no point checking for $. == 1 anymore.
You're getting uninitialized warning because, while checking the $col_lens[$col_idx] < $col_len condition, one or both of them are undef.
Solution 1:
You can skip checking this condition by the use of next statement.
for my $col_idx (0..$#rows) {
my $col_len = length $rows[$col_idx];
next unless $col_lens[$col_idx];
if ($col_lens[$col_idx] < $col_len) {
$col_lens[$col_idx] = $col_len;
}
}
Solution 2: (Not advised):
You can simply ignore Use of uninitialized value.. warnings by putting this line at top of your script. This will disable uninitialized warnings in a block.
no warnings 'uninitialized';
For more info, please refer this link
Following code demonstrates one of many possible ways for solution to this task
read line by line
get length of each field
compare with stored earlier
adjust to max length
form $format string for print
print formatted data
use strict;
use warnings;
use feature 'say';
my(#data,#length,$format);
while ( <DATA> ) {
my #e = split ' ';
my #l = map{ length } #e;
$length[$_] = ($length[$_] // 0) < $l[$_] ? $l[$_] : $length[$_] for 0..$#e;
push #data,\#e;
}
$format = join ' ', map{ '%'.$_.'s' } #length;
$format .= "\n";
for my $row ( #data ) {
printf $format, map { $row->[$_] // '' } 0..$#length;;
}
__DATA__
id0 id1 id2 batch
0 34 56 70
2 3647 58 72 566
4 39 616 75 98 78 78987 9876 7899 776
89 40 62 76
8 42 64 78
34 455 544 565
Output
id0 id1 id2 batch
0 34 56 70
2 3647 58 72 566
4 39 616 75 98 78 78987 9876 7899 776
89 40 62 76
8 42 64 78
34 455 544 565

Generate all combinations of up to N digits, including repeating digits, in Perl

What's the best way to generate all combinations of 1 to N digits, where digits could be repeated in the combination? E.g, given array 0..2, the result should be:
0
1
2
00
01
02
10
11
12
20
21
22
000
001
002
010
011
etc.
I've played with Algorithm::Permute, but it looks likt it could generate just unique combinations of N numbers:
for( my $a = 0; $a < 3; $a++ ) {
for( my $b = 0; $b < 3; $b++ ) {
my #array = $a..$b;
Algorithm::Permute::permute {
my $Num = join("", #array);
print $Num;
sleep 1;
} #array;
}
}
Thank you.
As its name suggests,
Algorithm::Permute
offers permutations. There are many mathematical variations on selecting k items from a population of N: with and without replacement, with and without repetition, ignoring order or not
It's hard to be certain, but you probably want
Algorithm::Combinatorics
Here's some example code that reproduces at least the part of your expected data that you have shown. It's pretty much the same as zdim's solution but there may be something extra useful to you here
use strict;
use warnings 'all';
use feature 'say';
use Algorithm::Combinatorics 'variations_with_repetition';
my #data = 0 .. 2;
for my $k ( 1 .. #data ) {
say #$_ for variations_with_repetition(\#data, $k);
}
output
0
1
2
00
01
02
10
11
12
20
21
22
000
001
002
010
011
012
020
021
022
100
101
102
110
111
112
120
121
122
200
201
202
210
211
212
220
221
222
my #digits = 0..2;
my $len = 3;
my #combinations = map glob("{#{[join ',', #digits]}}" x $_), 1..$len;

Perl: find sum and average of specific columns

I want to calculate the average over all itemsX (where X is a digit) for each row in Perl on windows.
I have file in format:
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Want to find sum of item blocks and their average.
Any help on this?
Here's what I've tried so far:
use strict;
use warnings;
open my $fh, '<', "files.txt" or die $!;
my $total = 0;
my $count = 0;
while (<$fh>) {
my ($item1, $item2, ) = split;
$total += $numbers;
$count += 1;
}
For the first line of input (the column names), we store the indices of the columns that start with item. For each subsequent line, we sum the columns referenced by the array slice derived from #indices.
use strict;
use warnings;
use List::Util qw(sum);
my #indices;
while (<DATA>) {
my #fields = split;
if ($. == 1) {
#indices = grep { $fields[$_] =~ /^item/ } 0 .. $#fields;
next;
}
my $sum = sum(#fields[#indices]);
my $avg = $sum / scalar(#indices);
printf("Row %d stats: sum=%d, avg=%.2f\n", $., $sum, $avg);
}
__DATA__
id1 item1 cart1 id2 item2 cart2 id3 item3 cart3
0 11 34 1 22 44 2 44 44
1 44 44 55 66 34 45 55 33
Output:
Row 2 stats: sum=77, avg=25.67
Row 3 stats: sum=165, avg=55.00

Fastest way to index and query huge tab delimited file

I have 30Gb tab-delimited text file with numbers, I need the fastest way index it and to do a query to it by first and second column. I've tried MongoDB but it takes huge time to upload data to database, I've tried mongoimport via json file but it takes huge amount of time.
mongoimport --upsert --upsertFields A,B,S1,E1,S2,E2 -d DBName -c
TableName data.json
Data file fragment:
504 246 91.92007 93 0 4657 5631 5911 0 39 1061 1162
813 469 92.14697 109 0 2057 2665 7252 1 363 961 1399
2388 987 92.20945 61 0 1183 1575 1824 0 66 560 5088
2388 2323 92.88472 129 0 75 1161 1824 1 2516 3592 12488
2729 1008 95.29058 47 0 435 1166 1193 1 76 654 1055
2757 76 94.25837 12 0 0 44 1946 0 51 68 247
2757 2089 92.63158 14 0 12 30 1946 0 14 30 211
What is the right efficient way to do it with minimum time? Any hints about the best database for it? Or about mongo upload speed optimisation?
Query examples:
objs = db.TableName.find({'A':2757})
objs = db.TableName.find({'B':76})
For each number in column A and B there are up to 1000 hits with the mean 20.
Databases often has complex work to do in order to be more robust.
If you use strait B-tree indexes, normally it is faster.
Following you'll find a upload script in perl.
#!/usr/bin/perl
use DB_File;
use Fcntl ;
# $DB_BTREE->{'cachesize'} = 1000000;
$DB_BTREE->{'flags'} = R_DUP ;
my (%h, %h1, %h2,$n);
my $x = tie %h, 'DB_File', "bf.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
my $x1= tie %h1, 'DB_File', "i1.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
my $x2= tie %h2, 'DB_File', "i2.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
while(<>){ chomp;
if(/(\d+)\s+(\d+)/){
$h{++$n}=$_; ## add the tup
$h1{$1} = $n; ## add to index1
$h2{$2} = $n ## add to index2;
}
}
untie %h;
untie %h1;
untie %h2;
and a query:
#!/usr/bin/perl
use DB_File;
use Fcntl ;
$DB_BTREE->{'flags'} = R_DUP ;
my (%h, %h1, %h2, $n, #list);
my $x = tie %h, 'DB_File', "bf.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
my $x1= tie %h1, 'DB_File', "i1.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
my $x2= tie %h2, 'DB_File', "i2.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
while(<>){ chomp; # Queries input format: A:number or B:number
if(/A:(\d+)/){
#list = sort $x1->get_dup($1) ;
for(#list){print $h{$_},"\n"; }
}
if(/B:(\d+)/){
#list = sort $x2->get_dup($1) ;
for(#list){print $h{$_},"\n"; }
}
}
Query is very fast.
But upload took 20s (user time) for 1 000 000 lines...
(please if you do experiments with your data, show us the times)

Combining duplicated lines in txt file with perl

I am trying to combine duplicate lines using Perl with little luck. My tab-delimited text file is structured as follows (spaces added for readability):
Pentamer Probability Observed Length
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
I would like to be combine duplicated lines by adding the three numeric columns, therefor the line containing "ATGCA" would now look like this:
ATGCA 0.024 3 39
Any ideas/help/suggestions would be greatly appreciated! Thanks!
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
while(<>) {
my #v = split(/\s+/);
if (defined $hash{$v[0]}) {
my $arr = $hash{$v[0]};
$hash{$v[0]} = [$v[0], $arr->[1] + $v[1],
$arr->[2] + $v[2], $arr->[3] + $v[3]];
} else {
$hash{$v[0]} = [#v];
}
}
foreach my $key (keys %hash) {
print join(" ", #{$hash{$key}}), "\n";
}
Here's another option:
use Modern::Perl;
my %hash;
while ( my $line = <DATA> ) {
my #vals = split /\s+/, $line;
$hash{ $vals[0] }->[$_] += $vals[ $_ + 1 ] for 0 .. 2;
}
say join "\t", $_, #{ $hash{$_} } for sort keys %hash;
__DATA__
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
Output:
ATGCA 0.024 3 39
CAGTG 0.004 1 16
GGCAT 0.008 1 16
TGTAC 0.028 2 39