Need Help in Perl looping - perl

I have a file with following data. I want to print in an external file with sum of every 9 rows. Here is my data.
file.xyz
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
My output looks like
-2.121332105 -5.405270886 -2.234333276 6.221675693
-2.121332105 -5.405270886 -2.234333276 6.221675693
Where the first line is sum of 1-9 and second line is sum from 10-18. Here data are same for first 9 and last 9 rows, it gives same value. I want to print sum of EVERY NINE lines of big files with thousands line file.
Here is my code, it calculates the total sum, but I need to split into two parts as above.
Thank you for your help and appreciated.
my #sums;
open FILE, "file.xyz" or die "Can't find";
while( <FILE> ) { # there is FILE written within angular brackets
my #summands = split / /;
foreach my $i ( 0 .. $#summands ) {
$sums[$i] += $summands[$i];
}
}
$total = sqrt($sums[0]*$sums[0]+$sums[1]*$sums[1]+$sums[2]*$sums[2]);
print "$sums[0], $sums[1], $sums[2], $total\n";

Something like the following:
use strict;
use warnings;
use autodie;
open my $fh, '<', 'file.xyz';
my #sums;
my $n = 0;
while (<$fh>) {
my #summands = split /\s+/;
foreach my $i ( 0 .. $#summands ) {
$sums[$i] += $summands[$i];
}
unless ( ++$n % 9 ) {
my $total = sqrt( $sums[0] * $sums[0] + $sums[1] * $sums[1] + $sums[2] * $sums[2] );
print "$sums[0], $sums[1], $sums[2], $total\n";
#sums = ();
}
}

Add a line counter and move the $total and print code into the loop under a conditional. You'll also need to clear the sums there.
if ($lines % 9 == 0) {
...
}

Use Input line number, $., to determine when your on every 9th line:
use strict;
use warnings;
use autodie;
use List::Util qw(sum);
#open my $fh, '<', "file.xyz";
my $fh = \*DATA;
my #sums;
while (<$fh>) {
my #cols = split ' ';
for my $i ( 0 .. $#cols ) {
$sums[$i] += $cols[$i];
}
if ( ( $. % 9 ) == 0 or eof ) {
my $weighted_total = sqrt sum map $_**2, #sums;
print join( ', ', #sums, $weighted_total ), "\n";
#sums = ();
}
}
warn "FH did not end an on even multiple of 9" if $. % 9;
__DATA__
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
-0.485718003092488 3.25568455554021 -0.60544991716881
-1.01253068155602 -2.49251542491767 0.713923699625837
0.791137982988487 -2.56492609246597 -0.853251541212567
Outputs:
-2.12133210498006, -5.40527088553029, -2.23433327626662, 6.22167569349391
-2.12133210498006, -5.40527088553029, -2.23433327626662, 6.22167569349391

The whole thing in Perl-golf style, watered down with a bit of whitespace and indentation:
while (<>) {
#l = split;
#s = map { $s[$_] + $l[$_] } 0..$#l;
next if ++$n % 9;
$, = ' ';
print #s, sqrt(eval join '+', (map { $_*$_ } #s)), "\n";
#s = ();
}
And some real Perl-golf, inspired by #Nemo, but even shorter:
perl -lanE '$s[$_]+=$F[$_]for 0..$#F;$.%9&&next;print"#s #{[sqrt eval join qq(+),map $_**2,#s]}";#s=()'
Finding the used tricks is left as an exercise to the readers. :-)

Related

SPARQL: slow processing of result of subquery

I'm currently learning SPARQL, and I can't wrap my head around why what seems to me like a very straightforward query takes a large amount to time. I'm trying to count the number of articles per author in a journal, using the OpenCitations project (SPARQL endpoint https://opencitations.net/sparql, I also downloaded a dump of an earlier version).
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX pro: <http://purl.org/spar/pro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX fabio: <http://purl.org/spar/fabio/>
SELECT $author (COUNT($author) as $cnt) WHERE {
# narrowing down journal down to articles (over issues and volumes)
$jnl a fabio:Journal;
dcterms:title "Nature" .
$volm frbr:partOf $jnl .
$issue frbr:partOf $volm .
$artcl frbr:partOf $issue .
# selecting author
?artcl pro:isDocumentContextFor ?artcl_atrbts .
?artcl_atrbts pro:isHeldBy ?author.
# making sure that author is a person
$author foaf:familyName $y .
}
GROUP BY $author
ORDER BY DESC($cnt)
LIMIT 10
This works as expected, and takes around 3 seconds on the dump, and maybe 5 on the OpenCitations endpoint.
However now I also want to get the actual names of the authors, so my idea was to use the previous query as a subquery:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX pro: <http://purl.org/spar/pro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX fabio: <http://purl.org/spar/fabio/>
SELECT $author $last_name $cnt WHERE {
$author foaf:familyName $last_name.
{
SELECT $author (COUNT($author) as $cnt) WHERE {
$jnl a fabio:Journal;
dcterms:title "Nature" .
$volm frbr:partOf $jnl .
$issue frbr:partOf $volm .
$artcl frbr:partOf $issue .
?artcl pro:isDocumentContextFor ?artcl_trbts .
?artcl_trbts pro:isHeldBy ?author.
$author foaf:familyName $y .
}
GROUP BY $author
ORDER BY DESC($cnt)
LIMIT 10
}
}
ORDER BY DESC($cnt)
This now takes around 15 seconds on the dump, (more than a minute on the online endpoint), even though it seems to me all it is doing is looking up the 10 values of the givenName for the authors. If I include the first name (foaf:givenName) as well, the query can take even longer. Furthermore, when I select names without grouping by author, it executes within a split second:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX pro: <http://purl.org/spar/pro/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
PREFIX fabio: <http://purl.org/spar/fabio/>
SELECT $author $first_name $last_name WHERE {
$jnl a fabio:Journal;
dcterms:title "Nature" .
$volm frbr:partOf $jnl .
$issue frbr:partOf $volm .
$artcl frbr:partOf $issue .
?artcl pro:isDocumentContextFor ?artcl_trbts .
?artcl_trbts pro:isHeldBy ?author.
$author foaf:familyName $last_name .
$author foaf:givenName $first_name .
}
LIMIT 10
Can somebody tell me what am I doing wrong here? Thanks in advance!

Different output from same code parsing simiiar tab-delimited files

The Perl script below is written in a shell.
If I use the tab-delimited file numeric then I get the desired result of each line parsed accordingly. However, if I use the file alpha as input then only the first line is parsed.
The only difference between alpha and numeric is that numeric has NC_000023
NC_000023.11:g.41747805_41747806delinsTT
NC_000023.11:g.41750615C>A
while alpha has NC_0000X
NC_0000X.11:g.41747805_41747806delinsTT
NC_0000X.11:g.41750615C>A
What am I missing?
numeric
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_000023.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM
NM_003924.3:c.013G>T NC_000023.11:g.41750615C>A LRG_513t1:c.13G>T
alpha
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_0000X.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM_003924.3:c.*18_*19delinsAA
NM_003924.3:c.013G>T NC_0000X.11:g.41750615C>A LRG_513t1:c.13G>T NM_003924.3:c.13G>T
Perl
perl -ne '
next if $. == 1;
if ( /.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/ ) { # indel
print join( "\t", $3, $4, $5, $1, $2 ), "\n";
}
else {
while ( /\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g ) {
# conditional parse
( $num1, $num2, $common ) = ( $1, $2, $3 );
$num3 = $num2;
if ( $common =~ /^([A-Z])>([A-Z])$/ ) { # SNP
( $ch1, $ch2 ) = ( $1, $2 );
}
elsif ( $common =~ /^del([A-Z])$/ ) { # deletion
( $ch1, $ch2 ) = ( $1, "-" );
}
elsif ( $common =~ /^ins([A-Z])$/ ) { # insertion
( $ch1, $ch2 ) = ( "-", $1 );
}
elsif ( $common =~ /^_(\d+)del([A-Z]+)$/ ) { # multi deletion
( $num3, $ch1, $ch2 ) = ( $1, $2, "-" );
}
elsif ( $common =~ /^_(\d+)ins([A-Z]+)$/ ) { # multi insertion
( $num3, $ch1, $ch2 ) = ( "-", $1, $2 );
}
printf( "%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2 ); # output
map { undef } ( $num1, $num2, $num3, $common, $ch1, $ch2 );
}
}' numeric
output
23 41747805 41747806 GC AA
23 41750615 41750615 C A
output using alpha:
X 41747805 41747806 GC AA
If I use \w instead of \d in the while condition, like this
while ( /\t*NC_(\w+)\.\S+g\.(\d+)(\S+)/g ) { ... }
I get this result
X 41747805 41747806 GC AA
0 41750615 41750615 C A
Why the zero in $1
The while (/\t*NC_(\d+)\. will not match 'NC_0000X.11', due to the 'X' and the regex is looking for digits only.
After the change you made, NC_(\w+) will match 'NC_0000X' and $num1 is set to '0000X'.
Your printf "%d...." $num1 ... will print a 0 for non numeric input. As $num1 is '0000X', it will print as 0.
The input example suggests, that each line consists of fields, which are separated by white-space. Some fields are of interest, otheres not. Each field holds recognizable information.
Your program should follow this structure.
Read the file line by line
split the line into fields
skip fields, which are not of interest, e.g. non NC.*
extract the necessary information from the field
do whatever necessary, sum it up, collect it
print the information at the level needed. Per field, line, file or after all files
It is much easier to work on smaller chuncks instead of finding a regex which works for the whole line. It is much easier to read, understand and maintain.

Script merging two files

I'm fairly inexperienced with coding, but I often use Perl to merge files and match ID's and information between two files. I have just tried matching two files using a program I have used many times previously, but this time it's not working and I don't understand why.
Here is the code:
use strict;
use warnings;
use vars qw($damID $damF $damAHC $prog $hash1 %hash1 $info1 $ID $sire $dam $F $FB $AHC $FA $hash2 %hash2 $info2);
open (FILE1, "<damF.txt") || die "$!\n Couldn't open damF.txt\n";
my $N = 1;
while (<FILE1>){
chomp (my $line=$_);
next if 1..$N==$.;
my ($damID, $damF, $damAHC, $prog) = split (/\t/, $line);
if ($prog){
$hash1 -> {$prog} -> {info1} = "$damID\t$damF\t$damAHC";
}
open (FILE2, "<whole pedigree_F.txt") || die "$!\n whole pedigree_F.txt \n";
open (Output, ">Output.txt")||die "Can't Open Output file";
while (<FILE2>){
chomp (my $line=$_);
next if 1..$N==$.;
my ($ID, $sire, $dam, $F, $FB, $AHC, $FA) = split (/\t/, $line);
if ($ID){
$hash2 -> {$ID} -> {info2} = "$F\t$AHC";
}
if ($ID && ($hash1->{$prog})){
$info1 = $hash1 -> {$prog} -> {info1};
$info2 = $hash2 -> {$ID} -> {info2};
print "$ID\t$info2\t$info1\n";
}
}
}
close(FILE1);
close FILE2;
close Output;
print "Done!\n";
and these snippets from the two input file formats:
File 1:
501093 0 0 3162
2958 0 0 3163
1895 0 0 3164
1382 0 0 3165
2869 0 0 3166
2361 0 0 3167
754 0 0 3168
3163 0 0 3169
File 2:
49327 20543 49325 0.077 0.4899 0.808 0.0484
49328 15247 49326 0.0755 0.5232 0.8972 0.0499
49329 27823 49327 0.0834 0.5138 0.8738 0.0541
I want to match the values from column 4 in file 1, with column 1 in file 2.
Then I also want to print the matching values from columns 2 and 3 in file 1 and columns 3 and 5 in file 2.
Also, it is probably worth mentioning there are about 500000 entries on each file.
This is the output I am getting:
11476 0.0362 0.3237 501093 0 0
11477 0.0673 0.4768 501093 0 0
11478 0.0443 0.2619 501093 0 0
Note that it isn’t looping through the first hash that I created.
Create two tables in SQLite. Load the TSVs into them. Do a SQL join. It will be simpler and faster.
Refer to this answer about how to load data into SQLite. In your case you want .mode tabs.
sqlite> create table file1 ( col1 int, col2 int, col3 int, col4 int );
sqlite> create table file2 ( col1 int, col2 int, col3 int, col4 numeric, col5 numeric, col6 numeric, col7 numeric );
sqlite> .mode tabs
sqlite> .import /path/to/file1 file1
sqlite> .import /path/to/file2 file2
There's any number of ways to improve those tables, but I don't know what your data is. Use better names in your own. You'll also want to declare things like primary and foreign keys as well as indexes to speed things up.
Now you have your data in an easy to manipulate format using a well known query language, not a bunch of custom code.
I want to match the values from column 4 in file 1, with column 1 in file 2.
Then I also want to print the matching values from columns 2 and 3 in file 1 and columns 3 and 5 in file 2.
You can do this with a SQL join between the two tables.
select file1.col2, file1.col3, file2.col3, file2.col5
from file1
join file2 on file1.col4 = file2.col1

Untainting a blessed hash member with or without the delete

I saw this line of code in some sources
( $self->{arg} ) = ( ( delete $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
I understand the untainting. I also known delete
My question is, in what circumstances is it necessary or preferred to use the delete, and isn't it enough to use the simpler
( $self->{arg} ) = ( ( $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
For example
#!/usr/bin/env perl -T
use 5.014;
use warnings;
package Some {
use Moose;
has 'arg' => (is => 'rw', isa => 'Str');
sub doit {
my $self = shift;
#( $self->{arg} ) = ( ( delete $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
( $self->{arg} ) = ( ( $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
}
};
my $some = Some->new( arg => 'some text' );
$some->doit();
say $some->arg;
With a normal hash deleting the value and reinserting will give the same result as modifying it in place.
The commit does not give any information about why he deletes it just that he copies the functionality from Mason 1. But if you look at the source of HTML::Mason::Lexer, you will find this comment:
We need to untaint the component or else the regexes will fail
to a Perl bug. The delete is important because we need to
create an entirely new scalar, not just modify the existing one.
($current->{comp_source}) = (delete $current->{comp_source}) =~ /(.*)/s if taint_is_on;
So the reason of doing it this way is to have a new scalar, although he does not do that for the other place where he is untainting: Mason::Interp, so my guess is an earlier Perl bug, when untainting.
So the difference is that with delete will give you a new scalar, although this will seldom have a practical application. (Delete and insert is also a slower operation of course.)
use strict;
my $hash->{test} = 'test';
print \($hash->{test}),"\n";
( $hash->{test} ) = ( ( $hash->{test} ) =~ /(.*)/s );
print \($hash->{test}),"\n";
( $hash->{test} ) = ( ( delete $hash->{test} ) =~ /(.*)/s );
print \($hash->{test}),"\n";
gives
SCALAR(0x7f84d10047e8)
SCALAR(0x7f84d10047e8)
SCALAR(0x7f84d1029230)

use multiple values as a key in a Perl hash

I have two tables. First one is $sampleand looks like this:
col1 col2
A 1
A 3
A 4
B 7
... ...
Second one is $exonand looks like this:
col1 col2 col3 col4 col5
name1 A 1 100 200
name2 A 2 300 400
name3 A 3 500 600
name4 A 4 700 800
I want to check if there is a match between col1 and col2 from $sampleand col2 and col3from exon.
I normally use hashes for this in Perl. I know how it works when you are just looking for a match between two columns. But I'm stuck now because values from two columns should match. This is what I have for now
my %hash = ();
while(<$sample>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0]; #col1
my $keyfield2 = $cols[1]; #col2
push #{ $hash{$keyfield}}, $keyfield2}; #this is probably not correct
}
seek $exon,0,0; #cursor resetting
while(<$exon>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[1]; #col2
my $keyfield2 = $cols[2]; #col3
if (exists($hash{$keyfield}) && exists($hash{$keyfield2})) {
print $output $cols[0], "\t", $cols[3], "\t", $cols[4], "\n";
}
}
You should use a concatenation of col2 and col3 values as the keys for your hastable
my %hash = ();
while(<$sample>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0] #col1
my $keyfield2 = $cols[1] #col2
my $key = "$keyfield - $keyfield2";
$hash{$key}=1;
}
seek $exon,0,0 #cursor resetting
while(<$exon>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[1]; #col2
my $keyfield2 = $cols[2]; #col3
my $key = "$keyfield - $keyfield2";
if (exists($hash{$key}) {
print $output $cols[0], "\t", $cols[3], "\t", $cols[4], "\n";
}
}
You can put both fields as key separarted with a delimiter in your hash:
my #cols = split(/\t);
my $keyfield = $cols[0]."--".$cols[1];
push #{ $hash{$keyfield}}, value};