Optimizing Large Data Intersect

Optimizing Large Data Intersect - perl

I have two files from which a subset looks like this:
regions
chr1 150547262 150547338 v2MCL1_29.1.122 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr1 150547417 150547537 v2MCL1_29.1.283 . GENE_ID=MCL1;Pool=1;PURPOSE=CNV
chr1 150547679 150547797 v2MCL1_29.2.32 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr1 150547866 150547951 v2MCL1_29.2.574 . GENE_ID=MCL1;Pool=1;PURPOSE=CNV
chr1 150548008 150548096 v2MCL1_29.2.229 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr4 1801108 1801235 v2FGFR3_3.11.182 . GENE_ID=FGFR3;Pool=2;PURPOSE=CNV
chr4 1801486 1801615 v2FGFR3_3.11.202 . GENE_ID=FGFR3;Pool=1;PURPOSE=CNV
chrX 66833436 66833513 v2AR_region.70.118 . GENE_ID=AR;Pool=1;PURPOSE=CNV
chrX 66866117 66866228 v2AR_region.103.68 . GENE_ID=AR;Pool=2;PURPOSE=CNV
chrX 66871579 66871692 v2AR_region.108.32 . GENE_ID=AR;Pool=1;PURPOSE=CNV
Note: field 1 goes from chr1..chrX
query (a somewhat standard VCF file)
1 760912 . C T 21408 PASS . GT:DP:GQ:PL 1/1:623:99:21408,1673,0
1 766105 . T A 11865 PASS . GT:DP:GQ:PL 1/1:618:99:11865,1025,0
1 767780 . G A 15278 PASS . GT:DP:GQ:PL 1/1:512:99:15352,1274,74
1 150547747 . G A 9840 PASS . GT:DP:GQ:PL 0/1:645:99:9840,0,9051
1 204506107 . C T 22929 PASS . GT:DP:GQ:PL 1/1:636:99:22929,1801,0
1 204508549 . T G 22125 PASS . GT:DP:GQ:PL 1/1:638:99:22125,1757,0
2 2765262 . A G 22308 PASS . GT:DP:GQ:PL 1/1:678:99:22308,1854,0
2 2765887 . C T 9355 PASS . GT:DP:GQ:PL 0/1:649:99:9355,0,9235
2 25463483 . G A 31041 PASS . GT:DP:GQ:PL 1/1:936:99:31041,2422,0
2 212578379 . TA T 5355 PASS . GT:DP:GQ:PL 0/1:500:99:5355,0,3249
3 178881270 . T G 10012 PASS . GT:DP:GQ:PL 0/1:632:99:10012,0,7852
3 182673196 . C T 31170 PASS . GT:DP:GQ:PL 1/1:896:99:31170,2483,0
4 1801511 . C T 12218 PASS . GT:DP:GQ:PL 0/1:885:99:12218,0,11568
4 55097835 . G C 7259 PASS . GT:DP:GQ:PL 0/1:512:99:7259,0,7099
4 55152040 . C T 15866 PASS . GT:DP:GQ:PL 0/1:1060:99:15866,0,14953
X 152017752 . G A 9786 PASS . GT:DP:GQ:PL 0/1:735:99:9786,0,11870
X 152018832 . T G 12281 PASS . GT:DP:GQ:PL 0/1:924:99:12281,0,13971
X 152019715 . A G 10128 PASS . GT:DP:GQ:PL 0/1:689:99:10128,0,9802
Note: there are several leading lines that comprise the header and start with a '#' char.
I'm trying to write a script that will use the first two fields of the query file to see if the coordinates fall between the second and third fields of the regions file. I've coded it like this:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dump;
my $bed = shift;
my $query_file = shift;
my %regions;
open( my $region_fh, "<", $bed ) || die "Can not open the input regions BED file: $!";
while (<$region_fh>) {
next if /track/;
my #line = split;
$line[0] =~ s/chr//; # need to strip of 'chr' or won't match query file
my ($gene, $pool, $purpose) = $line[5] =~ /GENE_ID=(\w+);(Pool=\d+);PURPOSE=(.*)$/;
#{$regions{$line[3]}} = (#line[0..4],$gene,$pool,$purpose);
}
close $region_fh;
my ( #header, #results );
open( my $query_fh, "<", $query_file ) || die "Can not open the query file: $!";
while (<$query_fh>) {
if ( /^#/ ) {
push( #header, $_ );
next;
}
my #fields = split;
for my $amp ( keys %regions ) {
if ( $fields[0] eq $regions{$amp}->[0] && $fields[1] >= $regions{$amp}->[1] && $fields[1] <= $regions{$amp}->[2] ) {
$fields[2] = $regions{$amp}->[5]; # add gene name to VCF file
push( #results, join( "\t", #fields ) );
}
}
}
close $query_fh;
The issue is that the query file is ~3.25 million lines long, and the regions file is about 2500 lines long. So, running this takes a very long time (I quit after about 20 minutes of waiting).
I think my overall logic is correct (hopefully!), and I'm wondering if there is a way to optimize how the data is processed to speed up the time it takes to process. I think the problem is that I need to traverse the array within regions 2500*3.25 million times. Can anyone offer any advice on how to revise my algorithm to process these data more efficiently?
Edit: Added a larger sample dataset, which should show some positives this time.

There are two approaches that I can think of. The first is to change the keys of %regions to the chromosome names, with the values being a list of all the start, end, and gene ID values corresponding to this chromosome, sort by the start value.
With your new data the hash would look like this
(
chr1 => [
[150547262, 150547338, "MCL1"],
[150547417, 150547537, "MCL1"],
[150547679, 150547797, "MCL1"],
[150547866, 150547951, "MCL1"],
[150548008, 150548096, "MCL1"],
],
chr4 => [
[1801108, 1801235, "FGFR3"],
[1801486, 1801615, "FGFR3"]
],
chrX => [
[66833436, 66833513, "AR"],
[66866117, 66866228, "AR"],
[66871579, 66871692, "AR"],
],
)
This way the chromosome name would give instant acccess to the right part of the hash instead of having to search through every entry each time, and the sorted start value allows a binary search.
The other possibility is to write the whole of the regions file to an SQLite temporary in-memory database. Once the data is stored and indexed, looking up a gene ID for a given chromosome and position will be pretty fast.

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;

my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}

my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5

# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];

if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!

Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Mapping SNP coordinates to gene coordinates is too slow

I have two tab-delimited files like these
File 1 (these are Single Nucleotide Polymorphism (SNP) positions)
Chr1 26690
Chr1 33667
Chr1 75049
.
.
Chr2 12342
Chr2 32642
Chr2 424421
.
.
File 2 (these are gene start and end coordinates)
Chr1 2903 10817 LOC_Os01g01010
Chr1 2984 10562 LOC_Os01g01010
Chr1 11218 12435 LOC_Os01g01019
Chr1 12648 15915 LOC_Os01g01030
Chr1 16292 18304 LOC_Os01g01040
Chr1 16292 20323 LOC_Os01g01040
Chr1 16321 20323 LOC_Os01g01040
Chr1 16321 20323 LOC_Os01g01040
Chr1 22841 26971 LOC_Os01g01050
Chr1 22841 26971 LOC_Os01g01050
.
.
What I want is to match SNPs in file 1 to genes in file 2. The script should match the string in the first column of the files, and if they match it should then find which gene in the file 2 contains the corresponding SNP and return the locus ID from the fourth column of File 2.
Here's the script I have written
use strict;
my $i1 = $ARGV[0]; # SNP
my $i2 = $ARGV[1]; # gene coordinate
open(I1, $i1);
open(I2, $i2);
my #snp = ();
my #coor = ();
while( <I1> ) {
push(#snp, $_);
}
while ( <I2> ) {
push(#coor, $_);
}
for ( my $i = 0; $i <= $#snp; $i++ ) {
my #snp_line = split "\t", $snp[$i];
for ( my $j = 0; $j <= $#coor; $j++ ) {
my #coor_line = split "\t", $coor[$i];
if ( $snp_line[0] eq $coor_line[0] ) {
if ( $snp_line[1] >= $coor_line[1] && $snp_line[1] <= $coor_line[2] ) {
print "$snp_line[0]\t$snp_line[1]\t$coor_line[3]\n";
goto a;
}
}
}
a:
}
The problem is that obviously this is not the best way to do it as it iterates over all the ~60,000 lines in file 2 for each SNP in line 1. Also, it ran overnight and did not go past Chr1; we have upto Chr12.

You could work with these files when reformatted as UCSC BED format, using a toolkit like BEDOPS that does efficient set operations on sorted BED files.
Convert your first file of SNPs to a sorted BED file:
$ awk -v OFS="\t" '{ print $1, $2, ($2+1); }' snps.txt | sort-bed - > snps.bed
Sort the genes ("file 2"):
$ sort-bed genes.unsorted.txt > genes.bed
Map SNPs to genes:
$ bedmap --echo --echo-map-id-uniq --delim '\t' snps.bed genes.bed > answer.bed
If you need to, you can strip the end position of the SNP from the answer:
$ cut -f1,2,4 answer.bed > answer.txt
These tools will run very fast, usually within a few moments.
I would not use Perl or Python to do these kinds of set operations, unless I was doing some kind of academic exercise.

Here is a working script, the one posted above had bugs
use strict;
my $i1=$ARGV[0]; # SNP
my $i2=$ARGV[1]; # gene coordinate
open(I1,$i1);
open(I2,$i2);
my #snp=();
my #coor=();
while(<I1>)
{
push(#snp,$_);
}
while(<I2>)
{
push(#coor,$_);
}
for(my $i=0;$i<=$#snp;$i++)
{
my #snp_line = split "\t",$snp[$i];
for(my $j=0;$j<=$#coor;$j++)
{
my #coor_line = split "\t",$coor[$j];
if ($snp_line[0] eq $coor_line[0])
{
if ($snp_line[1] >= $coor_line[1] && $snp_line[1] <= $coor_line[2])
{
print "$snp_line[0]\t$snp_line[1]\t$coor_line[3]\n";
}
}
}
}
This one does the job.

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab
file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 0.781985
16 0.810993
20 0.769601
24 0.733831
file B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
0.721985
0.610993
0.760123
0.573831
I expect an output like this. file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.7194950944
16 1.754418585
20 1.7047982296
24 1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr2
for file B (in this file values are just copy paste of file A)
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.7194950944
1.754418585
1.7047982296
1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar
variableStep chrom=chr2

This awk script does the calculation that you want:
awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file
This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}.
Testing it out on your new files:
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.719495094445
16 1.754418584953
20 1.704798229573
24 1.663049372620
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.649449947457
1.527310087388
1.693635012985
1.488470882686

So here's the Perl version:
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.

This is the modified little script of #ThomasKilian
Thanks to him for providing the framework.
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
$power= sprintf("%.12f", $power);
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
$power= sprintf("%.12f", $power);
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;

calculating velocity from massive simulation data

I have simulation data for the velocity of water molecules. The format of the data is as below. I would like to describe the format of the data for clarity purposes, and it easily would lead to what I want to calculate.
A water molecule is made of three atoms: Oxygen(O) and two Hydrogen (H). Here I would name them O, H1, and H2.
The data below starts with line title 0 and the number 4335, saying it contains 4335 atoms (4335/3 = 1445 water molecules).
The first three numbers starting from the third row ( 0.0923365 0.0341984 -0.1248516 ) representing velocity for oxygen (O) atom at three Cartesian directions Ox, Oy, Oz. The next three numbers, in the same row representing velocities for hydrogen (H1) ==> H1x, H1y, H1z. And finally the first three numbers in fourth row representing velocities for hydrogen (H2) ==> H2x,H2y,H2z. finally, the following three numbers in the same fourth row representing velocities for oxygen atom.
These sequence is goes on for all 4335 atoms in 2170 lines including the top two lines in the data file and it repeats for the following section starting from title 1.
title 0
4335 2.0001000e+04
0.0923365 0.0341984 -0.1248516 -0.8946258 1.6688854 0.8259304
0.2890579 0.8051153 -1.5612963 0.0625492 -0.1361579 0.2869132
0.2343408 -0.0665305 1.0745378 -0.8375892 0.6953992 0.5149021
-0.1628550 0.0131844 0.0688080 0.2429340 0.2168210 -0.0289806
-0.3677613 0.2054004 -0.1511643 -0.3487551 -0.1454157 0.0801884
-0.9039297 -0.0682939 -0.2337404 -0.5605327 -0.0369157 0.2243892
-0.3100274 -0.2673132 -0.2093299 0.1975043 -0.4572202 -0.8410826
-0.6995287 -0.4123909 0.0649209 -0.1910519 0.2289656 0.2443295
-0.0279093 0.5790939 -0.0104249 -1.1961776 -0.5387340 0.1445187
-0.3188485 0.3789352 -0.0112114 0.7831523 0.6043882 -0.7131590
-0.7214440 -0.5358508 -0.3035673 -0.1549275 -0.1402387 -0.0101964
-0.2027608 1.5107149 0.2963312 -1.5104872 -0.1554981 -1.3323215
0.1097982 -0.1553742 0.3803437 0.0816858 0.0265007 0.4215823
0.1157368 0.2100116 0.4712551 0.1799426 -0.1260255 -0.2131755
0.1811777 -0.9442581 -0.6036636 0.9681703 -0.1523646 -0.3502441
0.0976771 0.0019619 -0.1832204 -0.0055989 0.2701100 -0.4416720
0.8496723 0.4070951 -0.0819204 0.1156806 -0.1619873 -0.0016126
-0.4051959 0.4263505 -0.9460036 0.4412067 0.1002270 0.5864405
-0.3831136 0.3240860 -0.0005143 -0.5667163 0.2618876 0.0103317
-0.6442209 0.3965833 -0.0778050 -0.2404238 -0.1339887 -0.1662417
0.3421198 0.7480828 -1.8316993 -0.4454920 -0.0499657 -0.1951254
-0.2895359 -0.1934811 -0.2674928 0.1255802 1.3522828 -0.2829485
-0.4129106 -0.6842645 -1.0147657 -0.1278501 -0.0597648 -0.1478294
-0.2519974 0.0665314 -0.0690079 -0.0480210 -0.1179547 -0.2091919
-0.1942484 0.2583650 -0.0734658 -0.1216313 0.5158040 -0.0676843
-0.3063602 0.8148463 -0.1959571 -0.1009838 -0.3394633 -0.0866587
.
. (goes on until line 2170)
.
0.1028815 -0.0844088 -0.2156557 -0.1698745 -0.2018967 -0.3863209
0.1793070 -0.1005802 0.1800752 -0.1404713 0.2216020 0.2236271
0.5192825 -0.7398186 0.0418758 0.0347715 -0.3457840 -0.1300237
-0.3089482 1.1125441 -0.4020403 0.2739744 -0.9062766 0.0012294
0.1498538 0.0883857 -0.0094638 0.0963565 -1.1027019 0.0115313
-0.0432824 0.3330713 0.0304943
title 1
4335 2.0002000e+04
-0.2082078 0.1774843 -0.1023302 -0.1100437 0.5973607 1.0627041
-0.2216015 0.0448885 -0.8415924 0.1691296 0.6008261 -0.0373434
0.9387534 -0.3642305 0.6756270 -0.6000357 0.6632088 1.0567899
-0.3234407 -0.1781680 -0.1936070 -0.4799916 -0.1522612 -0.2347461
0.1045985 0.1999704 -0.1482928 -0.0439331 0.0413923 0.1605458
0.3403952 -0.2012104 0.4851457 -0.9665228 0.2202362 0.0046218
.
. (goes on until line 2170)
.
What I want to calculate is the resultant velocity for each molecule and I would like to do this using Perl. The algorithm goes in this way.
First store the velocities for oxygen (O) and hydrogens (H1 & H2) in Ox,Oy,Oz, H1x,H1y,H1z and H2x,H2y,H2z respectively.
Next we define:
velocity_x = Ox + Hx + Hx
velocity_y = Oy + Hy + Hy
velocity_z = Oz + Hz + Hz
Finally calculate
resultant_velocity = sqrt(velocity_x**2 + velocity_y**2 + velocity_z**2)
and store the "resultant_velocity" into new file (the file should be title_0.dat). And the program shall calculate the velocities starting from title 1 until title 200 in the file.
I am a newbie at Perl, but I would like to do this operation in Perl since I find that it is very interesting. I can write simple "read and write" operations in Perl but found no idea how to split and assign the values to the variables and carryout the calculation though the calculation is high school standard.
#!/usr/bin/perl -w
$data_file="malto.dat";
open(DAT, $data_file) || die("Could not open file!");
#raw_data=<DAT>;
close(DAT);
while(<#raw_data>){
#columns=split /\s+/,$_;
if($columns[0]=~ m/ATOM/){
print "$columns[5], $columns[6], $columns[7]\n";
}
}
I would like to get some guidance from experts so that I can enhance my understanding of Perl while working on the code.
Appreciate any help.
Regards

Perhaps the following will assist you:
use strict;
use warnings;
use Math::Complex;
my $dataFile = 'malto.dat';
{
local $/ = 'title ';
open my $fh, '<', $dataFile or die $!;
while (<$fh>) {
chomp;
my #data = split or next;
my $titleNum = 'Title ' . shift #data;
my $atom = shift(#data) . ' ' . shift #data;
my $resultantVel = calcResultantVel( \#data );
print $titleNum, "\n";
print $atom, "\n";
print 'ResultantVel: ' . $resultantVel, "\n\n";
}
close $fh;
}
sub calcResultantVel {
my ($dataRef) = #_;
my ($velocity_x, $velocity_y, $velocity_z);
while ( my #nums = splice( #$dataRef, 0, 9 ) ) {
$velocity_x += $nums[0] + $nums[3] + $nums[6];
$velocity_y += $nums[1] + $nums[4] + $nums[7];
$velocity_z += $nums[2] + $nums[5] + $nums[8];
}
return sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
}
The word and space title is used as the record separator, so each read takes in a chunk of data that's delimited by title. The chomp removes the record separator, and then the record is split on whitespace.
The zeroth element is the title number, and that's shifted off #data. The first and second elements of #data are the atom count, and they're shifted off, too. The remaining array elements are the Cartesian directions, and a reference to that array is send to the subroutine calcResultantVel.
The subroutine takes a chunk of nine elements at a time: three for O atom, three for the first H atom, and three for the second H atom, and a running sum is kept based upon the definition you've provided. Finally, the resultant velocity is returned.
Here's some sample output:
Title 0
4335 2.0001000e+04
ResultantVel: 13.2945751170603
Title 1
4335 2.0001000e+04
ResultantVel: 12.7696611061461
You can visually verify that it's working correctly. Since you "...can write simple 'read and write' operations in Perl...," the next step is to have it write the desired results to a file.
Hope this helps!

Here's my advice: break the job down into small components, and write a method for each meaningful part of the work. To wit:
use strict;
use warnings;
main(#ARGV); # Pass data file name on command line. Don't hard-code it.
sub main {
my $data_f = shift;
open(my $data_h, '<', $data_f) or die "$!: $data_f";
while (my $section = get_section($data_h)){
# Also write methods that can be called here to make
# desired computations, print output, etc.
}
}
sub get_section {
# Takes a file handle.
# Returns a hash reference containing all of the data
# for an entire section of the file.
my $h = shift;
return if eof($h);
chomp (my $title = <$h>);
my ($n_atoms) = <$h> =~ /^(\d+)/;
return {
'title' => $title,
'n_atoms' => $n_atoms,
'molecules' => get_molecules($h, $n_atoms / 3),
};
}
sub get_molecules {
my #molecules;
return \#molecules;
}
I have not written the get_molecules() method. It takes a file handle and an integer (N of molecules). It could return a reference to an array-of-arrays or maybe an array-of-hashes, with each inner array/hash holding the info for a single molecule.

Thanks for your help and guide. I have tried to modify your code as below. It works at least for my need.
#!/usr/bin/perl
###############
#use strict;
#use warnings;
use Math::Complex;
open OUTPUT, '>', "velocityOnly.dat" or die "Can't create filehandle: $!";
my $dataFile = 'velF1F2.vel';
{
local $/ = 'title ';
open my $FH, '<', $dataFile or die $!;
while (<$FH>) {
chomp;
my #data = split or next;
my $titleNum = 'Title ' . shift(#data);
my $atom = shift(#data) . ' ' . shift(#data);
#my $resultantVel = calcResultantVel( \#data );
#print OUTPUT "$titleNum", "\n";
print "$titleNum", "\n";
for my $i (1..1445)
{
$j=(9*($i-1));
$velocity_x = $data[($j+0)] + $data[($j+3)] + $data[($j+6)];
$velocity_y = $data[($j+1)] + $data[($j+4)] + $data[($j+7)];
$velocity_z = $data[($j+2)] + $data[($j+5)] + $data[($j+8)];
$velo = sprintf '%.3f',sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
chomp $velo;
print "$velo","\n";
print OUTPUT "$velo\n";
}
#print 'ResultantVel: ' . $resultantVel, "\n\n";
}
close $FH;
}
But I would like to extend further by adding some other functionality for doing some complex calculations. The code
Before that, I need some guide on making the below code into subroutine. I am bit lost here. Your CODE actually add all the X, Y and Z and finally find the velocity. But what I want is not that. Each 9 values subsequently represent coordinate for a water molecule which contain three atoms.
(The number 1445 is number of molecules. Each molecule contain three atoms and each atom has three coordinates.So for a water molecule has 9 Cartesian coordinates.)
the i here represent number of water molecule
for my $i (1..1445)
{
$j=(9*($i-1));
$velocity_x = $data[($j+0)] + $data[($j+3)] + $data[($j+6)];
$velocity_y = $data[($j+1)] + $data[($j+4)] + $data[($j+7)];
$velocity_z = $data[($j+2)] + $data[($j+5)] + $data[($j+8)];
$velo = sprintf '%.3f',sqrt( $velocity_x**2 + $velocity_y**2 + $velocity_z**2 );
chomp $velo;
print "$velo","\n";
print OUTPUT "$velo\n";
}

How do I turn a table into a matrix?

If I got a table in a text file such like
A B 1
A C 2
A D 1
B A 3
C D 2
A E 1
E D 2
C B 2
. . .
. . .
. . .
And I got another symbol list in another text file. I want to transform this table into a Perl data structure like:
_ A D E . . .
A 0 1 1 . . .
D 1 0 2 . . .
E 1 2 0 . . .
. . . . . . .
But I only need some selected symbol, for example A, D and E are selected in the symbol text but B and C are not.

Use an array for the first one and a 2-dimentional hash for the second one. The first one should look roughly like:
$list[0] # row 1 - the value is "A B 1"
And the hash like:
$hash{A}{A} # the intersection of A and A - the value is 0
Figuring out how to implement a problem is about 75% of the mental battle for me. I'm not going to go into specifics about how to print the hash or the array, because that's easy and I'm also not entirely clear on how you want it printed or how much you want printed. But converting the array to the hash should look a bit like this:
foreach (#list) {
my ($letter1, $letter2, $value) = split(/ /);
$hash{$letter1}{$letter2} = $value;
}
At least, I think that's what you're looking for. If you really want you could use a regular expression, but that's probably overkill for just extracting 3 values out of a string.
EDIT: Of course, you could forgo the #list and just assemble the hash straight from the file. But that's your job to figure out, not mine.

you can try this with awk:
awk -f matrix.awk yourfile.txt > newfile.matrix.txt
where matrix.awk is :
BEGIN {
OFS="\t"
}
{
row[$1,$2]=$3
if (!($2 in f2)) { header=(header)?header OFS $2:$2;f2[$2]}
if (col1[c]!=$1)
col1[++c]=$1
}
END {
printf("%*s%s\n", length(col1[1])+2, " ",header)
ncol=split(header,colA,OFS)
for(i=1;i<=c;i++) {
printf("%s", col1[i])
for(j=1;j<=ncol;j++)
printf("%s%s%c", OFS, row[col1[i],colA[j]], (j==ncol)?ORS:"")
}
}

Another way to do this would be to make a two-dimensional array -
my #fArray = ();
## Set the 0,0th element to "_"
push #{$fArray[0]}, '_';
## Assuming that the first line is the range of characters to skip, e.g. BC
chomp(my $skipExpr = <>);
while(<>) {
my ($xVar, $yVar, $val) = split;
## Skip this line if expression matches
next if (/$skipExpr/);
## Check if these elements have already been added in your array
checkExists($xVar);
checkExists($yVar);
## Find their position
for my $i (1..$#fArray) {
$xPos = $i if ($fArray[0][$i] eq $xVar);
$yPos = $i if ($fArray[0][$i] eq $yVar);
}
## Set the value
$fArray[$xPos][$yPos] = $fArray[$yPos][$xPos] = $val;
}
## Print array
for my $i (0..$#fArray) {
for my $j (0..$#{$fArray[$i]}) {
print "$fArray[$i][$j]", " ";
}
print "\n";
}
sub checkExists {
## Checks if the corresponding array element exists,
## else creates and initialises it.
my $nElem = shift;
my $found;
$found = ($_ eq $nElem ? 1 : 0) for ( #{fArray[0]} );
if( $found == 0 ) {
## Create its corresponding column
push #{fArray[0]}, $nElem;
## and row entry.
push #fArray, [$nElem];
## Get its array index
my $newIndex = $#fArray;
## Initialise its corresponding column and rows with '_'
## this is done to enable easy output when printing the array
for my $i (1..$#fArray) {
$fArray[$newIndex][$i] = $fArray[$i][$newIndex] = '_';
}
## Set the intersection cell value to 0
$fArray[$newIndex][$newIndex] = 0;
}
}
I am not too proud regarding the way I have handled references but bear with a beginner here (please leave your suggestions/changes in comments). The above mentioned hash method by Chris sounds a lot easier (not to mention a lot less typing).

CPAN has many potentially useful suff. I use Data::Table for many purposes. Data::Pivot also looks promising, but I have never used it.

Categories

HOME

semantic-segmentation

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Optimizing Large Data Intersect - perl

Related

Add new hash keys and then print in a new file

Mapping SNP coordinates to gene coordinates is too slow

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

calculating velocity from massive simulation data

How do I turn a table into a matrix?

Categories

Resources