Remove Line from File Based on Column Value in Perl - perl

I wish to loop through multiple files, and their respective lines in the file. I have done is successfully already. Want I want to do now is remove lines in a file based on a numeric value in one of the columns.
If I have an input such as this:
XP.sta1 -41.5166 0.0513 0.6842 0.1794 0 CPHI.BHZ 300.2458 -42.2436
XP.sta2 3.5972 0.0500 0.7699 0.1213 0 E000.BHZ 300.5616 2.5545
XP.sta3 3.7112 0.0267 0.7813 0.1457 0 E002.BHZ 300.6140 2.6160
XP.sta4 4.2891 0.0214 0.6870 0.1308 0 E004.BHZ 301.2073 2.6006
where the ninth column is the column I wish to look at. I need to remove that value in column 9 (let's assign it a variable $time), in that if that $time is > 10 or less than -10, remove the entire line. Thus far I have tried this:
unless (($time < -10) || ($time > 10) {
print OUT2 ($stlat," ",$stlon," ",$eqlat," ",$eqlong," ",$eqdepth," ",$time,"\n");
}}
However I get the following output:
XP.sta1 -41.5166 0.0513 0.6842 0.1794 0 CPHI.BHZ 300.2458 2.5545
XP.sta2 3.5972 0.0500 0.7699 0.1213 0 E000.BHZ 300.5616 2.6160
XP.sta3 3.7112 0.0267 0.7813 0.1457 0 E002.BHZ 300.6140 2.6006
XP.sta4 4.2891 0.0214 0.6870 0.1308 0 E004.BHZ 301.2073
As you can see, the entire line isn't deleted -- just the value that meets the true 'unless' condition, and then the other values move up in the 9th column. How do I delete the entire line, rather than just the ninth column number?
Here's where I wish to edit my script:
open(TABLEC,$File);
#tablec = <TABLEC>;
for ($j = 2; $j < $stop; $j++) {
chomp ($tablec[$j]);
($netSta,$delayTime) = (split /\s+/,$tablec[$j])[1,9] ;
}
In this for loop, I'm looping through each file, reading in the lines from 2 to 'stop', and chopming the return character. I set the 9th column to the delay time variable. So I'm looping through each line, but I don't want to print anything yet (that comes later in my script). I would just like to remove that entire line, so that later on in my script when I have to print the lines, the line where the 9th column values is >abs(10) does not exist.

I'd just skip the line:
use warnings;
use strict;
while(<DATA>){
my #split = split;
next if $split[8] > 10 or $split[8] < -10;
print "$_\n";
}
XP.sta2 3.5972 0.0500 0.7699 0.1213 0 E000.BHZ 300.5616 2.5545
XP.sta3 3.7112 0.0267 0.7813 0.1457 0 E002.BHZ 300.6140 2.6160
XP.sta4 4.2891 0.0214 0.6870 0.1308 0 E004.BHZ 301.2073 2.6006

You haven't shown enough of your code to diagnose the problem, but what you ask is very simply done like this
use strict;
use warnings;
while ( <DATA> ) {
print unless abs((split)[8]) > 10;
}
__DATA__
XP.sta1 -41.5166 0.0513 0.6842 0.1794 0 CPHI.BHZ 300.2458 -42.2436
XP.sta2 3.5972 0.0500 0.7699 0.1213 0 E000.BHZ 300.5616 2.5545
XP.sta3 3.7112 0.0267 0.7813 0.1457 0 E002.BHZ 300.6140 2.6160
XP.sta4 4.2891 0.0214 0.6870 0.1308 0 E004.BHZ 301.2073 2.6006
output
XP.sta2 3.5972 0.0500 0.7699 0.1213 0 E000.BHZ 300.5616 2.5545
XP.sta3 3.7112 0.0267 0.7813 0.1457 0 E002.BHZ 300.6140 2.6160
XP.sta4 4.2891 0.0214 0.6870 0.1308 0 E004.BHZ 301.2073 2.6006

I thought your question had been answered, buit here's something that should help you with the contents of your edit
Some points on your code
Identifiers for lexical variables should contain only lower-case letters, decimal digits, and underscore. Capital letters are reserved for global variables such as constants and package names
You should use lexical file handles with the three-parameter form of open
You should always verify that an open succeeded. In the case of a failure your program should die and include the value of $! in the die string to reveal why the operation failed
Together, those points mean that
open(TABLEC, $File);
becomes
open my $tablec_fh, '<', $File or die qq{Unable to open "$File" for input: $!};
You can chomp an entire array at once with chomp #tablec
You should avoid the C-style for loop as it is rarely a good choice. Perl allows you to iterate over a range, and you should make use of that. So
for ($j = 2; $j < $stop; $j++) { ... }
becomes
for my $j ( 2 .. $stop-1 ) { ... }
split /\s+/ should almost always be split ' '. The latter is a special case for the operator, which prevents it from returning an initial empty field if the parameter string has leading spaces. If you call split without any parameters then it defaults to split ' ', $_
Here's a rewrite of your sample code that takes these points into account. I hope it's a better fit than my previous answer
open my $tablec_fh, '<', $File or die qq{Unable to open "$File" for input: $!};
my #tablec = <$tablec_fh>;
chomp #tablec;
close $tablec_fh;
for my $i ( 2 .. $stop-1 ) {
my $row = $tablec[$i];
my ($net_sta, $delay_time) = (split ' ', $row)[0,8];
next unless abs($delay_time) <= 10;
# Do stuff with $row
}

Related

Randomly pick a region and process it, a number of times

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to randomly pick up a region with 10 letters from it then calculate the number of F, I want to do that for a certain number of times for example 1000 times or even more
as an example, I randomly pick
LVPSLTRYLT 0
then
ITNLRSFIHK 1
then again randomly go and pick up 10 letters consecutive
AHSRIRKERP 0
This continues until it meets the number of run asked. I want to store all randomly selected ones with their values, because then I want to calculate how many times F is seen
So I do the following
# first I remove the header
grep -v ">" data.txt > out.txt
then get randomly one region with 10 letters I tried to use shuf with no success,
shuf -n1000 data.txt
then I tried to use awk and was not successful either
awk 'BEGIN {srand()} !/^$/ { if (rand() == 10) print $0}'
then calculate the number of F and save it in the file
grep -i -e [F] |wc -l
Note, we should not pick up the same region twice
I've got to assume some things here, and leave some restrictions
Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code
First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::MoreUtils qw(uniq);
my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);
GetOptions(
'num-regions|n=i' => \$num_regions,
'region-len|l=i' => \$region_len,
'char|c=s' => \$count_freq_for,
) or usage();
my $file = shift || usage();
# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my #rand = uniq sort { $a <=> $b }
map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: #rand";
my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0);
my $region;
while (my $line = <$fh>) {
chomp $line;
# Total number of characters so far, up to this line and with this line
$nchars_prev = $nchars;
$nchars += length $line;
next if $line =~ /^\s*>sp/;
# Complete the region if there wasn't enough chars on the previous line
if ($chars_left > 0) {
$region .= substr $line, 0, $chars_left;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
$chars_left = -1;
};
# Random positions that happen to be on this line
my #pos = grep { $_ > $nchars_prev and $_ < $nchars } #rand;
# say "\tPositions on ($nchars_prev -- $nchars) line: #pos" if #pos;
for (#pos) {
my $pos_in_line = $_ - $nchars_prev;
$region = substr $line, $pos_in_line, $region_len;
# Don't print if there aren't enough chars left on this line
last if ( $chars_left =
($region_len - (length($line) - $pos_in_line)) ) > 0;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
}
}
sub fraction_skipped {
my ($fh) = #_;
my ($skip_len, $data_len);
my $curr_pos = tell $fh;
seek $fh, 0, 0 if $curr_pos != 0;
while (<$fh>) {
chomp;
if (/^\s*>sp/) { $skip_len += length }
else { $data_len += length }
}
seek $fh, $curr_pos, 0; # leave it as we found it
return $skip_len / ($skip_len+$data_len);
}
sub usage {
say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
exit;
}
Uncomment the srand line so to have the same run always, for testing.
Notes follow.
Some corner cases
If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little
Handling of issues regarding randomness
"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.
Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.
If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.
The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.
More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.
In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.
Note on a particular test case
With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.
Here is a simple driver to run the above a given number of times, for statistics.
Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.†
Adjust and add code to process as needed for statistics; output is in files.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use IPC::Run qw(run);
my $outdir = 'rr_output'; # pick a directory name
mkdir $outdir if not -d $outdir;
my $prog = 'random_regions.pl'; # your name for the program
my $input = 'data_file.txt'; # your name for input file
my $ch = 'F';
my ($runs, $regions, $len) = (10, 10, 10);
GetOptions(
'runs|n=i' => \$runs,
'regions=i' => \$regions,
'length=i' => \$len,
'char=s' => \$ch,
'input=s' => \$input
) or usage();
my #cmd = ( $prog, $input,
'--num-regions', $regions,
'--region-len', $len,
'--char', $ch
);
say "Run: #cmd, $runs times.";
for my $n (1..$runs) {
my $outfile = "$outdir/regions_r$n.txt";
say "Run #$n, output in: $outdir/$outfile";
run \#cmd, '>', $outfile or die "Error with #cmd: $!";
}
sub usage {
say STDERR "Usage: $0 [options]", "\n\toptions: ...";
exit;
}
Please expand on the error checking. See for instance this post and links on details.
Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.
The called program (random_regions.pl above) must be executable.
†   Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.
awk to the rescue!
you didn't specify but there are two random actions going on. I treated them independently, may not be so. First picking a line and second picking a random 10 letter substring from that line.
This assumes the file (or actually half of it) can fit in memory. Otherwise, split the file into equal chunks and run this on chunks. Doing so will reduce some of the clustering but not sure how important in this case. (If you have one big file, it's possible that all samples may be drawn from the first half, with splitting you eliminate this probability). For certain cases this is a desired property. Don't know your case.
$ awk 'BEGIN {srand()}
!/^>/ {a[++n]=$0}
END {while(i++<1000)
{line=a[int(rand()*n)+1];
s=int(rand()*(length(line)-9));
print ss=substr(line,s,10), gsub(/F/,"",ss)}}' file
GERISPKDRC 0
QDEPRCLHEW 0
LLYQLFRNLF 2
GTHGAGAMES 0
TKALQDVQIR 0
FCVHTKALQD 1
SNKAQVKPGQ 0
CMECQGHGER 0
TRRFVGHTKD 1
...
Here is one solution using Perl
It slurps the entire file to memory. Then the lines starting with > are removed.
Here I'm looping for 10 times $i<10, you can increase the count here.
Then rand function is called by passing length of the file and using the rand value, substr of 10 is computed. $s!~/\n/ guard is to make sure we don't choose the substring crossing newlines.
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f\n" ;$i++} else { $i-- } } '
random10.txt
ENTQLLETKN 0
LSEGALSPDG 0
LRKARAEAED 0
RLWDLTTGTT 0
KWSGRCGLGY 0
TRRFVGHTKD 1
PVKRPIPHPA 0
GMVQQIQSVC 0
LTHPVLSFGI 1
KVNFPENGFL 2
$
To know the random number generated
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f $x\n" ;$i++} else { $i-- } }
' random10.txt
QLDGSLTMSS 0 1378.61409368207
DLIAKVDELT 0 1703.46689004765
SGGGANGTSF 1 900.269562152326
PEELTLSPKL 0 1368.55540468164
TCLSEGALSP 0 1016.50744004085
NRTWNSSAVP 0 23.7868578293154
VNFPENGFLS 2 363.527933104776
NSGLTWSGND 0 48.656607650744
MILSASRDKT 0 422.67705815168
RRGEDLFMCM 1 290.828530365
AGDGLLTPDA 0 1481.78080339531
$
Since your input file is huge I'd do it in these steps:
select random 10-char strings from each line of your input file
shuffle those to get the number of samples you want in random order
count the Fs
e.g.
$ cat tst.sh
#!/bin/env bash
infile="$1"
sampleSize=10
numSamples=15
awk -v sampleSize="$sampleSize" '
BEGIN { srand() }
!/^>/ {
begPos = int((rand() * sampleSize) + 1)
endPos = length($0) - sampleSize
for (i=begPos; i<=endPos; i+=sampleSize) {
print substr($0,i,sampleSize)
}
}
' "$infile" |
shuf -n "$numSamples"
.
$ ./tst.sh file
HGDIKCVLNE
QDEPRCLHEW
SEVQAIIEST
THDLRVSLEE
SEWVSCVRFS
LTRYLTLNAS
KDGQKITFHG
SNSPEPQKAV
QGGSKATTPA
QLLETKNALN
LLFCDNHKKQ
DETNYGIPQR
IRFQPQLNPD
LQTIRFSPDI
SLKRCGGFLI
$ ./tst.sh file | awk '{print $0, gsub(/F/,"")}'
SPKLQLDGSL 0
IKLFCVHTKA 1
VVSRCRLRHT 0
SPEPQKAVEQ 0
AYNPKNFSND 1
FGESRPELGS 1
AGDGLLTPDA 0
VGHTKDVLSV 0
VTHDLRVSLE 0
PISLGIFPLP 1
ASQITNLRSF 1
LTRPPEELTL 0
FDRYGEEGLK 1
IYIEGQDEPR 0
WNTLGVCKYT 0
Just change the numSamples from 15 to 1000 or whatever you like when run against your real data.
The above relies on shuf -n being able to handle however much input we throw at it, presumably much like sort does by using paging. If it fails in that regard then obviously you'd have to choose/implement a different tool for that part. FWIW I tried seq 100000000 | shuf -n 10000 (i.e. 10 times as many input lines as the OPs posted max file length of 10000000 to account for the awk part generating N lines of output per 1 line of input and 10 times as many output lines required than the OPs posted 1000) and it worked fine and only took a few secs to complete.

how to make a difference between two data files more efficient (run time)

I have a code which compares values on some specific terms between two files. The main time-consuming part of the code as follows:
my #ENTIRE_FILE;
my %NETS;
my %COORDINATES;
my $INT=1;
my %IR_VALUES;
################################# READING
foreach my $IR_REPORT_FILE_1(#IR_REPORT_FILES){
{
open (FHIN, "<", $IR_REPORT_FILE_1) or die("Could not open $! for reading\n");
# chomp(my #ENTIRE_FILE = <FHIN>); # READS THE ENTIRE FILE
local undef $/;
#ENTIRE_FILE = split(/\n(.*NET.*)/,<FHIN>);
close (FHIN);
}
############################### BUILDING HASH
for my $i(1..$#ENTIRE_FILE/2){
if($ENTIRE_FILE[$i*2-1]=~ /^----.*\s+"(\w+)"\s+/){
my $net =$1;
my #ir_values_of_net = split(/\n/,$ENTIRE_FILE[$i*2]);
for my $val (#ir_values_of_net){
push ( #{ $NETS{$INT}{$net} }, $val ) if ($val =~ /^r.*\s+m1|v1_viadg|v1_viabar|m2|ay_viabar|ay_viadg|c1\s+/); # NETS{1}{VDD}=array of values, NETS{1}{VSS}, NETS{1}{AVDD}
}
}
}
$INT++; # For the next file: NETS{2}{VDD}, NETS{2}{VSS}, NETS{2}{AVDD}
}
############################### COMPARISON
my $loop_count=0;
foreach my $net(keys %{ $NETS{1} }){
print "net is $net\n";
foreach my $file_1_net( #{ $NETS{1}{$net} }){
my #sub_str_1 = split (' ', $file_1_net);
foreach my $file_2_net ( #{ $NETS{2}{$net} } ){
$loop_count++;
# my #sub_str_1 = split (' ', $file_1_net);
my #sub_str_2 = split (' ', $file_2_net);
if(($sub_str_1[2] eq $sub_str_2[2])&&(($sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6] eq $sub_str_2[3].$sub_str_2[4].$sub_str_2[5].$sub_str_2[6]) || ($sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6] eq $sub_str_2[5].$sub_str_2[6].$sub_str_2[3].$sub_str_2[4]))){
push (#{ $COORDINATES{$net}{X} },$sub_str_1[3],$sub_str_1[5]) if ($sub_str_1[3] && $sub_str_1[5]);
push (#{ $COORDINATES{$net}{Y} },$sub_str_1[4],$sub_str_1[6]) if ($sub_str_1[4] && $sub_str_1[6]);
my $difference=$sub_str_1[1]-$sub_str_2[1];
if($sub_str_1[3]=~/^-/){
push (#{ $MATCHED_RESISTORS{$net}{$sub_str_1[2].$sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6]} }, $file_1_net,$file_2_net,$difference);
}else{
push (#{ $MATCHED_RESISTORS{$net}{$sub_str_1[2]."-".$sub_str_1[3].$sub_str_1[4].$sub_str_1[5].$sub_str_1[6]} }, $file_1_net,$file_2_net,$difference);
}
push (#{ $IR_VALUES{$net} }, $sub_str_2[1]);
last;
}
}
}
print max #{ $IR_VALUES{$net} };
print "\nloop count is $loop_count\n";
$loop_count = 0;
# <>;
}
I ran a profiler on the code. Below is the output on the above part of code:
Some statistics:
For my testcase, the outer-most foreach has 3 elements. Below is the
number of matched elements for each iteration:
element_1: 14
element_1: 316
element_1: 8
The file sizes are 8.3 MB and 518.3KB.
Run time for the entire code is: 220s
My main concern is when I have a file size of 8.3MB each, and there are more numbers of matching between two files, the run-time is humongous e.g. 3 hours.
My question is really simple: How do I make my code run faster?
Sample Data File_1:
r6_2389 1.29029e-05 ay_viabar 23.076 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_2397 1.29029e-05 ay_viabar 22.948 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_2400 1.29029e-05 ay_viabar 22.82 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_2403 1.29029e-05 ay_viabar 22.692 57.755 22.628 57.755 4.5 0 0 3.68449e-06 -5.99170336965613
r6_971 1.3279e-05 c1 9.492 45.742 -0.011 46.779 0.001 9.5589 10 0.0508653
Sample Data File_2:
r6_9261 0.00206167 ay_viabar 23.076 57.755 22.628 57.755 4.5 0 0 0.0207546
r6_9258 0.00206167 ay_viabar 22.948 57.755 22.628 57.755 4.5 0 0 0.0161057
r6_9399 0.00206167 ay_viabar 22.82 57.755 22.628 57.755 4.5 0 0 0.0127128
r6_9486 0.00206167 ay_viabar 22.692 57.755 22.628 57.755 4.5 0 0 0.0103186
r6_1061 1.3279e-05 cb_pc_viadg -6.696 44.157 -0.159 44.847 0.001 0 0 0
Sample Output:
r6_9261 0.00206167 ay_viabar 23.076 57.755 22.628 57.755 4.5 0 0 0.0207546
r6_9258 0.00206167 ay_viabar 22.948 57.755 22.628 57.755 4.5 0 0 0.0161057
r6_9399 0.00206167 ay_viabar 22.82 57.755 22.628 57.755 4.5 0 0 0.0127128
r6_9486 0.00206167 ay_viabar 22.692 57.755 22.628 57.755 4.5 0 0 0.0103186
The sample output is basically pushed into another hash which is further processed. But building up this hash consumes about 90% of the total run-time as per the profiler.
OK, so my first thought is - you've a 3 deep loop, and that will always be inefficient. We can probably trade memory for a lot of speed there.
Assuming the 'bigger' file is 'sample_1', otherwise swap them.
In this example - sample_2 will consume memory proportionate to the number of rows - so we ideally want that to be the smaller file. You may need to swap the match/test around, depending on whether file1 cols 5,6,3,4 matches file2 or vice versa.
But hopefully this illustrates a useful concept for solving your problem, if not entirely solving it?
Something like this will do the trick:
#!/usr/bin/env perl
use strict;
use warnings;
my %is_match;
open ( my $sample_1, '<', 'sample1.txt' ) or die $!;
open ( my $sample_2, '<', 'sample2.txt' ) or die $!;
# first of all, column 2 , 3,4,5,6 should match between 2 files.
# and then print out both matching lines from two files.
# column 3,4,5,6 from one file can match with column 5,6,3,4.
while ( <$sample_2> ) {
my #row = split;
#insert into hash
#this would be much clearer if the fields were named rather than numbered
#I think.
$is_match{$row[3]}{$row[4]}{$row[5]}{$row[6]}++;
$is_match{$row[5]}{$row[6]}{$row[3]}{$row[4]}++;
}
while ( <$sample_1> ) {
my #row = split;
#print the current line if it matches from the hash above.
print if $is_match{$row[3]}{$row[4]}{$row[5]}{$row[6]};
}
Because this iterates each file once, it should be a lot faster. and because one of your files is small, then that's the one you should read first into memory.
With your sample data as provided, this gives you the desired output.
The first loop reads though the file, selects your interest fields and inserts them into a hash, based on your 4 keys.
And then it does so again for the other set of valid matching keys.
The second loop reads the other file, selects the keys and just checks if either combination exists in the hash. And prints the current line if it does.

Perl: perl regex for extracting values from complex lines

Input log file:
Nservdrx_cycle 4 servdrx4_cycle
HCS_cellinfo_st[10] (type = (LTE { 2}),cell_param_id = (28)
freq_info = (10560),band_ind = (rsrp_rsrq{ -1}),Qoffset1 = (0)
Pcompensation = (0),Qrxlevmin = (-20),cell_id = (7),
agcreserved{3} = ({ 0, 0, 0 }))
channelisation_code1 16/5 { 4} channelisation_code1
sync_ul_info_st_ (availiable_sync_ul_code = (15),uppch_desired_power =
(20),power_ramping_step = (3),max_sync_ul_trans = (8),uppch_position_info =
(0))
trch_type PCH { 7} trch_type8
last_report 0 zeroth bit
I was trying to extract only integer for my above inputs but I am facing some
issue with if the string contain integer at the beginning and at the end
For ( e.g agcreserved{3},HCS_cellinfo_st[10],Qoffset1)
here I don't want to ignore {3},[10] and 1 but in my code it does.
since I was extracting only integer.
Here I have written simple regex for extracting only integer.
MY SIMPLE CODE:
use strict;
use warnings;
my $Ipfile = 'data.txt';
open my $FILE, "<", $Ipfile or die "Couldn't open input file: $!";
my #array;
while(<$FILE>)
{
while ($_ =~ m/( [+-]?\d+ )/xg)
{
push #array, ($1);
}
}
print "#array \n";
output what I am getting for above inputs:
4 4 10 2 28 10560 -1 1 0 0 -20 7 3 0 0 0 1 16 5 4 1 15 20 3 8 0 7 8 0
expected output:
4 2 28 10560 -1 0 0 -20 7 0 0 0 4 15 20 3 8 0 7 0
If some body can help me with explanation ?
You are catching every integer because your regex has no restrictions on which characters can (or can not) come before/after the integer. Remember that the /x modifier only serves to allow whitespace/comments inside your pattern for readability.
Without knowing a bit more about the possible structure of your output data, this modification achieves the desired output:
while ( $_ =~ m! [^[{/\w] ( [+-]?\d+ ) [^/\w]!xg ) {
push #array, ($1);
}
I have added rules before and after the integer to exclude certain characters. So now, we will only capture if:
There is no [, {, /, or word character immediately before the number
There is no / or word character immediately after the number
If your data could have 2-digit numbers in the { N} blocks (e.g. PCH {12}) then this will not capture those and the pattern will need to become much more complex. This solution is therefore quite brittle, without knowing more of the rules about your target data.

cosine similarity between strings perl

i have a file contain for example this text:
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
I found a module which calculates cosine similaity, http://search.cpan.org/~wollmers/Bag-Similarity-0.019/lib/Bag/Similarity/Cosine.pm
I did a simple test in the bignning,
my $cosine = Bag::Similarity::Cosine->new;
my $similarity = $cosine->similarity(['perl','java','python','php','scala'],['java','pascal','perl','ruby','ada']);
print $similarity;
The rusult was 0.4;
The problem when i read from the file and calculate the cosine between each line, the results are different, this is the code:
open(F,"/home/ahmed/FILE.txt") or die " Pb pour ouvrir";
my #data; # containt each line of the FILE in each case
while(<F>) {
chomp;
push #data, $_;
}
#print join " ", #data;
my $cosine = Bag::Similarity::Cosine->new;
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i + 1 .. $#data ) {
my $similarity = $cosine->similarity($data[$i],$data[$j]);
print "line $i a une similarite de $similarity avec line $j\n";
$i + 1,
$j + 1;
}
}
the results :
line 0 has a similarity of 0.933424735647156 with line 1
line 0 has a similarity of 0.953945734121021 with line 2
line 0 has a similarity of 0.939759036144578 with line 3
line 1 has a similarity of 0.917585834612093 with line 2
line 1 has a similarity of 0.945092544842746 with line 3
line 2 has a similarity of 0.908826679128811 with line 3
the similarity must be 0.4 between line 1 and 2;
I changed the FILE like this :
['perl','java','python','php','scala']
['java','pascal','perl','ruby','ada']
['ASP','awk','php','java','perl']
['C#','ada','python','java','scala']
but the same result,
Thank you.
There is syntax error in your program. Were you trying to use printf and used print mistakenly? Not sure about you but below works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use Bag::Similarity::Cosine;
my $cosine = Bag::Similarity::Cosine->new;
my #data;
while ( <DATA> ) {
push #data, { map { $_ => 1 } split };
}
for my $i ( 0 .. $#data-1 ) {
for my $j ( $i + 1 .. $#data ) {
my $similarity = $cosine->similarity($data[$i],$data[$j]);
print "line $i has a similarity of $similarity with line $j\n";
}
}
__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
Output:
line 0 has a similarity of 0.4 with line 1
line 0 has a similarity of 0.6 with line 2
line 0 has a similarity of 0.6 with line 3
line 1 has a similarity of 0.4 with line 2
line 1 has a similarity of 0.4 with line 3
line 2 has a similarity of 0.2 with line 3
I know nothing at all about this module. But I can read the documentation.
It looks to me like the module has two methods. similarity() is used for comparing two strings and from_bags() is used to compare two references to arrays containing strings. I expect that when you call similarity passing it two array references, then what gets compared is actually the stringification of the two references.
Try switching to from_bags() and see if that's any better.
Update: On investigating further, I see that similarity() will compare any kind of input (strings, array refs or hash refs).
This demonstrates using similarity() to compare the lines as text and as arrays of words.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Bag::Similarity::Cosine;
chomp(my #data = <DATA>);
my $cos = Bag::Similarity::Cosine->new;
for my $i (0 .. $#data - 1) {
for my $j (1 .. $#data) {
next if $i == $j;
say "$i -> $j: strings ", $cos->similarity($data[$i], $data[$j]);
say "$i -> $j: array refs ", $cos->similarity([split /\s+/, $data[$i]], [split /\s+/, $data[$j]]);
}
}
__DATA__
perl java python php scala
java pascal perl ruby ada
ASP awk php java perl
C# ada python java scala
And it gives this output:
$ perl similar
0 -> 1: strings 0.88602000346543
0 -> 1: array refs 0.4
0 -> 2: strings 0.89566858950296
0 -> 2: array refs 0.6
0 -> 3: strings 0.852802865422442
0 -> 3: array refs 0.6
1 -> 2: strings 0.872356744289958
1 -> 2: array refs 0.4
1 -> 3: strings 0.884721984738799
1 -> 3: array refs 0.4
2 -> 1: strings 0.872356744289958
2 -> 1: array refs 0.4
2 -> 3: strings 0.753778361444409
2 -> 3: array refs 0.2
I don't know which version gives you the information you want. I suspect it might be the array reference version.

Insert the highest value among the number of times it occurs

I have two files:
1) Tab file with the following content. Let's call this reference file:
V$HMGIY_01_rc Ncor=0.405
V$CACD_01 Ncor=0.405
V$GKLF_02 Ncor=0.650
V$AML2_Q3 Ncor=0.792
V$WT1_Q6 Ncor=0.607
V$KID3_01 Ncor=0.668
V$CNOT3_01 Ncor=0.491
V$KROX_Q6 Ncor=0.423
V$ETF_Q6_rc Ncor=0.547
V$E2F_Q2_rc Ncor=0.653
V$SP1_Q6_01_rc Ncor=0.650
V$SP4_Q5 Ncor=0.660
2) The second tab file contains the search string X as shown below. Let's call this file as search_string:
A X
NF-E2_SC-22827 NF-E2
NRSF NRSF
NFATC1_SC-17834 NFATC1
NFKB NFKB
TCF3_SC-349 TCF3
MEF2A MEF2A
what I have already done is: Take the first search term (from search_string file; column X), check if it occurs in first column of the reference file. Example: The first search term is NF-E2. I checked if this string occurs in the first column of the reference file. If it occurs, then give a score of 1, else give 0. Also i have counted the number of times it matches the pattern. Now my output is of the format:
Keyword Keyword in file? Number of times keyword occurs in file
NF-E2 1 3
NRSF 0 0
NFATC1 0 0
NFKB 1 7
TCF3 0 0
Now, in addition to this, what I would like to add is the highest Ncor value for each string in each file. Say for example: while I search for NF-E2 in NF-E2.txt, the Ncor values present are: 3.02, 2.87 and 4.59. Then I want the value 4.59 to be printed in the next column. So now my output should look like:
Keyword Keyword in file? Number of times keyword occurs in file Ncor
NF-E2 1 3 4.59
NRSF 0 0
NFATC1 0 0
NFKB 1 7 1.66
TCF3 0 0
Please note: I need to search each string in different files i.e. The first string (Nf-E2) should be searched in file NF-E2.tab; the second string (NRSF) should be searched in file NRSF.tab and so on.
Here is my code:
perl -lanE '$str=$F[1]; $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2
PLease help!!!
This should work:
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
chomp;
my $keyword = (split /\s+/)[1];
my $file = "/home/$keyword/list/${keyword}.txt";
open my $reference, '<', "$file" or die "Cannot open $file: $!";
my $key_cnt = 0;
my $max_ncor = 0;
while (my $line = <$reference>) {
my ($string, undef, $ncor) = split /\s+|=/, $line;
if ($string =~ $keyword) {
$key_cnt++;
$max_ncor = $ncor if ($max_ncor < $ncor);
}
}
print join("\t", $keyword, $key_cnt ? 1 : 0, $key_cnt, $key_cnt ? $max_ncor : ''), "\n";
}
Run it like this:
perl t.pl search_string.txt