Why does it take so long to print "\n" in Perl? - perl

Why does it take so long to print a newline? Is this just my machine, or do others see the same effect?
With the newline:
#!/usr/bin/perl
use strict;
use Benchmark;
timethis(100000,'main();');
sub main {
print "you are the bomb. \n";
}
# outputs:
# timethis 100000: 8 wallclock secs ( 0.15 usr + 0.45 sys = 0.60 CPU) # 166666.67/s (n=100000)
W/o the newline:
#!/usr/bin/perl
use strict;
use Benchmark;
timethis(100000,'main();');
sub main {
print "you are the bomb. ";
}
# outputs:
# timethis 100000: 0 wallclock secs ( 0.09 usr + 0.04 sys = 0.13 CPU) # 769230.77/s (n=100000)
# (warning: too few iterations for a reliable count)
Edit: I'd like to add that placing two "\n" causes the execution to take
twice as long, at least for wallclock seconds.
timethis 100000: 16 wallclock secs ( 0.15 usr + 0.52 sys = 0.67 CPU) # 149253.73/s (n=100000)

I don't think buffering has much to do with it. I'm guessing it's because the terminal needs to scroll when you print a newline to it (or print enough characters to fill a line). When I benchmark these functions writing to a file or to /dev/null, there is not much of a difference.
use Benchmark;
timethis(1000000, 'main');
timethis(1000000, 'main2');
select STDERR; $| = 0; select STDOUT; # enable buffering on STDERR
sub main { print STDERR "you are the bomb. \n" }
sub main2 { print STDERR "you are the bomb. " }
$ perl benchmark.pl 2> a_file
timethis 1000000: 21 wallclock secs ( 4.67 usr + 13.38 sys = 18.05 CPU) # 55410.87/s
timethis 1000000: 21 wallclock secs ( 4.91 usr + 13.34 sys = 18.25 CPU) # 54797.52/s
$ perl benchmark.pl 2> /dev/null
timethis 1000000: 26 wallclock secs ( 2.86 usr + 10.36 sys = 13.22 CPU) # 75648.69/s
timethis 1000000: 27 wallclock secs ( 2.86 usr + 10.30 sys = 13.16 CPU) # 76010.95/s
$ perl benchmark.pl 2> a_file (without buffering)
timethis 1000000: 29 wallclock secs ( 3.78 usr + 12.14 sys = 15.92 CPU) # 62806.18/s
timethis 1000000: 29 wallclock secs ( 3.27 usr + 12.51 sys = 15.78 CPU) # 63367.34/s
$ perl benchmark.pl 2> /dev/tty (window has 35 lines and buffers 10000, YMMV)
[ 200000 declarations of how you are a bomb deleted ]
timethis 100000: 53 wallclock secs ( 0.98 usr + 3.73 sys = 4.72 CPU) # 21190.93/s
timethis 100000: 9 wallclock secs ( 0.36 usr + 1.94 sys = 2.30 CPU) # 43535.05/s
Summary: extra flushing reduces performance by about 10%. Extra scrolling on the terminal reduces performance by about 50%.

It's not the \n per se that causes this problem. Rather, successive calls to print are buffered by the OS until the \n character is encountered or the buffer is full. At that point, the output buffer is flushed to the screen. Flushing the output to the screen is a (relatively) expensive operation, so the loop in which you flush the output buffer many times has much slower performance than the loop in which you only flush the buffer once at the end (which happens implicitly when your program exits).

Newline flushes output.
In most stdio implementations, buffering varies with the type of output device ... Serial devices, including terminals, modems, mice, and joysticks, are normally line-buffered; stdio sends the entire line out only when it gets the newline

Related

Having problems getting specific column in file to be printed to output file in a hash

I am working on an assignment and I am having some issues getting portions of a certain file to be printed to the output file in a hash. I was given a large file containing a list of different species (along with some other variables that aren't too important) and I am getting stuck on how to isolate that specific column and put it into a hash that can be printed to the output while counting how many times each species is mentioned.
#!/usr/bin/perl
use strict; use warnings;
open IN, $ARGV[0]; ## open input file given as argument 1
open OUT, ">", $ARGV[1]; ## open output file given as argument 2
my #cols; ## creates variable to hold column of data
print "\nWorking on file: $ARGV[0]\n\n"; ## while data exists in input file, read
## line by line
while (my $file = <IN>) {
chomp $file; ## remove trailing newline
print "$file\n";
#cols = split /\t/, $file; ## split data into columns on tab
print "#cols[9]\n";
my %hits;
$hits{species} += 1;
print "$hits{species}\n";
print OUT "#cols[9]\n"; ##write species column to output file
}
print "File has been read and output written!\n";
close IN;
close OUT;
This is currently what I have for my code and any suggestions or tips would be greatly appreciated. Thanks!
Sample of input data (Bacteria_firmicutes is the 11th column)
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
I had to change the split functions tab delimiter to split on spaces instead. I shortened the code a bit to make it a little easier to understand. On Perl 5.16.3... this works just fine.
use strict;
use warnings;
my #cols; ## creates variable to hold column of data
print "\nWorking on __DATA__\n\n";
while (my $file = <DATA>) {
chomp $file; ## remove trailing newline
print "$file\n";
#cols = split /\s/, $file; ## split data into columns on SPACE. OR CHANGE TO TAB DELIMITED SAMPLE DATA
print "$cols[10]\n";
my %hits;
$hits{species} += 1;
}
__DATA__
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
EDIT: I would also like to submit this for you to study. you can capture the last string before the new line character. This would be useful if you wanted to always get the very last column in the sample data.
Also, using this method, you can modify the regex and search for specific patterns in sample data.
while (my $file = <DATA>) {
$file =~ /.*\s(.*)\n$/; #or $file =~ /.*(SOME_PATTERN).*\n$/
print $1;
}
__DATA__
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
EDIT 2: Here is a good way to keep track of which and how many patterns were matched in the sample data (EDIT: code shortened some more):
use strict;
use warnings;
my %hits; #always declare hash/arrays outside of loops unless its intentional.
# ADD and/or INCREMENT hash foreach pattern found
$hits{$1}++ while (<DATA> =~ /.*\s(.*)\n$/);
print "$_: $hits{$_}\n", for keys %hits;
__DATA__
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_firmicutes
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test2
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test3
Query dbj|BAI87270.2| 1 456 98.048 461 911 0.0 645657 Bacillus_subtilis Bacteria_test3
The following code should work, assuming your actual input is tab separated values instead of space separated like the sample input in the question seems to be.
Here's what I changed:
Added mode to first open() call
Assigned $cols[9] to a variable $species then used that as the key for %hits.
Removed the print calls in the while loop
Moved declaration of %hits above the loop
Added foreach loop that loops through %hits and prints the number of times that species name was seen and the species name to *OUT filehandle
#!/usr/bin/perl
use strict; use warnings;
open IN, "<", $ARGV[0]; ## open input file given as argument 1
open OUT, ">", $ARGV[1]; ## open output file given as argument 2
my #cols; ## creates variable to hold column of data
print "\nWorking on file: $ARGV[0]\n\n"; ## while data exists in input file, read
## line by line
my %hits;
while (my $file = <IN>) {
chomp $file; ## remove trailing newline
print "$file\n";
#cols = split /\t/, $file; ## split data into columns on tab
my $species = $cols[9];
$hits{$species} += 1;
}
foreach my $species (keys %hits) {
print OUT "$hits{$species} $species\n";
}
print "File has been read and output written!\n";
close IN;
close OUT;

Perl: read an array and calculate corresponding percentile

I am trying to code for a perl code that reads a text file with a series of number, calculates, and prints out the numbers that corresponds to the percentiles. I do not have access to the other statistical modules, so I'd like to stick with just pure perl coding. Thanks in advance!
The input text file looks like:
197
98
251
82
51
272
154
167
38
280
157
212
188
88
40
229
228
125
292
235
67
70
127
26
279
.... (and so on)
The code I have is:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (my $fh, "<", "testing2.txt")
or die "Cannot open: $!\n";
while (<$fh>){
push #data, $_;
}
close $fh;
my %count;
foreach my $datum (#data) {
++$count{$datum};
}
my %percentile;
my $total = 0;
foreach my $datum (sort { $a <=> $b } keys %count) {
$total += $count{$datum};
$percentile{$datum} = $total / #data;
# percentile subject to change
if ($percentile{$datum} <= 0.10) {
print "$datum : $percentile{$datum}\n\n";
}
}
My desired output:
2 : 0.01
3 : 0.01333
4 : 0.01666
6 : 0.02
8 : 0.03
10 : 0.037
12 : 0.04
14 : 0.05
15 : 0.05333
16 : 0.06
18 : 0.06333
21 : 0.07333
22 : 0.08
25 : 0.09
26 : 0.09666
Where the format is #number from the list : #corresponding percentile
To store the numer wihtout a newline in #data, just add chomp; before pushing it, or chomp #data; after you've read them all.
If your input file has MSWin style newlines, convert it to *nix style using dos2unix or fromdos.
Also, try to learn how to indent your code, it boosts readability. And consider renaming $total to $running_total, as you use the value as it changes.

Modifying perl script to not print duplicates and extract sequences of a certain length

I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.
Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)
Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acur_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250
I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################
my($progname) = $0;
############################################################################
# Initialize variables
############################################################################
my($jter);
my($com);
my($t1);
if ( #ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}
my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];
############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my #data = <$INF>;
print #data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";
############################################################################
# Extract hits and place sequences into new file
############################################################################
my #temparray;
my #templine;
my($seqfname);
open ($INF, $infile) or die "Could not open file $infile for input.\n";
#temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}
# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...
# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence
Using this script, here is the output I'm getting:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.
For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get
AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC
at that base pair range, not
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?
Thanks, and sorry for the lengthy post!
Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Here is one such infile with 9 hits, from which I was expecting only 4 sequences.
Note: this issue has been largely resolved based on the solution provided below the answer section
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Apil_comp16418_c0_seq1 2079 1 1 1587 1 416 2002 0.0 2931 100.00 1587 1587
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Atri_comp13712_c0_seq1 1938 1 1 1587 1 1651 75 0.0 1221 80.73 1286 1593
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Ctom_01003023.1 2162 1 1 1406 1 1403 1 0.0 1430 85.07 1197 1407
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Apil_comp16418_c0_seq1 2079 1 1 1437 1 1866 430 0.0 1170 81.43 1175 1443
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Atri_comp13712_c0_seq1 1938 1 1 1441 1 201 1641 0.0 2662 100.00 1441 1441
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Acur_01000228.1 2415 1 1 1440 1 2231 797 0.0 1906 90.62 1305 1440
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Apil_comp16418_c0_seq1 2079 1 3 1284 1 1714 430 0.0 1351 85.69 1102 1286
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Acur_01000228.1 2415 1 1 1287 1 2084 797 0.0 1219 83.81 1082 1291
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Ctom_01003023.1 2162 1 1 1289 1 106 1394 0.0 2381 100.00 1289 1289
Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Apil_comp19456_c0_seq1 3549 1 1 2464 1 761 3224 0.0 4551 100.00 2464 2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Btri_comp15171_c0_seq1 3766 1 1 2456 1 3046 591 0.0 1877 80.53 1985 2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Apil_comp19456_c0_seq1 3549 1 1 2457 1 3214 758 0.0 1879 80.54 1986 2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Atri_comp28646_c0_seq1 1403 1 1256 2454 1 1401 203 0.0 990 81.60 980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Btri_comp15171_c0_seq1 3766 1 1 2457 1 593 3049 0.0 4538 100.00 2457 2457
You can use hash to remove duplicates
The bellow code remove duplicates depending on their subject length (keep larger subject length rows).
Just update your # iterate through the individual hits part with
# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(exists $filterhash{$templine[2]} ){
if($filterhash{$templine[2]} < $subject_length){
$filterhash{$templine[2]}= $subject_length;
}
}
else{
$filterhash{$templine[2]}= $subject_length;
}
}
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(not exists $printhash{$templine[2]})
{
$printhash{$templine[2]}=1;
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
else{
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
#print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
} # end for ($jter=0; $jter<$t1...
Hope this will help you.
Edit part update
for negative stand you need to replace
$subject_length = $templine[9] -$templine[8];
with
if($templine[8] > $templine[9]){
$subject_length = $templine[8] -$templine[9];
}else{
$subject_length = $templine[9] -$templine[8];
}
You also need to update your extract_from_genome2 code for negative strand sequences.

Fastest way to index and query huge tab delimited file

I have 30Gb tab-delimited text file with numbers, I need the fastest way index it and to do a query to it by first and second column. I've tried MongoDB but it takes huge time to upload data to database, I've tried mongoimport via json file but it takes huge amount of time.
mongoimport --upsert --upsertFields A,B,S1,E1,S2,E2 -d DBName -c
TableName data.json
Data file fragment:
504 246 91.92007 93 0 4657 5631 5911 0 39 1061 1162
813 469 92.14697 109 0 2057 2665 7252 1 363 961 1399
2388 987 92.20945 61 0 1183 1575 1824 0 66 560 5088
2388 2323 92.88472 129 0 75 1161 1824 1 2516 3592 12488
2729 1008 95.29058 47 0 435 1166 1193 1 76 654 1055
2757 76 94.25837 12 0 0 44 1946 0 51 68 247
2757 2089 92.63158 14 0 12 30 1946 0 14 30 211
What is the right efficient way to do it with minimum time? Any hints about the best database for it? Or about mongo upload speed optimisation?
Query examples:
objs = db.TableName.find({'A':2757})
objs = db.TableName.find({'B':76})
For each number in column A and B there are up to 1000 hits with the mean 20.
Databases often has complex work to do in order to be more robust.
If you use strait B-tree indexes, normally it is faster.
Following you'll find a upload script in perl.
#!/usr/bin/perl
use DB_File;
use Fcntl ;
# $DB_BTREE->{'cachesize'} = 1000000;
$DB_BTREE->{'flags'} = R_DUP ;
my (%h, %h1, %h2,$n);
my $x = tie %h, 'DB_File', "bf.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
my $x1= tie %h1, 'DB_File', "i1.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
my $x2= tie %h2, 'DB_File', "i2.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
while(<>){ chomp;
if(/(\d+)\s+(\d+)/){
$h{++$n}=$_; ## add the tup
$h1{$1} = $n; ## add to index1
$h2{$2} = $n ## add to index2;
}
}
untie %h;
untie %h1;
untie %h2;
and a query:
#!/usr/bin/perl
use DB_File;
use Fcntl ;
$DB_BTREE->{'flags'} = R_DUP ;
my (%h, %h1, %h2, $n, #list);
my $x = tie %h, 'DB_File', "bf.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
my $x1= tie %h1, 'DB_File', "i1.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
my $x2= tie %h2, 'DB_File', "i2.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
while(<>){ chomp; # Queries input format: A:number or B:number
if(/A:(\d+)/){
#list = sort $x1->get_dup($1) ;
for(#list){print $h{$_},"\n"; }
}
if(/B:(\d+)/){
#list = sort $x2->get_dup($1) ;
for(#list){print $h{$_},"\n"; }
}
}
Query is very fast.
But upload took 20s (user time) for 1 000 000 lines...
(please if you do experiments with your data, show us the times)

Combining duplicated lines in txt file with perl

I am trying to combine duplicate lines using Perl with little luck. My tab-delimited text file is structured as follows (spaces added for readability):
Pentamer Probability Observed Length
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
I would like to be combine duplicated lines by adding the three numeric columns, therefor the line containing "ATGCA" would now look like this:
ATGCA 0.024 3 39
Any ideas/help/suggestions would be greatly appreciated! Thanks!
#!/usr/bin/perl
use warnings;
use strict;
my %hash;
while(<>) {
my #v = split(/\s+/);
if (defined $hash{$v[0]}) {
my $arr = $hash{$v[0]};
$hash{$v[0]} = [$v[0], $arr->[1] + $v[1],
$arr->[2] + $v[2], $arr->[3] + $v[3]];
} else {
$hash{$v[0]} = [#v];
}
}
foreach my $key (keys %hash) {
print join(" ", #{$hash{$key}}), "\n";
}
Here's another option:
use Modern::Perl;
my %hash;
while ( my $line = <DATA> ) {
my #vals = split /\s+/, $line;
$hash{ $vals[0] }->[$_] += $vals[ $_ + 1 ] for 0 .. 2;
}
say join "\t", $_, #{ $hash{$_} } for sort keys %hash;
__DATA__
ATGCA 0.008 1 16
TGTAC 0.021 1 16
GGCAT 0.008 1 16
CAGTG 0.004 1 16
ATGCA 0.016 2 23
TGTAC 0.007 1 23
Output:
ATGCA 0.024 3 39
CAGTG 0.004 1 16
GGCAT 0.008 1 16
TGTAC 0.028 2 39