Removing lines in one file that are present in another file - perl

I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...
I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...
Code:
#!/usr/bin/perl
use strict;
use warnings;
## bring files to program
MAIN: {
my ($chrC, $posC, $junkC, $baserefC, $allrestC);
my (#ref_arrayH, #ref_arrayC);
my ($chrH, $posH, $baserefH);
my $entriesH;
# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
if (defined) {
chomp;
if (/^#/) { next; }
if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
my #line_arrayH = split(/\t/, $_);
push (#ref_arrayH, \#line_arrayH); # the "reference" of each line_array is now in each element of #ref_array
$entriesH = scalar #ref_arrayH; # count the number of entries in the het file
}
}
}
close HET;
# print $entriesH,"\n";
open (CNS, $cns) or die "Could not open file $cns - $!";
foreach my $refH ( #ref_arrayH ) {
$chrH = $refH -> [0];
$posH = $refH -> [1];
$baserefH = $refH -> [3];
foreach my $line (<CNS>) {
chomp $line;
if ($line =~ /^#/) { next; }
if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
else { print "$line\n"; }
}
}
}
# close CNS;
}
CNS file:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
HET file:
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
I would like the output to be like this
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
but is giving me this instead:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?
Would it be better to change the foreach loop
foreach my $refH ( #ref_arrayH ) {
with a for loop
for (my $i = 0; $i <= $entriesH; $i++) {
?

#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;
open my $cns, '<', 't.cns' or die $!;
while (my $line = <$cns>) {
next if exists $seen{ $line };
print $line;
}
close $cns or die $!;
If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the %seen hash.
This will use memory in proportion to the number of lines in the HET file.
To reduce memory usage, you can tie %seen to a DBM_File on disk.

If you are concerned about memory usage you can read both file one line at a time while doing the comparison. The following is an alternative approach.
Note: Because of the way filehandle works we have to reset connection every time we are to read from the file in the second nested loop.
#!/usr/bin/env perl
use strict;
use warnings;
open my $cns, '<', 't.cns' or die $!;
CNS:
while (my $line = <$cns>) { #read one line at a time from t.cns file.
open my $het, '<', 't.het' or die $!;
while (my $reference = <$het>){
if ($line eq $reference) { #test if current t.cns line is equal to any line in t.hex file.
close $het; #reset conection to t.hex file.
next CNS; # skip current t.cns line.
}
}
print $line;
close $het; #reset conection to t.hex file.
}
close $cns or die $!;

Related

Why is the output the way it is? -Splitting and chop

I'm trouble understanding the output of the below code.
1. Why is the output Jo Al Ch and Sa? Doesn't chop remove the last character of string and return that character, so shouldn't the output be n i n and y? 2. What is the purpose of the $firstline=0; line in the code?
3. What exactly is happening at the lines
foreach(#data)
{$name,$age)=split(//,$_);
print "$name $age \n";
The output of the following code is
Data in file is:
J o
A l
C h
S a
The file contents are:
NAME AGE
John 26
Ali 21
Chen 22
Sally 25
The code:
#!/usr/bin/perl
my ($firstline,
#data,
$data);
open (INFILE,"heading.txt") or die $.;
while (<INFILE>)
{
if ($firstline)
{
$firstline=0;
}
else
{
chop(#data=<INFILE>);
}
print "Data in file is: \n";
foreach (#data)
{
($name,$age)=split(//,$_);
print "$name $age\n";
}
}
There are few issues with this script but first I will answer your points
chop will remove the last character of a string and returns the character chopped. In your data file "heading.txt" every line might be ending with \n and hence chop will be removing \n. It is always recommended to use chomp instead.
You can verify what is the last character of the line by running the command below:
od -bc heading.txt
0000000 116 101 115 105 040 101 107 105 012 112 157 150 156 040 062 066
N A M E A G E \n J o h n 2 6
0000020 012 101 154 151 040 062 061 012 103 150 145 156 040 062 062 012
\n A l i 2 1 \n C h e n 2 2 \n
0000040 123 141 154 154 171 040 062 065 012
S a l l y 2 5 \n
0000051
You can see \n
There is no use of $firstline because it is never been set to 1. So you can remove the if/else block.
In the first line it is reading all the elements of array #data one by one. In 2nd line it is splitting the contents of the element in characters and capturing first 2 characters and assigning them to $name and $age variables and discarding the rest. In the last line we are printing those captured characters.
IMO, in line 2 we should do split based on space to actual capture the name and age.
So the final script should looks like:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (INFILE,"heading.txt") or die "Can't open heading.txt: $!";
while (<INFILE>) {
chomp(#data= <INFILE>);
}
close(INFILE);
print "Data in file is: \n";
foreach (#data) {
my ($name,$age)=split(/ /,$_);
print "$name $age\n";
}
Output:
Data in file is:
John 26
Ali 21
Chen 22
Sally 25

Print and use the data from two files simultaenously

pdb1.pdb
ATOM 709 CA THR 25 -29.789 33.001 72.164 1.00 0.00
ATOM 711 CB THR 25 -29.013 31.703 72.370 1.00 0.00
ATOM 734 CG THR 25 -29.838 30.458 72.573 1.00 0.00
ATOM 768 CE THR 25 -28.541 28.330 71.361 1.00 0.00
pdb2.pdb
ATOM 765 N ALA 25 -30.838 33.150 73.195 1.00 0.00
ATOM 764 N LEU 26 -29.457 33.193 69.767 1.00 0.00
ATOM 783 N VAL 27 -30.286 31.938 66.438 1.00 0.00
ATOM 798 N GLY 28 -28.076 30.044 64.519 1.00 0.00
output desired
709 CA 765 N 1.477 -29.789 33.001 72.164 -30.838 33.150 73.195
709 CA 764 N 2.427 -29.789 33.001 72.164 -29.457 33.193 69.767
709 CA 783 N 5.844 -29.789 33.001 72.164 -30.286 31.938 66.438
and so on.
The content from pdb1.pdb and pdb2.pdb is to read values in column 2,3,6,7 and 8 and then using column 6,7,8 do distance calculations.
I tried with this but the output is not getting printed.
Perl
open( f1, "pdb1.pdb" or die $! );
open( f2, "pdb2.pdb" or die $! );
while ( ( $line1 = <$f1> ) and ( $line2 = <$f2> ) ) {
#splitted = split( ' ', $line1 );
my #fields = split / /, $line1;
print $fields[1], "\n";
my $atom1 = #{ [ $line1 =~ m/\S+/g ] }[2];
my $no1 = #{ [ $line1 =~ m/\w+/g ] }[3];
my $x1 = #{ [ $line1 =~ m/\w+/g ] }[6];
my $y1 = #{ [ $line1 =~ m/\w+/g ] }[7];
my $z1 = #{ [ $line1 =~ m/\w+/g ] }[8];
my $atom2 = #{ [ $line2 =~ m/\w+/g ] }[2];
my $no2 = #{ [ $line2 =~ m/\w+/g ] }[3];
my $x2 = #{ [ $line2 =~ m/\w+/g ] }[6];
my $y2 = #{ [ $line2 =~ m/\w+/g ] }[7];
my $z2 = #{ [ $line2 =~ m/\w+/g ] }[8];
print $atom1;
for ( $f1, $f2 ) {
print $atom1 $no1 $x1 $y1 $z1 $atom2 $no2 $x2 $y2 $z2 "\n";
}
}
close( $f1 );
close( $f2 );
It's probably simplest to read both files into memory unless they're enormous
This solution calls subroutine read_file to build an array of hashes of all five fields of interest from each file. It then calculates the delta and reformats the data for output
use strict;
use warnings 'all';
my $f1 = read_file('file1.txt');
my $f2 = read_file('file2.txt');
for my $r1 ( #$f1 ) {
for my $r2 ( #$f2 ) {
my ($dx, $dy, $dz) = map { $r1->{$_} - $r2->{$_} } qw/ x y z /;
my $delta = sqrt( $dx * $dx + $dy * $dy + $dz * $dz );
my #rec = (
#{$r1}{qw/ id name /},
#{$r2}{qw/ id name /},
sprintf('%5.3f', $delta),
#{$r1}{qw/ x y z /},
#{$r2}{qw/ x y z /},
);
print "#rec\n";
}
}
sub read_file {
my ($file_name) = #_;
open my $fh, '<', $file_name or die qq{Unable to open "$file_name" for input: $!};
my #records;
while ( <$fh> ) {
next unless /\S/;
my %record;
#record{qw/ id name x y z /} = (split)[1,2,5,6,7];
push #records, \%record;
}
\#records;
}
output
709 CA 765 N 1.478 -29.789 33.001 72.164 -30.838 33.150 73.195
709 CA 764 N 2.427 -29.789 33.001 72.164 -29.457 33.193 69.767
709 CA 783 N 5.845 -29.789 33.001 72.164 -30.286 31.938 66.438
709 CA 798 N 8.374 -29.789 33.001 72.164 -28.076 30.044 64.519
711 CB 765 N 2.471 -29.013 31.703 72.370 -30.838 33.150 73.195
711 CB 764 N 3.032 -29.013 31.703 72.370 -29.457 33.193 69.767
711 CB 783 N 6.072 -29.013 31.703 72.370 -30.286 31.938 66.438
711 CB 798 N 8.079 -29.013 31.703 72.370 -28.076 30.044 64.519
734 CG 765 N 2.938 -29.838 30.458 72.573 -30.838 33.150 73.195
734 CG 764 N 3.937 -29.838 30.458 72.573 -29.457 33.193 69.767
734 CG 783 N 6.327 -29.838 30.458 72.573 -30.286 31.938 66.438
734 CG 798 N 8.255 -29.838 30.458 72.573 -28.076 30.044 64.519
768 CE 765 N 5.646 -28.541 28.330 71.361 -30.838 33.150 73.195
768 CE 764 N 5.199 -28.541 28.330 71.361 -29.457 33.193 69.767
768 CE 783 N 6.348 -28.541 28.330 71.361 -30.286 31.938 66.438
768 CE 798 N 7.069 -28.541 28.330 71.361 -28.076 30.044 64.519
Your code has lot of syntactical errors. I had made some changes to your code and this will get you started to what you want.
First of all use strict and use warnings by this way you would have already got a lot of noise removed.
use strict;
use warnings;
open(my $f1, "pdb1.pdb") or die $!;
open(my $f2, "pdb2.pdb") or die $!;
while(defined(my $line1 = <$f1>) and defined(my $line2 = <$f2>))
{
# print "Iam here";
my #splitted = split(' ',$line1);
my #fields = split / /, $line1;
#print $fields[1], "\n";
my $atom1 = #{[$line1 =~ m/\S+/g]}[2];
my $no1 = #{[$line1 =~ m/\w+/g]}[3];
my $x1 = #{[$line1 =~ m/\w+/g]}[6];
my $y1 = #{[$line1 =~ m/\w+/g]}[7];
my $z1 = #{[$line1 =~ m/\w+/g]}[8];
my $atom2 = #{[$line2 =~ m/\w+/g]}[2];
my $no2 = #{[$line2 =~ m/\w+/g]}[3];
my $x2 = #{[$line2 =~ m/\w+/g]}[6];
my $y2 = #{[$line2 =~ m/\w+/g]}[7];
my $z2 = #{[$line2 =~ m/\w+/g]}[8];
#print $atom1;
for ($f1, $f2) {
print "$atom1 $no1 $x1 $y1 $z1 $atom2 $no2 $x2 $y2 $z2 \n";
}
}
close ($f1);
close ($f2);
Now coming to your question, your expected output seems to be different from what you are doing in your logic. You are looping two files simultaneously which will do a one one one iteration rather than each line from file1 with all lines in file2. So I think you might need to look at looping part.
And the next thing you need to know is about column splitting.
#splitted = split(' ',$line1);
if you split a line in the above mentioned way you get all columns in the array. SO now your column1 is in zeroth index, column2 in first index and so on.
SO to get first column you should do
my $col1 = #splitted[0];
If you are using those regexs just for getting columns then its not needed as you are splitting those already and you have each column indpendently in the array.
Update:
The problem that you were getting was that you were using filehandles to iterate which is causing the issue.
use strict;
use warnings;
open(my $f1, "<pdb1.pdb") or die "$!" ;
open(my $f2, "<pdb2.pdb") or die "$!" ;
my #in1 = <$f1>;
my #in2 = <$f2>;
foreach my $file1 (#in1) { #use array to iterate
chomp($file1);
#print "File1 $file1\n";
my $atomno1=(split " ", $file1)[1];
my $atomname1=(split " ", $file1)[2];
my $xx=(split " ", $file1)[5];
my $yy=(split " ", $file1)[6];
foreach my $file2(#in2) {
chomp($file2);
#print "File2 $file2\n";
my $atomno2=(split " ", $file2)[1];
my $atomname2=(split " ", $file2)[2];
my $x=(split " ", $file2)[5];
my $y=(split " ", $file2)[6];
my $dis=sqrt((($x-$xx)*($x-$xx))+ (($y-$yy)*($y-$yy)));
print "$atomno1 $atomname1 $atomno2 $atomname2 $dis $xx $yy $x $y\n" ;
}
#$file1++;
}
close ($f1);

Perl: read an array and calculate corresponding percentile

I am trying to code for a perl code that reads a text file with a series of number, calculates, and prints out the numbers that corresponds to the percentiles. I do not have access to the other statistical modules, so I'd like to stick with just pure perl coding. Thanks in advance!
The input text file looks like:
197
98
251
82
51
272
154
167
38
280
157
212
188
88
40
229
228
125
292
235
67
70
127
26
279
.... (and so on)
The code I have is:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (my $fh, "<", "testing2.txt")
or die "Cannot open: $!\n";
while (<$fh>){
push #data, $_;
}
close $fh;
my %count;
foreach my $datum (#data) {
++$count{$datum};
}
my %percentile;
my $total = 0;
foreach my $datum (sort { $a <=> $b } keys %count) {
$total += $count{$datum};
$percentile{$datum} = $total / #data;
# percentile subject to change
if ($percentile{$datum} <= 0.10) {
print "$datum : $percentile{$datum}\n\n";
}
}
My desired output:
2 : 0.01
3 : 0.01333
4 : 0.01666
6 : 0.02
8 : 0.03
10 : 0.037
12 : 0.04
14 : 0.05
15 : 0.05333
16 : 0.06
18 : 0.06333
21 : 0.07333
22 : 0.08
25 : 0.09
26 : 0.09666
Where the format is #number from the list : #corresponding percentile
To store the numer wihtout a newline in #data, just add chomp; before pushing it, or chomp #data; after you've read them all.
If your input file has MSWin style newlines, convert it to *nix style using dos2unix or fromdos.
Also, try to learn how to indent your code, it boosts readability. And consider renaming $total to $running_total, as you use the value as it changes.

How can I match values in one file to ranges from another?

There are two input files, as the following lines show.
Columns 3 and 4 in input1 hold a range (such as 1 to 78 in the first row)
Column 2 in input2 holds a single position value (such 32 in the first row) which corresponds to one of the ranges in column in input1, and the corresponding value in column 2: in this case B100002.
I want to generate a file that contain the position, relative to the start of the range, for the every value in column 2 of file input1
For example, 358-344 + 1 = 15 is the relative position value for B100043
input1:
Scaffold_1 B100002 1 78
Scaffold_1 B100041 179 243
Scaffold_1 B100043 344 418
Scaffold_1 B100045 519 583
Scaffold_1 B100058 684 751
Scaffold_1 B100059 852 915
Scaffold_1 B100066 1016 1079
Scaffold_1 B100080 1180 1246
Scaffold_1 B100111 1347 1413
Scaffold_1 B100118 1514 1585
Scaffold_2 B123465 31531 31595
input2:
Scaffold_1 32
Scaffold_1 358
Scaffold_2 31533
Required output:
B100002 32
B100043 15
B123465 2
This is my solution
Change the format from input1 to input_1 and input2 to input_2 (tab separation)
Use software bedtools and awk to generate the output file that I want.
input_1:
Scaffold_1 . B100002 1 78 . . . .
Scaffold_1 . B100041 179 243 . . . .
Scaffold_1 . B100043 344 418 . . . .
Scaffold_1 . B100045 519 583 . . . .
Scaffold_1 . B100058 684 751 . . . .
Scaffold_1 . B100059 852 915 . . . .
Scaffold_1 . B100066 1016 1079 . . . .
Scaffold_1 . B100080 1180 1246 . . . .
Scaffold_1 . B100111 1347 1413 . . . .
Scaffold_1 . B100118 1514 1585 . . . .
Scaffold_1 . B101068 9218 9290 . . . .
Scaffold_2 . B123465 31531 31595 . . . .
input_2:
Scaffold_1 . . 31 33 . . . .
Scaffold_1 . . 357 359 . . . .
Scaffold_2 . . 31532 31534 . . . .
bedtools intersect -wb -a test2 -b test1 | awk '{print $12,($5-$13)}'
B100002 32
B100043 15
B123465 3
How can I use awk or perl to achieve my purpose? (I have to change file format when I use bedtools.)
if the data file sizes are not huge, there is a simpler way
$ join input1 input2 | awk '$5<$4 && $3<$5 {print $2, $5-$3+1}'
B100002 32
B100043 15
B123465 3
This Perl code seems to solve your problem
It is a common idiom: to load the entire dictionary in input1.txt into an in-memory data structure -- here %data, which is indexed by the scaffold ID -- and then process the object data to gather information from the dictionary
Assuming your input1 isn't enormous this should work fine. It's impossible to key data structures on a range, so every candidate range must be checked to see if the index falls above the start and below the end
If there is a match then the ID is printed together with the result of the arithmetic to calculate the one-based relative index
Note that your required result for the entry Scaffold_2 31533 should be 3 and not 2
use strict;
use warnings 'all';
use autodie;
use Data::Dump;
my %data;
{
open my $fh, '<', 'input1.txt';
while ( <$fh> ) {
next unless /\S/;
my ($scaff, $code, $start, $end) = split;
push #{ $data{$scaff} }, { start => $start, end => $end, code => $code };
}
}
open my $fh, '<', 'input2.txt';
while ( <$fh> ) {
my ($scaff, $index) = split;
my $items = $data{$scaff} or die qq{No such scaffold "$scaff"};
for my $item( #$items ) {
next unless $index >= $item->{start} and $index <= $item->{end};
printf "%s\t%d\n",
$item->{code},
$index - $item->{start} + 1;
last
}
}
output
B100002 32
B100043 15
B123465 3

Fastest way to index and query huge tab delimited file

I have 30Gb tab-delimited text file with numbers, I need the fastest way index it and to do a query to it by first and second column. I've tried MongoDB but it takes huge time to upload data to database, I've tried mongoimport via json file but it takes huge amount of time.
mongoimport --upsert --upsertFields A,B,S1,E1,S2,E2 -d DBName -c
TableName data.json
Data file fragment:
504 246 91.92007 93 0 4657 5631 5911 0 39 1061 1162
813 469 92.14697 109 0 2057 2665 7252 1 363 961 1399
2388 987 92.20945 61 0 1183 1575 1824 0 66 560 5088
2388 2323 92.88472 129 0 75 1161 1824 1 2516 3592 12488
2729 1008 95.29058 47 0 435 1166 1193 1 76 654 1055
2757 76 94.25837 12 0 0 44 1946 0 51 68 247
2757 2089 92.63158 14 0 12 30 1946 0 14 30 211
What is the right efficient way to do it with minimum time? Any hints about the best database for it? Or about mongo upload speed optimisation?
Query examples:
objs = db.TableName.find({'A':2757})
objs = db.TableName.find({'B':76})
For each number in column A and B there are up to 1000 hits with the mean 20.
Databases often has complex work to do in order to be more robust.
If you use strait B-tree indexes, normally it is faster.
Following you'll find a upload script in perl.
#!/usr/bin/perl
use DB_File;
use Fcntl ;
# $DB_BTREE->{'cachesize'} = 1000000;
$DB_BTREE->{'flags'} = R_DUP ;
my (%h, %h1, %h2,$n);
my $x = tie %h, 'DB_File', "bf.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
my $x1= tie %h1, 'DB_File', "i1.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
my $x2= tie %h2, 'DB_File', "i2.db", O_RDWR|O_CREAT|O_TRUNC , 0640, $DB_BTREE;
while(<>){ chomp;
if(/(\d+)\s+(\d+)/){
$h{++$n}=$_; ## add the tup
$h1{$1} = $n; ## add to index1
$h2{$2} = $n ## add to index2;
}
}
untie %h;
untie %h1;
untie %h2;
and a query:
#!/usr/bin/perl
use DB_File;
use Fcntl ;
$DB_BTREE->{'flags'} = R_DUP ;
my (%h, %h1, %h2, $n, #list);
my $x = tie %h, 'DB_File', "bf.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
my $x1= tie %h1, 'DB_File', "i1.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
my $x2= tie %h2, 'DB_File', "i2.db", O_RDWR|O_CREAT , 0640, $DB_BTREE;
while(<>){ chomp; # Queries input format: A:number or B:number
if(/A:(\d+)/){
#list = sort $x1->get_dup($1) ;
for(#list){print $h{$_},"\n"; }
}
if(/B:(\d+)/){
#list = sort $x2->get_dup($1) ;
for(#list){print $h{$_},"\n"; }
}
}
Query is very fast.
But upload took 20s (user time) for 1 000 000 lines...
(please if you do experiments with your data, show us the times)