VLOOKUP like 1 liner using awk - text-processing

Tons of threads regarding using awk as a VLOOKUP, and yet none seem to work when I try them out.
I have 2 files:
#BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125_VS_Danio.blastp_results
Sequence name Hit desc. E-Value Similarity
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97
Locus_11_Transcript_7/7_Confidence_0.647_Length_1989 gnl|BL_ORD_ID|6732gi|528475412|ref|XP_005164328.1| PREDICTED: cerebellar degeneration-related protein 2-like isoform X2 [Danio rerio] 0.0 96
#BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ head GAGA_all_merged_k125.LocusList
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912
Notice how the second file has all Loci counting from 1 onwards, while the first file skips a few, 3 and 7.
What I need an output of file 2 with columns (let's say column 2) from file 1 when the Locus is present in file #1. If Locus isn't present in File1, I want to see NA.
So far this is the closest I got, but it doesn't show the columns from file1:
#BioPower3-IBM ~/Goldfish/Assemblies/HighLength/blastx $ awk 'FNR == NR {keys[$1]; next} {if ($1 in keys) {print $1, $2} else {print $1, "NA"} }' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList | head
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912
Notice 3 and 7 have the needed NA, however, how do I make others display what's in file1? Thanks, Adrian

You were near the end. What problems? You do:
FNR == NR {keys[$1]; next}
that saves nothing in the associative array. Replace with:
FNR == NR {keys[$1] = $1; next}
And when printing, $2 does not exists:
if ($1 in keys) {print $1, $2}
Instead put the content saved in the associative array before:
if ($1 in keys) {print $1, keys[$1]}
So, it remains like:
awk '
FNR == NR {keys[$1] = $1; next}
{ if ($1 in keys) { print $1, keys[$1] }
else {print $1, "NA"}
}
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList
UPDATE based in comments: It's similar to the previous one. Just remove first field and then save the whole line in the array.
awk '
FNR == NR {f1 = $1; $1 = ""; keys[f1] = $0; next}
{ if ($1 in keys) { print $1, keys[$1] }
else {print $1, "NA"}
}
' GAGA_all_merged_k125_VS_Danio.blastp_results GAGA_all_merged_k125.LocusList
It yields:
Locus_1_Transcript_1/1_Confidence_1.000_Length_2223 gnl|BL_ORD_ID|19336gi|50540432|ref|NP_001002682.1| calsequestrin-2 precursor [Danio rerio] 0.0 89
Locus_2_Transcript_11/19_Confidence_0.580_Length_7240 gnl|BL_ORD_ID|42660gi|688610863|ref|XP_009294955.1| PREDICTED: band 4.1-like protein 1 isoform X1 [Danio rerio] 0.0 97
Locus_3_Transcript_1/1_Confidence_1.000_Length_417 NA
Locus_4_Transcript_46/49_Confidence_0.453_Length_5901 gnl|BL_ORD_ID|39369gi|59858543|ref|NP_001012312.1| gelsolin [Danio rerio] 0.0 92
Locus_5_Transcript_115/115_Confidence_0.452_Length_8023 gnl|BL_ORD_ID|30731gi|528504026|ref|XP_001345885.4| PREDICTED: protein Jumonji [Danio rerio] 0.0 91
Locus_6_Transcript_18/27_Confidence_0.299_Length_3005 gnl|BL_ORD_ID|28851gi|688587725|ref|XP_009289915.1| PREDICTED: phosphatidylinositol binding clathrin assembly protein b isoform X6 [Danio rerio] 0.0 98
Locus_7_Transcript_2/7_Confidence_0.611_Length_2222 NA
Locus_8_Transcript_198/200_Confidence_0.159_Length_4179 gnl|BL_ORD_ID|45364gi|52219062|ref|NP_001004604.1| BCSC-1 [Danio rerio] 0.0 86
Locus_9_Transcript_1/6_Confidence_0.600_Length_1266 gnl|BL_ORD_ID|10854gi|528479736|ref|XP_005165325.1| PREDICTED: cathepsin L1 isoform X1 [Danio rerio] 0.0 97
Locus_10_Transcript_2635/2635_Confidence_0.015_Length_11912 gnl|BL_ORD_ID|39467gi|116004513|ref|NP_001070618.1| 3-oxoacid CoA transferase 1b [Danio rerio] 0.0 97

Related

Why is the output the way it is? -Splitting and chop

I'm trouble understanding the output of the below code.
1. Why is the output Jo Al Ch and Sa? Doesn't chop remove the last character of string and return that character, so shouldn't the output be n i n and y? 2. What is the purpose of the $firstline=0; line in the code?
3. What exactly is happening at the lines
foreach(#data)
{$name,$age)=split(//,$_);
print "$name $age \n";
The output of the following code is
Data in file is:
J o
A l
C h
S a
The file contents are:
NAME AGE
John 26
Ali 21
Chen 22
Sally 25
The code:
#!/usr/bin/perl
my ($firstline,
#data,
$data);
open (INFILE,"heading.txt") or die $.;
while (<INFILE>)
{
if ($firstline)
{
$firstline=0;
}
else
{
chop(#data=<INFILE>);
}
print "Data in file is: \n";
foreach (#data)
{
($name,$age)=split(//,$_);
print "$name $age\n";
}
}
There are few issues with this script but first I will answer your points
chop will remove the last character of a string and returns the character chopped. In your data file "heading.txt" every line might be ending with \n and hence chop will be removing \n. It is always recommended to use chomp instead.
You can verify what is the last character of the line by running the command below:
od -bc heading.txt
0000000 116 101 115 105 040 101 107 105 012 112 157 150 156 040 062 066
N A M E A G E \n J o h n 2 6
0000020 012 101 154 151 040 062 061 012 103 150 145 156 040 062 062 012
\n A l i 2 1 \n C h e n 2 2 \n
0000040 123 141 154 154 171 040 062 065 012
S a l l y 2 5 \n
0000051
You can see \n
There is no use of $firstline because it is never been set to 1. So you can remove the if/else block.
In the first line it is reading all the elements of array #data one by one. In 2nd line it is splitting the contents of the element in characters and capturing first 2 characters and assigning them to $name and $age variables and discarding the rest. In the last line we are printing those captured characters.
IMO, in line 2 we should do split based on space to actual capture the name and age.
So the final script should looks like:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
open (INFILE,"heading.txt") or die "Can't open heading.txt: $!";
while (<INFILE>) {
chomp(#data= <INFILE>);
}
close(INFILE);
print "Data in file is: \n";
foreach (#data) {
my ($name,$age)=split(/ /,$_);
print "$name $age\n";
}
Output:
Data in file is:
John 26
Ali 21
Chen 22
Sally 25

Modifying perl script to not print duplicates and extract sequences of a certain length

I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.
Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)
Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acur_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250
I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################
my($progname) = $0;
############################################################################
# Initialize variables
############################################################################
my($jter);
my($com);
my($t1);
if ( #ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}
my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];
############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my #data = <$INF>;
print #data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";
############################################################################
# Extract hits and place sequences into new file
############################################################################
my #temparray;
my #templine;
my($seqfname);
open ($INF, $infile) or die "Could not open file $infile for input.\n";
#temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}
# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...
# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence
Using this script, here is the output I'm getting:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.
For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get
AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC
at that base pair range, not
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?
Thanks, and sorry for the lengthy post!
Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Here is one such infile with 9 hits, from which I was expecting only 4 sequences.
Note: this issue has been largely resolved based on the solution provided below the answer section
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Apil_comp16418_c0_seq1 2079 1 1 1587 1 416 2002 0.0 2931 100.00 1587 1587
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Atri_comp13712_c0_seq1 1938 1 1 1587 1 1651 75 0.0 1221 80.73 1286 1593
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Ctom_01003023.1 2162 1 1 1406 1 1403 1 0.0 1430 85.07 1197 1407
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Apil_comp16418_c0_seq1 2079 1 1 1437 1 1866 430 0.0 1170 81.43 1175 1443
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Atri_comp13712_c0_seq1 1938 1 1 1441 1 201 1641 0.0 2662 100.00 1441 1441
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Acur_01000228.1 2415 1 1 1440 1 2231 797 0.0 1906 90.62 1305 1440
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Apil_comp16418_c0_seq1 2079 1 3 1284 1 1714 430 0.0 1351 85.69 1102 1286
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Acur_01000228.1 2415 1 1 1287 1 2084 797 0.0 1219 83.81 1082 1291
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Ctom_01003023.1 2162 1 1 1289 1 106 1394 0.0 2381 100.00 1289 1289
Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Apil_comp19456_c0_seq1 3549 1 1 2464 1 761 3224 0.0 4551 100.00 2464 2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Btri_comp15171_c0_seq1 3766 1 1 2456 1 3046 591 0.0 1877 80.53 1985 2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Apil_comp19456_c0_seq1 3549 1 1 2457 1 3214 758 0.0 1879 80.54 1986 2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Atri_comp28646_c0_seq1 1403 1 1256 2454 1 1401 203 0.0 990 81.60 980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Btri_comp15171_c0_seq1 3766 1 1 2457 1 593 3049 0.0 4538 100.00 2457 2457
You can use hash to remove duplicates
The bellow code remove duplicates depending on their subject length (keep larger subject length rows).
Just update your # iterate through the individual hits part with
# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(exists $filterhash{$templine[2]} ){
if($filterhash{$templine[2]} < $subject_length){
$filterhash{$templine[2]}= $subject_length;
}
}
else{
$filterhash{$templine[2]}= $subject_length;
}
}
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(not exists $printhash{$templine[2]})
{
$printhash{$templine[2]}=1;
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
else{
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
#print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
} # end for ($jter=0; $jter<$t1...
Hope this will help you.
Edit part update
for negative stand you need to replace
$subject_length = $templine[9] -$templine[8];
with
if($templine[8] > $templine[9]){
$subject_length = $templine[8] -$templine[9];
}else{
$subject_length = $templine[9] -$templine[8];
}
You also need to update your extract_from_genome2 code for negative strand sequences.

Deleting lines with sed or awk

I have a file data.txt like this.
>1BN5.txt
207
208
211
>1B24.txt
88
92
I have a folder F1 that contains text files.
1BN5.txt file in F1 folder is shown below.
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 422 C SER A 248 70.124 -29.955 8.226 1.00 55.81 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
ATOM 626 N MET B 87 1.054 -3.071 -5.633 1.00 10.00 N
ATOM 627 CA MET B 87 -0.213 -2.354 -5.826 1.00 10.00 C
1B24.txt file in F1 folder is shown below.
ATOM 630 CB MET B 87 -0.476 -2.140 -7.318 1.00 10.00 C
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
ATOM 644 CA ALA B 94 -2.560 -5.149 -4.675 1.00 10.00 C
I need only the lines containing 207,208,211(6th column)in 1BN5.txt file. I want to delete other lines in 1BN5.txt file. Like this, I need only the lines containing 88,92 in 1B24.txt file.
Desired output
1BN5.txt file
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
1B24.txt file
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Here's one way using GNU awk. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file][$1]
}
END {
for (i in a) {
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,b)
for (j in a[i]) {
if (b[6]==j) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file][$1] } END { for (i in a) { while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,b); for (j in a[i]) if (b[6]==j) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
If you have an older version of awk, older than GNU Awk 4.0.0, you could try the following. Run like:
awk -f script.awk data.txt
Contents of script.awk:
/^>/ {
file = substr($1,2)
next
}
{
a[file]=( a[file] ? a[file] SUBSEP : "") $1
}
END {
for (i in a) {
split(a[i],b,SUBSEP)
while ( ( getline line < ("./F1/" i) ) > 0 ) {
split(line,c)
for (j in b) {
if (c[6]==b[j]) {
print line > "./F1/" i ".new"
}
}
}
system(sprintf("mv ./F1/%s.new ./F1/%s", i, i))
}
}
Alternatively, here's the one-liner:
awk '/^>/ { file = substr($1,2); next } { a[file]=( a[file] ? a[file] SUBSEP : "") $1 } END { for (i in a) { split(a[i],b,SUBSEP); while ( ( getline line < ("./F1/" i) ) > 0 ) { split(line,c); for (j in b) if (c[6]==b[j]) print line > "./F1/" i ".new" } system(sprintf("mv ./F1/%s.new ./F1/%s", i, i)) } }' data.txt
Please note that this script does exactly as you describe. It expects files like 1BN5.txt and 1B24.txt to reside in the folder F1 in the present working directory. It will also overwrite your original files. If this is not the desired behavior, drop the system() call. HTH.
Results:
Contents of F1/1BN5.txt:
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
Contents of F1/1B24.txt:
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Don't try to delete lines from the existing file, try to create a new file with only the lines you want to have:
cat 1bn5.txt | awk '$6 == 207 || $6 == 208 || $6 == 211 { print }' > output.txt
assuming gnu awk, run this command from the directory containing data.txt:
awk -F">" '{if($2 != ""){fname=$2}if($2 == ""){term=$1;system("grep "term" F1/"fname" >>F1/"fname"_results");}}' data.txt
this parses data.txt for filenames and search terms, then calls grep from inside awk to append the matches from each file and term listed in data.txt to a new file in F1 called originalfilename.txt_results.
if you want to replace the original files completely, you could then run this command:
grep "^>.*$" data.txt | sed 's/>//' | xargs -I{} find F1 -name {}_results -exec mv F1/{}_results F1/{} \;
This will move all of the files in F1 to a tmp dir named "backup" and then re-create just the resultant non-empty files under F1
mv F1 backup &&
mkdir F1 &&
awk '
NF==FNR {
if (sub(/>/,"")) {
file=$0
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,$0] = "F1/" file
}
next
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' data.txt &&
rm -rf backup
If you want the empty files too it's a trivial tweak and if you want to keep the backup dir just get rid of the "&& rm.." at the end (do that during testing anyway).
EDIT: FYI this is one case where you could argue the case for getline not being completely incorrect since it's parsing a first file that's totally unlike the rest of the files in structure and intent so parsing that one file differently from the rest isn't going to cause any maintenance headaches later:
mv F1 backup &&
mkdir F1 &&
awk -v data="data.txt" '
BEGIN {
while ( (getline line < data) > 0 ) {
if (sub(/>/,"",line)) {
file=line
ARGV[ARGC++] = "backup/" file
}
else {
tgt[file,line] = "F1/" file
}
}
}
(FILENAME,$6) in tgt {
print > tgt[FILENAME,$6]
}
' &&
rm -rf backup
but as you can see it makes the script a bit more complicated (though slightly more efficient as there's now no test for FNR==NR in the main body).
This solution plays some tricks with the record separator: "data.txt" uses > as the record separator, while the other files use newline.
awk '
BEGIN {RS=">"}
FNR == 1 {
# since the first char in data.txt is the record separator,
# there is an empty record before the real data starts
next
}
{
n = split($0, a, "\n")
file = "F1/" a[1]
newfile = file ".new"
RS="\n"
while (getline < file) {
for (i=2; i<n; i++) {
if ($6 == a[i]) {
print > newfile
break
}
}
}
RS=">"
system(sprintf("mv \"%s\" \"%s.bak\" && mv \"%s\" \"%s\"", file, file, newfile, file))
}
' data.txt
Definitely a job for awk:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 421 CA SER A 207 68.627 -29.819 8.533 1.00 50.79 C
ATOM 615 H LEU B 208 3.361 -5.394 -6.021 1.00 10.00 H
ATOM 616 HA LEU B 211 2.930 -4.494 -3.302 1.00 10.00 H
$ awk '$6==92||$6==88 { print }' 1B24.txt
ATOM 631 CG MET B 88 -0.828 -0.688 -7.575 1.00 10.00 C
ATOM 632 SD MET B 88 -2.380 -0.156 -6.830 1.00 10.00 S
ATOM 643 N ALA B 92 -1.541 -4.371 -5.366 1.00 10.00 N
Redirect to save the output:
$ awk '$6==207||$6==208||$6==211 { print }' 1bn5.txt > output.txt
I don't think you can do this with just sed alone. You need a loop to read your file data.txt. For example, using a bash script:
#!/bin/bash
# First remove all possible "problematic" characters from data.txt, storing result
# in data.clean.txt. This removes everything except A-Z, a-z, 0-9, leading >, and ..
sed 's/[^A-Za-z0-9>\.]//g;s/\(.\)>/\1/g;/^$/d' data.txt >| data.clean.txt
# Next determine which lines to keep:
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
# If input starts with ">", set remainder to be the current file
file="${line:1}"
else
# If value is in sixth column, add "keep" to end of line
# Columns assumed separated by one or more spaces
# "+" is a GNU extension, so we need the -r switch
sed -i -r "/^[^ ]+ +[^ ]+ +[^ ]+ +[^ ]+ +$line +/s/$/keep/" $file
fi
done
# Finally delete the unwanted lines, i.e. those without "keep":
# (assumes each file appears only once in data.txt)
cat data.clean.txt | while read line; do
if [[ "${line:0:1}" == ">" ]]; then
sed -i -n "/keep/{s/keep//g;p;}" ${line:1}
fi
done

Taking only values which form continous range

I have a file with 3 columns ->
A1 0 9
A1 4 14
A1 16 24
A1 25 54
A1 64 84
A1 74 84
A2 15 20
A2 19 50
I want to check if each line (value in col2 and 3) is present already or is in between the range of previous line, if col1 value is equal.
The desired output is ->
A1 0 14
A1 16 54
A1 64 84
A2 15 50
I have tried ->
#ARGV or die "No input file specified";
open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
#open $second,'<', $ARGV[1] or die "Unable to open input file: $!";
$k=0;
while (<$first>)
{
if($k==0)
{
#cols = split /\s+/;
$p0=$cols[0];
$p1=$cols[1];
$p2=$cols[2];
$p3=$cols[2]+1;
}
else{
#new = split /\s+/;
if ($new[0] eq $p0){
if ($new[1]>$p3)
{
print join("\t", #new),"\n";
$p0=$new[0];
$p1=$new[1];
$p2=$new[2];
$p3=$new[2]+1;
}
elsif ($new[2]>=$p2)
{
print $p0,"\t",$p1,"\t",$new[2],"\n";
$p2=$new[2];
$p3=$new[2]+1;
}
else
{
$p5=1;
}
}
else
{
print join("\t", #new),"\n";
$p0=$new[0];
$p1=$new[1];
$p2=$new[2];
$p3=$new[2]+1;
}}
$k=1;
}
and output I am getting is ->
A1 0 14
A1 16 24
A1 16 54
A1 64 84
A1 64 84
A2 15 20
A2 22 50
I am not able to understand why I am getting this wrong output. Also if there is any way that I can erase(or overwrite) the last printed line, then it will be very easy.
First of all, it would be much more simple to help you if you
used strict and warnings, and declared all your variabled close to first use with my
indented your code properly to show the structure
The reason your code fails is that you are printing data under too many conditions. For example you output A1 16 24 when you find it cannot be joined with the previous range A1 4 14 without waiting for it to be extended by the subsequent A1 25 54 (when you correctly extend the range and print it again). A1 64 84 is output twice for the same reason: first because it cannot be merged with A1 25 54, and again because it has been "extended" with A1 74 84. Finally A2 15 20 is output straight away because it has a new first column, even though it is merged with the next line and output again.
You need to output a range only when you have found that it cannot be extended again. That happens when
a new record is found that doesn't overlap the existing data
the end of the file is reached
This code prints output only in those cases an appears to do what you need.
use strict;
use warnings;
my #data;
while (<DATA>) {
if (not #data) {
#data = split;
next;
}
my #new = split;
if ($new[0] eq $data[0] and $new[1] <= $data[2] + 1) {
$data[2] = $new[2];
}
else {
print join("\t", #data), "\n";
#data = #new;
}
print join("\t", #data), "\n" if eof DATA;
}
__DATA__
A1 0 9
A1 4 14
A1 16 24
A1 25 54
A1 52 57
A1 59 62
A1 64 84
A1 74 84
A2 15 20
A2 19 50
OUTPUT
A1 0 14
A1 16 57
A1 59 62
A1 64 84
A2 15 50
You need to have some variables describing currently-accumulated contiguous region. For each line of input, flush the previously-accumulated region if the new input is a new column1 label, or is same label but non-contiguous, or is end-of-file. If it's same label and contiguous yo update the min and max values.
This assumes that columns 1 and 2 are sorted.
The rest is left as an exercise for the reader.

Using perl hash for comparing columns of 2 files

I have asked this question(sorry for asking again, this time it is different and difficult) but I have tried a lot but did not achieve the results.
I have 2 big files (tab delimited).
first file ->
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
101_#2 1 H F0 263 278 2 1.5
102_#1 1 6 F1 766 781 1 1.0
103_#1 2 15 V1 526 581 1 0.0
103_#1 2 9 V2 124 134 1 1.3
104_#1 1 12 V3 137 172 1 1.0
105_#1 1 17 F2 766 771 1 1.0
second file ->
Col1 Col2 Col3 Col4
97486 H 262 279
67486 9 118 119
87486 9 183 185
248233 9 124 134
If col3 value/character (of file1) and col2 value/character (of file 2) are same and then compare col5 and col6 of file 1(like a range value) with col3 and col4 of file2, if range of file 1 is present in file 2 then return that row (from file1) and also add the extra column1 from file2 in output.
Expected output ->
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
101_#2 1 H F0 263 278 2 1.5 97486
103_#1 2 9 V2 124 134 1 1.3 248233
So far I have tried something with hashes->
#ARGV or die "No input file specified";
open my $first, '<',$ARGV[0] or die "Unable to open input file: $!";
open my $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar (<$first>);
while(<$second>){
chomp;
#line=split /\s+/;
$hash{$line[2]}=$line[3];
}
while (<$first>) {
#cols = split /\s+/;
$p1 = $cols[4];
$p2 = $cols[5];
foreach $key (sort keys %hash){
if ($p1>= "$key"){
if ($p2<=$hash{$key})
{
print join("\t",#cols),"\n";
}
}
else{ next; }
}
}
But there is no comparison of col3 value/character (of file1) and col2 value/character (of file 2)in above code.
But this is also taking lot of time and memory.Can anybody suggest how I can make it fast using hashes or hashes of hashes.Thanks a lot.
Hello everyone,
Thanks a lot for your help. I figured out an efficient way for my own question.
#ARGV or die "No input file specified";
open $first, '<',$ARGV[0] or die "Unable to open input file: $!";
open $second,'<', $ARGV[1] or die "Unable to open input file: $!";
print scalar (<$first>);
while(<$second>){
chomp;
#line=split /\s+/;
$hash{$line[1]}{$line[2]}{$line[3]}= $line[0];
}
while (<$first>) {
#cols = split /\s+/;
foreach $key1 (sort keys %hash) {
foreach $key2 (sort keys %{$hash{$key1}}) {
foreach $key3 (sort keys %{$hash{$key1}{$key2}}) {
if (($cols[2] eq $key1) && ($cols[4]>=$key2) && ($cols[5]<=$key3)){
print join("\t",#cols),"\t",$hash{$key1}{$key2}{$key3},"\n";
}
last;
}
}
}
}
Is it right?
You don't need two hash tables. You just need one hash table built from entries in the first file, and when you loop through the second file, check if there's a key in the first-file hash table using defined.
If there is a key, do your comparisons on the values of other columns (we store values from the first file in the hash table for the third column's key).
If there's no key, then either warn, die, or have the script just keep going without saying anything, if that's what you want:
#!/usr/bin/perl -w
use strict;
use warnings;
my $firstHashRef;
open FIRST, "< $firstFile" or die "could not open first file...\n";
while (<FIRST>) {
chomp $_;
my #elements = split "\t", $_;
my $col3Val = $elements[2]; # Perl arrays are zero-indexed
my $col5Val = $elements[4];
my $col6Val = $elements[5];
# keep the fifth and sixth column values on hand, for
# when we loop through the second file...
if (! defined $firstHashRef->{$col3Val}) {
$firstHashRef->{$col3Val}->{Col5} = $col5Val;
$firstHashRef->{$col3Val}->{Col6} = $col6Val;
}
}
close FIRST;
open SECOND, "< $secondFile" or die "could not open second file...\n";
while (<SECOND>) {
chomp $_;
my #elements = split "\t", $_;
my $col2ValFromSecondFile = $elements[1];
my $col3ValFromSecondFile = $elements[2];
my $col4ValFromSecondFile = $elements[3];
if (defined $firstHashRef->{$col2ValFromSecondFile}) {
# we found a matching key
# 1. Compare $firstHashRef->{$col2ValFromSecondFile}->{Col5} with $col3ValFromSecondFile
# 2. Compare $firstHashRef->{$col2ValFromSecondFile}->{Col6} with $col4ValFromSecondFile
# 3. Do something interesting, based on comparison results... (this is left to you to fill in)
}
else {
warn "We did not locate entry in hash table for second file's Col2 value...\n";
}
}
close SECOND;
How about using just awk for this -
awk '
NR==FNR && NR>1{a[$3]=$0;b[$3]=$5;c[$3]=$6;next}
($2 in a) && ($3<=b[$2] && $4>=c[$2]) {print a[$2],$1}' file1 file2
Input Data:
[jaypal:~/Temp] cat file1
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
101_#2 1 H F0 263 278 2 1.5
109_#2 1 H F0 263 278 2 1.5
102_#1 1 6 F1 766 781 1 1.0
103_#1 2 15 V1 526 581 1 0.0
103_#1 2 9 V2 124 134 1 1.3
104_#1 1 12 V3 137 172 1 1.0
105_#1 1 17 F2 766 771 1 1.0
[jaypal:~/Temp] cat file2
Col1 Col2 Col3 Col4
97486 H 262 279
67486 9 118 119
87486 9 183 185
248233 9 124 134
Test:
[jaypal:~/Temp] awk '
NR==FNR && NR>1{a[$3]=$0;b[$3]=$5;c[$3]=$6;next}
($2 in a) && ($3<=b[$2] && $4>=c[$2]) {print a[$2],$1}' file1 file2
101_#2 1 H F0 263 278 2 1.5 97486
103_#1 2 9 V2 124 134 1 1.3 248233