Perl with FASTA sequence extraction has problems (only) with first sequence - perl

I am using a function/subroutine extract_seq available on internet to extract sequences in FASTA files. Briefly:
A sequence begins with first line identified by '>', followed by ID and other information separated by spaces
Subsequent lines (not beginning with '>' have multiple strings
A FASTA file can have 1 or more sequences
Bug is that the output has additional '>' character for first sequence (only) causing consistency problems.
Program works fine in extracting sequences based on ID except for additional '>' in case of first sequence. Could you please suggest a solution as well as reason for the bug? A simple regex would fix the problem but I do not feel good about fixing bugs that I cannot understand.
The Perl script is:
#!/usr/bin/perl -w
use strict;
my $seq_all = "seq_all.fa"; # all proteins in fasta format
foreach my $q_seq ("A0A1D8PC43","A0A1D8PC38") {
print "Querying $q_seq\n";
&extract_seq($seq_all, $q_seq);
}
exit 0;
sub extract_seq
{
open(my $fh, ">query.seq");
my $seq_all = $_[0];
my $lookup = $_[1];
local $/ = "\n>";
#ARGV = ($seq_all);
while (my $seq = <>) {
chomp $seq;
my ($id) = $seq =~ /^>*(\S+)/;
if ($id eq $lookup) {
print "$seq\n";
last;
}
}
}
The FASTA file is:
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
>A0A1D8PC56 A0A1D8PC56_CANAL Uncharacterized protein OS=Candida
MSDTKKTTETDSEVGYLDIYLRFNDDMEKDYCFQVKTTTVFKDLYKVFRTLPISLRPSVF
YHAQPIGFKKSVSPGYLTQDGNFIFDEDSQKQAVPVNDNDLINETVWPGQLILPVWQFND
FGFYSFLAFLACWLYTDLPDFISPTPGICLTNQMTKLMAWVLVQFGKDRFAETLLADLYD
TVGVGAQCVFFGFHIIKCLFIFGFLYTGVFNPMRVFRLTPRSVKLDVTKEELVKLGWTGT
RKATIDEYKEYYREFKINQHGGMIQAHRAGLFNTLRNLGVQLESGEGYNTPLTEENKLRT
MRQIVEDAKKPDFKLKLSYEYFAELGYVFATNAENKEGSELAQLIKQYRRYGLLVSDQRI
KTVVRARKGETDEEKPKVEEVVEE
>A0A1D8PC67 A0A1D8PC67_CANAL Bfa1p OS=Candida albicans (strain
MVSDKLTLLRQFSEEDELFGDIEGIDYHDGETLKINKFSFPSSASSPSFAITGQSPNMRS
INGKRITRETLSEYSEENETDLTSEFSDQEFEWDGFNKNQSIYQQMNQRLIATKVAKQRE
AEREQRELMQKRHKDYDPNQTLRLKDFNKLTNENLTLLDQLDDEKTVNYEYVRDDVEDFA
QGFDKDFETKLRIQPSMPTLRSNAPTLKKYKSYGEFKCDNRVKQKLDRIPSFYNKNQLLS
KFKETKSYHPHHKKMGTVRCLNNNSEVPVTYPSISNMKLNKEKNRWEGNDIDLIRFEKPS
LITHKENKTKKRQGNMVYDEQNLRWINIESEHDVFDDIPDLAVKQLQSPVRGLSQFTQRT
TSTTATATAPSKNNETQHSDFEISRKLVDKFQKEQAKIEKKINHWFIDTTSEFNTDHYWE
IRKMIIEE
>A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
Edit:
The problem, as explained above, I face is that the output has additional '>' character for first sequence (only). I do not see the reason for the same and this is causing a lot of trouble. Output is:
Querying A0A1D8PC43
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
Querying A0A1D8PC38
A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN

$/ is the input record separator, setting local $/="\n>"; effect is that input is split into record ending with \n>, after chomp, the ending is removed however />*(\S+)/ may not match because > is consumed from previous record.
from FASTA wikipedia a line beginning by > is a comment and may not always be an id. However in case it is always the case, following may fix.
my ($id,$seq) = $seq =~ /^>*(.*)\n(\S+)/;

You set the record separator to \n>. This does not apply to the first sequence.
Fixed code sequence:
...
chomp $seq;
# for first sequence
$seq =~ s/^>//;
my ($id) = $seq =~ /^(\S+)/;
if ($id eq $lookup) {
...
Please note that your implementation is extremely inefficient, because it reads & parses the file contents for each query. How about splitting loading/parsing and querying into separate functions?
Alternative solution: give the full list of lookup values to the loader. It would then fill an answer array as it encounters the matches during reading the file.

Related

Why is my last line is always output twice?

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s+(\D+)/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt
Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s+(\D+)/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA
You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s+//g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.
You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.
Finally, print what's left. Without an argument, print uses the value in the topic variable $_.
Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.

Replace comma with space in just one field - from a .CSV file

I have happened upon a problem with a program that parses through a CSV file with a few million records: two fields in each line has comments that users have put in, and sometimes they use commas within their comments. If there are commas input, that field will be contained in double quotes. I need to replace any commas found in those fields with a space. Here is one such line from the file to give you an idea -
1925,47365,2,650187016,1,1,"MADE FOR DRAWDOWNS, NEVER P/U",16,IFC 8112NP,Standalone-6,,,44,10/22/2015,91607,,B24W02651,,"PA-3, PURE",4/28/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,C48A0D001EF449E3AB97F0B98C811B1B,POS.MISTINT.V0000.UP.Q,PROD_SMISA_BK,414D512050524F445F504F5331393235906F28561D2F0020,10/22/2015 9:29,10/22/2015 9:30
NOTE - I do not have the Text::CSV module available to me, nor will it be made available in the server I am using.
Here is part of my code in parsing this file. The first thing I do is concatenate the very first three fields and prepend that concatenated field to each line. Then I want to clear out the commas in #fields[7,19], then format the DATE in three fields and the DATETIME in two fields. The only line I can't figure out is clearing out those commas -
my #data;
# Read the lines one by one.
while ( $line = <$FH> ) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
# remove user input commas in fields[7,19]
$_ = for fields[7,19];
# format DATE and DATETIME fields for MySQL/sqlbatch60
$_ = join '-', (split /\//)[2,0,1] for #fields[14,20,23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
# write the parsed record back to the file
push #data, \#fields;
}
If it is ONLY the eighth field that is troubling AND you know exactly how many fields there should be, you can do it this way
Suppose the total number of fields is always N
Split the line on commas ,
Separate and store the first six fields
Separate and store the last n fields, where n is N-8
Rejoin what remains with commas ,. This now forms field 8
and then do what ever you like to do with it. For example, write it to a proper CSV file
Text::CSV_XS handles quoted commas just fine:
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
my $aoa = csv(in => 'file.csv'); # The file contains the sample line.
print $aoa->[0][6];
Note The two main versions below clean up one field. The most recent change in the question states that there are, in fact, two such fields. The third version, at the end, works with any number of bad fields.
All code has been tested with the supplied example and its variations.
Following clarifications, this deals with the case when the file need be processed by hand. A module is easily recommended for parsing .csv, but there is a problem here: reliance on the user to enter double quotes. If they end up not being there we have a malformed file.
I take it that the number of fields in the file is known with certainty and ahead of time.
The two independent solutions below use either array or string processing.
(1) The file is being processed line by line anyway, the line being split already. If there are more fields than expected, join the extra array elements by space and then overwrite the array with correct fields. This is similar to what is outlined in the answer by vanHoesel.
use strict;
use warnings;
my $num_fields = 39; # what should be, using the example
my $ibad = 6; # index of the malformed field-to-be
my #last = (-($num_fields-$ibad-1)..-1); # index-range, rest of fields
my $file = "file.csv";
open my $fh, '<', $file;
while (my $line = <$fh>) { # chomp it if needed
my #fields = split ',', $line;
if (#fields != $num_fields) {
# join extra elements by space
my $fixed = join ' ', #fields[$ibad..$ibad+#fields-$num_fields];
# overwrite array by good fields
#fields = (#fields[0..$ibad-1], $fixed, #fields[#last]);
}
# Process #fields normally
print "#fields";
}
close $fh;
(2) Preprocess the file, only checking for malformed lines and fixing them as needed. Uses string manipulations. (Or, the method above can be used.) The $num_fields and $ibad are the same.
while (my $line = <$fh>) {
# Number of fields: commas + 1 (tr|,|| counts number of ",")
my $have_fields = $line =~ tr|,|| + 1;
if ($have_fields != $num_fields) {
# Get indices of commas delimiting the bad field
my ($beg, $end) = map {
my $p = '[^,]*,' x $_;
$line =~ /^$p/ and $+[0]-1;
} ($ibad, $ibad+$have_fields-$num_fields);
# Replace extra commas and overwrite that part of the string
my $bad_field = substr($line, $beg+1, $end-$beg-1);
(my $fixed = $bad_field) =~ tr/,/ /;
substr($line, $beg+1, $end-$beg-1) = $fixed;
}
# Perhaps write the line out, for a corrected .csv file
print $line;
}
In the last line the bad part of $line is overwritten by assigning to substr, what this function allows. The new substring $fixed is constructed with commas changed (or removed, if desired), and used to overwrite the bad part of the $line. See docs.
If quotes are known to be there a regex can be used. This works with any number of bad fields.
while (my $line = <$fh>) {
$line =~ s/."([^"]+)"/join ' ', split(',', $1)/eg; # "
# process the line. note that double quotes are removed
}
If the quotes are to be kept move them inside parenthesis, to be captured as well.
This one line is all that need be done after while (...) { to clean up data.
The /e modifier makes the replacement side be evaluated as code, instead of being used as a double-quoted string. There the matched part of the line (between ") is split by comma and then joined by space, thus fixing the field. See the last item under "Search and replace" in perlretut.
All code has been tested with multiple lines and multiple commas in the bad field.

replace NULL token in Hive CLI output

I am trying to deliver the output of Hive CLI to my (closed source) application, and want to replace all "NULL" tokens with empty string. This is because Hive returns NULL even for numeric fields which the application raises exceptions for. I thought this should be a simple sed, or perl regex, but cant solve the problem so far.
Here's an example of the data record -
NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>2015-02-08
The perl code I tried is
my %replace = (
"\tNULL\t" => "b",
"^NULL\t" => "a",
"\tNULL\$" => "c"
);
my $regex = join "|", keys %replace;
#$regex = qr/$regex/;
my $filename = hout;
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
$row =~ s/($regex)/$replace{$1}/g;
print "$row\n";
}
This is the output I get -
NULLbNULLbNULLbNULL<TAB>2015-02-08
In other words, in a stream of 'fields' delimited by a 'character', I want to replace any field that is equal to the string "NULL" with an empty string, so the delimiters surrounding the field (or start of line + delimiter, or delimiter + end of line) become adjacent.
Any guidance would be much appreciated! Thanks!
P.S. I dont need a perl solution per se; just any terse solution would be awesome (I tried sed as well with similar results)
The root of your problem here is that your patterns overlap. You have a delimiter either side of your 'NULL' which you then modify and move on.
So something like this:
my $string = "NULL\tNULL\tNULL\tsome_value\tNULL\n";
print $string;
$string =~ s/(\A|\t)NULL(\t|\Z)/$1$2/g;
print $string;
$string =~ s/(\A|\t)NULL(\t|\Z)/$1$2/g;
print $string;
You need two passes to process it, because the pattern 'grabs' too much for the next iteration to match.
So with reference to: Matching two overlapping patterns with Perl
What you probably need is:
$string =~ s/(\A|\t)NULL(?=(\t|\Z))/$1/g;
You can use the same model if you're wanting to apply it to separate patterns.
I do this a lot creating output files from hive cli, it may be "simple" but I just pipe my output though this sed string:
hive -f test.hql | sed 's/NULL//g' > test.out
Works fine for me.

How to read large files with different line delimiters?

I have two very large XML files that have different kinds of line endings.
File A has CR LF at the end of each XML record. File B has only CR at the end of each XML record.
In order to read File B properly, I need to set the built-in Perl variable $/ to "\r".
But if I'm using the same script with File A, the script does not read each line in the file and instead reads it as a single line.
How can I make the script compatible with text files that have various line ending delimiters? In the code below, the script is reading XML data and then using regex to split records based on a specific XML tag record ending tag like <\record>. Finally it writes the requested records to a file.
open my $file_handle, '+<', $inputFile or die $!;
local $/ = "\r";
while(my $line = <$file_handle>) { #read file line-by-line. Does not load whole file into memory.
$current_line = $line;
if ($spliceAmount > $recordCounter) { #if the splice amount hasn't been reached yet
push (#setofRecords,$current_line); #start adding each line to the set of records array
if ($current_line =~ m|$recordSeparator|) { #check for the node to splice on
$recordCounter ++; #if the record separator was found (end of that record) then increment the record counter
}
}
#don't close the file because we need to read the last line
}
$current_line =~/(\<\/\w+\>$)/;
$endTag = $1;
print "\n\n";
print "End Tag: $endTag \n\n";
close $file_handle;
While you may not need it for this, in theory, to parse .xml, you should use an xml parser. I'd recommend XML::LibXM or perhaps to start off with XML::Simple.
If the file isn't too big to hold in memory, you can slurp the whole thing into a scalar and split it into the correct lines yourself with a suitably flexible regular expression. For example,
local $/ = undef;
my $data = <$file_handle>;
my #lines = split /(?>\r\n)|(?>\r)|(?>\n)/, $data;
foreach my $line (#lines) {
...
}
Using a look-ahead assertion (?>...) preserves the end-of-line characters like the regular <> operator does. If you were just going to chomp them anyway, you can save yourself a step by passing /\r\n|\r|\n/ to split instead.

Perl: How to add a line to sorted text file

I want to add a line to the text file in perl which has data in a sorted form. I have seen examples which show how to append data at the end of the file, but since I want the data in a sorted format.
Please guide me how can it be done.
Basically from what I have tried so far :
(I open a file, grep its content to see if the line which I want to add to the file already exists. If it does than exit else add it to the file (such that the data remains in a sorted format)
open(my $FH, $file) or die "Failed to open file $file \n";
#file_data = <$FH>;
close($FH);
my $line = grep (/$string1/, #file_data);
if($line) {
print "Found\n";
exit(1);
}
else
{
#add the line to the file
print "Not found!\n";
}
Here's an approach using Tie::File so that you can easily treat the file as an array, and List::BinarySearch's bsearch_str_pos function to quickly find the insert point. Once you've found the insert point, you check to see if the element at that point is equal to your insert string. If it's not, splice it into the array. If it is equal, don't splice it in. And finish up with untie so that the file gets closed cleanly.
use strict;
use warnings;
use Tie::File;
use List::BinarySearch qw(bsearch_str_pos);
my $insert_string = 'Whatever!';
my $file = 'something.txt';
my #array;
tie #array, 'Tie::File', $file or die $!;
my $idx = bsearch_str_pos $insert_string, #array;
splice #array, $idx, 0, $insert_string
if $array[$idx] ne $insert_string;
untie #array;
The bsearch_str_pos function from List::BinarySearch is an adaptation of a binary search implementation from Mastering Algorithms with Perl. Its convenient characteristic is that if the search string isn't found, it returns the index point where it could be inserted while maintaining the sort order.
Since you have to read the contents of the text file anyway, how about a different approach?
Read the lines in the file one-by-one, comparing against your target string. If you read a line equal to the target string, then you don't have to do anything.
Otherwise, you eventually read a line 'greater' than your current line according to your sort criteria, or you hit the end of the file. In the former case, you just insert the string at that position, and then copy the rest of the lines. In the latter case, you append the string to the end.
If you don't want to do it that way, you can do a binary search in #file_data to find the spot to add the line without having to examine all of the entries, then insert it into the array before outputting the array to the file.
Here's a simple version that reads from stdin (or filename(s) specified on command line) and appends 'string to append' to the output if it's not found in the input. Outuput is printed on stdout.
#! /usr/bin/perl
$found = 0;
$append='string to append';
while(<>) {
$found = 1 if (m/$append/o);
print
}
print "$append\n" unless ($found);;
Modifying it to edit a file in-place (with perl -i) and taking the append string from the command line would be quite simple.
A 'simple' one-liner to insert a line without using any module could be:
perl -ni -le '$insert="lemon"; $eq=($insert cmp $_); if ($eq == 0){$found++}elsif($eq==-1 && !$found){print$insert} print'
giver a list.txt whose context is:
ananas
apple
banana
pear
the output is:
ananas
apple
banana
lemon
pear
{
local ($^I, #ARGV) = ("", $file); # Enable in-place editing of $file
while (<>) {
# If we found the line exactly, bail out without printing it twice
last if $_ eq $insert;
# If we found the place where the line should be, insert it
if ($_ gt $insert) {
print $insert;
print;
last;
}
print;
}
# We've passed the insertion point, now output the rest of the file
print while <>;
}
Essentially the same answer as pavel's, except with a lot of readability added. Note that $insert should already contain a trailing newline.