Why is my last line is always output twice? - perl

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s+(\D+)/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt

Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s+(\D+)/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA

You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s+//g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.
You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.
Finally, print what's left. Without an argument, print uses the value in the topic variable $_.
Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.

Related

Perl with FASTA sequence extraction has problems (only) with first sequence

I am using a function/subroutine extract_seq available on internet to extract sequences in FASTA files. Briefly:
A sequence begins with first line identified by '>', followed by ID and other information separated by spaces
Subsequent lines (not beginning with '>' have multiple strings
A FASTA file can have 1 or more sequences
Bug is that the output has additional '>' character for first sequence (only) causing consistency problems.
Program works fine in extracting sequences based on ID except for additional '>' in case of first sequence. Could you please suggest a solution as well as reason for the bug? A simple regex would fix the problem but I do not feel good about fixing bugs that I cannot understand.
The Perl script is:
#!/usr/bin/perl -w
use strict;
my $seq_all = "seq_all.fa"; # all proteins in fasta format
foreach my $q_seq ("A0A1D8PC43","A0A1D8PC38") {
print "Querying $q_seq\n";
&extract_seq($seq_all, $q_seq);
}
exit 0;
sub extract_seq
{
open(my $fh, ">query.seq");
my $seq_all = $_[0];
my $lookup = $_[1];
local $/ = "\n>";
#ARGV = ($seq_all);
while (my $seq = <>) {
chomp $seq;
my ($id) = $seq =~ /^>*(\S+)/;
if ($id eq $lookup) {
print "$seq\n";
last;
}
}
}
The FASTA file is:
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
>A0A1D8PC56 A0A1D8PC56_CANAL Uncharacterized protein OS=Candida
MSDTKKTTETDSEVGYLDIYLRFNDDMEKDYCFQVKTTTVFKDLYKVFRTLPISLRPSVF
YHAQPIGFKKSVSPGYLTQDGNFIFDEDSQKQAVPVNDNDLINETVWPGQLILPVWQFND
FGFYSFLAFLACWLYTDLPDFISPTPGICLTNQMTKLMAWVLVQFGKDRFAETLLADLYD
TVGVGAQCVFFGFHIIKCLFIFGFLYTGVFNPMRVFRLTPRSVKLDVTKEELVKLGWTGT
RKATIDEYKEYYREFKINQHGGMIQAHRAGLFNTLRNLGVQLESGEGYNTPLTEENKLRT
MRQIVEDAKKPDFKLKLSYEYFAELGYVFATNAENKEGSELAQLIKQYRRYGLLVSDQRI
KTVVRARKGETDEEKPKVEEVVEE
>A0A1D8PC67 A0A1D8PC67_CANAL Bfa1p OS=Candida albicans (strain
MVSDKLTLLRQFSEEDELFGDIEGIDYHDGETLKINKFSFPSSASSPSFAITGQSPNMRS
INGKRITRETLSEYSEENETDLTSEFSDQEFEWDGFNKNQSIYQQMNQRLIATKVAKQRE
AEREQRELMQKRHKDYDPNQTLRLKDFNKLTNENLTLLDQLDDEKTVNYEYVRDDVEDFA
QGFDKDFETKLRIQPSMPTLRSNAPTLKKYKSYGEFKCDNRVKQKLDRIPSFYNKNQLLS
KFKETKSYHPHHKKMGTVRCLNNNSEVPVTYPSISNMKLNKEKNRWEGNDIDLIRFEKPS
LITHKENKTKKRQGNMVYDEQNLRWINIESEHDVFDDIPDLAVKQLQSPVRGLSQFTQRT
TSTTATATAPSKNNETQHSDFEISRKLVDKFQKEQAKIEKKINHWFIDTTSEFNTDHYWE
IRKMIIEE
>A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
Edit:
The problem, as explained above, I face is that the output has additional '>' character for first sequence (only). I do not see the reason for the same and this is causing a lot of trouble. Output is:
Querying A0A1D8PC43
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
Querying A0A1D8PC38
A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
$/ is the input record separator, setting local $/="\n>"; effect is that input is split into record ending with \n>, after chomp, the ending is removed however />*(\S+)/ may not match because > is consumed from previous record.
from FASTA wikipedia a line beginning by > is a comment and may not always be an id. However in case it is always the case, following may fix.
my ($id,$seq) = $seq =~ /^>*(.*)\n(\S+)/;
You set the record separator to \n>. This does not apply to the first sequence.
Fixed code sequence:
...
chomp $seq;
# for first sequence
$seq =~ s/^>//;
my ($id) = $seq =~ /^(\S+)/;
if ($id eq $lookup) {
...
Please note that your implementation is extremely inefficient, because it reads & parses the file contents for each query. How about splitting loading/parsing and querying into separate functions?
Alternative solution: give the full list of lookup values to the loader. It would then fill an answer array as it encounters the matches during reading the file.

Removing lines containing a string from a file w/ perl

I'm trying to take a file INPUT and, if a line in that file contains a string, replace the line with something else (the entire line, including line breaks), or nothing at all (remove the line like it wasn't there). Writing all this to a new file .
Here's that section of code...
while(<INPUT>){
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT ""; #remove the line
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
}
}
while(<INPUT>) should read one line at a time (if my understanding is correct) and store each line in the special variable $_
However, when I run the above code I get only the very first if statement condition returned Some_Replacement_String, and only once. (1 line, out of a file with 1.3m, and expecting 600,000 replacements). This obviously isn't the behavior I expect. If I do something like while(<INPUT>){print OUTPUT $_;) I get a copy of the entire file, every line, so I know the entire file is being read (expected behavior).
What I'm trying to do is get a line, test it, do something with it, and move on to the next one.
If it helps with troubleshooting at all, if I use print $.; anywhere in that while statement (or after it), I get 1 returned. I expected this to be the "Current line number for the last filehandle accessed.". So by the time my while statement loops through the entire file, it should be equal to the number of lines in the file, not 1.
I've tried a few other variations of this code, but I think this is the closest I've come. I assume there's a good reason I'm not getting the behavior I expect, can anyone tell me what it is?
The problem you are describing indicates that your input file only contains one line. This may be because of a great many different things, such as:
You have changed the input record separator $/
Your input file does not contain the correct line endings
You are running your script with -0777 switch
Some notes on your code:
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
No need to chomp a line you are not using.
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT "";
This is quite redundant. You don't need to print an empty string (ever, really), and chomp a value you're not using.
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
No need to remove newlines and then put them back. Also, normally you would use \n as your line ending, even on windows.
And, since you are chomping in every if-else clause, you might as well move that outside the entire if-block.
chomp;
if (....) {
But since you are never relying on line endings not being there, why bother using chomp at all?
When using the $_ variable, you can abbreviate some commands, such as you are doing with chomp. For example, a lone regex will be applied to $_:
} elsif (/ <\/closeTag>/) { # works splendidly
When, like above, you have a regex that contains slashes, you can choose another delimiter for your regex, so that you do not need to escape the slashes:
} elsif (m# </closeTag>#) {
But then you need to use the full notation of the m// operator, with the m in front.
So, in short
while(<INPUT>){
if (/ <openTag>/){
print OUTPUT "Some_Replacement_String";
} elsif (m# </closeTag>#) {
# do nothing
} else {
print OUTPUT $_; # print the original line
}
}
And of course, the last two can be combined into one, with some negation logic:
} elsif (not m# </closeTag>#) {
print OUTPUT $_;
}

grep keyword with if-condition

I have an input file with
words;
yadda yadda;
keyword 123;
yadda;
and I want to simply get the value 123 saved as a variable. I tried a solution from here:
my $var;
open(FILE,$data.dat) or die "error on opening $data: $!\n";
while (my $line = <FILE>) {
if (/^keyword/) {
$var = $1;
print $line;
last;
}
}
close(FILE);
This isn't working and gives me following error: Use of uninitialized value $_ in pattern match (m//) at ./script.pl line 91, <FILE> line 384. (this occurs for all lines of <FILE>)
I found another solution without the if-condition which just states #string = sort grep /^keyword/,<FILE>; and works. Can you please explain to me what is happening here?
/edit
Thx for the answers and explanations! What do you think is the better/more elegant way? The grep or the if-condition?
You need the following change:
if ($line =~ m/^keyword\s+(\d+)/)
Explanation: You read into $line, hence $_ which is the default target for match is undefined.
In addition, you'd get another error with $1, because your pattern did not specify a capturing group.
$1 refers to the first capture group, but your regex doesn't contain any capture groups, so it's undefined. Try
if ($line =~ /^keyword\s+(-?(?:\d+|\d*\.\d*)(?:[Ee]-?(?:\d+|\d*\.\d*))?)/) {
Notice also that the regex is being applied to the variable containing the line you just read.
Edit: Updated to cope with numbers in scientific notation. This is a significant additional requirement which you should have specified explicitly in the first place.

Why the digits are being displayed using perl regular expressions?

I'm using \D to not display digits but why the digits are being displayed using perl regular expressions?
Here's the content of the text2.tx file
1. Hello Brue this is a test.
2. Hello Lisa this is a test.
This is a test 1.
This is a test 2.
Here is the perl program.
#!/usr/bin/perl
use strict;
use warnings;
open READFILE,"<", "test2.txt" or die "Unable to open file";
while(<READFILE>)
{
if(/\D/)
{
print;
}
}
/\D/ just checks that the line has at least one non-digit character (including the newline...). Can you explain what you wanted to check? What output you were expecting?
If you want to only print lines that don't have a digit, you want to do:
if ( ! /\d/ )
(does the line not have a digit), not
if ( /\D/ )
(does the line have a non-digit).
Lets take a look at what is going on behind the scenes. Your while loop is equivalent to:
while(defined($_ = <READFILE>))
{
if($_ =~ /\D/)
{
print $_;
}
}
So, you are checking if the line contains a non-digit character (which it does) and then printing that line.
If you want to print Hello Brue this is a test. instead of 1. Hello Brue this is a test., then you would have to use something like:
while(<READFILE>) {
s/^\d+\. //;
print;
}
Also, it would make for more readable code if you used a variable rather than $_.
What you want is to reject lines that have a digit rather than match lines that have a non-digit (as you're doing)
while (<READFILE>) {
print unless /\d/;
}
This will print each line unless it has a digit on it.

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks
I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;
I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...
This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;