Adding sequence from FASTA file using Perl

Adding sequence from FASTA file using Perl - perl

I'm still learning Perl and I have a program which is able to take a FASTA file sequence header and print only the species name within square brackets. I want to add to this code to have it also print the entire sequence associated with the species.
Here is my code:
#!/usr/bin/perl
use warnings;
my $file = 'seqs.fasta';
my $tmp = 'newseqs.fasta';
open(OUT, '>', $tmp) or die "Can't open $tmp: $!";
open(IN, '<', $file) or die "Can't open $file: $!";
while(<IN>) {
chomp;
if ( $_ =~ /\[([^]]+)\]/ ) {
print OUT "$1\n";
}
}
close(IN);
close(OUT);
Here is a sample of the original FASTA file I had:
>gi|334187971|ref|NP_001190408.1| Cam-binding protein 60-like G [Arabidopsis thaliana] >gi|332006244|gb|AED93627.1| Cam-binding protein 60-like G [Arabidopsis thaliana]
MKIRNSPSFHGGSGYSVFRARNLTFKKVVKKVMRDQSNNQFMIQMENMIRRIVREEIQRSLQPFLSSSCVSMERSRSETP
SSRSRLKLCFINSPPSSIFTGSKIEAEDGSPLVIELVDATTNTLVSTGPFSSSRVELVPLNADFTEESWTVEGFNRNILT
QREGKRPLLTGDLTVMLKNGVGVITGDIAFSDNSSWTRSRKFRLGAKLTGDGAVEARSEAFGCRDQRGESYKKHHPPCPS
DEVWRLEKIAKDGVSATRLAERKILTVKDFRRLYTIIGAGVSKKTWNTIVSHAMDCVLDETECYIYNANTPGVTLLFNSV
YELIRVSFNGNDIQNLDQPILDQLKAEAYQNLNRITAVNDRTFVGHPQRSLQCPQDPGFVVTCSGSQHIDFQGSLDPSSS
SMALCHKASSSTVHPDVLMSFDNSSTARFHIDKKFLPTFGNSFKVSELDQVHGKSQTVVTKGCIENNEEDENAFSYHHHD
DMTSSWSPGTHQAVETMFLTVSETEEAGMFDVHFANVNLGSPRARWCKVKAAFKVRAAFKEVRRHTTARNPREGL
Currently, the output only pulls the species name Arabidopsis thaliana
However, I want it to print properly in a fasta file as such:
>Arabidopsis thaliana
MKIRNSPSFHGGSGYSVFRARNLTFKKVVKKVMRDQSNNQFMIQMENMIRRIVREEIQRSLQPFLSSSCVSMERSRSETP
SSRSRLKLCFINSPPSSIFTGSKIEAEDGSPLVIELVDATTNTLVSTGPFSSSRVELVPLNADFTEESWTVEGFNRNILT
QREGKRPLLTGDLTVMLKNGVGVITGDIAFSDNSSWTRSRKFRLGAKLTGDGAVEARSEAFGCRDQRGESYKKHHPPCPS
DEVWRLEKIAKDGVSATRLAERKILTVKDFRRLYTIIGAGVSKKTWNTIVSHAMDCVLDETECYIYNANTPGVTLLFNSV
YELIRVSFNGNDIQNLDQPILDQLKAEAYQNLNRITAVNDRTFVGHPQRSLQCPQDPGFVVTCSGSQHIDFQGSLDPSSS
SMALCHKASSSTVHPDVLMSFDNSSTARFHIDKKFLPTFGNSFKVSELDQVHGKSQTVVTKGCIENNEEDENAFSYHHHD
DMTSSWSPGTHQAVETMFLTVSETEEAGMFDVHFANVNLGSPRARWCKVKAAFKVRAAFKEVRRHTTARNPREGL
Could you suggest ways to modify the code to achieve this?

That's because what this does:
if ( $_ =~ /\[([^]]+)\]/ ) {
print OUT "$1\n";
}
Is find and capture any text in []. But if that pattern doesn't match, you don't do anything else with the line - like print it.
Adding:
else {
print OUT $_;
}
Will mean if a line doesn't contain [] it'll get printed by default.
I will also suggest:
turn on use strict;.
lexical filehandles are good practice: open ( my $input, '<', $file ) or die $!;
a pattern match implicitly applies to $_ by default. So you can write that 'if' as if ( /\[([^]]+)\]/ )

A couple of general points about your program
You must always use strict as well as use warnings 'all' at the top of every Perl program you write. It will reveal many simple mistakes that you could otherwise easily overlook
You have done well to choose the three-parameter form of open, but you should also use lexical file handles. So this line
open(OUT, '>', $tmp) or die "Can't open $tmp: $!";
should be written as
open my $out_fh, '>', $tmp or die "Can't open $tmp: $!";
It's probably best to supply the input and output file names on the command line, so you don't have to edit your program to run it against different files
I would solve your problem like this. It checks to see if each line is a header that contains a string enclosed in square brackets. The first test is that the line starts with a close angle bracket >, and the second test is the same as you wrote in your own program that captures the bracketed string — the species name
If these checks are passed then the species name is printed with an closing angle bracket and a newline, otherwise the line is printed as it is
This program should be run like this
$ fasta_species.pl seqs.fasta > newseqs.fasta
The dollar is just the Linux prompt character, and it assumes you have put the program in a file names fasta_species.pl. You can omit the > newseqs.fasta to display the output directly to the screen so that you can see what is being produced without creating an output file and editing it
use strict;
use warnings 'all';
while ( <> ) {
if ( /^>/ and / \[ ( [^\[\]]+ ) \] /x ) {
print ">$1\n";
}
else {
print;
}
}
output
>Arabidopsis thaliana
MKIRNSPSFHGGSGYSVFRARNLTFKKVVKKVMRDQSNNQFMIQMENMIRRIVREEIQRSLQPFLSSSCVSMERSRSETP
SSRSRLKLCFINSPPSSIFTGSKIEAEDGSPLVIELVDATTNTLVSTGPFSSSRVELVPLNADFTEESWTVEGFNRNILT
QREGKRPLLTGDLTVMLKNGVGVITGDIAFSDNSSWTRSRKFRLGAKLTGDGAVEARSEAFGCRDQRGESYKKHHPPCPS
DEVWRLEKIAKDGVSATRLAERKILTVKDFRRLYTIIGAGVSKKTWNTIVSHAMDCVLDETECYIYNANTPGVTLLFNSV
YELIRVSFNGNDIQNLDQPILDQLKAEAYQNLNRITAVNDRTFVGHPQRSLQCPQDPGFVVTCSGSQHIDFQGSLDPSSS
SMALCHKASSSTVHPDVLMSFDNSSTARFHIDKKFLPTFGNSFKVSELDQVHGKSQTVVTKGCIENNEEDENAFSYHHHD
DMTSSWSPGTHQAVETMFLTVSETEEAGMFDVHFANVNLGSPRARWCKVKAAFKVRAAFKEVRRHTTARNPREGL

Related

Perl - Compare two large txt files and return the required lines from the first

So I am quite new to perl programming. I have two txt files, combined_gff.txt and pegs.txt.
I would like to check if each line of pegs.txt is a substring for any of the lines in combined_gff.txt and output only those lines from combined_gff.txt in a separate text file called output.txt
However my code returns empty. Any help please ?
P.S. I should have mentioned this. Both the contents of the combined_gff and pegs.txt are present as rows. One row has a string. second row has another string. I just wish to pickup the rows from combined_gff whose substrings are present in pegs.txt
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
foreach my $extline (#ext) {
if ( index($gffline, $extline) != -1) {
$str=$str.$gffline;
$str=$str."\n";
exit;
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);

The first problem is exit. The output file is never created if a substring is found.
The second problem is chomp: you don't remove newlines from the lines, so the only way how a substring can be found is when a string from pegs.txt is a suffix of a string from combined_gff.txt.
Even after fixing these two problems, the algorithm will be very slow, as you're comparing each line from one file to each line of the second file. It will also print a line multiple times if it contains several different substrings (not sure if that's what you want).
Here's a different approach: First, read all the lines from pegs.txt and assemble them into a regex (quotemeta is needed so that special characters in substrings are interpreted literally in the regex). Then, read combined_gff.txt line by line, if the regex matches the line, print it.
#!/usr/bin/perl
use warnings;
use strict;
open my $data, '<', 'pegs.txt' or die $!;
chomp( my #ext = <$data> );
my $regex = join '|', map quotemeta, #ext;
open my $file, '<', 'combined_gff.txt' or die $!;
open my $out, '>', 'output.txt' or die $!;
while (<$file>) {
print {$out} $_ if /$regex/;
}
close $out;
I also switched to 3 argument version of open with lexical filehandles as it's the canonical way (3 argument version is safe even for files named >file or rm *| and lexical filehandles aren't global and are easier to pass as arguments to subroutines). Also, showing the actual error is more helpful than just dying with "error".

As choroba says you don't need the "exit" inside the loop since it ends the complete execution of the script and you must remove the line forwards (LF you do it by chomp lines) to find the matches.
Following the logic of your script I made one with the corrections and it worked fine.
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
chomp($gffline);
foreach my $extline (#ext) {
chomp($extline);
print $extline;
if ( index($gffline, $extline) > -1) {
$str .= $gffline ."\n";
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
Hope it works for you.
Welcho

Search for multiple terms in perl

I have a file with more than hundred single column entries. I need to search for each of these entries into a file of multiple column and more than thousand entries and need a output file. I tried these codes:
#!/usr/bin/perl -w
use strict;
use warnings;
print "Enter the input file name:";
my $inputfile = <STDIN>;
chomp($inputfile);
print "\nEnter the search file name:";
my $searchfile=<STDIN>;
chomp($searchfile);
open (INPUTFILE, $inputfile) || die;
open (SEARCHFILE, $searchfile) || die;
open (OUT, ">write.txt") || die;
while (my $line=<SEARCHFILE>){
while (<INPUTFILE>) {
if (/$line/){
print OUT $_;
}
}
}
close (INPUTFILE) || die;
close (SEARCHFILE) || die;
close (OUT) || die;
The output file has only one line. It has searched the term from the search file into input file, but only for the first term, not for all. Please help!

When you read INPUTFILE in the inner loop, it's read to the end during the first round of SEARCHFILE. Because it's not reset, the filehandle is used up and will always return eof.
If there are hundreds of lines, but not several 100,000 you can easily read it into an array first and then use that for the lookup. The fact that it's single-column makes that very easy. Note that this is less efficient then the alternative solution below.
chomp( my #needles = <SEARCHFILE> );
while (<INPUTFILE>) {
foreach my $needle (#needles) {
print OUT $_ if m/\Q$needle\E/; # \Q end \E quote regex meta chars
}
}
Alternatively you can also build one large lookup regex that matches all the strings in one go. That is probably faster than iterating the array for each line.
# open ...
chomp( my #needles = <SEARCHFILE> );
my $lookup = join '|', map quotemeta, #needles;
my $lookup_regex = qr/$lookup/; # possibly with /i?
while (my $line = <INPUTFILE>) {
print OUT $line if $line =~ $lookup_regex;
}
The quotemeta takes care of strings that contain regex meta characters like / or | or even .. It's the same as using \Q and \E as above.
Please also use three-argument-open and named filehandles.
open my $fh_searchfile, '<', $searchfile or die $!;
open my $fh_inputfile, '<', $inputfile or die $!;
open my $fh_out, '>', 'write.txt' or die $!;
chomp( my #needles = <$fh_searchfile> );
# ...
The three-argument-open is important because you are taking user input and using it as the filename directly. A malicious user could enter something like | rm -rf *, which would open a pipe to a delete all my files without asking program. Oops. But if you specify the '<' read open method explicitly in its own parameter, the method characters are ignored in the third param.
The lexical filehandle $fh is, as the name says, lexical, while INPUTFILE is a GLOB, which makes it global. That's not so bad if you only have this one script and no modules, but as soon as you deal with different packages it becomes problematic because those are super-global and every part of the program sees them. That can lead to name collisions and weird stuff happening.

How to find position of a word by using a counter?

I am currently working on a code that changes certain words to Shakespearean words. I have to extract the sentences that contain the words and print them out into another file. I had to remove .START from the beginning of each file.
First I split the files with the text by spaces, so now I have the words. Next, I iterated the words through a hash. The hash keys and values are from a tab delimited file that is structured as so, OldEng/ModernEng (lc_Shakespeare_lexicon.txt). Right now, I'm trying to figure out how to find the exact position of each modern English word that is found, change it to the Shakespearean; then find the sentences with the change words and printing them out to a different file. Most of the code is finished except for this last part. Here is my code so far:
#!/usr/bin/perl -w
use diagnostics;
use strict;
#Declare variables
my $counter=();
my %hash=();
my $conv1=();
my $conv2=();
my $ssph=();
my #text=();
my $key=();
my $value=();
my $conversion=();
my #rmv=();
my $splits=();
my $words=();
my #word=();
my $vals=();
my $existingdir='/home/nelly/Desktop';
my #file='Sentences.txt';
my $eng_words=();
my $results=();
my $storage=();
#Open file to tab delimited words
open (FILE,"<", "lc_shakespeare_lexicon.txt") or die "could not open lc_shakespeare_lexicon.txt\n";
#split words by tabs
while (<FILE>){
chomp($_);
($value, $key)= (split(/\t/), $_);
$hash{$value}=$key;
}
#open directory to Shakespearean files
my $dir="/home/nelly/Desktop/input";
opendir(DIR,$dir) or die "can't opendir Shakespeare_input.tar.gz";
#Use grep to get WSJ file and store into an array
my #array= grep {/WSJ/} readdir(DIR);
#store file in a scalar
foreach my $file(#array){
#open files inside of input
open (DATA,"<", "/home/nelly/Desktop/input/$file") or die "could not open $file\n";
#loop through each file
while (<DATA>){
#text=$_;
chomp(#text);
#Remove .START
#rmv=grep(!/.START/, #text);
foreach $splits(#rmv){
#split data into separate words
#word=(split(/ /, $splits));
#Loop through each word and replace with Shakespearean word that exists
$counter=0;
foreach $words(#word){
if (exists $hash{$words}){
$eng_words= $hash{$words};
$results=$counter;
print "$counter\n";
$counter++;
#create a new directory and store senteces with Shakespearean words in new file called "Sentences.txt"
mkdir $existingdir unless -d $existingdir;
open my $FILE, ">>", "$existingdir/#file", or die "Can't open $existingdir/conversion.txt'\n";
#print $FILE "#words\n";
close ($FILE);
}
}
}
}
}
close (FILE);
close (DIR);

Natural language processing is very hard to get right except in trivial cases, for instance it is difficult to define exactly what is meant by a word or a sentence, and it is awkward to distinguish between a single quote and an apostrophe when they are both represented using the U+0027 "apostrophe" character '
Without any example data it is difficult to write a reliable solution, but the program below should be reasonably close
Please note the following
use warnings is preferable to -w on the shebang line
A program should contain as few comments as possible as long as it is comprehensible. Too many comments just make the program bigger and harder to grasp without adding any new information. The choice of identifiers should make the code mostly self documenting
I believe use diagnostics to be unnecessary. Most messages are fairly self-explanatory, and diagnostics can produce large amounts of unnecessary output
Because you are opening multiple files it is more concise to use autodie which will avoid the need to explicitly test every open call for success
It is much better to use lexical file handles, such as open my $fh ... instead of global ones, like open FH .... For one thing a lexical file handle will be implicitly closed when it goes out of scope, which helps to tidy up the program a lot by making explicit close calls unnecessary
I have removed all of the variable declarations from the top of the program except those that are non-empty. This approach is considered to be best practice as it aids debugging and assists the writing of clean code
The program lower-cases the original word using lc before checking to see if there is a matching entry in the hash. If a translation is found, then the new word is capitalised using ucfirst if the original word started with a capital letter
I have written a regular expression that will take the next sentence from the beginning of the string $content. But this is one of the things that I can't get right without sample data, and there may well be problems, for instance, with sentences that end with a closing quotation mark or a closing parenthesis
use strict;
use warnings;
use autodie;
my $lexicon = 'lc_shakespeare_lexicon.txt';
my $dir = '/home/nelly/Desktop/input';
my $existing_dir = '/home/nelly/Desktop';
my $sentences = 'Sentences.txt';
my %lexicon = do {
open my ($fh), '<', $lexicon;
local $/;
reverse(<$fh> =~ /[^\t\n\r]+/g);
};
my #files = do {
opendir my ($dh), $dir;
grep /WSJ/, readdir $dh;
};
for my $file (#files) {
my $contents = do {
open my $fh, '<', "$dir/$file";
join '', grep { not /\A\.START/ } <$fh>;
};
# Change any CR or LF to a space, and reduce multiple spaces to single spaces
$contents =~ tr/\r\n/ /;
$contents =~ s/ {2,}/ /g;
# Find and process each sentence
while ( $contents =~ / \s* (.+?[.?!]) (?= \s+ [A-Z] | \s* \z ) /gx ) {
my $sentence = $1;
my #words = split ' ', $sentence;
my $changed;
for my $word (#words) {
my $eng_word = $lexicon{lc $word};
$eng_word = ucfirst $eng_word if $word =~ /\A[A-Z]/;
if ($eng_word) {
$word = $eng_word;
++$changed;
}
}
if ($changed) {
mkdir $existing_dir unless -d $existing_dir;
open my $out_fh, '>>', "$existing_dir/$sentences";
print "#words\n";
}
}
}

Match file names in if condition of perl

I've files with filenames such as lin.txt and lin1.txt along with other .txt files. I need to find only these files and print its content only by one. I've the below code, but its somehow not matching the files starting with lin*. What is the issue?
$te_dir= "/projects/xxx/";
opendir (DIR, $te_dir) or die $!;
while (my $file = readdir(DIR))
{
if ($file=~/\.txt/)
{
#// Doing some tasks.
if($file ~= 'lin*.txt')
{
$linfile=$te_dir/$file;
open(LINFILE, $linfile) or die "Couldn't open file $file:$!";
while(my $line = <LINFILE>)
{
print $line;
}
close LINFILE;
}
}
}

You are mixing globs (shell wildcards) with regular expressions. These are two different formalisms with different syntax and semantics. In regular expressions (which is what Perl matching uses), n* matches zero or more occurrences of the character n. You probably mean
if ($file =~ /lin.*\.txt/)
Notice also the syntax error in the operator. You correctly have =~ in the first conditional, but you misspelled it as ~= where you do this comparison. (Maybe it's just a transcription error; for me, this creates a clear syntax error, so the script would not run in the first place.)
As noted in #brianadams' answer, the proper regular expression for this is
if ($file =~ /^lin.*\.txt$/)
with beginning of line ^ and end of line $ anchors to prevent e.g. feline.txt.html from matching. The default behavior of Perl's regular expressions is to find a match anywhere in the input string.

Here's a quick (and minimal) rewrite of your code that might help:
use strict;
use warnings;
my $te_dir = "/projects/xxx/";
opendir( my $dirh, $te_dir ) or die "Could not open '$te_dir': $!";
while ( my $file = readdir($dirh) ) {
next unless $file =~ /\.txt$/;
#// Doing some tasks.
if ( $file =~ /^ lin \d* \.txt $/x ) {
my $linfile = "$te_dir/$file";
open( my $fh, $linfile ) or die "Couldn't open file $linfile: $!";
while ( my $line = <$fh> ) {
print $line;
}
close $fh or die "Could not close $linfile: $!";
}
}
First, note that we've put strict and warnings at the top of the code. That will tell you about all sorts of interesting issues, including misspelled variable names.
Next, we've switch to lexical handles (e.g., my $dirh instead of DIR). The "bareword" version of the handles you're using (DIR and LINFILE have been discouraged for a long time because those are effectively global constructs and generally global data is bad because when it gets broken, it's awfully hard to tell what broke it, so we much, much prefer the lexical versions (the handles declared with the my builtin).
Also, this line you had probably doesn't do what you're thinking:
$linfile=$te_dir/$file;
You're trying to smash together a directory and filename with a forward slash, but since you didn't use string interpolation, you're actually using division. Both your director and filename will, in this numeric context, probably evaluate to zero, giving you a divide by zero error when you're trying to open a file!
However, if you're willing to use a CPAN module, you can make this even easier:
use strict;
use warnings;
use File::Find::Rule;
my $te_dir = "/projects/xxx/";
my #files = File::Find::Rule->file->name('lin*.txt')->in($te_dir);
foreach my $linfile (#files) {
#// Doing some tasks.
open my $fh, $linfile or die "Couldn't open file $linfile: $!";
while ( my $line = <$fh> ) {
print $line;
}
}
No muss, no fuss. Get only the files you want in the first pass and already have the correct file names (note that I didn't close the filehandle because it will close automatically when $fh goes out of scope at the end of the foreach loop.)

To match files starting with lin
if ( $file =~ /^lin.*\.txt$/ )

Try changing your 2nd if condition from this,
if($file ~= 'lin*.txt')
to this,
if($file =~ /lin*\.txt/)
You could also try: if($file =~ /^lin*\.txt/) , as already pointed out in other answers, but you'll need to make sure that the file names stored in the $file variable contain only the file name and not the entire path as well.

What can be wrong with word count program?

I've got a question in my test:
What is wrong with program that counts number of lines and words in file?
open F, $ARGV[0] || die $!;
my #lines = <F>;
my #words = map {split /\s/} #lines;
printf "%8d %8d\n", scalar(#lines), scalar(#words);
close(F);
My conjectures are:
If file does not exist, program won't tell us about that.
If there are punctuation signs in file, program will count them, for example, in
abc cba
, , ,dce
will be five word, but on the other hand wc outputs the same result, so it might be considered as correct behavior.
If F is a large file, it might be better to iterate over lines and not to dump it into lines array.
Do you have any less trivial ideas?

On the first line, you have a precedence problem:
open F, $ARGV[0] || die $!;
is the same as
open F, ($ARGV[0] || die $!);
which means the die is executed if the filename is false, not if the open fails. You wanted to say
open(F, $ARGV[0]) || die $!;
or
open F, $ARGV[0] or die $!;
Also, you should be using the 3 argument form of open, in case $ARGV[0] contains characters that mean something to open.
open F, '<', $ARGV[0] or die $!;
On a different note, splitting on /\s/ means that you get a "word" between consecutive whitespace characters. You probably meant /\s+/, or as amphetamachine suggested, /\W+/, depending on how you want to define a "word".
That still leaves the problem of the empty "word" you get if the line begins with whitespace. You could split on ' ' to suppress that (it's a special case), or you could trim leading whitespace first, or insert a grep { length $_ } to weed out empty "words", or abandon split and use a different method for counting words.
Processing line by line instead of reading the whole file at once would also be a good improvement, but it's not as important as those first two items.

Your conjecture #1 is incorrect: your program will die if the open fails. (see cjm's answer re order of operations.)
you're using a global filehandle, rather than a lexical variable.
you're not using the three-argument form of open.
you could just read from stdin, which gives more flexibility as to input - the user can provide a file, or pipe the input into stdin.
lastly, I wouldn't write my own code to parse words; I'd reach for CPAN, say something like Lingua::EN::Splitter.
use strict; use warnings;
use Lingua::EN::Splitter qw(words);
my ($wordcount, $lines);
while (<>)
{
my $line = $_;
$lines++;
$wordcount += scalar(words $line);
}
printf "%8d %8d\n", $lines, $wordcount;

When you open F, $ARGV[0] || die $! that will effectively exit if the file doesn't exist.
There are some improvements to be made here:
{local $/; $lines = <F>;} # read all lines at once
my #words = split /\W+/, $lines;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Adding sequence from FASTA file using Perl - perl

Related

Perl - Compare two large txt files and return the required lines from the first

Search for multiple terms in perl

How to find position of a word by using a counter?

Match file names in if condition of perl

What can be wrong with word count program?

Categories

Resources