Extract specific FASTA sequences from a file using Perl - perl

I have written a Perl script to retrieve a hypothetical protein list from a FASTA file. I'm able to get only the header line with all hypothetical proteins, but I want to have all the sequences along with protein IDs.
The script is as follows.
#!/usr/bin/perl
use strict;
use warnings;
my $line;
open $fh, '<', '/home/Desktop/hypo_proteins/testprotein.fasta' or die "Cannot open file $fh, $!";
open OUT, ">output.txt";
while ( $line = <$fh> ) {
chomp $line;
if ( $line =~ /hypothetical protein/ ) {
print OUT "$line\n";
}
}
close $fh;
The output the I got from the above script is as follows
>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
>gi|113460158|ref|YP_718214.1| hypothetical protein HS_0009 [Haemophilus somnus 129PT]
>gi|113460165|ref|YP_718221.1| hypothetical protein HS_0016 [Haemophilus somnus 129PT]
But I need the output as follows:
>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLILALRYLISQK
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKDKLFKKQLTFNDRTAKVHIVNDCLFIVSPSSFEL
YLQEKGESYDEECINNLQYEFQALGLHRKRIIKNDTINFWRCKVIGPKKESFLVGYLVPNTRLFFGDKIL
INNRHLLLEE

This will do as you ask
#!/usr/bin/perl
use strict;
use warnings;
use constant INPUT => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';
open my $in_fh, '<', INPUT or die "Cannot open input file: $!";
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;
my $print;
while ( <$in_fh> ) {
if ( /^>/ ) {
$print = /hypothetical protein/;
}
print if $print;
}
Regarding your (deleted) question about this solution, it uses the implicit variable $_ in several places. It is equivalent to this program
#!/usr/bin/perl
use strict;
use warnings;
use constant INPUT => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';
open my $in_fh, '<', INPUT or die "Cannot open input file: $!";
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;
my $print;
while ( defined( $_ = <$in_fh>) ) {
if ( $_ =~ /^>/ ) {
$print = ( $_ =~ /hypothetical protein/ );
}
print $_ if $print;
}
So I hope you can see that $print = $_ =~ /hypothetical protein/ checks whether the current line (in $_) contains the string hypothetical protein and sets $print to a true value if so.
Because $print is defined outside the loop it keeps its value across iterations of the loop, and as you can see it is only changed on header lines, when the current line begins with >, and will stay true until the next header line, so that print if $print will output the header containing hypothetical protein and all following lines until the next header
I hope that helps

Related

Writing a file and Reading it in Perl

I'm trying to build a primary key into a new file from an original File which has the following structure (tbl_20180615.txt):
573103150033,0664,54,MSS02VEN*',INT,zxzc,,,,,
573103150033,0665,54,MSS02VEN,INT,zxzc,,,,,
573103150080,0659,29,MSS05ARA',INT,zxzc,,,,,
573103150080,0660,29,MSS05ARA ,INT,zxzc,,,,,
573103154377,1240,72,MSSTRI01,INT,zxzc,,,,,
573103154377,1240,72,MSSTRI01,INT,zxzc,,,,,
I launch my perl Verify.pl then I send the arguments, the first one is the number of columns to build the primary key in the new file, after I have to send the name of file (original file).
(Verify.pl)
#!/usr/bin/perl
use strict;
use warnings;
my $n1 = $ARGV[0];
my $name = $ARGV[1];
$n1 =~ s/"//g;
my $n2 = $n1 + 1;
my %seen;
my ( $file3 ) = qw(log.txt);
open my $fh3, '>', $file3 or die "Can't open $file3: $!";
print "Loading file ...\n";
open( my $file, "<", "$name" ) || die "Can't read file somefile.txt: $!";
while ( <$file> ) {
chomp;
my #rec = split( /,/, $_, $n2 ); #$n2 sirve para armar la primary key, hacer le split en los campos deseados
for ( my $i = 0; $i < $n1; $i++ ) {
print $fh3 "#rec[$i],";
}
print $fh3 "\n";
}
close( $file );
print "Done!\n";
#########to check duplicates
my ($file4) = qw(log.txt);
print "Checking duplicates records...\n\n";
open (my $file4, "<", "log.txt") || die "Can't read file log.txt: $!";
while ( <$file4> ) {
print if $seen{$_}++;
}
close($file4);
if I send the following instruction
perl Verify.pl 2 tbl_20180615.txt
this code build a new file called "log.txt" with the following structure, splitting the original file () into two columns given by the first argument:
(log.txt)
573103150033,0664,
573103150033,0665,
573103150080,0659,
573103150080,0660,
573103154377,1240,
573103154377,1240,
That works ok, but if I want to read the new file log.txt to check duplicates, it doesn't work, but If I comment the lines to generate the file log.txt (listed above) before the line in the code (###############to check duplicates################) launch the next part of the code it works ok, giving me two duplicates lines and looks like this:
(Result in command line)
573103154377,1240
573103154377,1240
How can I solve this issue?
I think this does what you're asking for. It builds a unique list of derived keys before printing any of them, using a hash to check whether a key has already been generated
Note that I have assigned values to #ARGV to emulate input values. You must remove that statement before running the program with input from the command line
#!/usr/bin/perl
use strict;
use warnings;
use autodie; # Handle bad IO statuses automatically
local #ARGV = qw/ 2 tbl_20180615.txt /; # For testing only
tr/"//d for #ARGV; # "
my ($key_fields, $input_file) = #ARGV;
my $output_file = 'log.txt';
my (#keys, %seen);
print "Loading input ... ";
open my $in_fh, '<', $input_file;
while ( <$in_fh> ) {
chomp;
my #rec = split /,/;
my $key = join ',', #rec[0..$key_fields-1];
push #keys, $key unless $seen{$key}++;
}
print "Done\n";
open my $out_fh, '>', $output_file;
print $out_fh "$_\n" for #keys;
close $out_fh;
output log.txt
573103150033,0664
573103150033,0665
573103150080,0659
573103150080,0660
573103154377,1240

Alter a file using information from another file

I want to alter the names in a phylip file using information from another file. The phylip is just one continuous string of information, and the names I want to alter (e.g. aaaaaaabyd) are embedded in it. Like so
((aaaaaaabyd:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,aaaaaaaafj:0.47991503739434709930):0.06859184769990583908,((aaaaaaaabk:0.09244297511609228524,aaaaaaaete:0.12568841555837687030):0.28431
(there are no new lines)
The names within are like aaaaaaaabk.
The other file has the information change to, like so in the other file,
aaaaaaaabk;Ciona savignyi
aaaaaaaete;Homo sapiens
aaaaaaaafj;Cryptosporidium hominis
aaaaaaaaad;Strongylocentrotus purpuratus
aaaaaaabyd;Theileria parva
aaaaaaaaaf;Plasmodium vivax
I have tried numerous things but this is the closest I got. The problem is it does it for one and doesn't print out the rest of the phylip file. I need to get to ((Theileria parva:0.23400159127856412500, etc.
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $csv, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $new_tree, "> raxml_tree.phy");
# Declare variables
my $find;
my $replace;
my $digest;
# put the file of the tree into string variable
my $string = <$tree>;
# open csv file
while (my $line = <$csv>) {
# aaaaaaaaaa;Ciona savignyi
if ($line =~ m/(\w+)\;+(\w+\s+\w*)/) {
$find = $1;
$replace = $2;
$string =~ s/$find/$replace/g;
}
}
print $new_tree "$string";
close $tree;
close $csv;
close $new_tree;
Some guidelines on your own code
The problem is almost certainly that you are opening the same file $ARGV[0] twice. Presumably one should be `$ARGV[1]
You must always use strict and use warnings at the top of every Perl program you write (there is very little point in declaring your variables unless use strict is in place) and declare all your variables with my as close as possible to their first point of use. It is bad form to declare all your variables in a block at the start, because it makes them all effectively global, and you lose most of the advantages of declaring lexical variables
You should use the three-parameter form of open, and it is a good idea to put the name of the file in the die string so that you can see which one failed. So
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
becomes
open my $tree, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
You should look for simpler solutions rather than apply regex methods every time. $line =~ m/(\w+)\;+(\w+\s+\w*)/ is much tidier as chomp, split /;/
You shouldn't use double-quotes around variables when you want just the value of the variable, so print $new_tree "$string" should be print $new_tree $string
Rather than trying to use the data from the other file (please try to use useful names for items in your question, as it's tough to know what to call them when writing a solution) it is best to build a hash that contains all the translations
This program will do as you ask. It builds a regex consisting of an alternation of all the hash keys, and then converts all ocurrences of that pattern into its corresponding name. Only those names that are in your sample other file are translated: the others are left as they are
use strict;
use warnings;
use 5.014; # For non-destructive substitution
use autodie;
my %names;
open my $fh, '<', 'other_file.txt';
while ( <$fh> ) {
my ($k, $v) = split /;/, s/\s+\z//r;
$names{$k} = $v;
}
open $fh, '<', 'phylip.txt';
my $data = <$fh>;
close $fh;
my $re = join '|', sort { length $b <=> length $a } keys %names;
$re = qr/(?:$re)/;
$data =~ s/\b($re)\b/$names{$1}/g;
print $data;
output
((Theileria parva:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,Cryptosporidium hominis:0.47991503739434709930):0.06859184769990583908,((Ciona savignyi:0.09244297511609228524,Homo sapiens:0.12568841555837687030):0.28431
Update
Here is a revised version of your own program with the above points accounted for and the bugs fixed
use strict;
use warnings;
open my $tree_fh, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
my $string = <$tree_fh>;
close $tree_fh;
open my $csv_fh, '<', $ARGV[1] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
while ( <$csv_fh> ) {
chomp;
my ($find, $replace) = split /;/;
$string =~ s/$find/$replace/g;
}
close $csv_fh;
open my $new_tree_fh, '>', 'raxml_tree.phy' or die qq{Failed to open "raxml_tree.phy" for output: $!\n};
print $new_tree_fh $string;
close $new_tree_fh;

How to replace string dynamically using perl script

I am trying to solve below issues.
I have 2 files. Address.txt and File.txt. I want to replace all A/B/C/D (File.txt) with corresponding string value (Read from Address.txt file) using perl script. It's not replacing in my output file. I am getting same content of File.txt.
I tried below codes.
Here is Address.txt file
A,APPLE
B,BAL
C,CAT
D,DOG
E,ELEPHANT
F,FROG
G,GOD
H,HORCE
Here is File.txt
A B C
X Y X
M N O
D E F
F G H
Here is my code :
use strict;
use warnings;
open (MYFILE, 'Address.txt');
foreach (<MYFILE>){
chomp;
my #data_new = split/,/sm;
open INPUTFILE, "<", $ARGV[0] or die $!;
open OUT, '>ariout.txt' or die $!;
my $src = $data_new[0];
my $des = $data_new[1];
while (<INPUTFILE>) {
# print "In while :$src \t$des\n";
$_ =~ s/$src/$des/g;
print OUT $_;
}
close INPUTFILE;
close OUT;
# /usr/bin/perl -p -i -e "s/A/APPLE/g" ARGV[0];
}
close (MYFILE);
If i Write $_ =~ s/A/Apple/g;
Then output file is fine and A is replacing with "Apple". But when dynamically coming it's not getting replaced.
Thanks in advance. I am new in perl scripting language . Correct me if I am wrong any where.
Update 1: I updated below code . It's working fine now. My questions Big O of this algo.
Code :
#!/usr/bin/perl
use warnings;
use strict;
open( my $out_fh, ">", "output.txt" ) || die "Can't open the output file for writing: $!";
open( my $address_fh, "<", "Address.txt" ) || die "Can't open the address file: $!";
my %lookup = map { chomp; split( /,/, $_, 2 ) } <$address_fh>;
open( my $file_fh, "<", "File1.txt" ) || die "Can't open the file.txt file: $!";
while (<$file_fh>) {
my #line = split;
for my $char ( #line ) {
( exists $lookup{$char} ) ? print $out_fh " $lookup{$char} " : print $out_fh " $char ";
}
print $out_fh "\n";
}
Not entirely sure how you want your output formatted. Do you want to keep the rows and columns as is?
I took a similar approach as above but kept the formatting the same as in your 'file.txt' file:
#!/usr/bin/perl
use warnings;
use strict;
open( my $out_fh, ">", "output.txt" ) || die "Can't open the output file for writing: $!";
open( my $address_fh, "<", "address.txt" ) || die "Can't open the address file: $!";
my %lookup = map { chomp; split( /,/, $_, 2 ) } <$address_fh>;
open( my $file_fh, "<", "file.txt" ) || die "Can't open the file.txt file: $!";
while (<$file_fh>) {
my #line = split;
for my $char ( #line ) {
( exists $lookup{$char} ) ? print $out_fh " $lookup{$char} " : print $out_fh " $char ";
}
print $out_fh "\n";
}
That will give you the output:
APPLE BAL CAT
X Y X
M N O
DOG ELEPHANT FROG
FROG GOD HORCE
Here's another option that lets Perl handle the opening and closing of files:
use strict;
use warnings;
my $addresses_txt = pop;
my %hash = map { $1 => $2 if /(.+?),(.+)/ } <>;
push #ARGV, $addresses_txt;
while (<>) {
my #array;
push #array, $hash{$_} // $_ for split;
print "#array\n";
}
Usage: perl File.txt Addresses.txt [>outFile.txt]
The last, optional parameter directs output to a file.
Output on your dataset:
APPLE BAL CAT
X Y X
M N O
DOG ELEPHANT FROG
FROG GOD HORCE
The name of the addresses' file is implicitly popped off of #ARGV for use later. Then, a hash is built, using the key/value pairs in File.txt.
The addresses' file is read, splitting each line into its single elements, and the defined-or (//) operator is used to returned the defined hash item or the single element, which is then pushed onto #array. Finally, the array is interpolated in a print statement.
Hope this helps!
First, here is your existing program, rewritten slightly
open the address file
convert the address file to a hash so that the letters are the keys and the strings the values
open the other file
read in the single line in it
split the line into single letters
use the letters to lookup in the hash
use strict;
use warnings;
open(my $a,"Address.txt")||die $!;
my %address=map {split(/,/) } map {split(' ')} <$a>;
open(my $f,"File.txt")||die $!;
my $list=<$f>;
for my $letter (split(' ',$list)) {
print $address{$letter}."\n" if (exists $address{$letter});
}
to make another file with the substitutions in place alter the loop that processes $list
for my $letter (split(' ',$list)) {
if (exists $address{$letter}) {
push #output, $address{$letter};
}
else {
push #output, $letter;
}
}
open(my $o,">newFile.txt")||die $!;
print $o "#output";
Your problem is that in every iteration of your foreach loop you overwrite any changes made earlier to output file.
My solution:
use strict;
use warnings;
open my $replacements, 'Address.txt' or die $!;
my %r;
foreach (<$replacements>) {
chomp;
my ($k, $v) = split/,/sm;
$r{$k} = $v;
}
my $re = '(' . join('|', keys %r) . ')';
open my $input, "<", $ARGV[0] or die $!;
while (<$input>) {
s/$re/$r{$1}/g;
print;
}
#!/usr/bin/perl -w
# to replace multiple text strings in a file with text from another file
#select text from 1st file, replace in 2nd file
$file1 = 'Address.txt'; $file2 = 'File.txt';
# save the strings by which to replace
%replacement = ();
open IN,"$file1" or die "cant open $file1\n";
while(<IN>)
{chomp $_;
#a = split ',',$_;
$replacement{$a[0]} = $a[1];}
close IN;
open OUT,">replaced_file";
open REPL,"$file2" or die "cant open $file2\n";
while(<REPL>)
{chomp $_;
#a = split ' ',$_; #replaced_data = ();
# replace strings wherever possible
foreach $i(#a)
{if(exists $replacement{$i}) {push #replaced_data,$replacement{$i};}
else {push #replaced_data,$i;}
}
print OUT trim(join " ",#replaced_data),"\n";
}
close REPL; close OUT;
########################################
sub trim
{
my $str = $_[0];
$str=~s/^\s*(.*)/$1/;
$str=~s/\s*$//;
return $str;
}

copy text after a specific string from a file and append to another in perl

I want to extract the desired information from a file and append it into another. the first file consists of some lines as the header without a specific pattern and just ends with the "END OF HEADER" string. I wrote the following code for find the matching line for end of the header:
$find = "END OF HEADER";
open FILEHANDLE, $filename_path;
while (<FILEHANDLE>) {
my $line = $_;
if ($line =~ /$find/) {
#??? what shall I do here???
}
}
, but I don't know how can I get the rest of the file and append it to the other file.
Thank you for any help
I guess if the content of the file isn't enormous you can just load the whole file in a scalar and just split it with the "END OF HEADER" then print the output of the right side of the split in the new file (appending)
open READHANDLE, 'readfile.txt' or die $!;
my $content = do { local $/; <READHANDLE> };
close READHANDLE;
my (undef,$restcontent) = split(/END OF HEADER/,$content);
open WRITEHANDLE, '>>writefile.txt' or die $!;
print WRITEHANDLE $restcontent;
close WRITEHANDLE;
This code will take the filenames from the command line, print all files up to END OF HEADER from the first file, followed by all lines from the second file. Note that the output is sent to STDOUT so you will have to redirect the output, like this:
perl program.pl headfile.txt mainfile.txt > newfile.txt
Update Now modified to print all of the first file after the line END OF HEADER followed by all of the second file
use strict;
use warnings;
my ($header_file, $main_file) = #ARGV;
open my $fh, '<', $header_file or die $!;
my $print;
while (<$fh>) {
print if $print;
$print ||= /END OF HEADER/;
}
open $fh, '<', $main_file or die $!;
print while <$fh>;
use strict;
use warnings;
use File::Slurp;
my #lines = read_file('readfile.txt');
while ( my $line = shift #lines) {
next unless ($line =~ m/END OF HEADER/);
last;
}
append_file('writefile.txt', #lines);
I believe this will do what you need:
use strict;
use warnings;
my $find = 'END OF HEADER';
my $fileContents;
{
local $/;
open my $fh_read, '<', 'theFile.txt' or die $!;
$fileContents = <$fh_read>;
}
my ($restOfFile) = $fileContents =~ /$find(.+)/s;
open my $fh_write, '>>', 'theFileToAppend.txt' or die $!;
print $fh_write $restOfFile;
close $fh_write;
my $status = 0;
my $find = "END OF HEADER";
open my $fh_write, '>', $file_write
or die "Can't open file $file_write $!";
open my $fh_read, '<', $file_read
or die "Can't open file $file_read $!";
LINE:
while (my $line = <$fh_read>) {
if ($line =~ /$find/) {
$status = 1;
next LINE;
}
print $fh_write $line if $status;
}
close $fh_read;
close $fh_write;

Perl problems printing output to a new file

I want to remove all lines in a text file that start with HPL_ I have acheived this and can print to screen, but when I try to write to a file, I just get the last line of the amended text printed in the new file. Any help please!
open(FILE,"<myfile.txt");
#LINES = <FILE>;
close(FILE);
open(FILE,">myfile.txt");
foreach $LINE (#LINES) {
#array = split(/\:/,$LINE);
my $file = "changed";
open OUTFILE, ">$file" or die "unable to open $file $!";
print OUTFILE $LINE unless ($array[0] eq "HPL_");
}
close(FILE);
close (OUTFILE);
exit;
You just want to remove all lines that start with HPL_? That's easy!
perl -pi -e 's/^HPL_.*//s' myfile.txt
Yes, it really is just a one-liner. :-)
If you don't want to use the one-liner, re-write the "write to file" portion as follows:
my $file = "changed";
open( my $outfh, '>', $file ) or die "Could not open file $file: $!\n";
foreach my $LINE (#LINES) {
my #array = split(/:/,$LINE);
next if $array[0] eq 'HPL_';
print $outfh $LINE;
}
close( $outfh );
Note how you are open()ing the file each time through the loop. This is causing the file to only contain the last line, as using open() with > means "overwrite what's in the file". That's the major problem with your code as it stands.
Edit: As an aside, you want to clean up your code. Use lexical filehandles as I've shown. Always add the three lines that tchrist posted at the top of every one of your Perl programs. Use the three-operator version of open(). Don't slurp the entire file into an array, as if you try to read a huge file it could cause your computer to run out of memory. Your program could be re-written as:
#!perl
use strict;
use autodie;
use warnings FATAL => "all";
my $infile = "myfile.txt";
my $outfile = "changed.txt";
open( my $infh, '<', $infile );
open( my $outfh, '>', $outfile );
while( my $line = <$infh> ) {
next if $line =~ /^HPL_/;
print $outfh $line;
}
close( $outfh );
close( $infh );
Note how with use autodie you don't need to add or die ... to the open() function, as the autodie pragma handles that for you.
The issue with your code is that you open the file for output within your line-processing loop which, due to your use of the '>' form of open, opens the file each time for write, obliterating any previous content.
Move the invocation of open() to the top of your file, above the loop, and it should work.
Also, I'm not sure of your intent but at line 4 of your example, you reopen your input file for write (using '>'), which also clobbers anything it contains.
As a side note, you might try reading up on Perl's grep() command which is designed to do exactly what you need, as in:
#!/usr/bin/perl
use strict;
use warnings;
open(my $in, '<', 'myfile.txt') or die "failed to open input for read: $!";
my #lines = <$in> or die 'no lines to read from input';
close($in);
# collect all lines that do not begin with HPL_ into #result
my #result = grep ! /^HPL_/, #lines;
open(my $out, '>', 'changed.txt') or die "failed to open output for write: $!";
print { $out } #result;
close($out);