How to get the output for this code? - perl

I have a file t_code.txt in which I want to replace all occurrences of strings PIOMUX_UART1_TXD and PIOMUX_UART1_RXD with strings in #array1 containing TXD and RXD respectively and then print it in new file c_code2.txt but it's not working
open my $f6, '<', 't_code.txt' or die $!;
my #lines = <$f6>;
my #newlines;
foreach (#lines) {
$_ =~ s/PIOMUX_UART1_TXD/ grep ( / TXD / )(#array1)/g;
$_ =~ s/PIOMUX_UART1_RXD/ grep ( / RXD / )(#array1)/g;
push(#newlines, $_);
}
close($f6);
open my $output, '>', 'c_code2.txt' or die "Can't open the output file!";
print $output #newlines;
close($output);

Since #array1 (a dreadful choice of identifier, by the way) doesn't change inside the loop, it is best to build the replacement strings outside instead of every time you make a replacement.
It isn't clear exactly what string you want to replace PIOMUX_UART1_TXD and PIOMUX_UART1_RXD with, but this code joins all the matching elements of the array with commas and uses that. I hope it's cler how to do something different if you need to.
I've also used a while loop, as there's no need to read the whole file into an array beforehand.
my ($in_file, $out_file) = qw/ t_code.txt c_code2.txt /;
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for reading: $!};
open my $out_fh, '>', $out_file or die qq{Unable to open "$out_file" for writing: $!};
my ($txd) = grep /TXD/, #array1;
my ($rxd) = grep /RXD/, #array1;
while ( <$in_fh> ) {
s/PIOMUX_UART1_TXD/$txd/g;
s/PIOMUX_UART1_RXD/$rxd/g;
print $out_fh $_;
}
close $out_fh or die $!;

Several problems in the code:
To be able to use code in the replacement part of a substitution, you must use the /e modifier.
In a s/// construct, you can't use / unquoted. Either change the separator, or backslash it.
The replacement part in a substitution is a string. In case of code, it's evaluated in scalar context. grep in scalar context returns the number of matches.
Cf:
#! /usr/bin/perl
use warnings;
use strict;
my #array1 = qw( aTXDb cRXDd );
while (<DATA>) {
s/PIOMUX_UART1_TXD/join q(), grep m=TXD=, #array1/eg;
s/PIOMUX_UART1_RXD/join q(), grep m=RXD=, #array1/eg;
print;
}
__DATA__
PIOMUX_UART1_TXD
PIOMUX_UART1_RXD

Related

perl array prints as GLOB(#x#########)

I have a file which contains a list of email addresses which are separated by a semi-colon which is configured much like this (but much larger) :
$ cat email_casper.txt
casper1#foo.com; casper2#foo.com; casper3#foo.com; casper.casper4#foo.com;
#these throw outlook error :
#casper101#foo.com ; casper100#foo.com
#cat /tmp/emailist.txt | tr '\n' '; '
#cat /tmp/emallist.txt | perl -nle 'print /\<(.*)\>/' | sort
I want to break it up on the semicolon - so I suck the whole file into an array supposedly the contents are split on semicolon.
#!/usr/bin/perl
use strict;
use warnings;
my $filename = shift #ARGV ;
open(my $fh, '<', $filename) or die "Could not open file $filename $!";
my #values = split(';', $fh);
foreach my $val (#values) {
print "$val\n";
}
exit 0 ;
But the file awards me with a golb. I just don't know what is going one.
$ ./spliton_semi.pl email_casper.txt
GLOB(0x80070b90)
If I use Data::Dumper I get
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper ;
my $filename = shift #ARGV ;
open(my $fh, '<', $filename) or die "Could not open file $filename $!";
my #values = split(';', $fh);
print Dumper \#values ;
This is what the Dumper returns :
$ ./spliton_semi.pl email_casper.txt
$VAR1 = [
'GLOB(0x80070b90)'
];
You do not "suck the whole file into an array". You don't even attempt to read from the file handle. Instead, you pass the file handle to split. Expecting a string, it stringifies the file handle into GLOB(0x80070b90).
You could read the file into an array of lines as follows:
my #lines = <$fh>;
for my $line ($lines) {
...
}
But it's far simpler to read one line at a time.
while ( my $line = <$fh> ) {
...
}
In fact, there is no reason not to use ARGV here, simplifying your program to the following:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw( say );
while (<>) {
chomp;
say for split /\s*;\s*/, $_;
}
This line
my #values = split(';', $fh);
is not reading from the filehandle like you think it is. You're actually calling split on the filehandle object itself.
You want this:
my $line = <$fh>;
my #values = split(';', $line);
Starting point:
#!/usr/bin/perl
use strict;
use warnings;
open(my $fh, '<', 'dummy.txt')
or die "$!";
my #values;
while (<$fh>) {
chomp;
# original
# push(#values, split(';', $_));
# handle white space
push(#values, split(/\s*;\s*/, $_));
}
close($fh);
foreach my $val (#values) {
print "$val\n";
}
exit 0;
Output for your example:
$ perl dummy.pl
casper1#foo.com
casper2#foo.com
casper3#foo.com
casper.casper4#foo.com

perl: make script fast to use big file

My problem is how to make my script fast (I use big files)
I have the script above it add "bbb" between words if the words exist in an other file that contain sequences of words
for exemple file2.txt : i eat big pizza .my big pizza ...
file1.txt (sequences):
eat big pizza
big pizza
the result Newfile
i eatbbbbigbbbpizza.my bigbbbpizza ...
my script:
use strict;
use warnings;
use autodie;
open Newfile ,">./newfile.txt" or die "Cannot create Newfile.txt";
my %replacement;
my ($f1, $f2) = ('file1.txt', 'file2.txt');
open(my $fh, $f1);
my #seq;
foreach (<$fh> )
{
chomp;
s/^\s+|\s+$//g;
push #seq, $_;
}
close $fh;
#seq = sort bylen #seq;
open($fh, $f2);
foreach (<$fh> ) {
foreach my $r (#seq) {
my $t = $r;
$t =~ s/\h+/bbb/g;
s/$r/$t/g;
}
print Newfile ;
}
close $fh;
close Newfile ;
exit 0;
sub bylen {
length($b) <=> length($a);
}
Instead of an array
my #seq;
define your words as a hash.
my %seq;
Instead of pushing the words
push #seq, $_;
store the words in the hash. Precalculate the replacement and move it out of the loop.
my $t = $_;
$t =~ s/\h+/bbb/g;
$seq{$_} = $t;
Precalculate the words in front of the outer loop:
my #seq = keys %seq;
And use hash look-ups to find the replacement in the inner loop:
my $t = $seq{$r};
This might be a bit faster, but do not expect too much.
In most cases it is better to reduce the problem by preparing the input in a way, which makes the solution easier. For example grep -f is much faster than your Perl loops. Use grep to find the lines, which need a replacement, and do the replacement with Perl or Sed.
Another way is to parallel the job. You can divide your input in n parts and run n processes on n CPUs in parallel. See the GNU parallel tutorial.
What about a regexp like this (beware that this approach can cause security concerns) ?
use strict;
use warnings;
open (my $Newfile, '>', 'newfile.txt') or die "Cannot create Newfile.txt: $!";
my ($f1, $f2) = qw(file1.txt file2.txt);
open (my $fh, $f1) or die "Can't open $f1 for reading: $!";
my #seq = map {split ' ', $_ } <$fh>;
close $fh;
# an improvement would be to use an hash to avoid dupplicates
my $regexp = '(' . join('|', #seq) . ')';
open($fh, $f2) or die "Can't open $f2 for reading: $!";
foreach my $line (<$fh> ) {
$line =~ s/$regexp/$1bbb/g;
print $Newfile $line;
}
close $fh;
close $Newfile ;
exit 0;

Extract specific FASTA sequences from a file using Perl

I have written a Perl script to retrieve a hypothetical protein list from a FASTA file. I'm able to get only the header line with all hypothetical proteins, but I want to have all the sequences along with protein IDs.
The script is as follows.
#!/usr/bin/perl
use strict;
use warnings;
my $line;
open $fh, '<', '/home/Desktop/hypo_proteins/testprotein.fasta' or die "Cannot open file $fh, $!";
open OUT, ">output.txt";
while ( $line = <$fh> ) {
chomp $line;
if ( $line =~ /hypothetical protein/ ) {
print OUT "$line\n";
}
}
close $fh;
The output the I got from the above script is as follows
>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
>gi|113460158|ref|YP_718214.1| hypothetical protein HS_0009 [Haemophilus somnus 129PT]
>gi|113460165|ref|YP_718221.1| hypothetical protein HS_0016 [Haemophilus somnus 129PT]
But I need the output as follows:
>gi|113461928|ref|YP_718205.1| hypothetical protein HS_1792 [Haemophilus somnus 129PT]
MFKSLIQFFKSKSNTSNIKKENAVQRQERQDIEGWITPYSGQELLNTELRQHHLGLLWQQVSMTREMFEH
LYQKPIERYAEMVQLLPASESHHHSHLGGMLDHGLEVISFAAKLRQNYVLPLNAAPEDQAKQKDAWTAAV
IYLALVHDIGKSIVDIEIQLQDGKRWLAWHGIPTLPYKFRYIKQRDYELHPVLGGFIANQLIAKETFDWL
ATYPEVFSALMYAMAGHYDKANVLAEIVQKADQNSVALALGGDITKLVQKPVISFAKQLILALRYLISQK
FKISSKGPGDGWLTEDGLWLMSKTTADQIRAYLMGQGISVPSDNRKLFDEMQAHRVIESTSEGNAIWYCQ
LSADAGWKPKDKFSLLRIKPEVIWDNIDDRPELFAGTICVVEKENEAEEKISNTVNEVQDTVPINKKENI
ELTSNLQEENTALQSLNPSQNPEVVVENCDNNSVDFLLNMFSDNNEQQVMNIPSADAEAGTTMILKSEPE
NLNTHIEVEANAIPKLPTNDDTHLKSEGQKFVDWLKDKLFKKQLTFNDRTAKVHIVNDCLFIVSPSSFEL
YLQEKGESYDEECINNLQYEFQALGLHRKRIIKNDTINFWRCKVIGPKKESFLVGYLVPNTRLFFGDKIL
INNRHLLLEE
This will do as you ask
#!/usr/bin/perl
use strict;
use warnings;
use constant INPUT => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';
open my $in_fh, '<', INPUT or die "Cannot open input file: $!";
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;
my $print;
while ( <$in_fh> ) {
if ( /^>/ ) {
$print = /hypothetical protein/;
}
print if $print;
}
Regarding your (deleted) question about this solution, it uses the implicit variable $_ in several places. It is equivalent to this program
#!/usr/bin/perl
use strict;
use warnings;
use constant INPUT => '/home/Desktop/hypo_proteins/testprotein.fasta';
use constant OUTPUT => 'output.txt';
open my $in_fh, '<', INPUT or die "Cannot open input file: $!";
open my $out_fh, '>', OUTPUT or die "Cannot open output file: $!";
select $out_fh;
my $print;
while ( defined( $_ = <$in_fh>) ) {
if ( $_ =~ /^>/ ) {
$print = ( $_ =~ /hypothetical protein/ );
}
print $_ if $print;
}
So I hope you can see that $print = $_ =~ /hypothetical protein/ checks whether the current line (in $_) contains the string hypothetical protein and sets $print to a true value if so.
Because $print is defined outside the loop it keeps its value across iterations of the loop, and as you can see it is only changed on header lines, when the current line begins with >, and will stay true until the next header line, so that print if $print will output the header containing hypothetical protein and all following lines until the next header
I hope that helps

Alter a file using information from another file

I want to alter the names in a phylip file using information from another file. The phylip is just one continuous string of information, and the names I want to alter (e.g. aaaaaaabyd) are embedded in it. Like so
((aaaaaaabyd:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,aaaaaaaafj:0.47991503739434709930):0.06859184769990583908,((aaaaaaaabk:0.09244297511609228524,aaaaaaaete:0.12568841555837687030):0.28431
(there are no new lines)
The names within are like aaaaaaaabk.
The other file has the information change to, like so in the other file,
aaaaaaaabk;Ciona savignyi
aaaaaaaete;Homo sapiens
aaaaaaaafj;Cryptosporidium hominis
aaaaaaaaad;Strongylocentrotus purpuratus
aaaaaaabyd;Theileria parva
aaaaaaaaaf;Plasmodium vivax
I have tried numerous things but this is the closest I got. The problem is it does it for one and doesn't print out the rest of the phylip file. I need to get to ((Theileria parva:0.23400159127856412500, etc.
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $csv, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $new_tree, "> raxml_tree.phy");
# Declare variables
my $find;
my $replace;
my $digest;
# put the file of the tree into string variable
my $string = <$tree>;
# open csv file
while (my $line = <$csv>) {
# aaaaaaaaaa;Ciona savignyi
if ($line =~ m/(\w+)\;+(\w+\s+\w*)/) {
$find = $1;
$replace = $2;
$string =~ s/$find/$replace/g;
}
}
print $new_tree "$string";
close $tree;
close $csv;
close $new_tree;
Some guidelines on your own code
The problem is almost certainly that you are opening the same file $ARGV[0] twice. Presumably one should be `$ARGV[1]
You must always use strict and use warnings at the top of every Perl program you write (there is very little point in declaring your variables unless use strict is in place) and declare all your variables with my as close as possible to their first point of use. It is bad form to declare all your variables in a block at the start, because it makes them all effectively global, and you lose most of the advantages of declaring lexical variables
You should use the three-parameter form of open, and it is a good idea to put the name of the file in the die string so that you can see which one failed. So
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
becomes
open my $tree, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
You should look for simpler solutions rather than apply regex methods every time. $line =~ m/(\w+)\;+(\w+\s+\w*)/ is much tidier as chomp, split /;/
You shouldn't use double-quotes around variables when you want just the value of the variable, so print $new_tree "$string" should be print $new_tree $string
Rather than trying to use the data from the other file (please try to use useful names for items in your question, as it's tough to know what to call them when writing a solution) it is best to build a hash that contains all the translations
This program will do as you ask. It builds a regex consisting of an alternation of all the hash keys, and then converts all ocurrences of that pattern into its corresponding name. Only those names that are in your sample other file are translated: the others are left as they are
use strict;
use warnings;
use 5.014; # For non-destructive substitution
use autodie;
my %names;
open my $fh, '<', 'other_file.txt';
while ( <$fh> ) {
my ($k, $v) = split /;/, s/\s+\z//r;
$names{$k} = $v;
}
open $fh, '<', 'phylip.txt';
my $data = <$fh>;
close $fh;
my $re = join '|', sort { length $b <=> length $a } keys %names;
$re = qr/(?:$re)/;
$data =~ s/\b($re)\b/$names{$1}/g;
print $data;
output
((Theileria parva:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,Cryptosporidium hominis:0.47991503739434709930):0.06859184769990583908,((Ciona savignyi:0.09244297511609228524,Homo sapiens:0.12568841555837687030):0.28431
Update
Here is a revised version of your own program with the above points accounted for and the bugs fixed
use strict;
use warnings;
open my $tree_fh, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
my $string = <$tree_fh>;
close $tree_fh;
open my $csv_fh, '<', $ARGV[1] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
while ( <$csv_fh> ) {
chomp;
my ($find, $replace) = split /;/;
$string =~ s/$find/$replace/g;
}
close $csv_fh;
open my $new_tree_fh, '>', 'raxml_tree.phy' or die qq{Failed to open "raxml_tree.phy" for output: $!\n};
print $new_tree_fh $string;
close $new_tree_fh;

How to search and replace using hash with Perl

I'm new to Perl and I'm afraid I am stuck and wanted to ask if someone might be able to help me.
I have a file with two columns (tab separated) of oldname and newname.
I would like to use the oldname as key and newname as value and store it as a hash.
Then I would like to open a different file (gff file) and replace all the oldnames in there with the newnames and write it to another file.
I have given it my best try but am getting a lot of errors.
If you could let me know what I am doing wrong, I would greatly appreciate it.
Here are how the two files look:
oldname newname(SFXXXX) file:
genemark-scaffold00013-abinit-gene-0.18 SF130001
augustus-scaffold00013-abinit-gene-1.24 SF130002
genemark-scaffold00013-abinit-gene-1.65 SF130003
file to search and replace in (an example of one of the lines):
scaffold00013 maker gene 258253 258759 . - . ID=maker-scaffold00013-augustus-gene-2.187;Name=maker-scaffold00013-augustus-gene-2.187;
Here is my attempt:
#!/usr/local/bin/perl
use warnings;
use strict;
my $hashfile = $ARGV[0];
my $gfffile = $ARGV[1];
my %names;
my $oldname;
my $newname;
if (!defined $hashfile) {
die "Usage: $0 hash_file gff_file\n";
}
if (!defined $gfffile) {
die "Usage: $0 hash_file gff_file\n";
}
###save hashfile with two columns, oldname and newname, into a hash with oldname as key and newname as value.
open(HFILE, $hashfile) or die "Cannot open $hashfile\n";
while (my $line = <HFILE>) {
chomp($line);
my ($oldname, $newname) = split /\t/;
$names{$oldname} = $newname;
}
close HFILE;
###open gff file and replace all oldnames with newnames from %names.
open(GFILE, $gfffile) or die "Cannot open $gfffile\n";
while (my $line2 = <GFILE>) {
chomp($line2);
eval "$line2 =~ s/$oldname/$names{oldname}/g";
open(OUT, ">SFrenamed.gff") or die "Cannot open SFrenamed.gff: $!";
print OUT "$line2\n";
close OUT;
}
close GFILE;
Thank you!
Your main problem is that you aren't splitting the $line variable. split /\t/ splits $_ by default, and you haven't put anything in there.
This program builds the hash, and then constructs a regex from all the keys by sorting them in descending order of length and joining them with the | regex alternation operator. The sorting is necessary so that the longest of all possible choices is selected if there are any alternatives.
Every occurrence of the regex is replaced by the corresponding new name in each line of the input file, and the output written to the new file.
use strict;
use warnings;
die "Usage: $0 hash_file gff_file\n" if #ARGV < 2;
my ($hashfile, $gfffile) = #ARGV;
open(my $hfile, '<', $hashfile) or die "Cannot open $hashfile: $!";
my %names;
while (my $line = <$hfile>) {
chomp($line);
my ($oldname, $newname) = split /\t/, $line;
$names{$oldname} = $newname;
}
close $hfile;
my $regex = join '|', sort { length $b <=> length $a } keys %names;
$regex = qr/$regex/;
open(my $gfile, '<', $gfffile) or die "Cannot open $gfffile: $!";
open(my $out, '>', 'SFrenamed.gff') or die "Cannot open SFrenamed.gff: $!";
while (my $line = <$gfile>) {
chomp($line);
$line =~ s/($regex)/$names{$1}/g;
print $out $line, "\n";
}
close $out;
close $gfile;
Why are you using an eval? And $oldname is going to be undefined in the second while loop, because the first while loop you redeclare them in that scope (even if you used the outer scope, it would store the very last value that you processed, which wouldn't be helpful).
Take out the my $oldname and my $newname at the top of your script, it is useless.
Take out the entire eval line. You need to repeat the regex for each thing you want to replace. Try something like:
$line2 =~ s/$_/$names{$_}/g for keys %names;
Also see Borodin's answer. He made one big regex instead of a loop, and caught your lack of the second argument to split.