Perl while loop is repeating itself - perl

I am 100% new to Perl but do have some PHP knowledge. I'm trying to create a quick script that will take the #url vars and save it to a .txt file. The problem that I'm having is that it's saving the url again everytime it runs through the loop which is super annoying. So when the loop runs, it'll look like this.
url1.com
url1.com url2.com
url1.com url2.com url3.com
What I would like it to look like is just a plain and simple:
url1.com
url2.com
url3.com
Here is my code. If anyone can help, I would appreciate it SO SO much!
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.rdf.u8";
my #urls;
open(my $fh, "<", $file) or die "Unable to open $file\n";
while (my $line = <$fh>) {
if ($line =~ m/<(?:ExternalPage about|link r:resource)="([^\"]+)"\/?>/) {
push #urls, $1;
}
open (FH, ">>my_urls.txt") or die "$!";
print FH "#urls ";
close(FH);
}
close $fh;

Your print is inside your while loop. It sounds like you want to move your print outside of the loop.
Or if you want to print each url as you go through each line, move the declaration of "my #urls" down into the loop, then it will get reset each line

Shouldn't this part:
open (FH, ">>my_urls.txt") or die "$!";
print FH "#urls ";
close(FH);
...be placed outside of while loop? It makes no sense within while, as #urls are apparently incomplete there.
And two regex-related sidenotes: first, with m operator you may choose another set of delimiters so you don't have to escape / sign; second, it's not necessary to escape " sign within character class definition. In fact, it's not required to escape it in regex at all - unless you choose this character as a delimiter. )
So your regex may look like this:
$line =~ m#<(?:ExternalPage about|link r:resource)="([^"]+)"/?>#

do you need the #urls array elsewhere? because else, you could simply:
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.rdf.u8";
my #urls;
open(my $fh, "<", $file) or die "Unable to open $file\n";
open (FH, ">>my_urls.txt") or die "$!";
while (my $line = <$fh>) {
if ($line =~ m/<(?:ExternalPage about|link r:resource)="([^\"]+)"\/?>/) {
print FH $1;
}
}
close(FH);
close $fh;

Related

Write old fasta header and new to file

I want to extract the old fasta names which looks something like this:
>Bartonella bibbi
AUUCCGGUUGAUCCUGCCGGAGGCCACUGCUAUCGGGGUCCG
The new headers should look like this:
>Seq1
AUUCCGGUUGAUCCUGCCGGAGGCCACUGCUAUCGGGGUCCG
and so on...
The Bartonella Bibbi should be saved together with the new name Seq1 in a new file an so on. So I've started a bit, by looking for lines with >, and then I split to get an array to get the old name. I don't know how to continue, because I want two things here, first to put the new name in there, but also extracting the old name together with the new in a file, and ALSO get an output file with my sequence and my new names. Please, any input from you will help!
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
while (my $line = <$IN>) {
if ($line =~ /^>/) {
my #header = split (/\>/, $line);
my $oldfasta = "$header[1]";
}
}
So after some edits, this is the current script:
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $seqid = 1;
my %id;
while (my $line = <$IN>) {
if ($line =~ /^>/) {
$id{"Seq$seqid "} = $line;
print ">Seq$seqid\n";
$seqid++
} else {
print $line;
}
}
my $outfile = 'output';
open my $OUT, '>', $outfile or die "Could not open $outfile: $!, $?"; # overwrites the file $outfile;
print $OUT %id;
This gives me a file that looks like this:
Seq29 >Sulfophobococcus_zilligii
Seq20 >Pyrococcus_shinkaii
and so on.
They are not in order, how do I sort them and get rid of the > in the species name?
You’re simply not printing anything. Once you add a print statement, it should work.
In addition, it’s unclear what you’re using split for. Just increase a counter for the sequence:
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $seqid = 1;
while (my $line = <$IN>) {
if ($line =~ /^>/) {
print ">Seq$seqid\n";
$seqid++;
} else {
print $line;
}
}
Simply write the new entries as you create them.
#!/usr/bin/perl
use warnings;
use strict;
my $infile = $ARGV[0];
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $outfile = 'output';
open my $OUT, '>', $outfile or die "Could not open $outfile: $!, $?"; # overwrites the file $outfile;
my $seqid = 1;
while (my $line = <$IN>) {
if ($line =~ /^>(.+)/) {
print $OUT "Seq$seqid\t$1\n"
print ">Seq$seqid\n";
$seqid++
} else {
print $line;
}
}
I tried to fix the indentation but left the gratutious variable for the $OUT file name.
If you want to keep the mapping in memory for other reasons (maybe to develop this into a much more complex script) using an array instead of a hash would seem like a natural way to keep the entries sorted; the new label is trivially derivable from the array index.

Alter a file using information from another file

I want to alter the names in a phylip file using information from another file. The phylip is just one continuous string of information, and the names I want to alter (e.g. aaaaaaabyd) are embedded in it. Like so
((aaaaaaabyd:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,aaaaaaaafj:0.47991503739434709930):0.06859184769990583908,((aaaaaaaabk:0.09244297511609228524,aaaaaaaete:0.12568841555837687030):0.28431
(there are no new lines)
The names within are like aaaaaaaabk.
The other file has the information change to, like so in the other file,
aaaaaaaabk;Ciona savignyi
aaaaaaaete;Homo sapiens
aaaaaaaafj;Cryptosporidium hominis
aaaaaaaaad;Strongylocentrotus purpuratus
aaaaaaabyd;Theileria parva
aaaaaaaaaf;Plasmodium vivax
I have tried numerous things but this is the closest I got. The problem is it does it for one and doesn't print out the rest of the phylip file. I need to get to ((Theileria parva:0.23400159127856412500, etc.
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $csv, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $new_tree, "> raxml_tree.phy");
# Declare variables
my $find;
my $replace;
my $digest;
# put the file of the tree into string variable
my $string = <$tree>;
# open csv file
while (my $line = <$csv>) {
# aaaaaaaaaa;Ciona savignyi
if ($line =~ m/(\w+)\;+(\w+\s+\w*)/) {
$find = $1;
$replace = $2;
$string =~ s/$find/$replace/g;
}
}
print $new_tree "$string";
close $tree;
close $csv;
close $new_tree;
Some guidelines on your own code
The problem is almost certainly that you are opening the same file $ARGV[0] twice. Presumably one should be `$ARGV[1]
You must always use strict and use warnings at the top of every Perl program you write (there is very little point in declaring your variables unless use strict is in place) and declare all your variables with my as close as possible to their first point of use. It is bad form to declare all your variables in a block at the start, because it makes them all effectively global, and you lose most of the advantages of declaring lexical variables
You should use the three-parameter form of open, and it is a good idea to put the name of the file in the die string so that you can see which one failed. So
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
becomes
open my $tree, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
You should look for simpler solutions rather than apply regex methods every time. $line =~ m/(\w+)\;+(\w+\s+\w*)/ is much tidier as chomp, split /;/
You shouldn't use double-quotes around variables when you want just the value of the variable, so print $new_tree "$string" should be print $new_tree $string
Rather than trying to use the data from the other file (please try to use useful names for items in your question, as it's tough to know what to call them when writing a solution) it is best to build a hash that contains all the translations
This program will do as you ask. It builds a regex consisting of an alternation of all the hash keys, and then converts all ocurrences of that pattern into its corresponding name. Only those names that are in your sample other file are translated: the others are left as they are
use strict;
use warnings;
use 5.014; # For non-destructive substitution
use autodie;
my %names;
open my $fh, '<', 'other_file.txt';
while ( <$fh> ) {
my ($k, $v) = split /;/, s/\s+\z//r;
$names{$k} = $v;
}
open $fh, '<', 'phylip.txt';
my $data = <$fh>;
close $fh;
my $re = join '|', sort { length $b <=> length $a } keys %names;
$re = qr/(?:$re)/;
$data =~ s/\b($re)\b/$names{$1}/g;
print $data;
output
((Theileria parva:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,Cryptosporidium hominis:0.47991503739434709930):0.06859184769990583908,((Ciona savignyi:0.09244297511609228524,Homo sapiens:0.12568841555837687030):0.28431
Update
Here is a revised version of your own program with the above points accounted for and the bugs fixed
use strict;
use warnings;
open my $tree_fh, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
my $string = <$tree_fh>;
close $tree_fh;
open my $csv_fh, '<', $ARGV[1] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
while ( <$csv_fh> ) {
chomp;
my ($find, $replace) = split /;/;
$string =~ s/$find/$replace/g;
}
close $csv_fh;
open my $new_tree_fh, '>', 'raxml_tree.phy' or die qq{Failed to open "raxml_tree.phy" for output: $!\n};
print $new_tree_fh $string;
close $new_tree_fh;

How to replace ^M character in perl code in same file

I am looking to delete control-M character file in perl code (not perl one liner).
I tried this, but it will write to new file. Whats the way to do it in same file?
#!/usr/bin/perl
open (IN, '<', "FILE.dat") or die "$!";
open (OUT, '>', "FILE.dat.cpy") or die "$!";
while(<IN>)
{
$line = $_;
$line=~ tr/\015//d;
print OUT "$line";
}
close (IN);
close (OUT);
store the file internal in a String.
#!/usr/bin/perl
my $content = ''
open (IN, '<', "FILE.dat") or die "$!";
while(<IN>)
{
$line = $_;
$line=~ tr/\015//d;
$content .$line
}
close (IN);
open (OUT, '>', "FILE.dat") or die "$!";
print OUT $line;
close (OUT);
It could be done using $^I variable.
use autodie;
local $^I = "";
local #ARGV = "FILE.dat";
while (my $line = <>) {
$line=~ tr/\r//d;
print $line;
}
From perlvar
$^I - The current value of the inplace-edit extension. Use undef to disable inplace editing.
So it is undef by default, empty string is used to edit in-place, and non-empty string will be added as suffix to backup file name.
You needn't be afraid of coping the edited content to a new file, and then renaming. This is the standard way to edit content unless you're dealing with a file with fixed with records.
Check out How do I change, delete, or insert a line in a file, or append to the beginning of a file? to observe most of the ways to edit content, and the majority of them will ultimately be copying to a new file.
My preferred advice is to use in-place edit as demonstrated by mpapec. The only addition is that Windows forces you to specify a backup, so just need to add the unlink line after the processing.
use strict;
use warnings;
use autodie;
my $file = 'FILE.dat';
local #ARGV = $file;
local $^I = '.bak';
while (<>) {
tr/\r//d;
print;
}
unlink "$file$^I";
If you insist on being able to open the file only once, then perhaps you can take a look at I still don't get locking. I just want to increment the number in the file. How can I do this?
use strict;
use warnings;
use autodie;
use Fcntl qw(:seek);
my $file = 'afile.dat';
open my $fh, '+<:raw', $file;
my $data = do {local $/; <$fh>};
seek $fh, SEEK_SET, 0;
truncate $fh, 0;
$data =~ tr/\r//d;
print $fh $data;
close $fh;

How to search and replace using hash with Perl

I'm new to Perl and I'm afraid I am stuck and wanted to ask if someone might be able to help me.
I have a file with two columns (tab separated) of oldname and newname.
I would like to use the oldname as key and newname as value and store it as a hash.
Then I would like to open a different file (gff file) and replace all the oldnames in there with the newnames and write it to another file.
I have given it my best try but am getting a lot of errors.
If you could let me know what I am doing wrong, I would greatly appreciate it.
Here are how the two files look:
oldname newname(SFXXXX) file:
genemark-scaffold00013-abinit-gene-0.18 SF130001
augustus-scaffold00013-abinit-gene-1.24 SF130002
genemark-scaffold00013-abinit-gene-1.65 SF130003
file to search and replace in (an example of one of the lines):
scaffold00013 maker gene 258253 258759 . - . ID=maker-scaffold00013-augustus-gene-2.187;Name=maker-scaffold00013-augustus-gene-2.187;
Here is my attempt:
#!/usr/local/bin/perl
use warnings;
use strict;
my $hashfile = $ARGV[0];
my $gfffile = $ARGV[1];
my %names;
my $oldname;
my $newname;
if (!defined $hashfile) {
die "Usage: $0 hash_file gff_file\n";
}
if (!defined $gfffile) {
die "Usage: $0 hash_file gff_file\n";
}
###save hashfile with two columns, oldname and newname, into a hash with oldname as key and newname as value.
open(HFILE, $hashfile) or die "Cannot open $hashfile\n";
while (my $line = <HFILE>) {
chomp($line);
my ($oldname, $newname) = split /\t/;
$names{$oldname} = $newname;
}
close HFILE;
###open gff file and replace all oldnames with newnames from %names.
open(GFILE, $gfffile) or die "Cannot open $gfffile\n";
while (my $line2 = <GFILE>) {
chomp($line2);
eval "$line2 =~ s/$oldname/$names{oldname}/g";
open(OUT, ">SFrenamed.gff") or die "Cannot open SFrenamed.gff: $!";
print OUT "$line2\n";
close OUT;
}
close GFILE;
Thank you!
Your main problem is that you aren't splitting the $line variable. split /\t/ splits $_ by default, and you haven't put anything in there.
This program builds the hash, and then constructs a regex from all the keys by sorting them in descending order of length and joining them with the | regex alternation operator. The sorting is necessary so that the longest of all possible choices is selected if there are any alternatives.
Every occurrence of the regex is replaced by the corresponding new name in each line of the input file, and the output written to the new file.
use strict;
use warnings;
die "Usage: $0 hash_file gff_file\n" if #ARGV < 2;
my ($hashfile, $gfffile) = #ARGV;
open(my $hfile, '<', $hashfile) or die "Cannot open $hashfile: $!";
my %names;
while (my $line = <$hfile>) {
chomp($line);
my ($oldname, $newname) = split /\t/, $line;
$names{$oldname} = $newname;
}
close $hfile;
my $regex = join '|', sort { length $b <=> length $a } keys %names;
$regex = qr/$regex/;
open(my $gfile, '<', $gfffile) or die "Cannot open $gfffile: $!";
open(my $out, '>', 'SFrenamed.gff') or die "Cannot open SFrenamed.gff: $!";
while (my $line = <$gfile>) {
chomp($line);
$line =~ s/($regex)/$names{$1}/g;
print $out $line, "\n";
}
close $out;
close $gfile;
Why are you using an eval? And $oldname is going to be undefined in the second while loop, because the first while loop you redeclare them in that scope (even if you used the outer scope, it would store the very last value that you processed, which wouldn't be helpful).
Take out the my $oldname and my $newname at the top of your script, it is useless.
Take out the entire eval line. You need to repeat the regex for each thing you want to replace. Try something like:
$line2 =~ s/$_/$names{$_}/g for keys %names;
Also see Borodin's answer. He made one big regex instead of a loop, and caught your lack of the second argument to split.

Perl problems printing output to a new file

I want to remove all lines in a text file that start with HPL_ I have acheived this and can print to screen, but when I try to write to a file, I just get the last line of the amended text printed in the new file. Any help please!
open(FILE,"<myfile.txt");
#LINES = <FILE>;
close(FILE);
open(FILE,">myfile.txt");
foreach $LINE (#LINES) {
#array = split(/\:/,$LINE);
my $file = "changed";
open OUTFILE, ">$file" or die "unable to open $file $!";
print OUTFILE $LINE unless ($array[0] eq "HPL_");
}
close(FILE);
close (OUTFILE);
exit;
You just want to remove all lines that start with HPL_? That's easy!
perl -pi -e 's/^HPL_.*//s' myfile.txt
Yes, it really is just a one-liner. :-)
If you don't want to use the one-liner, re-write the "write to file" portion as follows:
my $file = "changed";
open( my $outfh, '>', $file ) or die "Could not open file $file: $!\n";
foreach my $LINE (#LINES) {
my #array = split(/:/,$LINE);
next if $array[0] eq 'HPL_';
print $outfh $LINE;
}
close( $outfh );
Note how you are open()ing the file each time through the loop. This is causing the file to only contain the last line, as using open() with > means "overwrite what's in the file". That's the major problem with your code as it stands.
Edit: As an aside, you want to clean up your code. Use lexical filehandles as I've shown. Always add the three lines that tchrist posted at the top of every one of your Perl programs. Use the three-operator version of open(). Don't slurp the entire file into an array, as if you try to read a huge file it could cause your computer to run out of memory. Your program could be re-written as:
#!perl
use strict;
use autodie;
use warnings FATAL => "all";
my $infile = "myfile.txt";
my $outfile = "changed.txt";
open( my $infh, '<', $infile );
open( my $outfh, '>', $outfile );
while( my $line = <$infh> ) {
next if $line =~ /^HPL_/;
print $outfh $line;
}
close( $outfh );
close( $infh );
Note how with use autodie you don't need to add or die ... to the open() function, as the autodie pragma handles that for you.
The issue with your code is that you open the file for output within your line-processing loop which, due to your use of the '>' form of open, opens the file each time for write, obliterating any previous content.
Move the invocation of open() to the top of your file, above the loop, and it should work.
Also, I'm not sure of your intent but at line 4 of your example, you reopen your input file for write (using '>'), which also clobbers anything it contains.
As a side note, you might try reading up on Perl's grep() command which is designed to do exactly what you need, as in:
#!/usr/bin/perl
use strict;
use warnings;
open(my $in, '<', 'myfile.txt') or die "failed to open input for read: $!";
my #lines = <$in> or die 'no lines to read from input';
close($in);
# collect all lines that do not begin with HPL_ into #result
my #result = grep ! /^HPL_/, #lines;
open(my $out, '>', 'changed.txt') or die "failed to open output for write: $!";
print { $out } #result;
close($out);