Alter a file using information from another file - perl

I want to alter the names in a phylip file using information from another file. The phylip is just one continuous string of information, and the names I want to alter (e.g. aaaaaaabyd) are embedded in it. Like so
((aaaaaaabyd:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,aaaaaaaafj:0.47991503739434709930):0.06859184769990583908,((aaaaaaaabk:0.09244297511609228524,aaaaaaaete:0.12568841555837687030):0.28431
(there are no new lines)
The names within are like aaaaaaaabk.
The other file has the information change to, like so in the other file,
aaaaaaaabk;Ciona savignyi
aaaaaaaete;Homo sapiens
aaaaaaaafj;Cryptosporidium hominis
aaaaaaaaad;Strongylocentrotus purpuratus
aaaaaaabyd;Theileria parva
aaaaaaaaaf;Plasmodium vivax
I have tried numerous things but this is the closest I got. The problem is it does it for one and doesn't print out the rest of the phylip file. I need to get to ((Theileria parva:0.23400159127856412500, etc.
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $csv, "$ARGV[0]") or die "Failed to open file: $!\n";
open(my $new_tree, "> raxml_tree.phy");
# Declare variables
my $find;
my $replace;
my $digest;
# put the file of the tree into string variable
my $string = <$tree>;
# open csv file
while (my $line = <$csv>) {
# aaaaaaaaaa;Ciona savignyi
if ($line =~ m/(\w+)\;+(\w+\s+\w*)/) {
$find = $1;
$replace = $2;
$string =~ s/$find/$replace/g;
}
}
print $new_tree "$string";
close $tree;
close $csv;
close $new_tree;

Some guidelines on your own code
The problem is almost certainly that you are opening the same file $ARGV[0] twice. Presumably one should be `$ARGV[1]
You must always use strict and use warnings at the top of every Perl program you write (there is very little point in declaring your variables unless use strict is in place) and declare all your variables with my as close as possible to their first point of use. It is bad form to declare all your variables in a block at the start, because it makes them all effectively global, and you lose most of the advantages of declaring lexical variables
You should use the three-parameter form of open, and it is a good idea to put the name of the file in the die string so that you can see which one failed. So
open(my $tree, "$ARGV[0]") or die "Failed to open file: $!\n";
becomes
open my $tree, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
You should look for simpler solutions rather than apply regex methods every time. $line =~ m/(\w+)\;+(\w+\s+\w*)/ is much tidier as chomp, split /;/
You shouldn't use double-quotes around variables when you want just the value of the variable, so print $new_tree "$string" should be print $new_tree $string
Rather than trying to use the data from the other file (please try to use useful names for items in your question, as it's tough to know what to call them when writing a solution) it is best to build a hash that contains all the translations
This program will do as you ask. It builds a regex consisting of an alternation of all the hash keys, and then converts all ocurrences of that pattern into its corresponding name. Only those names that are in your sample other file are translated: the others are left as they are
use strict;
use warnings;
use 5.014; # For non-destructive substitution
use autodie;
my %names;
open my $fh, '<', 'other_file.txt';
while ( <$fh> ) {
my ($k, $v) = split /;/, s/\s+\z//r;
$names{$k} = $v;
}
open $fh, '<', 'phylip.txt';
my $data = <$fh>;
close $fh;
my $re = join '|', sort { length $b <=> length $a } keys %names;
$re = qr/(?:$re)/;
$data =~ s/\b($re)\b/$names{$1}/g;
print $data;
output
((Theileria parva:0.23400159127856412500,(((aaaaaaaaxv:0.44910864993667892753,aaaaaaaagf:0.51328033054009691849):0.06090419044604544752,((aaaaaaabyc:0.11709094683204501752,aaaaaaafzz:0.04488198976629347720):0.09529995111708353117,((aaaaaaadbn:0.34408087090010841536,Cryptosporidium hominis:0.47991503739434709930):0.06859184769990583908,((Ciona savignyi:0.09244297511609228524,Homo sapiens:0.12568841555837687030):0.28431
Update
Here is a revised version of your own program with the above points accounted for and the bugs fixed
use strict;
use warnings;
open my $tree_fh, '<', $ARGV[0] or die qq{Failed to open "$ARGV[0]" for input: $!\n};
my $string = <$tree_fh>;
close $tree_fh;
open my $csv_fh, '<', $ARGV[1] or die qq{Failed to open "$ARGV[1]" for input: $!\n};
while ( <$csv_fh> ) {
chomp;
my ($find, $replace) = split /;/;
$string =~ s/$find/$replace/g;
}
close $csv_fh;
open my $new_tree_fh, '>', 'raxml_tree.phy' or die qq{Failed to open "raxml_tree.phy" for output: $!\n};
print $new_tree_fh $string;
close $new_tree_fh;

Related

perl write variables to a file

Here's my code to parse a configuration file, write the retrieved data to another file and send it to a MySQL database.
The database connection and writing data to a table works fine, however I can't get it to write data to the mentioned file mongoData.txt.
I'm quite new to Perl, so any help will be highly appreciated.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $line;
# Retrieving data
open( my $FILE, "<", "/etc/mongod.conf" )
or die "Cannot find file! : $!\n";
while ( $line = <$FILE> ) {
chomp($line);
my ( $KEY, $VALUE ) = split /\:/, $line;
# Ignoring commented lines
$_ = $line;
unless ( $_ = ~/^#/ ) {
# Write to file
open my $FILE2, ">", "/home/sierra/Documents/mongoData.txt"
or die "Cannot create file $!\n";
print $FILE2 "$KEY", "$VALUE\n";
}
# Connection to SQL database
my $db = DBI->connect(( "dbi:mysql:dbname=mongodconf;
host = localhost;", "root", "sqladmin"
)) or die "can't connect to mysql";
# Inserting into database
$db->do("insert into data values ('$KEY', '$VALUE')")
or die "query error\n";
}
close($FILE);
Every time you open a file for output, you create a new file and delete any pre-existing file with the same name. That means you're going to be left with only the last line you wrote to the file
Here are some more pointers
Variable identifiers should in general be all in digits, lower case letters, and underscores. Capital letters are reserved for global identifiers such as package names
If you are running a version of Perl later than v5.14 then you can use autodie which checks all IO operations for you and removes the need to test the return status by hand
If you use a die string that has no newline at the end, then Perl will add information about the source file name and line number where it occurred, which can be useful for debugging
It is unnecessary to name your loop control variables. Programs can be made much more concise and readable by using Perl's pronoun variable $_ which is the default for many built-in operators
It is wasteful to reconnect to your database every time you need to make changes. You should connect once at the top of your program and use that static connection throughout your code
You should use placeholders when passing parameter expressions to an SQL operation. It can be dangerous, and that way DBI will quote them correctly for you
There is no need to close input files explicitly. Everything will be closed automatically at the end of the program. But if you are worried about the integrity of your output data, you may want to do an explicit close on output file handles so that you can check that they succeeded
Here's what I would write. Rather than testing whether each line of the input begins with a hash, it removes everything from the first hash character onwards and then checks to see if there are any non-blank characters in what remains. That allows for trailing comments in the data
#!/usr/bin/perl
use strict;
use warnings 'all';
use autodie;
use DBI;
my ($input, $output, $dsn) = qw{
/etc/mongod.conf
/home/sierra/Documents/mongoData.txt
dbi:mysql:dbname=mongodconf;host=localhost;
};
open my $fh, '<', $input;
open my $out_fh, '>', $output;
my $dbh = DBI->connect($dsn, qw/ root sqladmin /)
or die "Can't connect to MySQL: $DBI::errstr";
while ( <$fh> ) {
chomp;
s/#.*//;
next unless /\S/;
my ( $key, $val ) = split /\:/;
print $out_fh "$key $val\n";
$dbh->do('insert into data values (?, ?)', $key, $val);
}
close $out_fh or die $!;
$dbh->disconnect or warn $dbh->errstr;
You need to append the text into the creating new file mongoData.txt
while ($line=<$FILE>)
{
chomp ($line);
my ($KEY, $VALUE) = split /\:/,$line;
# Ignoring commented lines
$_ = $line;
unless ($_ = ~/^#/)
{
open my $FILE2, ">>", "/home/sierra/Documents/mongoData.txt" or die "Cannot create file $!\n";
print $FILE2 "$KEY","$VALUE\n";
}
}
close($FILE2);
or else
Create the text file once before your nesting the while loop
open my $FILE2, ">", "/home/sierra/Documents/mongoData.txt"
or die "Cannot create file $!\n";
while ($line=<$FILE>)
{
chomp ($line);
my ($KEY, $VALUE) = split /\:/,$line;
# Ignoring commented lines
$_ = $line;
unless ($_ = ~/^#/)
{
print $FILE2 "$KEY","$VALUE\n";
}
}
close($FILE2);
May be this will help you.

My perl script isn't working, I have a feeling it's the grep command

I'm trying for search in the one file for instances of the
number and post if the other file contains those numbers
#!/usr/bin/perl
open(file, "textIds.txt"); #
#file = <file>; #file looking into
# close file; #
while(<>){
$temp = $_;
$temp =~ tr/|/\t/; #puts tab between name and id
#arrayTemp = split("\t", $temp);
#found=grep{/$arrayTemp[1]/} <file>;
if (defined $found[0]){
#if (grep{/$arrayTemp[1]/} <file>){
print $_;
}
#found=();
}
print "\n";
close file;
#the input file lines have the format of
#John|7791 154
#Smith|5432 290
#Conor|6590 897
#And in the file the format is
#5432
#7791
#6590
#23140
There are some issues in your script.
Always include use strict; and use warnings;.
This would have told you about odd things in your script in advance.
Never use barewords as filehandles as they are global identifiers. Use three-parameter-open
instead: open( my $fh, '<', 'testIds.txt');
use autodie; or check whether the opening worked.
You read and store testIds.txt into the array #file but later on (in your grep) you are
again trying to read from that file(handle) (with <file>). As #PaulL said, this will always
give undef (false) because the file was already read.
Replacing | with tabs and then splitting at tabs is not neccessary. You can split at the
tabs and pipes at the same time as well (assuming "John|7791 154" is really "John|7791\t154").
Your talking about "input file" and "in file" without exactly telling which is which.
I assume your "textIds.txt" is the one with only the numbers and the other input file is the
one read from STDIN (with the |'s in it).
With this in mind your script could be written as:
#!/usr/bin/perl
use strict;
use warnings;
# Open 'textIds.txt' and slurp it into the array #file:
open( my $fh, '<', 'textIds.txt') or die "cannot open file: $!\n";
my #file = <$fh>;
close($fh);
# iterate over STDIN and compare with lines from 'textIds.txt':
while( my $line = <>) {
# split "John|7791\t154" into ("John", "7791", "154"):
my ($name, $number1, $number2) = split(/\||\t/, $line);
# compare $number1 to each member of #file and print if found:
if ( grep( /$number1/, #file) ) {
print $line;
}
}

How to get the output for this code?

I have a file t_code.txt in which I want to replace all occurrences of strings PIOMUX_UART1_TXD and PIOMUX_UART1_RXD with strings in #array1 containing TXD and RXD respectively and then print it in new file c_code2.txt but it's not working
open my $f6, '<', 't_code.txt' or die $!;
my #lines = <$f6>;
my #newlines;
foreach (#lines) {
$_ =~ s/PIOMUX_UART1_TXD/ grep ( / TXD / )(#array1)/g;
$_ =~ s/PIOMUX_UART1_RXD/ grep ( / RXD / )(#array1)/g;
push(#newlines, $_);
}
close($f6);
open my $output, '>', 'c_code2.txt' or die "Can't open the output file!";
print $output #newlines;
close($output);
Since #array1 (a dreadful choice of identifier, by the way) doesn't change inside the loop, it is best to build the replacement strings outside instead of every time you make a replacement.
It isn't clear exactly what string you want to replace PIOMUX_UART1_TXD and PIOMUX_UART1_RXD with, but this code joins all the matching elements of the array with commas and uses that. I hope it's cler how to do something different if you need to.
I've also used a while loop, as there's no need to read the whole file into an array beforehand.
my ($in_file, $out_file) = qw/ t_code.txt c_code2.txt /;
open my $in_fh, '<', $in_file or die qq{Unable to open "$in_file" for reading: $!};
open my $out_fh, '>', $out_file or die qq{Unable to open "$out_file" for writing: $!};
my ($txd) = grep /TXD/, #array1;
my ($rxd) = grep /RXD/, #array1;
while ( <$in_fh> ) {
s/PIOMUX_UART1_TXD/$txd/g;
s/PIOMUX_UART1_RXD/$rxd/g;
print $out_fh $_;
}
close $out_fh or die $!;
Several problems in the code:
To be able to use code in the replacement part of a substitution, you must use the /e modifier.
In a s/// construct, you can't use / unquoted. Either change the separator, or backslash it.
The replacement part in a substitution is a string. In case of code, it's evaluated in scalar context. grep in scalar context returns the number of matches.
Cf:
#! /usr/bin/perl
use warnings;
use strict;
my #array1 = qw( aTXDb cRXDd );
while (<DATA>) {
s/PIOMUX_UART1_TXD/join q(), grep m=TXD=, #array1/eg;
s/PIOMUX_UART1_RXD/join q(), grep m=RXD=, #array1/eg;
print;
}
__DATA__
PIOMUX_UART1_TXD
PIOMUX_UART1_RXD

splitting a large file into small files based on column value in perl

I am trying to split up a large file (having around 17.6 million data) into 6-7 small files based on the column value.Currently, I am using sql bcp utility to dump in all data into one table and creating seperate files using bcp out utility.
But someone suggested me to use Perl as it would be more faster and you don't need to create a table for that.As I am not a perl guy. I am not sure how to do it in perl.
Any help..
INPUT file :
inputfile.txt
0010|name|address|city|.........
0020|name|number|address|......
0030|phone no|state|street|...
output files:
0010.txt
0010|name|address|city|.........
0020.txt
0020|name|number|address|......
0030.txt
0030|phone no|state|street|...
It is simplest to keep a hash of output file handles, keyed by the file name. This program shows the idea. The number at the start of each record is used to create the name of the file where it belongs, and file of that name is opened unless we already have a file handle for it.
All of the handles are closed once all of the data has been processed. Any errors are caught by use autodie, so explicit checking of the open, print and close calls is unnecessary.
use strict;
use warnings;
use autodie;
open my $in_fh, '<', 'inputfile.txt';
my %out_fh;
while (<$in_fh>) {
next unless /^(\d+)/;
my $filename = "$1.txt";
open $out_fh{$filename}, '>', $filename unless $out_fh{$filename};
print { $out_fh{$filename} } $_;
}
close $_ for values %out_fh;
Note close caught me out here because, unlike most operators that work on $_ if you pass no parameters, a bare close will close the currently selected file handle. That is a bad choice IMO, but it's way to late to change it now
17.6 million rows is going to be a pretty large file, I'd imagine. It'll still be slow with perl to process.
That said, you're going to want something like the below:
use strict;
use warnings;
my $input = 'FILENAMEHERE.txt';
my %results;
open(my $fh, '<', $input) or die "cannot open input file: $!";
while (<$fh>) {
my ($key) = split '|', $_;
my $array = $results{$key} || [];
push $array, $_;
$results{$key} = $array;
}
for my $filename (keys %results) {
open(my $out, '>', "$filename.txt") or die "Cannot open output file $out: $!";
print $out, join "\n", $results{$filename};
close($out);
}
I haven't explicitly tested this, but it should get you going in the right direction.
$ perl -F'|' -lane '
$key = $F[0];
$fh{$key} or open $fh{$key}, ">", "$key.txt" or die $!;
print { $fh{$key} } $_
' inputfile.txt
perl -Mautodie -ne'
sub out { $h{$_[0]} ||= open(my $f, ">", "$_[0].txt") && $f }
print { out($1) } $_ if /^(\d+)/;
' file

Perl while loop is repeating itself

I am 100% new to Perl but do have some PHP knowledge. I'm trying to create a quick script that will take the #url vars and save it to a .txt file. The problem that I'm having is that it's saving the url again everytime it runs through the loop which is super annoying. So when the loop runs, it'll look like this.
url1.com
url1.com url2.com
url1.com url2.com url3.com
What I would like it to look like is just a plain and simple:
url1.com
url2.com
url3.com
Here is my code. If anyone can help, I would appreciate it SO SO much!
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.rdf.u8";
my #urls;
open(my $fh, "<", $file) or die "Unable to open $file\n";
while (my $line = <$fh>) {
if ($line =~ m/<(?:ExternalPage about|link r:resource)="([^\"]+)"\/?>/) {
push #urls, $1;
}
open (FH, ">>my_urls.txt") or die "$!";
print FH "#urls ";
close(FH);
}
close $fh;
Your print is inside your while loop. It sounds like you want to move your print outside of the loop.
Or if you want to print each url as you go through each line, move the declaration of "my #urls" down into the loop, then it will get reset each line
Shouldn't this part:
open (FH, ">>my_urls.txt") or die "$!";
print FH "#urls ";
close(FH);
...be placed outside of while loop? It makes no sense within while, as #urls are apparently incomplete there.
And two regex-related sidenotes: first, with m operator you may choose another set of delimiters so you don't have to escape / sign; second, it's not necessary to escape " sign within character class definition. In fact, it's not required to escape it in regex at all - unless you choose this character as a delimiter. )
So your regex may look like this:
$line =~ m#<(?:ExternalPage about|link r:resource)="([^"]+)"/?>#
do you need the #urls array elsewhere? because else, you could simply:
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.rdf.u8";
my #urls;
open(my $fh, "<", $file) or die "Unable to open $file\n";
open (FH, ">>my_urls.txt") or die "$!";
while (my $line = <$fh>) {
if ($line =~ m/<(?:ExternalPage about|link r:resource)="([^\"]+)"\/?>/) {
print FH $1;
}
}
close(FH);
close $fh;