Perl Split File Into Lines and Variables - perl

I am currently splitting a Perl file which holds some user/password information and am doing it successfully, but am not satisfied with my code. I am sure there is a better way to do it in Perl (I am a beginner). If someone could come up with a slicker way that would be great!
my $i = 1;
my $DB;
my $DBHOST;
my $DBUSER;
my $DBPASS;
my $filename = "some_file";
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
if ($i == 1) {
$DB = (split /=/, $row)[1];
}
if ($i == 2) {
$DBHOST = (split /=/, $row)[1];
}
if ($i == 3) {
$DBUSER = (split /=/, $row)[1];
}
if ($i == 4) {
$DBPASS = (split /=/, $row)[1];
}
$i++;
}

map() is pretty handy for things like this:
my ($DB, $DBHOST, $DBUSER, $DBPASS) = map {$_ =~ /.*?=(.*)/} <$fh>;
What's happening:
map() operates on lists, so it treats <$fh> as one
for each element in the list (a file line in this case), assign it into the default variable ($_)
then, capture the part of the line we want using regex, return it, and assign it to the relevant variable on the left-hand-side (on each iteration of the file, each one of the receiving variables are shifted as well)
the regex operates as follows:
/
.*? # ignore everything, non greedy until we match a
= # our delimiter
( # begin capture
.* # capture everything until end of line (less the newline char)
) # end capture
/
Note that this solution will iterate all the way through the whole file, even after all four of the variables are populated (as does the way you've done it in your OP).
Also note that there's no error checking here, so if a value isn't captured, you'll get warnings related to undefined variables.

I think I would do away with your individual variables and store the connection information in a hash.
my %db_conn;
while (<$fh>) {
my ($key, $val) = split /=/, $_, 2;
$db_conn{$key} = $val;
}
Of course, this assumes that whatever is to the left of the = on each line is a unique identifier for the value.

Related

Perl question. Need a way to replace two variables on the same line

I am trying to write a replace script in Perl, and I have it working halfway, but it seems that I cannot replace two strings in the same line.
I have a json file named foo.json that contains the following line: "title": "%CLIENT% Dashboard Web Map %WEBMAPENV%",
Now, I have a second file named env.txt that contains all the variables that I wish to use. In this file, there is an entry called: %WEBMAPENV%=(Test-Dev). My goal is to have PERL read the file env.txt, and replace BOTH "%CLIENT% and %WEBMAPENV% simultaneously.
Here is my code so far:
my $envFilePath = "$dirScripts/env/env.txt";
# Reading Firebase variables from Test environment file.
open($fh, "<", $envFilePath);
while (my $line=<$fh>) {
if ($line eq "\n") {
next;
}
if ($line =~ m/^(%\w+%)=/) {
$cur_key = $1;
$line =~ s/$cur_key=//;
$replacements{$cur_key} = $line;
} else {
$replacements{$cur_key} .= $line;
}
}
...
my $targetFilePath3 = "$dirHome/foo.json";
tie my #v_lines, 'Tie::File', $targetFilePath3, autochomp => 0 or die $!;
replaceEnvVars(#v_lines);
# Replace the environment variables as part of the setup.
sub replaceEnvVars {
for my $line (#_) {
if ($line =~ m/(%\w+%)/) {
my $key = $1;
if (defined($replacements{$key})) {
my $value = $replacements{$key};
chomp $value;
$line =~ s/$key/$value/g;
}
}
}
untie #_;
}
I am only able to substitute one variable per line, but I need to be able to handle 2.
Can any offer some help?
Derek
You only check for one.
if ($line =~ m/(%\w+%)/) { ... }
Solution:
# Clean up %replacements before using it.
chomp for values %replacements;
for my $line (#_) {
$line =~ s{(%\w+%)}{ $replacements{$1} // $1 }eg;
}
By adding a loop inside of s/// (through the use of /g) rather than a loop around s///, this one doesn't mess up if the values contain %.
/e means the replacement will be run as Perl code.
// is the "defined-or" operator. It works like || but looks for defined rather than truth.
See the Perl Regex Tutorial for more.

How to fix the error of "Use of unitialized value in addition..." in perl script?

Here is the script of user Suic for calculating molecular weight of fasta sequences (calculating molecular weight in perl),
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
for my $file (#ARGV) {
open my $fh, '<:encoding(UTF-8)', $file;
my $input = join q{}, <$fh>;
close $fh;
while ( $input =~ /^(>.*?)$([^>]*)/smxg ) {
my $name = $1;
my $seq = $2;
$seq =~ s/\n//smxg;
my $mass = calc_mass($seq);
print "$name has mass $mass\n";
}
}
sub calc_mass {
my $a = shift;
my #a = ();
my $x = length $a;
#a = split q{}, $a;
my $b = 0;
my %data = (
A=>71.09, R=>16.19, D=>114.11, N=>115.09,
C=>103.15, E=>129.12, Q=>128.14, G=>57.05,
H=>137.14, I=>113.16, L=>113.16, K=>128.17,
M=>131.19, F=>147.18, P=>97.12, S=>87.08,
T=>101.11, W=>186.12, Y=>163.18, V=>99.14
);
for my $i( #a ) {
$b += $data{$i};
}
my $c = $b - (18 * ($x - 1));
return $c;
}
and the protein.fasta file with n (here is 2) sequences:
seq_ID_1 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASGSDGASDGDSAHSHAS
SFASGDASGDSSDFDSFSDFSD
>seq_ID_2 descriptions etc
ASDGDSAHSAHASDFRHGSDHSDGEWTSHSDHDSHFSDGSGASGADGHHAH
ASDSADGDASHDASHSAREWAWGDASHASGASGASG
When using: perl molecular_weight.pl protein.fasta > output.txt
in terminal, it will generate the correct results, however it also presents an error of "Use of unitialized value in addition (+) at molecular_weight.pl line36", which is just localized in line of "$b += $data{$i};" how to fix this bug ? Thanks in advance !
You probably have an errant SPACE somewhere in your data file. Just change
$seq =~ s/\n//smxg;
into
$seq =~ s/\s//smxg;
EDIT:
Besides whitespace, there may be some non-whitespace invisible characters in the data, like WORD JOINER (U+2060).
If you want to be sure to be thorough and you know all the legal symbols, you can delete everything apart from them:
$seq =~ s/[^ARDNCEQGHILKMFPSTWYV]//smxg;
Or, to make sure you won't miss any (even if you later change the symbols), you can populate a filter regex dynamically from the hash keys.
You'd need to make %Data and the filter regex global, so the filter is available in the main loop. As a beneficial side effect, you don't need to re-initialize the data hash every time you enter calc_mass().
use strict;
use warnings;
my %Data = (A=>71.09,...);
my $Filter_regex = eval { my $x = '[^' . join('', keys %Data) . ']'; qr/$x/; };
...
$seq =~ s/$Filter_regex//smxg;
(This filter works as long as the symbols are single character. For more complicated ones, it may be preferable to match for the symbols and collect them from the sequence, instead of removing unwanted characters.)

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

How to look at the next line of a file in Perl

I have a piece of code which opens up a file and parses it. This text document has a redundant structure and has multiple entries. I need to peek ahead within my loop to see if there is a new entry, if there is, I will be able to parse all of the data my program extracts. Let me first show my implementation so far
use strict;
my $doc = open(my $fileHandler, "<", "test.txt");
while(my $line = <$fileHandler>) {
## right here I want to look at the next line to see if
## $line =~ m/>/ where > denotes a new entry
}
Try handling the iteration yourself:
my $line = <$fileHandler>;
while(1) { # keep looping until I say so
my $nextLine = <$fileHandler>;
if ($line =~ m/>/ || !defined $nextLine) {
### Do the stuff
}
### Do any other stuff;
last unless defined $nextLine;
$line = $nextLine;
}
I added the extra check in the if statement under the assumption that you will also want to process what you have when you reach the end of the file.
Alternatively, as suggested by friedo, if the file can fit into memory, you can load the whole thing into an array at once:
my #lines = <$fileHandler>;
for (my $i = 0; $i <= $#lines; $i++) {
if ($i == $#lines || $lines[$i+1] =~ />/) {
### Do the stuff
}
}
This is more flexible in that you can access any arbitrary line of the file, in any order, but as mentioned the file does have to be small enough to fit into memory.
A nice way to handle these problems is using Tie::File, which allows you to treat a file like an array, without the performance penalty of actually loading the file into memory. It is also a core module since perl v5.7.3.
use Tie::File;
tie my #file, 'Tie::File', "test.txt" or die $!;
for my $linenr (0 .. $#file) { # loop over line numbers
if ($file[$linenr] =~ /foo/) { # this is the current line
if ($file[$linenr + 1] =~ /^>/ && # this is the next line
$linenr <= $#file) { # don't go past end of file
# do stuff
}
}
}
untie #file; # all done

Skip the problematic DAT and proceed with next DATs and out put the error or missed DATs list in separate text file in Perl

I have another question here, i have several dats and want to merge them. But the script first checks for header of all the DATs and if not matching it will throw error and stop the script. Now i want to run the script skipping the problematic dat and output the error in separate text file with list of errored DAts and reason. Could anyone please help on this. Here is what i have so far:
use strict;
my $rootdir = $ARGV[0];
die "usage: perl mergetxtfiles.pl <folder>" if ($#ARGV != 0);
#$rootdir =~ s/\\/\\\\/g;
print "\nFolder = $rootdir\n\n";
opendir(DIR, $rootdir)
or die "failed opening the directory $rootdir";
open(OF,">:utf8",'combined_'.time.'.dat')
or die "failed opening the file";
my $icr = 0;
my $cnt = 0;
my $header = '';
my $header_flag = 0;
while(my $fname = readdir(DIR)) {
# add extensions if needed
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) {
$icr++;
my $fnamepath = $rootdir.'\\'.$fname;
print "\($icr\) $fname\n";
open(IF, "<:utf8", $fnamepath)
or die "ERROR: cannot open the file\n$fnamepath ";
my $sep_icr = 0;
while(<IF>) {
my $line = $_;
chomp $line;
next if (/^$/);
$sep_icr++;
$cnt++;
my #ar = split(/\t/,$line);
if ($cnt == 1) {
$header_flag = 1;
$header = $line;
}
if ($sep_icr == 1 and $header_flag == 1) {
#print "$line \n $header\n";
if ($line ne $header) {
die "Headers are not same\n";
}
elsif (($line eq $header) and ($cnt >1)) {
print "INFO\: ignoring the same header for $fname \n";
$cnt--;
next;
}
}
print OF $line."\n";
}
print "\--Line count= $sep_icr\n\n";
close IF;
#print OF "\n";
}
}
print "\-\-\> Total line count= $cnt\n";
Named Loops
In your loop, we have to change your if-clause and the outer loop a bit:
FILE:
while(my $fname = readdir(DIR)) {
...;
if ($line ne $header) {
logger($fname, "Headers not matching");
next FILE;
}
...;
}
In Perl, loops can be labeled, so we can specify which loop we do next, instead of setting and checking flags. I used an example logging function loggeras given below, but you can substitute it with an appropriate print statement.
Logging
This is probably a bit more than asked, but here is a little logging function for flexibility. Arguments are a filename, a reason, and an optional severity. You can remove the severity code if it isn't needed. The severity is optional anyway and defaults to debug.
open my $logfile, ">>", "FILENAME" or die "..."; # open for append
sub logger {
my ($file, $reason, $severity) = (#_, 'debug');
$severity = {
debug => '',
info => 'INFO',
warn => '!WARN!',
fatal => '!!!ERROR!!!',
}->{$severity} // $severity; # transform the severity if it is a name we know
$severity .= ' ' if length $severity; # append space if we have a severity
print {$logfile} $severity . qq{$reason while processing "$file"\n};
}
If called with logger("./foo/bar", "Headers not matching", 'warn') it will output:
!WARN! Headers not matching while processing "./foo/bar"
Change the printed error message to something more machine-readable if needed.
Style tips and tricks:
If find these lines more elegant:
die "usage: ...\n" unless #ARGV;
my ($rootdir) = #ARGV;
note the newline at the end (supresses the "at line 3" etc). In scalar context, an array returns the array length. In the second line we can avoid array subscripting by assigning in list context. Surplus elements are ignored.
Instead
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) { ...; }
we can say
next unless $fname =~ m/(?: \.txt | \.dat | \.csv )$/xi;
and avoid unneccessary intendation, therefore improving readability.
I modified the regex so that all suffixes must come at the end, not only the .csv suffix, and added the /x modifier so that I can use non-semantic whitespace inside the regex.
Windows, and pretty much any OS, understand forward slashes in path names. So instead
my $fnamepath = $rootdir.'\\'.$fname;
we can write
my $fnamepath = "$rootdir/$fname";
I find that easier to write and understand.
The
while(<IF>) {
my $line = $_;
construct can be simplified to
while(my $line = <IF>) {...}
Last but not least, consider starting a habit of using filehandles with my. Often, global filehandles are not needed and can cause some bugs.