Populate an array by splitting a string - perl

I am trying to convert a string into an array based on space delimiter.
My input file looks like this:
>Reference
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnctcACCATGGTGTCGACTC
TTCTATGGAAACAGCGTGGATGGCGTCTCCAGGCGATCTGACGGTTCACTAAACGAGCTC
Ignoring the line starting with >, the length of rest of the string is 360.
I am trying to convert this into an array.
Here's my code so far:
#!/usr/bin/perl
use strict;
use warnings;
#### To to change bases with less than 10X coverage to N #####
#### Take depth file and consensus fasta file as input arguments ####
my ($in2) = #ARGV;
my $args = $#ARGV + 1;
if ( $args != 1 ) {
print "Error!!! Insufficient Number of Argumrnts\n";
print "Usage: $0 <consensus fasta file> \n";
}
#### Open a filehandle to read in consensus fasta file ####
my $FH2;
my $line;
my #consensus;
my $char;
open($FH2, '<', $in2) || die "Could not open file $in2\n";
while ( <$FH2> ) {
$line = $_;
chomp $line;
next if $line =~ />/; # skip header line
$line =~ s/\s+//g;
my $len = length($line);
print "$len\n";
#print "$line";
#consensus = split(// , $line);
print "$#consensus\n";
#print "#consensus\n";
#for $char (0 .. $#consensus){
# print "$char: $consensus[$char]\n";
# }
}
The problem is the $len variable returns a value of 60 instead of 360 and $#consensus returns a value of 59 instead of 360 which is the length of the string.
I have removed the whitespace after each line with code $line =~ s/\s+//g;but it still is not working.

It looks like your code is essentially working. It's just your checking logic that makes no sense. I'd do the following:
use strict;
use warnings;
if (#ARGV != 1) {
print STDERR "Usage: $0 <consensus fasta file>\n";
exit 1;
}
open my $fh, '<', $ARGV[0] or die "$0: cannot open $ARGV[0]: $!\n";
my #consensus;
while (my $line = readline $fh) {
next if $line =~ /^>/;
$line =~ s/\s+//g;
push #consensus, split //, $line;
}
print "N = ", scalar #consensus, "\n";
Main things to note:
Error messages should go to STDERR, not STDOUT.
If an error occurs, the program should exit with an error code, not keep running.
Error messages should include the name of the program and the reason for the error.
chomp is redundant if you're going to remove all whitespace anyway.
As you're processing the input line by line, you can just keep pushing elements to the end of #consensus. At the end of the loop it'll have accumulated all characters across all lines.
Examining #consensus within the loop makes little sense as it hasn't finished building yet. Only after the loop do we have all characters we're interested in.

Related

Perl - parse file - write out to two different files

I have written a Perl script to parse through a file, scrub it, and put it in a new file. Was using test data that I was originally given to work with, but now I've gotten all the actual data and it turns out there are a good deal of records I will NOT want in the newly scrubbed file (mainly because too many of the fields in those records are empty).
So I now need to check if a particular field in a record is empty and if so, write it out to an "error" file and not write it out to the scrubbed data file. Below is my script (and before people bring it up, I do not have the Text::CSV module nor will I ever have it available)
NOTE - until I tried putting the IF/ELSE statement in there, the code was working with the data I had prior to being given the actual data with these problem records.
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
my $filename = 'uncleanData.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
chomp($line);
Here is the new IF statement I put in with the code below the ELSE having not changed from my prior working script -
# Check if the storeNbr field is empty. If so, write record to error file.
if (!length $fields[28]) {
open ( my $ERR_FH, '>', "errorFiles.csv" ) or die $!;
print $ERR_FH join(',', #$_), $/ for #data;
close $ERR_FH;
}
else
{
# Scrub data of characters that cause scripting problems down the line.
$line =~ s/[\'\\]/ /g;
# split the fields, concatenate fields 28-30, and add the
# concatenated field to the beginning of each line in the file
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[28..30];
# Format the DATE fields for MySQL
$_ = join '-', (split /\//)[2,0,1] for #fields[10,14,24,26];
# Scrub colons from the data
$line =~ s/:/ /g;
# If Spectro_Model is "UNKNOWN", change
if($fields[22] eq "UNKNOWN"){
$_ = 'UNKNOW' for $fields[22];
}
# If tran_date is blank, insert 0000-00-00
if(!length $fields[10]){
$_ = '0000-00-00' for $fields[10];
}
# If init_tran_date is blank, insert 0000-00-00
if(!length $fields[14]){
$_ = '0000-00-00' for $fields[14];
}
# If update_tran_date is blank, insert 0000-00-00
if(!length $fields[24]){
$_ = '0000-00-00' for $fields[24];
}
# If cancel_date is blank, insert 0000-00-00
if(!length $fields[26]){
$_ = '0000-00-00' for $fields[26];
}
# Format the PROD_NBR field by deleting any leading zeros before decimals.
$fields[12] =~ s/^\s*0\././;
# put the records back
push #data, \#fields;
}
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[14] cmp $b->[14] ||
$a->[26] cmp $b->[26] ||
$a->[27] cmp $b-> [27]
} #data;
#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1502.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
exit;
I'm guessing my problem is where I am putting the closing brace } for the ELSE part of the statement. Here are some sample records from the file with the last file being one of the "problem" records -
650096571,1,1,used as store paint,14,IFC 8012NP,Standalone-9,3596,56,1/31/2015,80813,A97W01251,,1/16/2015,0.25,0.25,,SW,CUSTOM MATCH,TRUE,O,xts,,,,,,,1568,61006,1,FALSE
650368376,1,3,Tinted Wrong Color,16,IFC 8012NP,01DX8015206,,6,1/31/2015,160720,A87W01151,MATCH,1/31/2015,1,1,ENG,CUST,CUSTOM MATCH,TRUE,O,Ci52,,,,,,,1584,137252,1,FALSE
650175433,3,1,not tinted - e.w.,16,COROB MODULA HF,Standalone-7,,2,1/31/2015,95555,B20W02651,,1/29/2015,3,3,,COMP,CUSTOM MATCH,TRUE,P,xts,,,,,,,1627,68092,5,FALSE
650187016,2,1,checked out under cash ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
When I run this script, it's still processing the "error records" and throwing up all kinds of "unitialized value" warnings.
Text::CSV is useful if you need to handle quotes or embedded linefeeds. Text::ParseWords can do as a substitute instead if you need that capability.
But as long as you don't have quoting to worry about, split works just fine.
You can do something like:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $normal_fh, '>', "output.txt" ) or die $!;
open ( my $err_fh, '>', "errors.txt" ) or die $!;
while ( <> ) {
if ( ( split /,/ ) [27] =~ /\w/ ) {
select $normal_fh;
}
else {
select $err_fh;
}
print;
}

Perl script - Confusing error

When I run this code, I am purely trying to get all the lines containing the word "that" in them. Sounds easy enough. But when I run it, I get a list of matches that contain the word "that" but only at the end of the line. I don't know why it's coming out like this and I have been going crazy trying to solve it. I am currently getting an output of 268 total matches, and the output I need is only 13. Please advise!
#!/usr/bin/perl -w
#Usage: conc.shift.pl textfile word
open (FH, "$ARGV[0]") || die "cannot open";
#array = (1,2,3,4,5);
$count = 0;
while($line = <FH>) {
chomp $line;
shift #array;
push(#array, $line);
$count++;
if ($line =~ /that/)
{
$output = join(" ",#array);
print "$output \n";
}
}
print "Total matches: $count\n";
Don't you want to increment your $count variable only if the line contains "that", i.e.:
if ($line =~ /that/) {
$count++;
instead of incrementing the counter before checking if $line contains "that", as you have it:
$count++;
if ($line =~ /that/) {
Similarly, I suspect that your push() and join() calls, for stashing a matching line in #array, should also be within the if block, only executed if the line contains "that".
Hope this helps!

How do I find the line a word is on when the user enters text in Perl?

I have a simple text file that includes all 50 states. I want the user to enter a word and have the program return the line the specific state is on in the file or otherwise display a "word not found" message. I do not know how to use find. Can someone assist with this? This is what I have so far.
#!/bin/perl -w
open(FILENAME,"<WordList.txt"); #opens WordList.txt
my(#list) = <FILENAME>; #read file into list
my($state); #create private "state" variable
print "Enter a US state to search for: \n"; #Print statement
$line = <STDIN>; #use of STDIN to read input from user
close (FILENAME);
An alternative solution that reads only the parts of the file until a result is found, or the file is exhausted:
use strict;
use warnings;
print "Enter a US state to search for: \n";
my $line = <STDIN>;
chomp($line);
# open file with 3 argument open (safer)
open my $fh, '<', 'WordList.txt'
or die "Unable to open 'WordList.txt' for reading: $!";
# read the file until result is found or the file is exhausted
my $found = 0;
while ( my $row = <$fh> ) {
chomp($row);
next unless $row eq $line;
# $. is a special variable representing the line number
# of the currently(most recently) accessed filehandle
print "Found '$line' on line# $.\n";
$found = 1; # indicate that you found a result
last; # stop searching
}
close($fh);
unless ( $found ) {
print "'$line' was not found\n";
}
General notes:
always use strict; and use warnings; they will save you from a wide range of bugs
3 argument open is generally preferred, as well as the or die ... statement. If you are unable to open the file, reading from the filehandle will fail
$. documentation can be found in perldoc perlvar
Tool for the job is grep.
chomp ( $line ); #remove linefeeds
print "$line is in list\n" if grep { m/^\Q$line\E$/g } #list;
You could also transform your #list into a hash, and test that, using map:
my %cities = map { $_ => 1 } #list;
if ( $cities{$line} ) { print "$line is in list\n";}
Note - the above, because of the presence of ^ and $ is an exact match (and case sensitive). You can easily adjust it to support fuzzier scenarios.

Ignore lines in a file till match and process lines after that

I am looping over lines in a file and when matched a particular line, i want to process the lines after the current (matched) line. I can do it :-
open my $fh, '<', "abc" or die "Cannot open!!";
while (my $line = <$fh>){
next if($line !~ m/Important Lines below this Line/);
last;
}
while (my $line = <$fh>){
print $line;
}
Is there a better way to do this (code needs to be a part of a bigger perl script) ?
I'd use flip-flop operator:
while(<DATA>) {
next if 1 .. /Important/;
print $_;
}
__DATA__
skip
skip
Important Lines below this Line
keep
keep
output:
keep
keep

help merging perl code routines together for file processing

I need some perl help in putting these (2) processes/code to work together. I was able to get them working individually to test, but I need help bringing them together especially with using the loop constructs. I'm not sure if I should go with foreach..anyways the code is below.
Also, any best practices would be great too as I'm learning this language. Thanks for your help.
Here's the process flow I am looking for:
read a directory
look for a particular file
use the file name to strip out some key information to create a newly processed file
process the input file
create the newly processed file for each input file read (if i read in 10, I create 10 new files)
Part 1:
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
next if ($file =~ /^\.+$/);
#Get filename attributes
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
print "$1\n";
print "$2\n";
print "$3\n";
}
print "$file\n";
}
Part 2:
use strict;
use Digest::MD5 qw(md5_hex);
#Create new file
open (NEWFILE, ">/backups/processed/foo$1.name.$2-foo_p$3.out") || die "cannot create file";
my $data = '';
my $line1 = <>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ( "^A", "^E", "^D");
while (<>)
{
my $digest = md5_hex($data);
chomp;
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2" ;
$extra .= "$heading[$_]$sep1$values[$_]$sep2" for (0..scalar(#values));
$data .= "$extra$eorec";
print NEWFILE "$data";
}
#print $data;
close (NEWFILE);
You are using an old-style of Perl programming. I recommend you to use functions and CPAN modules (http://search.cpan.org). Perl pseudocode:
use Modern::Perl;
# use...
sub get_input_files {
# return an array of files (#)
}
sub extract_file_info {
# takes the file name and returs an array of values (filename attrs)
}
sub process_file {
# reads the input file, takes the previous attribs and build the output file
}
my #ifiles = get_input_files;
foreach my $ifile(#ifiles) {
my #attrs = extract_file_info($ifile);
process_file($ifile, #attrs);
}
Hope it helps
I've bashed your two code fragments together (making the second a sub that the first calls for each matching file) and, if I understood your description of the objective correctly, this should do what you want. Comments on style and syntax are inline:
#!/usr/bin/env perl
# - Never forget these!
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
# Parens on postfix "if" are optional; I prefer to omit them
next if $file =~ /^\.+$/;
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
process_file($file, $1, $2, $3);
}
print "$file\n";
}
sub process_file {
my ($orig_name, $foo_x, $name_x, $p_x) = #_;
my $new_name = "/backups/processed/foo$foo_x.name.$name_x-foo_p$p_x.out";
# - From your description of the task, it sounds like we actually want to
# read from the found file, not from <>, so opening it here to read
# - Better to use lexical ("my") filehandle and three-arg form of open
# - "or" has lower operator precedence than "||", so less chance of
# things being grouped in the wrong order (though either works here)
# - Including $! in the error will tell why the file open failed
open my $in_fh, '<', $orig_name or die "cannot read $orig_name: $!";
open(my $out_fh, '>', $new_name) or die "cannot create $new_name: $!";
my $data = '';
my $line1 = <$in_fh>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ("^A", "^E", "^D");
while (<$in_fh>) {
chomp;
my $digest = md5_hex($data);
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2";
$extra .= "$heading[$_]$sep1$values[$_]$sep2"
for (0 .. scalar(#values));
# - Useless use of double quotes removed on next two lines
$data .= $extra . $eorec;
#print $out_fh $data;
}
# - Moved print to output file to here (where it will print the complete
# output all at once) rather than within the loop (where it will print
# all previous lines each time a new line is read in) to prevent
# duplicate output records. This could also be achieved by printing
# $extra inside the loop. Printing $data at the end will be slightly
# faster, but requires more memory; printing $extra within the loop and
# getting rid of $data entirely would require less memory, so that may
# be the better option if you find yourself needing to read huge input
# files.
print $out_fh $data;
# - $in_fh and $out_fh will be closed automatically when it goes out of
# scope at the end of the block/sub, so there's no real point to
# explicitly closing it unless you're going to check whether the close
# succeeded or failed (which can happen in odd cases usually involving
# full or failing disks when writing; I'm not aware of any way that
# closing a file open for reading can fail, so that's just being left
# implicit)
close $out_fh or die "Failed to close file: $!";
}
Disclaimer: perl -c reports that this code is syntactically valid, but it is otherwise untested.