I have written a Perl script to parse through a file, scrub it, and put it in a new file. Was using test data that I was originally given to work with, but now I've gotten all the actual data and it turns out there are a good deal of records I will NOT want in the newly scrubbed file (mainly because too many of the fields in those records are empty).
So I now need to check if a particular field in a record is empty and if so, write it out to an "error" file and not write it out to the scrubbed data file. Below is my script (and before people bring it up, I do not have the Text::CSV module nor will I ever have it available)
NOTE - until I tried putting the IF/ELSE statement in there, the code was working with the data I had prior to being given the actual data with these problem records.
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
use Time::Piece;
my $filename = 'uncleanData.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
chomp($line);
Here is the new IF statement I put in with the code below the ELSE having not changed from my prior working script -
# Check if the storeNbr field is empty. If so, write record to error file.
if (!length $fields[28]) {
open ( my $ERR_FH, '>', "errorFiles.csv" ) or die $!;
print $ERR_FH join(',', #$_), $/ for #data;
close $ERR_FH;
}
else
{
# Scrub data of characters that cause scripting problems down the line.
$line =~ s/[\'\\]/ /g;
# split the fields, concatenate fields 28-30, and add the
# concatenated field to the beginning of each line in the file
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[28..30];
# Format the DATE fields for MySQL
$_ = join '-', (split /\//)[2,0,1] for #fields[10,14,24,26];
# Scrub colons from the data
$line =~ s/:/ /g;
# If Spectro_Model is "UNKNOWN", change
if($fields[22] eq "UNKNOWN"){
$_ = 'UNKNOW' for $fields[22];
}
# If tran_date is blank, insert 0000-00-00
if(!length $fields[10]){
$_ = '0000-00-00' for $fields[10];
}
# If init_tran_date is blank, insert 0000-00-00
if(!length $fields[14]){
$_ = '0000-00-00' for $fields[14];
}
# If update_tran_date is blank, insert 0000-00-00
if(!length $fields[24]){
$_ = '0000-00-00' for $fields[24];
}
# If cancel_date is blank, insert 0000-00-00
if(!length $fields[26]){
$_ = '0000-00-00' for $fields[26];
}
# Format the PROD_NBR field by deleting any leading zeros before decimals.
$fields[12] =~ s/^\s*0\././;
# put the records back
push #data, \#fields;
}
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#Sort the clean files on Primary Key, initTranDate, updateTranDate, and updateTranTime
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[14] cmp $b->[14] ||
$a->[26] cmp $b->[26] ||
$a->[27] cmp $b-> [27]
} #data;
#open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/cleaned1502.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
exit;
I'm guessing my problem is where I am putting the closing brace } for the ELSE part of the statement. Here are some sample records from the file with the last file being one of the "problem" records -
650096571,1,1,used as store paint,14,IFC 8012NP,Standalone-9,3596,56,1/31/2015,80813,A97W01251,,1/16/2015,0.25,0.25,,SW,CUSTOM MATCH,TRUE,O,xts,,,,,,,1568,61006,1,FALSE
650368376,1,3,Tinted Wrong Color,16,IFC 8012NP,01DX8015206,,6,1/31/2015,160720,A87W01151,MATCH,1/31/2015,1,1,ENG,CUST,CUSTOM MATCH,TRUE,O,Ci52,,,,,,,1584,137252,1,FALSE
650175433,3,1,not tinted - e.w.,16,COROB MODULA HF,Standalone-7,,2,1/31/2015,95555,B20W02651,,1/29/2015,3,3,,COMP,CUSTOM MATCH,TRUE,P,xts,,,,,,,1627,68092,5,FALSE
650187016,2,1,checked out under cash ,,,,,,,,,,,,,,,,,,,,,,,,,,,,
When I run this script, it's still processing the "error records" and throwing up all kinds of "unitialized value" warnings.
Text::CSV is useful if you need to handle quotes or embedded linefeeds. Text::ParseWords can do as a substitute instead if you need that capability.
But as long as you don't have quoting to worry about, split works just fine.
You can do something like:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $normal_fh, '>', "output.txt" ) or die $!;
open ( my $err_fh, '>', "errors.txt" ) or die $!;
while ( <> ) {
if ( ( split /,/ ) [27] =~ /\w/ ) {
select $normal_fh;
}
else {
select $err_fh;
}
print;
}
When I run this code, I am purely trying to get all the lines containing the word "that" in them. Sounds easy enough. But when I run it, I get a list of matches that contain the word "that" but only at the end of the line. I don't know why it's coming out like this and I have been going crazy trying to solve it. I am currently getting an output of 268 total matches, and the output I need is only 13. Please advise!
#!/usr/bin/perl -w
#Usage: conc.shift.pl textfile word
open (FH, "$ARGV[0]") || die "cannot open";
#array = (1,2,3,4,5);
$count = 0;
while($line = <FH>) {
chomp $line;
shift #array;
push(#array, $line);
$count++;
if ($line =~ /that/)
{
$output = join(" ",#array);
print "$output \n";
}
}
print "Total matches: $count\n";
Don't you want to increment your $count variable only if the line contains "that", i.e.:
if ($line =~ /that/) {
$count++;
instead of incrementing the counter before checking if $line contains "that", as you have it:
$count++;
if ($line =~ /that/) {
Similarly, I suspect that your push() and join() calls, for stashing a matching line in #array, should also be within the if block, only executed if the line contains "that".
Hope this helps!
I have a simple text file that includes all 50 states. I want the user to enter a word and have the program return the line the specific state is on in the file or otherwise display a "word not found" message. I do not know how to use find. Can someone assist with this? This is what I have so far.
#!/bin/perl -w
open(FILENAME,"<WordList.txt"); #opens WordList.txt
my(#list) = <FILENAME>; #read file into list
my($state); #create private "state" variable
print "Enter a US state to search for: \n"; #Print statement
$line = <STDIN>; #use of STDIN to read input from user
close (FILENAME);
An alternative solution that reads only the parts of the file until a result is found, or the file is exhausted:
use strict;
use warnings;
print "Enter a US state to search for: \n";
my $line = <STDIN>;
chomp($line);
# open file with 3 argument open (safer)
open my $fh, '<', 'WordList.txt'
or die "Unable to open 'WordList.txt' for reading: $!";
# read the file until result is found or the file is exhausted
my $found = 0;
while ( my $row = <$fh> ) {
chomp($row);
next unless $row eq $line;
# $. is a special variable representing the line number
# of the currently(most recently) accessed filehandle
print "Found '$line' on line# $.\n";
$found = 1; # indicate that you found a result
last; # stop searching
}
close($fh);
unless ( $found ) {
print "'$line' was not found\n";
}
General notes:
always use strict; and use warnings; they will save you from a wide range of bugs
3 argument open is generally preferred, as well as the or die ... statement. If you are unable to open the file, reading from the filehandle will fail
$. documentation can be found in perldoc perlvar
Tool for the job is grep.
chomp ( $line ); #remove linefeeds
print "$line is in list\n" if grep { m/^\Q$line\E$/g } #list;
You could also transform your #list into a hash, and test that, using map:
my %cities = map { $_ => 1 } #list;
if ( $cities{$line} ) { print "$line is in list\n";}
Note - the above, because of the presence of ^ and $ is an exact match (and case sensitive). You can easily adjust it to support fuzzier scenarios.
I need some perl help in putting these (2) processes/code to work together. I was able to get them working individually to test, but I need help bringing them together especially with using the loop constructs. I'm not sure if I should go with foreach..anyways the code is below.
Also, any best practices would be great too as I'm learning this language. Thanks for your help.
Here's the process flow I am looking for:
read a directory
look for a particular file
use the file name to strip out some key information to create a newly processed file
process the input file
create the newly processed file for each input file read (if i read in 10, I create 10 new files)
Part 1:
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
next if ($file =~ /^\.+$/);
#Get filename attributes
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
print "$1\n";
print "$2\n";
print "$3\n";
}
print "$file\n";
}
Part 2:
use strict;
use Digest::MD5 qw(md5_hex);
#Create new file
open (NEWFILE, ">/backups/processed/foo$1.name.$2-foo_p$3.out") || die "cannot create file";
my $data = '';
my $line1 = <>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ( "^A", "^E", "^D");
while (<>)
{
my $digest = md5_hex($data);
chomp;
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2" ;
$extra .= "$heading[$_]$sep1$values[$_]$sep2" for (0..scalar(#values));
$data .= "$extra$eorec";
print NEWFILE "$data";
}
#print $data;
close (NEWFILE);
You are using an old-style of Perl programming. I recommend you to use functions and CPAN modules (http://search.cpan.org). Perl pseudocode:
use Modern::Perl;
# use...
sub get_input_files {
# return an array of files (#)
}
sub extract_file_info {
# takes the file name and returs an array of values (filename attrs)
}
sub process_file {
# reads the input file, takes the previous attribs and build the output file
}
my #ifiles = get_input_files;
foreach my $ifile(#ifiles) {
my #attrs = extract_file_info($ifile);
process_file($ifile, #attrs);
}
Hope it helps
I've bashed your two code fragments together (making the second a sub that the first calls for each matching file) and, if I understood your description of the objective correctly, this should do what you want. Comments on style and syntax are inline:
#!/usr/bin/env perl
# - Never forget these!
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
# Parens on postfix "if" are optional; I prefer to omit them
next if $file =~ /^\.+$/;
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
process_file($file, $1, $2, $3);
}
print "$file\n";
}
sub process_file {
my ($orig_name, $foo_x, $name_x, $p_x) = #_;
my $new_name = "/backups/processed/foo$foo_x.name.$name_x-foo_p$p_x.out";
# - From your description of the task, it sounds like we actually want to
# read from the found file, not from <>, so opening it here to read
# - Better to use lexical ("my") filehandle and three-arg form of open
# - "or" has lower operator precedence than "||", so less chance of
# things being grouped in the wrong order (though either works here)
# - Including $! in the error will tell why the file open failed
open my $in_fh, '<', $orig_name or die "cannot read $orig_name: $!";
open(my $out_fh, '>', $new_name) or die "cannot create $new_name: $!";
my $data = '';
my $line1 = <$in_fh>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ("^A", "^E", "^D");
while (<$in_fh>) {
chomp;
my $digest = md5_hex($data);
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2";
$extra .= "$heading[$_]$sep1$values[$_]$sep2"
for (0 .. scalar(#values));
# - Useless use of double quotes removed on next two lines
$data .= $extra . $eorec;
#print $out_fh $data;
}
# - Moved print to output file to here (where it will print the complete
# output all at once) rather than within the loop (where it will print
# all previous lines each time a new line is read in) to prevent
# duplicate output records. This could also be achieved by printing
# $extra inside the loop. Printing $data at the end will be slightly
# faster, but requires more memory; printing $extra within the loop and
# getting rid of $data entirely would require less memory, so that may
# be the better option if you find yourself needing to read huge input
# files.
print $out_fh $data;
# - $in_fh and $out_fh will be closed automatically when it goes out of
# scope at the end of the block/sub, so there's no real point to
# explicitly closing it unless you're going to check whether the close
# succeeded or failed (which can happen in odd cases usually involving
# full or failing disks when writing; I'm not aware of any way that
# closing a file open for reading can fail, so that's just being left
# implicit)
close $out_fh or die "Failed to close file: $!";
}
Disclaimer: perl -c reports that this code is syntactically valid, but it is otherwise untested.