I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.
I have a process that reads a CSV file, and I want to make sure it's correct before I start parsing it.
I get a file name, check if it exists, then check its integrity. If it's not there or not a proper CSV file then I try the file from the previous day instead
Is there a way to check that the file is proper CSV file? I am using Text::CSV_XS to parse it.
Googling a bit I found this csv-check example code on the Text::CSV_XS Git repo. It looks like something I could use.
As others have noted, you have to parse the entire file to determine if it's valid. You may as well kill two birds with one stone and do your data processing and error checking at the same time.
Detecting errors
getline() returns undef when it reaches EOF or if it fails to parse a line. You can use this to parse a file, halting if there are any parse errors:
while ( my $row = $csv->getline($io) ) {
# Process row
}
$csv->eof or do_something();
You can also
use autodie;
or set the auto_diag option in Text::CSV_XS->new() to die on errors:
$csv = Text::CSV_XS->new({ auto_diag => 2 });
You can handle the errors by wrapping your parsing code in an eval block. This method will automatically call error_diag() before dieing, printing the error to stderr; this may not be what you want.
Reverting invalid files
How do you "revert" the processing you did for previous rows if you detect an error? One possibility, if your database engine supports them, are database transactions. When you start processing a file, start a transaction. If you get a parse error, simply roll back the transaction and move on to the next file; otherwise, commit the transaction.
As an aside, I haven't seen your code for inserting database records so I'm not sure if this applies, but it's not very efficient to have a separate insert statement for each row. Instead, consider either constructing a compound insert statement as you parse the file; or, for very large files, let the database do the parsing with something like MySQL's LOAD DATA INFILE (just an example since I don't know what DBMS you're using).
To use a compound insert, build the query statement in memory like Borodin suggested. If you get to the end of the file without any parse errors, execute the statement; otherwise, throw it out and move on to the next file.
For very large files, it might be fastest to let the database do the parsing, especially if you're doing minimal processing before inserting the data. MySQL's LOAD DATA INFILE, for example, will halt if it detects data interpretation or duplicate key errors. If you wrap the statement in a transaction, you can roll back if there are errors and try to load the next file. The advantage of this approach is that loading valid files will be extremely fast, much faster than if you had to parse them with Perl first.
You can't test the validity of a file without reading and parsing every record in it anyway.
I suggest the way to go is to process each file that you find, building in memory the data that you want to end up in the database, and if you find an error then just discard it and try with the next file.
Once you reach the end of the file and know that it is valid and complete, then you can just save your prepared data to the database, and go on to the next file.
This will work fine unless your CSV files are enormous and too large to fit into memory sensibly. In that case you should simply take two passes.
Here is what I did, the sub returns 1 if the file is ok and 0 if it's not ok:
sub CheckCSVFile {
my ($fileName) =#_;
my $csv = Text::CSV_XS->new();
open my $in_fh, '<:encoding(ISO-8859-1)', $fileName;
while ( <$in_fh> ) {
my $status = $csv->parse($_);
if ($status != 1){
return $status;
}
}
$csv->eof;
close $in_fh;
return 1;
}
I check for file existence beforehand, so it shouldn't error out. I also don't want to exit if something goes wrong. It's a bit crude, but worked for me.
I've just started dabbling in Perl, to try and gain some exposure to different programming languages - so forgive me if some of the following code is horrendous.
I needed a quick and dirty CSV parser that could receive a CSV file, and split it into file batches containing "X" number of CSV lines (taking into account that entries could contain embedded newlines).
I came up with a working solution, and it was going along just fine. However, as one of the CSV files that I'm trying to split, I came across one that contains serialized PHP code.
This seems to break the CSV parsing. As soon as I remove the serialization - the CSV file is parsed correctly.
Are there any tricks I need to know when it comes to parsing serialized data in CSV files?
Here is a shortened sample of the code:
use strict;
use warnings;
my $csv = Text::CSV_XS->new({ eol => $/, always_quote => 1, binary => 1 });
my $out;
my $in;
open $in, "<:encoding(utf8)", "infile.csv" or die("cannot open input file $inputfile");
open $out, ">outfile.000";
binmode($out, ":utf8");
while (my $line = $csv->getline($in)) {
$lines++;
$csv->print($out, $line);
}
I'm never able to get into the while loop shown above. As soon as I remove the serialized data, I suddenly am able to get into the loop.
Edit:
An example of a line that is causing me trouble (taken straight from Vim - hence the ^M):
"26","other","1","20,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","37","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","6","1"
"27","other","1","35,000 Subscriber Plan","Some test here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","38","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","7","1"
"28","other","1","50,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","39","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","8","1""73","other","8","10,000,000","","","","0","","0","","0","0","recurring","0","","payment","","0","","","75","0","10000000","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:17:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","14","0"
The CSV you are trying to read escapes embedded quotes with backslash, but the default for Text::CSV_XS is to escape by doubling them. Try adding escape_char => '\\' to the Text::CSV_XS constructor.
You may also need allow_loose_escapes => 1 if it uses backslash to quote other things that don't strictly need it like newlines.
The other option is to change the writer to use doubled quotes instead of backslashes for escaping. Might or might not be possible. Doubling the quotes is the more common flavour of CSV and while programmatic parsers can generally read both (if told), you won't be able to read the variant with backslash e.g. in Excel.
I've been writing output from perl scripts to files for some time using code as below:
open( OUTPUT, ">:utf8", $output_file ) or die "Can't write new file: $!";
print OUTPUT "First line I want printed\n";
print OUTPUT "Another line I want printing\n";
close(OUTPUT);
This works, and is faster than my initial approach which used "say" instead of print (Thank you NYTProf for enlightening my to that!)
However, my current script is looping over hundreds of thousands of lines and is taking many hours to run using this method and NYTProf is pointing the finger at my thousands of 'print' commands. So, the question is... Is there a faster way of doing this?
Other Info that's possibly relevant...
Perl Version: 5.14.2 (On Ubuntu)
Background of the script in question...
A number of '|' delimited flat files are being read into hashes, each file has some sort of primary key matching entries from one to another. I'm manipulating this data and them combining them into one file for import into another system.
The output file is around 3 Million lines, and the program starts to noticeably slow down after writing around 30,000 lines to said file. (A little reading around seemed to point towards running out of write buffer in other languages but I couldn't find anything about this with regard to perl?)
EDIT: I've now tried adding the line below, just after the open() statement, to disable print buffering, but the program still slows around the 30,000th line.
OUTPUT->autoflush(1);
I think you need to redesign the algorithm your program uses. File output speed isn't influenced by the amount of data that has been output, and it is far more likely that your program is reading and processing data but not releasing it.
Check the amount of memory used by your process to see if it increases inexorably
Beware of for (<$filehandle>) loops, which read whole files into memory at once
As I said in my comment, disable the relevant print statements to see how performance changes
Have you tried to concat all the single print's into a single scalar and then print scalar all at once? I have a script that outputs an average of 20 lines of text for each input line. When using individual print statements, even sending the output to /dev/null, took a long time. But when I packed all the output (for a single input line) together, using things like:
$output .= "...";
$output .= sprintf("%s...", $var);
Then just before leaving the line processing sub-routine, I 'print $output'. Printing all the lines at once. The number of calls to print went from ~7.7M to about 386K - equal to the number of lines in the input date file. This shaved about 10% off of my total execution time.