I have a process that reads a CSV file, and I want to make sure it's correct before I start parsing it.
I get a file name, check if it exists, then check its integrity. If it's not there or not a proper CSV file then I try the file from the previous day instead
Is there a way to check that the file is proper CSV file? I am using Text::CSV_XS to parse it.
Googling a bit I found this csv-check example code on the Text::CSV_XS Git repo. It looks like something I could use.
As others have noted, you have to parse the entire file to determine if it's valid. You may as well kill two birds with one stone and do your data processing and error checking at the same time.
Detecting errors
getline() returns undef when it reaches EOF or if it fails to parse a line. You can use this to parse a file, halting if there are any parse errors:
while ( my $row = $csv->getline($io) ) {
# Process row
}
$csv->eof or do_something();
You can also
use autodie;
or set the auto_diag option in Text::CSV_XS->new() to die on errors:
$csv = Text::CSV_XS->new({ auto_diag => 2 });
You can handle the errors by wrapping your parsing code in an eval block. This method will automatically call error_diag() before dieing, printing the error to stderr; this may not be what you want.
Reverting invalid files
How do you "revert" the processing you did for previous rows if you detect an error? One possibility, if your database engine supports them, are database transactions. When you start processing a file, start a transaction. If you get a parse error, simply roll back the transaction and move on to the next file; otherwise, commit the transaction.
As an aside, I haven't seen your code for inserting database records so I'm not sure if this applies, but it's not very efficient to have a separate insert statement for each row. Instead, consider either constructing a compound insert statement as you parse the file; or, for very large files, let the database do the parsing with something like MySQL's LOAD DATA INFILE (just an example since I don't know what DBMS you're using).
To use a compound insert, build the query statement in memory like Borodin suggested. If you get to the end of the file without any parse errors, execute the statement; otherwise, throw it out and move on to the next file.
For very large files, it might be fastest to let the database do the parsing, especially if you're doing minimal processing before inserting the data. MySQL's LOAD DATA INFILE, for example, will halt if it detects data interpretation or duplicate key errors. If you wrap the statement in a transaction, you can roll back if there are errors and try to load the next file. The advantage of this approach is that loading valid files will be extremely fast, much faster than if you had to parse them with Perl first.
You can't test the validity of a file without reading and parsing every record in it anyway.
I suggest the way to go is to process each file that you find, building in memory the data that you want to end up in the database, and if you find an error then just discard it and try with the next file.
Once you reach the end of the file and know that it is valid and complete, then you can just save your prepared data to the database, and go on to the next file.
This will work fine unless your CSV files are enormous and too large to fit into memory sensibly. In that case you should simply take two passes.
Here is what I did, the sub returns 1 if the file is ok and 0 if it's not ok:
sub CheckCSVFile {
my ($fileName) =#_;
my $csv = Text::CSV_XS->new();
open my $in_fh, '<:encoding(ISO-8859-1)', $fileName;
while ( <$in_fh> ) {
my $status = $csv->parse($_);
if ($status != 1){
return $status;
}
}
$csv->eof;
close $in_fh;
return 1;
}
I check for file existence beforehand, so it shouldn't error out. I also don't want to exit if something goes wrong. It's a bit crude, but worked for me.
Related
I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.
In my script I am dealing with opening files and writing to files. I found that there is some thing wrong with a file I try to open, the file exists, it is not empty and I am passing the right path to file handle.
I know that my question might sounds weird but while I was debugging my code I put the following command in my script to check some files
system ("ls");
Then my script worked well, when it's removed it does not work correctly anymore.
my #unique = ("test1","test2");
open(unique_fh,">orfs");
print unique_fh #unique ;
open(ORF,"orfs")or die ("file doesnot exist");
system ("ls");
while(<ORF>){
split ;
}
#neworfs=#_ ;
print #neworfs ;
Perl buffers the output when you print to a file. In other words, it doesn't actually write to the file every time you say print; it saves up a bunch of data and writes it all at once. This is faster.
In your case, you couldn't see anything you had written to the file, because Perl hadn't written anything yet. Adding the system("ls") call, however, caused Perl to write your output first (the interpreter is smart enough to do this, because it thinks you might want to use the system() call to do something with the file you just created).
How do you get around this? You can close the file before you open it again to read it, as choroba suggested. Or you can disable buffering for that file. Put this code just after you open the file:
my $fh = select (unique_fh);
$|=1;
select ($fh);
Then anytime you print to the file, it will get written immediately ($| is a special variable that sets the output buffering behavior).
Closing the file first is probably a better idea, although it is possible to have a filehandle for reading and writing open at the same time.
You did not close the filehandle before trying to read from the same file.
We have a script on an FTP endpoint that monitors the FTP logs spewed out by our FTP daemon.
Currently what we do is have a perl script essentially runs a tail -F on the file and sends every single line into a remote MySQL database, with slightly different column content based off the record type.
This database has tables for content of both the tarball names/content, as well as FTP user actions with said packages; Downloads, Deletes, and everything else VSFTPd logs.
I see this as particularly bad, but I'm not sure what's better.
The goal of a replacement is to still get log file content into a database as quick as possible. I'm thinking doing something like making a FIFO/pipe file in place of where the FTP log file is, so I can read it in once periodically, ensuring I never read the same thing in twice. Assuming VSFTPd will place nice with that (I'm thinking it won't, insight welcome!).
The FTP daemon is VSFTPd, I'm at least fairly sure the extent of their logging capabilies are: xfer style log, vsftpd style log, both, or no logging at all.
The question is, what's better than what we're already doing, if anything?
Honestly, I don't see much wrong with what you're doing now. tail -f is very efficient. The only real problem with it is that it loses state if your watcher script ever dies (which is a semi-hard problem to solve with rotating logfiles). There's also a nice File::Tail module on CPAN that saves you from the shellout and has some nice customization available.
Using a FIFO as a log can work (as long as vsftpd doesn't try to unlink and recreate the logfile at any point, which it may do) but I see one major problem with it. If no one is reading from the other end of the FIFO (for instance if your script crashes, or was never started), then a short time later all of the writes to the FIFO will start blocking. And I haven't tested this, but it's pretty likely that having logfile writes block will cause the entire server to hang. Not a very pretty scenario.
Your problem with reading a continually updated file is you want to keep reading, even after the end of file is reached. The solution to this is to re-seek to your current position in the file:
seek FILEHANDLE, 0, 1;
Here is my code for doing this sort of thing:
open(FILEHANDLE,'<', '/var/log/file') || die 'Could not open log file';
seek(FILEHANDLE, 0, 2) || die 'Could not seek to end of log file';
for (;;) {
while (<FILEHANDLE>) {
if ( $_ =~ /monitor status down/ ) {
print "Gone down\n";
}
}
sleep 1;
seek FILEHANDLE, 0, 1; # clear eof
}
You should look into inotify (assuming you are on a nice, posix based OS) so you can run your perl script whenever the logfile is updated. If this level of IO causes problems you could always keep the logfile on a RAMdisk so IO is very fast.
This should help you set this up:
http://www.cyberciti.biz/faq/linux-inotify-examples-to-replicate-directories/
You can open the file as an in pipe.
open(my $file, '-|', '/ftp/file/path');
while (<$file>) {
# you know the rest
}
File::Tail does this, plus heuristic sleeping and nice error handling and recovery.
Edit: On second thought, a real system pipe is better if you can manage it. If not, you need to find the last thing you put in the database, and spin through the file until you get to the last thing you put in the database whenever your process starts. Not that easy to accomplish, and potentially impossible if you have no way of identifying where you left off.
I have a application generating logs in every 5 sec. The logs are in below format.
11:13:49.250,interface,0,RX,0
11:13:49.250,interface,0,TX,0
11:13:49.250,interface,1,close,0
11:13:49.250,interface,4,error,593
11:13:49.250,interface,4,idle,2994215
and so on for other interfaces...
I am working to convert these into below CSV format:
Time,interface.RX,interface.TX,interface.close....
11:13:49,0,0,0,....
Simple as of now but the problem is, I have to get the data in CSV format online, i.e as soon the log file updated the CSV should also be updated.
What I have tried to read the output and make the header is:
#!/usr/bin/perl -w
use strict;
use File::Tail;
my $head=["Time"];
my $pos={};
my $last_pos=0;
my $current_event=[];
my $events=[];
my $file = shift;
$file = File::Tail->new($file);
while(defined($_=$file->read)) {
next if $_ =~ some filters;
my ($time,$interface,$count,$eve,$value) = split /[,\n]/, $_;
my $key = $interface.".".$eve;
if (not defined $pos->{$eve_key}) {
$last_pos+=1;
$pos->{$eve_key}=$last_pos;
push #$head,$eve;
}
print join(",", #$head) . "\n";
}
Is there any way to do this using Perl?
Module Text::CSV will allow you to both read and write CSV format files. Text::CSV will internally use Text::CSV_XS if it's installed, or it will fall back to using Text::CSV_PP (thanks to Brad Gilbert for improving this explanation).
Grouping the related rows together is something you will have to do; it is not clear from your example where the source date goes to.
Making sure that the CSV output is updated is primarily a question of ensuring that you have the output file line buffered.
As David M suggested, perhaps you should look at the File::Tail module to deal with the continuous reading aspect of the problem. That should allow you to continually read from the input log file.
You can then use the 'parse' method in Text::CSV to split up the read line, and the 'print' method to format the output. How you combine the information from the various input lines to create an output line is a mystery to me - I cannot see how the logic works from the example you give. However, I assume you know what you need to do, and these tools will give you the mechanisms you need to handle the data.
No-one can do much more to spoon-feed you the answer. You are going to have to do some thinking for yourself. You will have a file handle that can be continuously read via File::Tail; you will have a CSV structure for reading the data lines; you will probably have another CSV structure for the written output; you will have an output file handle that you ensure is flushed every time you write. Connecting these dots is now your problem.
I'm learning Perl and building an application that gets a random line from a file using this code:
open(my $random_name, "<", "out.txt");
my #array = shuffle(<$random_name>);
chomp #array;
close($random_name) or die "Error when trying to close $random_name: $!";
print shift #array;
But now I want to delete this random name from the file. How I can do this?
shift already deletes a name from the array.
So does pop (one from the beginning, one from the end) - I would suggest using pop as it may be more efficient and being a random one, you don't care which on you use.
Or do you need to delete it from a file?
If that's the case, you need to:
A. get a count of names inside a file (if small, read it all in memory using File::Slurp, if large, either read it line-by-line and count or simply execute wc -l $filename command via backticks.
B. Generate a random # from 1 to <$ of lines> (say, $random_line_number
C. Read the file line by line. For every line read, WRITE it to another temp file (use File::Temp to generate temp files. Except do NOT write the line numbered $random_line_number to text file
D. Close temp file and move it instead of your original file
If the list contains filenames and you need to delete the file itself (the random file), use unlink() function. Don't forget to process return code from unlink() and, like with any IO operation, print error message containing $! which will be the text of system error on failure.
Done.
D.
When you say "delete this … from the list" do you mean delete it from the file? If you simply mean remove it from #array then you've already done that by using shift. If you want it removed from the file, and the order doesn't matter, simply write the remaining names in #array back into the file. If the file order does matter, you're going to have to do something slightly more complicated, such as reopen the file, read the items in in order, except for the one you don't want, and then write them all back out again. Either that, or take more notice of the order when you read the file.
If you need to delete a line from a file (its not entirely clear from your question) one of the simplest and most efficient ways is to use Tie::File to manipulate a file as if it were an array. Otherwise perlfaq5 explains how to do it the long way.