How can I read a continuously updating log file in Perl? - perl

I have a application generating logs in every 5 sec. The logs are in below format.
11:13:49.250,interface,0,RX,0
11:13:49.250,interface,0,TX,0
11:13:49.250,interface,1,close,0
11:13:49.250,interface,4,error,593
11:13:49.250,interface,4,idle,2994215
and so on for other interfaces...
I am working to convert these into below CSV format:
Time,interface.RX,interface.TX,interface.close....
11:13:49,0,0,0,....
Simple as of now but the problem is, I have to get the data in CSV format online, i.e as soon the log file updated the CSV should also be updated.
What I have tried to read the output and make the header is:
#!/usr/bin/perl -w
use strict;
use File::Tail;
my $head=["Time"];
my $pos={};
my $last_pos=0;
my $current_event=[];
my $events=[];
my $file = shift;
$file = File::Tail->new($file);
while(defined($_=$file->read)) {
next if $_ =~ some filters;
my ($time,$interface,$count,$eve,$value) = split /[,\n]/, $_;
my $key = $interface.".".$eve;
if (not defined $pos->{$eve_key}) {
$last_pos+=1;
$pos->{$eve_key}=$last_pos;
push #$head,$eve;
}
print join(",", #$head) . "\n";
}
Is there any way to do this using Perl?

Module Text::CSV will allow you to both read and write CSV format files. Text::CSV will internally use Text::CSV_XS if it's installed, or it will fall back to using Text::CSV_PP (thanks to Brad Gilbert for improving this explanation).
Grouping the related rows together is something you will have to do; it is not clear from your example where the source date goes to.
Making sure that the CSV output is updated is primarily a question of ensuring that you have the output file line buffered.
As David M suggested, perhaps you should look at the File::Tail module to deal with the continuous reading aspect of the problem. That should allow you to continually read from the input log file.
You can then use the 'parse' method in Text::CSV to split up the read line, and the 'print' method to format the output. How you combine the information from the various input lines to create an output line is a mystery to me - I cannot see how the logic works from the example you give. However, I assume you know what you need to do, and these tools will give you the mechanisms you need to handle the data.
No-one can do much more to spoon-feed you the answer. You are going to have to do some thinking for yourself. You will have a file handle that can be continuously read via File::Tail; you will have a CSV structure for reading the data lines; you will probably have another CSV structure for the written output; you will have an output file handle that you ensure is flushed every time you write. Connecting these dots is now your problem.

Related

Perl Skipping Some Files When Looping Across Files and Writing Their Contents to Output File

I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.

Check for CSV file correctness in Perl

I have a process that reads a CSV file, and I want to make sure it's correct before I start parsing it.
I get a file name, check if it exists, then check its integrity. If it's not there or not a proper CSV file then I try the file from the previous day instead
Is there a way to check that the file is proper CSV file? I am using Text::CSV_XS to parse it.
Googling a bit I found this csv-check example code on the Text::CSV_XS Git repo. It looks like something I could use.
As others have noted, you have to parse the entire file to determine if it's valid. You may as well kill two birds with one stone and do your data processing and error checking at the same time.
Detecting errors
getline() returns undef when it reaches EOF or if it fails to parse a line. You can use this to parse a file, halting if there are any parse errors:
while ( my $row = $csv->getline($io) ) {
# Process row
}
$csv->eof or do_something();
You can also
use autodie;
or set the auto_diag option in Text::CSV_XS->new() to die on errors:
$csv = Text::CSV_XS->new({ auto_diag => 2 });
You can handle the errors by wrapping your parsing code in an eval block. This method will automatically call error_diag() before dieing, printing the error to stderr; this may not be what you want.
Reverting invalid files
How do you "revert" the processing you did for previous rows if you detect an error? One possibility, if your database engine supports them, are database transactions. When you start processing a file, start a transaction. If you get a parse error, simply roll back the transaction and move on to the next file; otherwise, commit the transaction.
As an aside, I haven't seen your code for inserting database records so I'm not sure if this applies, but it's not very efficient to have a separate insert statement for each row. Instead, consider either constructing a compound insert statement as you parse the file; or, for very large files, let the database do the parsing with something like MySQL's LOAD DATA INFILE (just an example since I don't know what DBMS you're using).
To use a compound insert, build the query statement in memory like Borodin suggested. If you get to the end of the file without any parse errors, execute the statement; otherwise, throw it out and move on to the next file.
For very large files, it might be fastest to let the database do the parsing, especially if you're doing minimal processing before inserting the data. MySQL's LOAD DATA INFILE, for example, will halt if it detects data interpretation or duplicate key errors. If you wrap the statement in a transaction, you can roll back if there are errors and try to load the next file. The advantage of this approach is that loading valid files will be extremely fast, much faster than if you had to parse them with Perl first.
You can't test the validity of a file without reading and parsing every record in it anyway.
I suggest the way to go is to process each file that you find, building in memory the data that you want to end up in the database, and if you find an error then just discard it and try with the next file.
Once you reach the end of the file and know that it is valid and complete, then you can just save your prepared data to the database, and go on to the next file.
This will work fine unless your CSV files are enormous and too large to fit into memory sensibly. In that case you should simply take two passes.
Here is what I did, the sub returns 1 if the file is ok and 0 if it's not ok:
sub CheckCSVFile {
my ($fileName) =#_;
my $csv = Text::CSV_XS->new();
open my $in_fh, '<:encoding(ISO-8859-1)', $fileName;
while ( <$in_fh> ) {
my $status = $csv->parse($_);
if ($status != 1){
return $status;
}
}
$csv->eof;
close $in_fh;
return 1;
}
I check for file existence beforehand, so it shouldn't error out. I also don't want to exit if something goes wrong. It's a bit crude, but worked for me.

How to parse a CSV file containing serialized PHP?

I've just started dabbling in Perl, to try and gain some exposure to different programming languages - so forgive me if some of the following code is horrendous.
I needed a quick and dirty CSV parser that could receive a CSV file, and split it into file batches containing "X" number of CSV lines (taking into account that entries could contain embedded newlines).
I came up with a working solution, and it was going along just fine. However, as one of the CSV files that I'm trying to split, I came across one that contains serialized PHP code.
This seems to break the CSV parsing. As soon as I remove the serialization - the CSV file is parsed correctly.
Are there any tricks I need to know when it comes to parsing serialized data in CSV files?
Here is a shortened sample of the code:
use strict;
use warnings;
my $csv = Text::CSV_XS->new({ eol => $/, always_quote => 1, binary => 1 });
my $out;
my $in;
open $in, "<:encoding(utf8)", "infile.csv" or die("cannot open input file $inputfile");
open $out, ">outfile.000";
binmode($out, ":utf8");
while (my $line = $csv->getline($in)) {
$lines++;
$csv->print($out, $line);
}
I'm never able to get into the while loop shown above. As soon as I remove the serialized data, I suddenly am able to get into the loop.
Edit:
An example of a line that is causing me trouble (taken straight from Vim - hence the ^M):
"26","other","1","20,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","37","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","6","1"
"27","other","1","35,000 Subscriber Plan","Some test here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","38","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","7","1"
"28","other","1","50,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","39","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","8","1""73","other","8","10,000,000","","","","0","","0","","0","0","recurring","0","","payment","","0","","","75","0","10000000","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:17:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","14","0"
The CSV you are trying to read escapes embedded quotes with backslash, but the default for Text::CSV_XS is to escape by doubling them. Try adding escape_char => '\\' to the Text::CSV_XS constructor.
You may also need allow_loose_escapes => 1 if it uses backslash to quote other things that don't strictly need it like newlines.
The other option is to change the writer to use doubled quotes instead of backslashes for escaping. Might or might not be possible. Doubling the quotes is the more common flavour of CSV and while programmatic parsers can generally read both (if told), you won't be able to read the variant with backslash e.g. in Excel.

What is the fastest way to 'print' to file in perl?

I've been writing output from perl scripts to files for some time using code as below:
open( OUTPUT, ">:utf8", $output_file ) or die "Can't write new file: $!";
print OUTPUT "First line I want printed\n";
print OUTPUT "Another line I want printing\n";
close(OUTPUT);
This works, and is faster than my initial approach which used "say" instead of print (Thank you NYTProf for enlightening my to that!)
However, my current script is looping over hundreds of thousands of lines and is taking many hours to run using this method and NYTProf is pointing the finger at my thousands of 'print' commands. So, the question is... Is there a faster way of doing this?
Other Info that's possibly relevant...
Perl Version: 5.14.2 (On Ubuntu)
Background of the script in question...
A number of '|' delimited flat files are being read into hashes, each file has some sort of primary key matching entries from one to another. I'm manipulating this data and them combining them into one file for import into another system.
The output file is around 3 Million lines, and the program starts to noticeably slow down after writing around 30,000 lines to said file. (A little reading around seemed to point towards running out of write buffer in other languages but I couldn't find anything about this with regard to perl?)
EDIT: I've now tried adding the line below, just after the open() statement, to disable print buffering, but the program still slows around the 30,000th line.
OUTPUT->autoflush(1);
I think you need to redesign the algorithm your program uses. File output speed isn't influenced by the amount of data that has been output, and it is far more likely that your program is reading and processing data but not releasing it.
Check the amount of memory used by your process to see if it increases inexorably
Beware of for (<$filehandle>) loops, which read whole files into memory at once
As I said in my comment, disable the relevant print statements to see how performance changes
Have you tried to concat all the single print's into a single scalar and then print scalar all at once? I have a script that outputs an average of 20 lines of text for each input line. When using individual print statements, even sending the output to /dev/null, took a long time. But when I packed all the output (for a single input line) together, using things like:
$output .= "...";
$output .= sprintf("%s...", $var);
Then just before leaving the line processing sub-routine, I 'print $output'. Printing all the lines at once. The number of calls to print went from ~7.7M to about 386K - equal to the number of lines in the input date file. This shaved about 10% off of my total execution time.

How do I properly format plain text data for a simple Perl dictionary app?

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c