perl File::Tail not reading lines from file after certain period - perl

I am having issues understanding why File::Tail FAILS to read the lines from an always updating file with 1000s of transactions, and automatic rollover .
It read to a certain extend correctly, but then later slows down, and for a long period even not able to read the lines in the logs. I can confirm that the actual log are being populated as when file::tail shows nothing.
my $file = File::Tail->new(name=>"$name",
tail => 1000,
maxinterval=>1,
interval=>1,
adjustafter=>5,resetafter=>1,
ignore_nonexistant=>1,
maxbuf=>32768);
while (defined(my $line=$file->read)) {
#my $line=$file->read;
my $xml_string = "";
#### Read only one event per line and deal with the XML.
#### If the log entry is not a SLIM log, I will ignore it.
if ($line =~ /(\<slim\-log\>.*\<\/slim\-log\>)/) {
do_something
} else {
not_working for some reason
}
Can someone please help me understand this. Know that this log file is updated at almost 10MB per second or 1000 events per second for an approximation.
Should I be handing the filehandle or the File::Tail results some other more efficient way?

Seems like there's limitations in File::Tail. There's some suggestions around other more direct options (a pipe, a fork, a thread, seeking to the end of the file in perl) discussed in http://www.perlmonks.org/?node_id=387706.
My favorite pick is the blocking read from a pipe:
open(TAIL, "tail -F $name|") or die "TAIL : $!";
while (<TAIL>) {
test_and_do_something
}

Related

Perl - Piping gunzip Output to File::ReadBackwards

I have a Perl project (CGI script, running on Apache) that has previously always used gunzip and tac (piping gunzip to tac, and piping that to filehandle) in order to accomplish its workload, which is to process large, flat text files, sometimes on the order of 10GB or more each in size. More specifically, in my use case these files are needing to be decompressed on-the-fly AND read backwards at times as well (there are times when both are a requirement - mainly for speed).
When I started this project, I looked at using File::ReadBackwards but decided on tac instead for performance reasons. After some discussion on a slightly-related topic last night and several suggestions to try and keep the processing entirely within Perl, I decided to give File::ReadBackwards another shot to see how it stands up under this workload.
Some preliminary tests indicate that it may in fact be comparable, and possibly even better, than tac. However, so far I've only been able to test this on uncompressed files. But it now has grabbed my interest so I'd like to see if I could make it work with compressed files as well.
Now I'm pretty sure I could probably unzip a file to another file, then read that backwards, but I think that would have terrible performance. Especially because the user has the option to limit results to X number for the exact reason of helping performance so I do not want to have to process/decompress the entirety of a file every single time I pull any lines out of it. Ideally I would like to be able to do what I do now, which is to decompress and read it backwards on-the-fly, with the ability to bail out as soon as I hit my quota if needed.
So, my dilemma is that I need to find a way to pipe output from gunzip, into File::ReadBackwards, if possible.
On a side note, I would be willing to give IO::Uncompress::Gunzip a chance as well (compare the decompression performance against a plain, piped gunzip process), either for performance gain (which would surprise me) or for convenience/the ability to pipe output to File::ReadBackwards (which I find slightly more likely).
Does anyone have any ideas here? Any advice is much appreciated.
You can't. File::ReadBackwards requires a seekable handle (i.e. a plain file and not a pipe or socket).
To use File::ReadBackwards, you'd have to first send the output to a named temporary file (which you could create using File::Temp).
While File::ReadBackwards won't work as desired, here is another take.
In the original approach you first gunzip before tac-ing, and the whole file is read so to get to its end; thus tac is there only for convenience. (For a plain uncompressed file one can get file size from file metadata and then seek toward the end of a file so to not have to read the whole thing.)
Then try the same, or similar, in Perl. The IO::Uncompress::Gunzip module also has seek method. It does have to uncompress data up to that point
Note that the implementation of seek in this module does not provide true random access to a compressed file/buffer
but with it we still avoid copying uncompressed data (into variables) and so pay the minimal price here, to uncompress data in order to seek. In my timings this saves upward from an order of magnitude, making it far closer to system's gunzip (competitive on the order of 10Mb file sizes).
For that we also need the uncompressed size, which module's seek uses, which I get with system's gzip -l. Thus I still need to parse output of an external tool; so there's that issue.†
use warnings;
use strict;
use feature 'say';
use IO::Uncompress::Gunzip qw($GunzipError);
my $file = shift;
die "Usage: $0 file\n" if not $file or not -f $file;
my $z = IO::Uncompress::Gunzip->new($file) or die "Error: $GunzipError";
my $us = (split ' ', (`gunzip -l $file`)[1])[1]; # CHECK gunzip's output
say "Uncompressed size: $us";
# Go to 1024 bytes before uncompressed end (should really be more careful
# since we aren't guaranteed that size estimate)
$z->seek($us-1024, 0);
while (my $line = $z->getline) {
print $line if $z->eof;
}
(Note: docs advertise SEEK_END but it didn't work for me, neither as a constant nor as 2. Also note that the constructor does not fail for non-existent files so the program doesn't die there.)
This only prints the last line. Collect those lines into an array instead, for more involved work.
For compressed text files on order of 10Mb in size this runs as fast as gunzip | tac. For files around 100Mb in size it is slower by a factor of two. This is a rather rudimentary estimate, and it depends on all manner of detail. But I am comfortable to say that it will be noticeably slower for larger files.
However, the code above has a particular problem with file sizes possible in this case, in tens of Gb. The good old gzip format has the limitation, nicely stated in gzip manual
The gzip format represents the input size modulo 2^32 [...]
Then sizes obtained by --list for files larger than 4Gb undermine the above optimization: We'll seek to a place early in the file instead of to near its end (for a 17Gb file the size is reported by -l as 1Gb and so we seek there), and then in fact read the bulk of the file by getline.
The best solution would be to use the known value for the uncompressed data size -- if that is known. Otherwise, if the compressed file size exceeds 4Gb then seek to its compressed size (as far as we can safely), and after that use read with very large chunks
my $len = 10*1_024_000; # only hundreds of reads but large buffer
$z->read($buf, $len) while not $z->eof;
my #last_lines = split /\n/, $buf;
The last step depends on what actually need be done. If it is indeed to read lines backwards then you can do while (my $el = pop #last_lines) { ... } for example, or reverse the array and work away. Note that it is likely that the last read will be far lesser than $len.
On the other hand, it could so happen that the last read buffer is too small for what's needed; so one may want to always copy the needed number of lines and keep that across reads.
The buffer size to read ($len) clearly depends on specifics of the problem.
Finally, if this is too much bother you can pipe gunzip and keep a buffer of lines.
use String::ShellQuote qw(shell_quote);
my $num_lines = ...; # user supplied
my #last_lines;
my $cmd = shell_quote('gunzip', '-c', '--', $file);
my $pid = open my $fh, '-|', $cmd // die "Can't open $cmd: $!";
push #last_lines, scalar <$fh> for 0..$num_lines; # to not have to check
while (<$fh>) {
push #last_lines, $_;
shift #last_lines;
}
close $fh;
while (my $line = pop #last_lines) {
print; # process backwards
}
I put $num_lines on the array right away so to not have to test the size of #last_lines against $num_lines for every shift, so on every read. (This improves runtime by nearly 30%.)
Any hint of the number of lines (of uncompressed data) is helpful, so that we skip ahead and avoid copying data into variables, as much as possible.
# Stash $num_lines on array
<$fh> for 0..$num_to_skip; # skip over an estimated number of lines
# Now push+shift while reading
This can help quite a bit, but depending on how well we can estimate the number of lines. Altogether, in my tests this is still slower than gunzip | tac | head, by around 50% in the very favorable case when I skip 90% of the file.
† The uncompressed size can be found without going to external tools as
my $us = do {
my $d;
open my $fh, '<', $file or die "Can't open $file: $!";
seek($fh, -4, 2) and read($fh, $d, 4) >= 4 and unpack('V', $d)
or die "Can't get uncompressed size: $!";
};
Thanks to mosvy for a comment with this.
If we still stick with using system's gunzip then the safety of running an external command with user input (filename), practically bypassed here by checking for that file, need be taken into account by using String::ShellQuote to compose the command
use String::ShellQuote qw(shell_quote);
my $cmd = shell_quote('gunzip', '-l', '--', $file);
# my $us = ... qx($cmd) ...;
Thanks to ikegami for comment.

How does the while works in case of Filehandle when reading a gigantic file in Perl

I have a very large file to read, so when I use while for reading it line by line, the script starts taking more time to read the line as I dig deep in the file; and to mention the rise is exponential.
while (<$fh>)
{do something}
Does while has to parse through all the lines it has already read to go to the next unread line or something like that?
How can I overcome such a situation?
EDIT 1:
My code:
$line=0;
%values;
open my $fh1, '<', "file.xml" or die $!;
while (<$fh1>)
{
$line++;
if ($_=~ s/foo//gi)
{
chomp $_;
$values{'id'} = $_;
}
elsif ($_=~ s/foo//gi)
{
chomp $_;
$values{'type'} = $_;
}
elsif ($_=~ s/foo//gi)
{
chomp $_;
$values{'pattern'} = $_;
}
if (keys(%values) == 3)
{
open FILE, ">>temp.txt" or die $!;
print FILE "$values{'id'}\t$values{'type'}\t$values{'pattern'}\n";
close FILE;
%values = ();
}
if($line == ($line1+1000000))
{
$line1=$line;
$read_time = time();
$processing_time = $read_time - $start_time - $processing_time;
print "xml file parsed till line $line, time taken $processing_time sec\n";
}
}
EDIT 2
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NLM//DTD NCBI-Entrezgene, 21st January 2005//EN" "http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>816394</Gene-track_geneid>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>7</Date-std_month>
<Date-std_day>30</Date-std_day>
<Date-std_hour>19</Date-std_hour>
<Date-std_minute>53</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2015</Date-std_year>
<Date-std_month>1</Date-std_month>
<Date-std_day>8</Date-std_day>
<Date-std_hour>15</Date-std_hour>
<Date-std_minute>41</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="chromosome">21</BioSource_genome>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Arabidopsis thaliana</Org-ref_taxname>
<Org-ref_common>thale cress</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
This is just a jest of the original xml file, if you like you can check the whole xml file from Here. Select any one entry and send it to file as xml file.
EDIT 3
As suggested by many pioneers that I should avoid using substitute but I feel it is essential to have it in my code as from a line in the xml file:
<Gene-track_geneid>816394</Gene-track_geneid>
I want to take only the Id which here is 816394 can be any number (any number of digits) for other entries; so how can I avoid using substitute.
Thanks in advance
ANSWER:
First, I would like to apologize to take so long to reply; as I started again from root to top for Perl and this time came clear with use strict, which helped me in maintaining the linear time. And also the use of XML Parsers is a good thing to do while handling large Xml files..
Thanks all for help and suggestions
Further to my comment above, you should get into the habit of using the strict and warnings pragma's at the start of every script. warnings just picks up mistakes that might not be found until runtime. strict enforce a number of good rules including declaring all variables with my. The variable then exists only in the scope (typically the code block) it was declared in.
Try something like this and see if you get any improvement.
use strict;
use warnings;
my %values;
my $line = 0;
open my $XML, '<', "file.xml" or die $!;
open my $TEMP, '>>', "temp.txt" or die $!;
while (<$XML>) {
chomp;
$line++;
if (s/foo//gi) { $values{id} = $_; }
elsif (s/foo//gi) { $values{type} = $_; }
elsif (s/foo//gi) { $values{pattern} = $_; }
if (keys(%values) == 3) {
print $TEMP "$values{id}\t$values{type}\t$values{pattern}\n";
undef %values;
}
# if ($line = ...
}
close $TEMP;
Ignore my one-line-if formatting, I did that for brevity. Format however you like
The main thing I've done which I hope helps is declare the %values hash inside the while block, so it doesn't have a "global" scope, and then it's undefine at the end of each block, which if I recall correctly should clear the memory it was using. Also opening and closing your output only once should cut down on a lot of unecessary operations.
Also just cleaned up a few things. Since you are acting on the topical $_ variable, you can leave it out of operations like chomp (which now occurs only once at the beginning of the loop) and you regex substution.
EDIT
It just occured to me that you might be waiting multiple loops until %values reaches 3 in which case it will not work so I moved the undef back inside the if.
MORE EDIT
As has been commented below, you should look into installing and using an XML parser from cpan. If you for whatever reason are unable to use a module, a capturing regex might work better than a replacements... eg: $var = /^<\/(\w+)>/ should capture <this>
There's no reason I see why that code would take exponentially more time. I don't see any memory leaks. %values will not grow. Looping over each line in a file does not depend on the file size only the line size. I even made an XML file with 4 million lines in it from your linked XML data to test it.
My thoughts are...
There's something you're not showing us (those regexes aren't real, $start_time is not initialized).
You're on a wacky filesystem, perhaps a network filesystem. (OP is on NTFS)
You're using a very old version of Perl with a bug. (OP is using Perl 5.20.1)
A poorly implemented network filesystem could slow down while reading an enormous file. It could also misbehave because of how you're opening and closing temp.txt rapidly. You could be chewing through file handles. temp.txt should be opened once before the loop. #Joshua's improvement suggestions are good (though the concern about %values is a red herring).
As also noted, you should not be parsing XML by hand. For a file this large, use a SAX parser which works on the XML a piece at a time keeping the memory costs down, as opposed to a DOM parser which reads the whole file. There are many to choose from.
while (<$fh>) {...} doesn't reread the file from the start on each iteration, no
The most likely cause of your problem is that you're keeping data in memory on each iteration, causing memory usage to grow as you work your way through the file. The slowdown comes in when physical memory is exhausted and the computer has to start paging out to virtual memory, ultimately producing a situation where you could be spending more time just moving memory pages back and forth between RAM and disk than on actual work.
If you can produce a brief, runnable test case which demonstrates your problem, I'm sure we can give more specific advice to fix it. If that's not possible, just a description of your {do something} process could give us enough to go on.
Edit after Edit 1 to question:
Looking at the code posted, I suspect that your slowdown may be caused by how you're handling your output. Closing and reopening the output file each time you add a line to it would definitely slow things down relative to if you just kept it open and, depending on your OS/filesystem combination, it may need to seek through the entire file to find the end to append.
Nothing else stands out to me as potentially causing performance issues, but a couple other minor points:
After your regex substitutions, $_ will never contain line ends (unless you explicitly include them in the foo patterns), so you can probably skip the chomp $_; lines.
You should open the output file the same way as you open the input file (lexical filehandle, three-argument open) instead of doing it the old way.

Efficient way to continually read in data from a text file

We have a script on an FTP endpoint that monitors the FTP logs spewed out by our FTP daemon.
Currently what we do is have a perl script essentially runs a tail -F on the file and sends every single line into a remote MySQL database, with slightly different column content based off the record type.
This database has tables for content of both the tarball names/content, as well as FTP user actions with said packages; Downloads, Deletes, and everything else VSFTPd logs.
I see this as particularly bad, but I'm not sure what's better.
The goal of a replacement is to still get log file content into a database as quick as possible. I'm thinking doing something like making a FIFO/pipe file in place of where the FTP log file is, so I can read it in once periodically, ensuring I never read the same thing in twice. Assuming VSFTPd will place nice with that (I'm thinking it won't, insight welcome!).
The FTP daemon is VSFTPd, I'm at least fairly sure the extent of their logging capabilies are: xfer style log, vsftpd style log, both, or no logging at all.
The question is, what's better than what we're already doing, if anything?
Honestly, I don't see much wrong with what you're doing now. tail -f is very efficient. The only real problem with it is that it loses state if your watcher script ever dies (which is a semi-hard problem to solve with rotating logfiles). There's also a nice File::Tail module on CPAN that saves you from the shellout and has some nice customization available.
Using a FIFO as a log can work (as long as vsftpd doesn't try to unlink and recreate the logfile at any point, which it may do) but I see one major problem with it. If no one is reading from the other end of the FIFO (for instance if your script crashes, or was never started), then a short time later all of the writes to the FIFO will start blocking. And I haven't tested this, but it's pretty likely that having logfile writes block will cause the entire server to hang. Not a very pretty scenario.
Your problem with reading a continually updated file is you want to keep reading, even after the end of file is reached. The solution to this is to re-seek to your current position in the file:
seek FILEHANDLE, 0, 1;
Here is my code for doing this sort of thing:
open(FILEHANDLE,'<', '/var/log/file') || die 'Could not open log file';
seek(FILEHANDLE, 0, 2) || die 'Could not seek to end of log file';
for (;;) {
while (<FILEHANDLE>) {
if ( $_ =~ /monitor status down/ ) {
print "Gone down\n";
}
}
sleep 1;
seek FILEHANDLE, 0, 1; # clear eof
}
You should look into inotify (assuming you are on a nice, posix based OS) so you can run your perl script whenever the logfile is updated. If this level of IO causes problems you could always keep the logfile on a RAMdisk so IO is very fast.
This should help you set this up:
http://www.cyberciti.biz/faq/linux-inotify-examples-to-replicate-directories/
You can open the file as an in pipe.
open(my $file, '-|', '/ftp/file/path');
while (<$file>) {
# you know the rest
}
File::Tail does this, plus heuristic sleeping and nice error handling and recovery.
Edit: On second thought, a real system pipe is better if you can manage it. If not, you need to find the last thing you put in the database, and spin through the file until you get to the last thing you put in the database whenever your process starts. Not that easy to accomplish, and potentially impossible if you have no way of identifying where you left off.

Will data in a pipe queue up for reading by Perl?

I have a Perl script that executes a long running process and observes its command line output (log messages), some of which are multiple lines long. Once it has a full log message, it sends it off to be processed and grabs the next log message.
open(PS_F, "run.bat |") or die $!;
$logMessage = "";
while (<PS_F>) {
$lineRead = $_;
if ($lineRead =~ m!(\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2})!) {
#process the previous log message
$logMessage = $lineRead;
}
else {
$logMessage = $logMessage.$_;
}
}
close(PS_F);
In its current form, do I have to worry about the line reading and processing "backing up"? For example, if I get a new log message every 1 second and it takes 5 seconds to do all the processing (random numbers I pulled out), do I have to worry that I will miss log messages or have memory issues?
In general, data output on the pipeline by one application will be buffered if the next cannot consume it fast enough. If the buffer fills up, the outputting application is blocked (i.e. calls to write to the output file handle just stall) until the consumer catches up. I believe the buffer on Linux is (or was) 65536 bytes.
In this fashion, you can never run out of memory, but you can seriously stall the producer application in the pipeline.
No you will not lose messages. The writing end of the pipe will block if the pipe buffer is full.
Strictly speaking, this should be a comment: Please consider re-writing your code as
# use lexical filehandle and 3 arg open
open my $PS_F, '-|', 'run.bat'
or die "Cannot open pipe to 'run.bat': $!";
# explicit initialization not needed
# limit scope
my $logMessage;
while (<$PS_F>) {
# you probably meant to anchor the pattern
# and no need to capture if you are not going to use
# captured matches
# there is no need to escape a space, although you might
# want to use [ ] for clarity
$logMessage = '' if m!^\d{4}-\d{2}-\d{2}[ ]\d{2}:\d{2}:\d{2}!;
$logMessage .= $_;
}
close $PS_F
or die "Cannot close pipe: $!";

How do I read a file which is constantly updating?

I am getting a stream of data (text format) from an external server and like to pass it on to a script line-by-line. The file is getting appended in a continuous manner. Which is the ideal method to perform this operation. Is IO::Socket method using Perl will do? Eventually this data has to pass through a PHP program (reusable) and eventually land onto a MySQL database.
The question is how to open the file, which is continuously getting updated?
In Perl, you can make use of seek and tell to read from a continuously growing file. It might look something like this (borrowed liberally from perldoc -f seek)
open(FH,'<',$the_file) || handle_error(); # typical open call
for (;;) {
while (<FH>) {
# ... process $_ and do something with it ...
}
# eof reached on FH, but wait a second and maybe there will be more output
sleep 1;
seek FH, 0, 1; # this clears the eof flag on FH
}
In perl there are a couple of modules that make tailing a file easier. IO::Tail and
File::Tail one uses a callback the other uses a blocking read so it just depends on which suits your needs better. There are likely other tailing modules as well but these are the two that came to mind.
IO::Tail - follow the tail of files/stream
use IO::Tail;
my $tail = IO::Tail->new();
$tail->add('test.log', \&callback);
$tail->check();
$tail->loop();
File::Tail - Perl extension for reading from continously updated files
use File::Tail;
my $file = File::Tail->new("/some/log/file");
while (defined(my $line= $file->read)) {
print $line;
}
Perhaps a named pipe would help you?
You talk about opening a file, and ask about IO::Socket. These aren't quite the same things, even if deep down you're going to be reading data of a file descriptor.
If you can access the remote stream from a named pipe or FIFO, then you can just open it as an ordinary file. It will block when nothing is available, and return whenever there is data that needs to be drained. You may, or may not, need to bring File::Tail to bear on the problem of not losing data if the sender runs too far ahead of you.
On the other hand, if you're opening a socket directly to the other server (which seems more likely), IO::Socket is not going to work out of the box as there is no getline method available. You would have to read and buffer block-by-block and then dole it out line by line through an intermediate holding pen.
You could pull out the socket descriptor into an IO::Handle, and use getline() on that. Something like:
my $sock = IO::Socket::INET->new(
PeerAddr => '172.0.0.1',
PeerPort => 1337,
Proto => 'tcp'
) or die $!;
my $io = new IO::Handle;
$io->fdopen(fileno($sock),"r") or die $!;
while (defined( my $data = $io->getline() )) {
chomp $data;
# do something
}
You may have to perform a handshake in order to start receiving packets, but that's another matter.
In python it is pretty straight-forward:
f = open('teste.txt', 'r')
for line in f: # read all lines already in the file
print line.strip()
# keep waiting forever for more lines.
while True:
line = f.readline() # just read more
if line: # if you got something...
print 'got data:', line.strip()
time.sleep(1) # wait a second to not fry the CPU needlessy
The solutions to read the whole fine to seek to the end are perfomance-unwise. If that happens under Linux, I would suggest just to rename the log file. Then, you can scan all the entites in the renamed file, while those in original file will be filled again. After scanning all the renamed file - delete it. Or move whereever you like. This way you get something like logrotate but for scanning newly arriving data.