Efficient way to continually read in data from a text file - perl

We have a script on an FTP endpoint that monitors the FTP logs spewed out by our FTP daemon.
Currently what we do is have a perl script essentially runs a tail -F on the file and sends every single line into a remote MySQL database, with slightly different column content based off the record type.
This database has tables for content of both the tarball names/content, as well as FTP user actions with said packages; Downloads, Deletes, and everything else VSFTPd logs.
I see this as particularly bad, but I'm not sure what's better.
The goal of a replacement is to still get log file content into a database as quick as possible. I'm thinking doing something like making a FIFO/pipe file in place of where the FTP log file is, so I can read it in once periodically, ensuring I never read the same thing in twice. Assuming VSFTPd will place nice with that (I'm thinking it won't, insight welcome!).
The FTP daemon is VSFTPd, I'm at least fairly sure the extent of their logging capabilies are: xfer style log, vsftpd style log, both, or no logging at all.
The question is, what's better than what we're already doing, if anything?

Honestly, I don't see much wrong with what you're doing now. tail -f is very efficient. The only real problem with it is that it loses state if your watcher script ever dies (which is a semi-hard problem to solve with rotating logfiles). There's also a nice File::Tail module on CPAN that saves you from the shellout and has some nice customization available.
Using a FIFO as a log can work (as long as vsftpd doesn't try to unlink and recreate the logfile at any point, which it may do) but I see one major problem with it. If no one is reading from the other end of the FIFO (for instance if your script crashes, or was never started), then a short time later all of the writes to the FIFO will start blocking. And I haven't tested this, but it's pretty likely that having logfile writes block will cause the entire server to hang. Not a very pretty scenario.

Your problem with reading a continually updated file is you want to keep reading, even after the end of file is reached. The solution to this is to re-seek to your current position in the file:
seek FILEHANDLE, 0, 1;
Here is my code for doing this sort of thing:
open(FILEHANDLE,'<', '/var/log/file') || die 'Could not open log file';
seek(FILEHANDLE, 0, 2) || die 'Could not seek to end of log file';
for (;;) {
while (<FILEHANDLE>) {
if ( $_ =~ /monitor status down/ ) {
print "Gone down\n";
}
}
sleep 1;
seek FILEHANDLE, 0, 1; # clear eof
}

You should look into inotify (assuming you are on a nice, posix based OS) so you can run your perl script whenever the logfile is updated. If this level of IO causes problems you could always keep the logfile on a RAMdisk so IO is very fast.
This should help you set this up:
http://www.cyberciti.biz/faq/linux-inotify-examples-to-replicate-directories/

You can open the file as an in pipe.
open(my $file, '-|', '/ftp/file/path');
while (<$file>) {
# you know the rest
}
File::Tail does this, plus heuristic sleeping and nice error handling and recovery.
Edit: On second thought, a real system pipe is better if you can manage it. If not, you need to find the last thing you put in the database, and spin through the file until you get to the last thing you put in the database whenever your process starts. Not that easy to accomplish, and potentially impossible if you have no way of identifying where you left off.

Related

A daemon to tail a log and fork multiple external (perl) script

I'm trying to write a program, actually a daemon, which stay in memory and perform something like tail -F on a rapidly updated log file. Then the program, when detect a new line on the file, have to launch another compiled perl script which will perform some operations on the log line and then send it with a post.
To clearly explain, I will refer to these two program as "prgTAIL" and "prgPROCESS". So, prgTAIL tail the log and launch prgPROCESS passing the new line to it.
Obviously the prgTAIL doesn't have to wait for the prgPROCESS to end the process, cause prgTAIL have to stay in memory and keep detecting new line on the log. Also, the rate of file update needs to launch multiple parallel prgPROCESS instance. For this reason I'm using two program: the first small and fast just pass the data to the second, which may be heavier cause it can be launched in multiple instances.
On the prgTAIL I used:
a pipe to tail the log file
a while loop to launch prgPROCESS on new log line
a fork(); to continue without waiting prgPROCESS ends
my $log_csv = "/log/csv.csv";
open (my $pipe, "-|", "tail", "-n0", "-F", $log_csv) or die "error";
while (<$pipe>) {
$line = $_ ;
my $pid = fork();
if (defined $pid && $pid == 0) {
exec("/bin/prgPROCESS ".$line) ; # I tried system() too.
exit 0;
}
}
The prgPROCESS operation are not so important; anyway.. it parses the $line passed as arguments, construct an XML and then post it via https.
So, this stuff actually run, but I think I messed up something with the process, cause when a reach a number of newline and prgPROCESS call around 550, prgTAIL keep running but it can't call prgPROCESS anymore, cause there are too many process. I get this error on the bash:
-bash: fork: Resource temporarily unavailable
What's wrong? Any idea? Maybe the prgPROCESS processes don't end and stay stuck without make room for other process?
PS: I'm using a Mac OS X now, but this will run on Linux.
Your problem is this:
while () {
doesn't have any constraint condition, so it's just spinning as fast as it can. You're never actually reading from your pipe, you're just forking as fast as you can and spawning that new script.
You might be wanting:
while ( my $line = <$pipe> ) {
#....
}
But really - it's arguable that you don't actually need to fork at all, because a read/process/read loop would probably do just fine - fork() and exec() is basically what system already does anyway.
You should also - if forking - clean up child processes. It doesn't matter too much for short running things, but things that sit in a loop will leave a lot of zombie processes. Either via setting $SIG{CHLD} or using waitpid.

How to read, write, append the input file using perl

How to read and overwrite the same input file using perl?
My input:
GCGCCACTGCACTCCAGCCTGGGCGACAGAGC (873 TO 904) GCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGC (3033 TO 3064)
CAAAAAAAAAAAAAAAAAAA (917 TO 936) TTTTTTTTTTTTTTTTTTTG (2998 TO 3017)
AAAAAAAAAAAAAAAAAAAG (922 TO 941) CTTTTTTTTTTTTTTTTTTT (2997 TO 3016)
I tried the below code:
#!/usr/local/bin/perl
open($in,'<',"/home/httpd/cgi-bin/exa.txt") || die("error");
open($out,'>>',"/home/httpd/cgi-bin/exa.txt")||die("error");
while(<$in>)
{
print $out;
}
close $in;
close $out;
Bear in mind what you're looking at doing here - you're opening a file for reading, reading it one line at a time.
What do you think is going to happen when you modify that file in the process?
There's also some constraints - Windows doesn't support concurrent opening for read/write anyway.
However take a look at open specifically:
You can put a + in front of the > or < to indicate that you want both read and write access to the file; thus +< is almost always preferred for read/write updates--the +> mode would clobber the file first. You can't usually use either read-write mode for updating textfiles, since they have variable-length records. See the -i switch in perlrun for a better approach. The file is created with permissions of 0666 modified by the process's umask value.
What I would suggest instead though - don't read and write from the same file at all. Rename one, execute your process, verify that it worked properly, and then tidy up afterwards.
That way a partial success won't mean corrupted data.
You can use the -i flag - see perlrun - this allows you to in place edit as you might be used to with sed. (Can be used within program via $^I - see perlvar )
There's a couple of constraints on doing this though - specifically it only works if you're using the while ( <> ) { construct. Practically speaking, I think this wouldn't be a good choice outside more simplistic programs - it's doing something implicitly, so might not be entirely clear to future readers, and it's doing essentially the same thing as opening and renaming anyway.

Perl - can't flush STDOUT or STDERR

Perl 5.14 from stock Ubuntu Precise repos. Trying to write a simple wrapper to monitor progress on copying from one stream to another:
use IO::Handle;
while ($bufsize = read (SOURCE, $buffer, 1048576)) {
STDERR->printflush ("Transferred $xferred of $sendsize bytes\n");
$xferred += $bufsize;
print TARGET $buffer;
}
This does not perform as expected (writing a line each time the 1M buffer is read). I end up seeing the first line (with a blank value of $xferred), and then nothing until everything flushes on the 7th and 8th lines (on an 8MB transfer). Been pounding my brains out on this for hours - I've read the perldocs, I've read the classic "Suffering from Buffering" article, I've tried everything from select and $|++ to IO::Handle to binmode (STDERR, "::unix") to you name it. I've also tried flushing TARGET with each line using IO::Handle (TARGET->flush). No dice.
Has anybody else ever encountered this? I don't have any ideas left. Sleeping one second "fixes" the problem, but obviously I don't want to sleep a second every time I read a buffer just so my progress will output on the screen!
FWIW, the problem is exactly the same whether I'm outputting to STDERR or STDOUT.
Perl read calls fread(3), not read(2).
This means that it goes through libc and may be using an internal buffer larger than yours; i.e., it gets all the data there is to be received and then quickly throws it at you in 1MB increments.
If this conjecture is correct, the solution might be to use sysread, which calls read(2), instead of read.

How do I flush a file in Perl?

I have Perl script which appends a new line to the existing file every 3 seconds. Also, there is a C++ application which reads from that file.
The problem is that the application begins to read the file after the script is done and file handle is closed. To avoid this I want to flush after each line append. How can I do that?
Try:
use IO::Handle;
$fh->autoflush;
This was actually posted as a way of auto-flushing in an early question of mine, which asked about the universally accepted bad way of achieving this :-)
TL/DR: use IO::Handle and the flush method, eg:
use IO::Handle;
$myfile->flush();
First, you need to decide how "flushed" you want it. There can be quite a few layers of buffering:
Perl's internal buffer on the file handle. Other programs can't see data until it's left this buffer.
File-system level buffering of "dirty" file blocks. Other programs can still see these changes, they seem "written", but they'll be lost if the OS or machine crashes.
Disk-level write-back buffering of writes. The OS thinks these are written to disk, but the disk is actually just storing them in volatile memory on the drive. If the OS crashes the data won't be lost, but if power fails it might be unless the disk can write it out first. This is a big problem with cheap consumer SSDs.
It gets even more complicated when SANs, remote file systems, RAID controllers, etc get involved. If you're writing via pipes there's also the pipe buffer to consider.
If you just want to flush the Perl buffer, you can close the file, print a string containing "\n" (since it appears that Perl flushes on newlines), or use IO::Handle's flush method.
You can also, per the perl faq use binmode or play with $| to make the file handle unbuffered. This is not the same thing as flushing a buffered handle, since queuing up a bunch of buffered writes then doing a single flush has a much lower performance cost than writing to an unbuffered handle.
If you want to flush the file system write back buffer you need to use a system call like fsync(), open your file in O_DATASYNC mode, or use one of the numerous other options. It's painfully complicated, as evidenced by the fact that PostgreSQL has its own tool just to test file syncing methods.
If you want to make sure it's really, truly, honestly on the hard drive in permanent storage you must flush it to the file system in your program. You also need to configure the hard drive/SSD/RAID controller/SAN/whatever to really flush when the OS asks it to. This can be surprisingly complicated to do and is quite OS/hardware specific. "plug-pull" testing is strongly recommended to make sure you've really got it right.
From 'man perlfaq5':
$old_fh = select(OUTPUT_HANDLE);
$| = 1;
select($old_fh);
If you just want to flush stdout, you can probably just do:
$| = 1;
But check the FAQ for details on a module that gives you a nicer-to-use abstraction, like IO::Handle.
Here's the answer - the real answer.
Stop maintaining an open file handle for this file for the life of the process.
Start abstracting your file-append operation into a sub that opens the file in append mode, writes to it, and closes it.
# Appends a new line to the existing file
sub append_new_line{
my $linedata = shift;
open my $fh, '>>', $fnm or die $!; # $fnm is file-lexical or something
print $fh $linedata,"\n"; # Flavor to taste
close $fh;
}
The process observing the file will encounter a closed file that gets modified whenever the function is called.
All of the solutions suggesting setting autoflush are ignoring the basic fact that most modern OS's are buffering file I/O irrespective of what Perl is doing.
You only possibility to force the commitment of the data to disk is by closing the file.
I'm trapped with the same dilemma atm where we have an issue with rotation of the log being written.
To automatically flush the output, you can set autoflush/$| as described by others before you output to the filehandle.
If you've already output to the filehandle and need to ensure that it gets to the physical file, you need to use the IO::Handle flush and sync methods.
There an article about this in PerlDoc: How do I flush/unbuffer an output filehandle? Why must I do this?
Two solutions:
Unbuffer the output filehandler with : $|
Call the autoflush method if you are using IO::Handle or one of its subclasses.
An alternative approach would be to use a named pipe between your Perl script and C++ program, in lieu of the file you're currently using.
For those who are searching a solution to flush output line by line to a file in Ansys CFD Post using a Session File (*.cse), this is the only solution that worked for me:
! $file="Test.csv";
! open(OUT,"+>>$file");
! select(OUT);$|=1; # This is the important line
! for($i=0;$i<=10;$i++)
! {
! print out "$i\n";
! sleep(3);
! }
Note that you need the exclamation marks at every begin of every line that contains Perl script. sleep(3); is only applied for demonstration reasons. use IO::Handle; is not needed.
The genuine correct answer is to use:-
$|=1; # Make STDOUT immediate (non-buffered)
and although that is one cause of your problem, the other cause of the same problem is this: "Also, there is a C++ application which reads from that file."
It is EXTREMELY NON-TRIVIAL to write C++ code which can properly read from a file that is growing, because your "C++" program will encounter an EOF when it gets to the end... (you cannot read past the end of a file without serious extra trickery) - you have to do a pile of complicated stuff with IO blocking and flags to properly monitor a file this way (like how the linux "tail" command works).
I had the same problem with the only difference of writing the same file over and over again with new content. This association of "$| = 1" and autoflush worked for me:
open (MYFILE, '>', '/internet/web-sites/trot/templates/xml_queries/test.xml');
$| = 1; # Before writing!
print MYFILE "$thisCardReadingContentTemplate\n\n";
close (MYFILE);
MYFILE->autoflush(1); # After writing!
Best of luck.
H

What's the best way to make sure only one instance of a Perl program is running?

There are several ways to do this, but I'm not sure which one of them is the best.
Here's what I can think of:
Look for the process using pgrep.
Have the script lock itself using flock, and then check if it is locked each time it runs.
Create a pid file in /var/run/program_name.pid and check for existence, and compare pids if needed.
There are probably more ways to do this. What do you think is the best approach?
There are many ways to do it. PID files are the traditional way to do it. You could also hold a lock on a file, for example the program itself. This small piece of code will do the trick:
use Fcntl ':flock';
open my $self, '<', $0 or die "Couldn't open self: $!";
flock $self, LOCK_EX | LOCK_NB or die "This script is already running";
One advantage over PID files is that files automatically get unlocked when the program exits. It's much easier to implement in a reliable way.
Do the old PID file trick.
start process
see if there is a file called "myprog.PID"
check for existence of running proc. with matching PID using kill 0, $pid
if prog name of PID proc. matches, complain loudly and exit
if not, clean up stale "myprog.PID"
create a file called "myprog.PID" and then continue
HTH
cheers,
Rob
All of the options that you list are fine. One thing with this though, is to be aware that in rare cases, you can end up with a process that runs for a very long time (i.e., stuck waiting on something). You might want to think about keeping an eye on how long the other running instance has been running and possibly send yourself an alert if it exceeds a certain amount of time (such as a day perhaps).