How do I read a file which is constantly updating? - perl

I am getting a stream of data (text format) from an external server and like to pass it on to a script line-by-line. The file is getting appended in a continuous manner. Which is the ideal method to perform this operation. Is IO::Socket method using Perl will do? Eventually this data has to pass through a PHP program (reusable) and eventually land onto a MySQL database.
The question is how to open the file, which is continuously getting updated?

In Perl, you can make use of seek and tell to read from a continuously growing file. It might look something like this (borrowed liberally from perldoc -f seek)
open(FH,'<',$the_file) || handle_error(); # typical open call
for (;;) {
while (<FH>) {
# ... process $_ and do something with it ...
}
# eof reached on FH, but wait a second and maybe there will be more output
sleep 1;
seek FH, 0, 1; # this clears the eof flag on FH
}

In perl there are a couple of modules that make tailing a file easier. IO::Tail and
File::Tail one uses a callback the other uses a blocking read so it just depends on which suits your needs better. There are likely other tailing modules as well but these are the two that came to mind.
IO::Tail - follow the tail of files/stream
use IO::Tail;
my $tail = IO::Tail->new();
$tail->add('test.log', \&callback);
$tail->check();
$tail->loop();
File::Tail - Perl extension for reading from continously updated files
use File::Tail;
my $file = File::Tail->new("/some/log/file");
while (defined(my $line= $file->read)) {
print $line;
}

Perhaps a named pipe would help you?

You talk about opening a file, and ask about IO::Socket. These aren't quite the same things, even if deep down you're going to be reading data of a file descriptor.
If you can access the remote stream from a named pipe or FIFO, then you can just open it as an ordinary file. It will block when nothing is available, and return whenever there is data that needs to be drained. You may, or may not, need to bring File::Tail to bear on the problem of not losing data if the sender runs too far ahead of you.
On the other hand, if you're opening a socket directly to the other server (which seems more likely), IO::Socket is not going to work out of the box as there is no getline method available. You would have to read and buffer block-by-block and then dole it out line by line through an intermediate holding pen.
You could pull out the socket descriptor into an IO::Handle, and use getline() on that. Something like:
my $sock = IO::Socket::INET->new(
PeerAddr => '172.0.0.1',
PeerPort => 1337,
Proto => 'tcp'
) or die $!;
my $io = new IO::Handle;
$io->fdopen(fileno($sock),"r") or die $!;
while (defined( my $data = $io->getline() )) {
chomp $data;
# do something
}
You may have to perform a handshake in order to start receiving packets, but that's another matter.

In python it is pretty straight-forward:
f = open('teste.txt', 'r')
for line in f: # read all lines already in the file
print line.strip()
# keep waiting forever for more lines.
while True:
line = f.readline() # just read more
if line: # if you got something...
print 'got data:', line.strip()
time.sleep(1) # wait a second to not fry the CPU needlessy

The solutions to read the whole fine to seek to the end are perfomance-unwise. If that happens under Linux, I would suggest just to rename the log file. Then, you can scan all the entites in the renamed file, while those in original file will be filled again. After scanning all the renamed file - delete it. Or move whereever you like. This way you get something like logrotate but for scanning newly arriving data.

Related

How to pipe to and read from the same tempfile handle without race conditions?

Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH

Code hangs when writing log message to text file by multiple source using Perl Script

I am using below code to write log message to text file, the program is getting hanged when different source calls this method in parallel. Is there a way to grant /control parallel writing without breaking the program.
sub sLog {
my $self = 'currentServerDirectory';
my $logMsg = "";
my $fileName = join '_', $self->{LogFilePrefix}, &sTimeStamp("Log");
my $absFileName = "$self->{LogFolder}/$fileName.txt";
open APPHDLER, ">>$absFileName" or &exitErr("Cannot append message to file, $absFileName");
print APPHDLER scalar(localtime(time))." - $logMsg\n";
close APPHDLER;
}
Try using flock -- here is a simple example you can try to understand its behavior:
use strict;
use warnings;
use Fcntl qw/:flock/;
chomp(my $id = `date`);
my $cnt = 0;
while (1) {
open LOG, ">>shared" or die "Couldn't open shared";
flock (LOG, LOCK_EX) or die "Couldn't flock";
print LOG "$id: $cnt\n";
$cnt++;
sleep 1;
close LOG;
}
Say this is saved in flock.pl, then you can run flock.pl& to run one or more instances in the background. Then do tail -f shared to see what happens. Since you are sleeping 1 second between obtaining the lock and releasing it via close LOG , you'll see an update once a second if you have one process. However, if you have N processes, you'll see each one taking N seconds.
In your existing example, you can try adding the use Fcntl and flock lines.
When you open a file for writing, a lock on that file is granted to the process that opened it. This is done to prevent corruption of data, by processes overwriting each other. Possible solutions would be to feed the output data to a single process that handles writing to the log file, making sure that processes close the file and release their locks when they are finished writing, or using a library or file format that is designed for parallel access of files. The first two of those methods would be the easiest and preferred for writing log files like this. There are also probably perl modules (check CPAN) that handle log files.

Perl: Pass one byte plus STDIN to another command

I would like to do this efficiently:
my $buf;
my $len = read(STDIN,$buf,1);
if($len) {
# Not empty
open(OUT,"|-", "wc") || die;
print OUT $buf;
# This is the line I want to do faster
print OUT <STDIN>;
exit;
}
The task is to start wc only if there is any input. If there is no input, the program should just exit.
wc is just an example here. It will be substituted with a much more complex command.
The input can be of several TB of data, so I would really like to not touch that data at all (not even with a sysread). I tried doing:
pipe(STDIN,OUT);
But that doesn't work. Is there some other way that I can tell OUT that after it has gotten the first byte, it should just read from STDIN? Maybe some open(">=&2") gymnastics combined with exec?
The FIONREAD ioctl, mentioned in the Perl Cookbook, can tell you how many bytes are pending on a file descriptor without consuming them. In perlish terms:
use strict;
use warnings;
use IO::Select qw( );
BEGIN { require 'sys/ioctl.ph'; }
sub fionread {
my $sz = pack('L', 0);
return unless ioctl($_[0], FIONREAD, $sz);
return unpack('L', $sz);
}
# Wait until it's known whether the handle has data to read or has reached EOF.
IO::Select->new(\*STDIN)->can_read();
if (fionread(\*STDIN)) {
system('wc');
# Check for errors
}
This should be very widely portable to UNIX and UNIX-like platforms.
A child process is always given duplicates of its parent's file handles, so simply starting wc - either with backticks or with a call to system or exec - will cause it to read from the same place as the Perl process's STDIN.
As for starting wc only when there is something to read, it looks like you need IO::Select, which will allow you either to check whether a file handle has something to read, or to block until it does have something.
This program will check whether STDIN has any data waiting, and run wc and print its output if so.
use strict;
use warnings;
use IO::Select;
my $select = IO::Select->new(\*STDIN);
if ( $select->can_read(0) ) {
print `wc`;
}
The parameter to can_read is a timeout in seconds. Passing a value of zero makes it return immediately, reporting true (actually it returns the file handle itself) if there is data waiting, or false (undef) if not.
If you don't pass a parameter then can_read will wait forever until there is something to read, so you can suspend your program and wait for data for wc by writing just
$select->can_read;
print `wc`;
or you could combine the construction of the object to make it even more concise
IO::Select->new(\*STDOUT)->can_read;
print `wc`;
Note also that IO::Select works fine with file descriptors too, and as the fileno for STDIN is zero, you could write
my $select = IO::Select(0)
but that isn't very descriptive and would need a comment to make sense
The specific solution in which you're interested is impossible.
As you surely discovered already, you can't determine if a file handle has reached EOF without reading from it. [Apparently, you can] select(2) will get you close. It will tell you that a handle has reached EOF or has data waiting, but it won't tell you which. This is why you're looking into alternate solutions. Unfortunately, the one you're looking into is just as impossible.
Is there some other way that I can tell OUT that after it has gotten the first byte, it should just read from STDIN?
No. OUT isn't code; it doesn't read anything. It's a variable. Furthermore, it's a variable in the parent. Changing a variable in the parent isn't going to affect the child.
Maybe you meant to ask: Can one tell the child program to start reading from a second handle?
No, generally speaking. You can't go and edit another program's variables. The program would have to be specifically written to accept two file handles and read from one after the other.
Then again, it's possible to obtain a file name for an arbitrary file handle, so all we need is a program that is specifically written to accept two file names and read from one after the other, and that's quite common.
$ echo abcdef | perl -MFcntl -e'
if (sysread(STDIN, $buf, 1)) {
pipe(my $r, my $w);
my $pid = fork();
if (!$pid) {
close($w);
# Clear close-on-exec flag.
my $flags = fcntl($r, Fcntl::F_GETFD, 0);
fcntl($r, Fcntl::F_SETFD, $flags & ~Fcntl::FD_CLOEXEC);
exec("cat", "/proc/$$/fd/".fileno($r), "/proc/$$/fd/".fileno(STDIN));
die $!;
}
close($r);
print($w $buf);
close($w);
waitpid($pid, 0);
}
'
abcdef
(Lots of error checking needed.)
Above, cat was used an example where your program would be used, but that presents another solution: Why not just use cat? The overhead of cat should be quite minor for an IO-bound program.
use String::ShellQuote qw( shell_quote );
my $cmd1 = shell_quote("cat", "/proc/$$/fd/".fileno($r), "/proc/$$/fd/".fileno(STDIN));
my $cmd2 = ...
exec("$cmd1 | $cmd2");

perl File::Tail not reading lines from file after certain period

I am having issues understanding why File::Tail FAILS to read the lines from an always updating file with 1000s of transactions, and automatic rollover .
It read to a certain extend correctly, but then later slows down, and for a long period even not able to read the lines in the logs. I can confirm that the actual log are being populated as when file::tail shows nothing.
my $file = File::Tail->new(name=>"$name",
tail => 1000,
maxinterval=>1,
interval=>1,
adjustafter=>5,resetafter=>1,
ignore_nonexistant=>1,
maxbuf=>32768);
while (defined(my $line=$file->read)) {
#my $line=$file->read;
my $xml_string = "";
#### Read only one event per line and deal with the XML.
#### If the log entry is not a SLIM log, I will ignore it.
if ($line =~ /(\<slim\-log\>.*\<\/slim\-log\>)/) {
do_something
} else {
not_working for some reason
}
Can someone please help me understand this. Know that this log file is updated at almost 10MB per second or 1000 events per second for an approximation.
Should I be handing the filehandle or the File::Tail results some other more efficient way?
Seems like there's limitations in File::Tail. There's some suggestions around other more direct options (a pipe, a fork, a thread, seeking to the end of the file in perl) discussed in http://www.perlmonks.org/?node_id=387706.
My favorite pick is the blocking read from a pipe:
open(TAIL, "tail -F $name|") or die "TAIL : $!";
while (<TAIL>) {
test_and_do_something
}

How do I flush a file in Perl?

I have Perl script which appends a new line to the existing file every 3 seconds. Also, there is a C++ application which reads from that file.
The problem is that the application begins to read the file after the script is done and file handle is closed. To avoid this I want to flush after each line append. How can I do that?
Try:
use IO::Handle;
$fh->autoflush;
This was actually posted as a way of auto-flushing in an early question of mine, which asked about the universally accepted bad way of achieving this :-)
TL/DR: use IO::Handle and the flush method, eg:
use IO::Handle;
$myfile->flush();
First, you need to decide how "flushed" you want it. There can be quite a few layers of buffering:
Perl's internal buffer on the file handle. Other programs can't see data until it's left this buffer.
File-system level buffering of "dirty" file blocks. Other programs can still see these changes, they seem "written", but they'll be lost if the OS or machine crashes.
Disk-level write-back buffering of writes. The OS thinks these are written to disk, but the disk is actually just storing them in volatile memory on the drive. If the OS crashes the data won't be lost, but if power fails it might be unless the disk can write it out first. This is a big problem with cheap consumer SSDs.
It gets even more complicated when SANs, remote file systems, RAID controllers, etc get involved. If you're writing via pipes there's also the pipe buffer to consider.
If you just want to flush the Perl buffer, you can close the file, print a string containing "\n" (since it appears that Perl flushes on newlines), or use IO::Handle's flush method.
You can also, per the perl faq use binmode or play with $| to make the file handle unbuffered. This is not the same thing as flushing a buffered handle, since queuing up a bunch of buffered writes then doing a single flush has a much lower performance cost than writing to an unbuffered handle.
If you want to flush the file system write back buffer you need to use a system call like fsync(), open your file in O_DATASYNC mode, or use one of the numerous other options. It's painfully complicated, as evidenced by the fact that PostgreSQL has its own tool just to test file syncing methods.
If you want to make sure it's really, truly, honestly on the hard drive in permanent storage you must flush it to the file system in your program. You also need to configure the hard drive/SSD/RAID controller/SAN/whatever to really flush when the OS asks it to. This can be surprisingly complicated to do and is quite OS/hardware specific. "plug-pull" testing is strongly recommended to make sure you've really got it right.
From 'man perlfaq5':
$old_fh = select(OUTPUT_HANDLE);
$| = 1;
select($old_fh);
If you just want to flush stdout, you can probably just do:
$| = 1;
But check the FAQ for details on a module that gives you a nicer-to-use abstraction, like IO::Handle.
Here's the answer - the real answer.
Stop maintaining an open file handle for this file for the life of the process.
Start abstracting your file-append operation into a sub that opens the file in append mode, writes to it, and closes it.
# Appends a new line to the existing file
sub append_new_line{
my $linedata = shift;
open my $fh, '>>', $fnm or die $!; # $fnm is file-lexical or something
print $fh $linedata,"\n"; # Flavor to taste
close $fh;
}
The process observing the file will encounter a closed file that gets modified whenever the function is called.
All of the solutions suggesting setting autoflush are ignoring the basic fact that most modern OS's are buffering file I/O irrespective of what Perl is doing.
You only possibility to force the commitment of the data to disk is by closing the file.
I'm trapped with the same dilemma atm where we have an issue with rotation of the log being written.
To automatically flush the output, you can set autoflush/$| as described by others before you output to the filehandle.
If you've already output to the filehandle and need to ensure that it gets to the physical file, you need to use the IO::Handle flush and sync methods.
There an article about this in PerlDoc: How do I flush/unbuffer an output filehandle? Why must I do this?
Two solutions:
Unbuffer the output filehandler with : $|
Call the autoflush method if you are using IO::Handle or one of its subclasses.
An alternative approach would be to use a named pipe between your Perl script and C++ program, in lieu of the file you're currently using.
For those who are searching a solution to flush output line by line to a file in Ansys CFD Post using a Session File (*.cse), this is the only solution that worked for me:
! $file="Test.csv";
! open(OUT,"+>>$file");
! select(OUT);$|=1; # This is the important line
! for($i=0;$i<=10;$i++)
! {
! print out "$i\n";
! sleep(3);
! }
Note that you need the exclamation marks at every begin of every line that contains Perl script. sleep(3); is only applied for demonstration reasons. use IO::Handle; is not needed.
The genuine correct answer is to use:-
$|=1; # Make STDOUT immediate (non-buffered)
and although that is one cause of your problem, the other cause of the same problem is this: "Also, there is a C++ application which reads from that file."
It is EXTREMELY NON-TRIVIAL to write C++ code which can properly read from a file that is growing, because your "C++" program will encounter an EOF when it gets to the end... (you cannot read past the end of a file without serious extra trickery) - you have to do a pile of complicated stuff with IO blocking and flags to properly monitor a file this way (like how the linux "tail" command works).
I had the same problem with the only difference of writing the same file over and over again with new content. This association of "$| = 1" and autoflush worked for me:
open (MYFILE, '>', '/internet/web-sites/trot/templates/xml_queries/test.xml');
$| = 1; # Before writing!
print MYFILE "$thisCardReadingContentTemplate\n\n";
close (MYFILE);
MYFILE->autoflush(1); # After writing!
Best of luck.
H