Perl STDIN without buffering or line buffering - perl

I have a Perl script that received input piped from another program. It's buffering with an 8k (Ubuntu default) input buffer, which is causing problems. I'd like to use line buffering or disable buffering completely. It doesn't look like there's a good way to do this. Any suggestions?
use IO::Handle;
use IO::Poll qw[ POLLIN POLLHUP POLLERR ];
use Text::CSV;
my $stdin = new IO::Handle;
$stdin->fdopen(fileno(STDIN), 'r');
$stdin->setbuf(undef);
my $poll = IO::Poll->new() or die "cannot create IO::Poll object";
$poll->mask($stdin => POLLIN);
STDIN->blocking(0);
my $halt = 0;
for(;;) {
$poll->poll($config{poll_timout});
for my $handle ($poll->handles(POLLIN | POLLHUP | POLLERR)) {
next unless($handle eq $stdin);
if(eof) {
$halt = 1;
last;
}
my #row = $csv->getline($stdin);
# Do more stuff here
}
last if($halt);
}
Polling STDIN kind of throws a wrench into things since IO::Poll uses buffering and direct calls like sysread do not (and they can't mix). I don't want to infinitely call sysread without no blocking. I require the use of select or poll since I don't want to hammer the CPU.
PLEASE NOTE: I'm talking about STDIN, NOT STDOUT. $|++ is not the solution.
[EDIT]
Updating my question to clarify based on the comments and other answers.
The program that is writing to STDOUT (on the other side of the pipe) is line buffered and flushed after every write. Every write contains a newline, so in effect, buffering is not an issue for STDOUT of the first program.
To verify this is true, I wrote a small C program that reads piped input from the same program with STDIN buffering disabled (setvbuf with _IONBF). The input appears in STDIN of the test program immediately. Sadly, it does not appear to be an issue with the output from the first program.
[/EDIT]
Thanks for any insight!
PS. I have done a fair amount of Googling. This link is the closest I've found to an answer, but it certainly doesn't satisfy all my needs.

Say there are two short lines in the pipe's buffer.
IO::Poll notifies you there's data to read, which you proceed to read (indirectly) using readline.
Reading one character at a time from a file handle is very inefficient. As such, readline (aka <>) reads a block of data from the file handle at a time. The two lines ends up in a buffer and the first of the two lines is returned.
Then you wait for IO::Poll to notify you that there is more data. It doesn't know about Perl's buffer; it just knows the pipe is empty. As such, it blocks.
This post demonstrates the problem. It uses IO::Select, but the principle (and solution) is the same.

You're actually talking about the other program's STDOUT. The solution is $|=1; (or equivalent) in the other program.
If you can't, you might be able to convince the other program use line-buffering instead of block buffering by connecting its STDOUT to a pseudo-tty instead of a pipe (like Expect.pm does, for example).
The unix program expect has a tool called unbuffer which does that exactly that. (It's part of the expect-dev package on Ubuntu.) Just prefix the command name with unbuffer.

Related

Line buffered reading in Perl

I have a perl script, say "process_output.pl" which is used in the following context:
long_running_command | "process_output.pl"
The process_output script, needs to be like the unix "tee" command, which dumps output of "long_running_command" to the terminal as it gets generated, and in addition captures output to a text file, and at the end of "long_running_command", forks another process with the text file as an input.
The behavior I am currently seeing is that, the output of "long_running_command" gets dumped to the terminal, only when it gets completed instead of, dumping output as it gets generated. Do I need to do something special to fix this?
Based on my reading in a few other stackexchange posts, i tried the following in "process_output.pl", without much help:
select(STDOUT); $| =1;
select(STDIN); $| =1; # Not sure even if this is needed
use FileHandle; STDOUT->autoflush(1);
stdbuf -oL -eL long_running_command | "process_output.pl"
Any pointers on how to proceed further.
Thanks
AB
This is more likely an issue with the output of the first process being buffered, rather than the input of your script. The easiest solution would be to try using the unbuffer command (I believe it's part of the expect package), something like
unbuffer long_running_command | "process_output.pl"
The unbuffer command will disable the buffering that happens normally when output is directed to a non-interactive place.
This will be the output processing of long_running_processing. More than likely it is using stdio - which will look to see what the output file descriptor is connected to before it does outputing. If it is a terminal (tty), then it will generally output line based, but in the above case - it will notice it is writing to a pipe and will therefore buffer the output into larger chunks.
You can control the buffering in your own process by using, as you showed
select(STDOUT); $| =1;
This means that things that your process prints to STDIO, are not buffered - it makes no sense doing this for input, as you control how much buffering is done - if you use sysread() then you are reading unbuffered, if you use a construct like <$fh> then perl will await until it has a "whole line" (it actually reads up to the next input line separator (as defined in variable $/ which is newline by default)) before it returns data to you.
unbuffer can be used to "disable" the output buffering, what it actually does is make the outputing process think that it is talking to a tty (by using a pseudo tty) so the output process does not buffer.

Disabling Inappropriate Buffering Perl

I was working with a file parser in perl that prints the name of every file it processes. But i noticed that these print outputs appeared out of order which got my attention. After further digging, i found out that this is because, Perl is using Buffering and releases these print statements to the output only when the buffer is full. I also learned that there is a work around by "making the filehandle hot". Whenever you print to a hot filehandle, Perl flushes the buffer immediately. So my question is :
Are there any consequences of "making the filehandle hot" ?
Does leaving the buffer to get filled up before flushing vs flushing immediately have any effect on performance ?
Perl uses different output buffering modes depending on context: Writing to files etc. buffers in chunks (this is important for performance), while a handle is flushed after each line if perl has reason to believe that the output goes to a terminal. STDERR is unbuffered by default.
You can deactivate buffering for the currently selected file handle by setting the special $| variable to a true value. However, this is better expressed as:
use IO::File; # on older perls
...
$some_file_handle->autoflush(1);
print { $some_file_handle } "this isn't buffered";
which has the advantage that you don't have to use the annoying select function for handles other than STDOUT. Why is this method called autoflush? The file handle is still buffered, but the buffer is automatically flushed after each print or say call.
Careful: The autoflush method won't work on truly ancient perls where file handles aren't objects yet. In that case, do the select dance:
my $old_fh = select $my_$fh;
$| = 1;
select $old_fh;
print { $my_fh } "this isn't buffered";
(select returns the currently selected file handle).

What is the preferred cross-platform IPC Perl module?

I want to create a simple IO object that represents a pipe opened to another program to that I can periodically write to another program's STDIN as my app runs. I want it to be bullet-proof (in that it catches all errors) and cross-platform. The best options I can find are:
open
sub io_read {
local $SIG{__WARN__} = sub { }; # Silence warning.
open my $pipe, '|-', #_ or die "Cannot exec $_[0]: $!\n";
return $pipe;
}
Advantages:
Cross-platform
Simple
Disadvantages
No $SIG{PIPE} to catch errors from the piped program
Are other errors caught?
IO::Pipe
sub io_read {
IO::Pipe->reader(#_);
}
Advantages:
Simple
Returns an IO::Handle object for OO interface
Supported by the Perl core.
Disadvantages
Still No $SIG{PIPE} to catch errors from the piped program
Not supported on Win32 (or, at least, its tests are skipped)
IPC::Run
There is no interface for writing to a file handle in IPC::Run, only appending to a scalar. This seems…weird.
IPC::Run3
No file handle interface here, either. I could use a code reference, which would be called repeatedly to spool to the child, but looking at the source code, it appears that it actually writes to a temporary file, and then opens it and spools its contents to the pipe'd command's STDIN. Wha?
IPC::Cmd
Still no file handle interface.
What am I missing here? It seems as if this should be a solved problem, and I'm kind of stunned that it's not. IO::Pipe comes closest to what I want, but the lack of $SIG{PIPE} error handling and the lack of support for Windows is distressing. Where is the piping module that will JDWIM?
Thanks to guidance from #ikegami, I have found that the best choice for interactively reading from and writing to another process in Perl is IPC::Run. However, it requires that the program you are reading from and writing to have a known output when it is done writing to its STDOUT, such as a prompt. Here's an example that executes bash, has it run ls -l, and then prints that output:
use v5.14;
use IPC::Run qw(start timeout new_appender new_chunker);
my #command = qw(bash);
# Connect to the other program.
my ($in, #out);
my $ipc = start \#command,
'<' => new_appender("echo __END__\n"), \$in,
'>' => new_chunker, sub { push #out, #_ },
timeout(10) or die "Error: $?\n";
# Send it a command and wait until it has received it.
$in .= "ls -l\n";
$ipc->pump while length $in;
# Wait until our end-of-output string appears.
$ipc->pump until #out && #out[-1] =~ /__END__\n/m;
pop #out;
say #out;
Because it is running as an IPC (I assume), bash does not emit a prompt when it is done writing to its STDOUT. So I use the new_appender() function to have it emit something I can match to find the end of the output (by calling echo __END__). I've also used an anonymous subroutine after a call to new_chunker to collect the output into an array, rather than a scalar (just pass a reference to a scalar to '>' if you want that).
So this works, but it sucks for a whole host of reasons, in my opinion:
There is no generally useful way to know that an IPC-controlled program is done printing to its STDOUT. Instead, you have to use a regular expression on its output to search for a string that usually means it's done.
If it doesn't emit one, you have to trick it into emitting one (as I have done here—god forbid if I should have a file named __END__, though). If I was controlling a database client, I might have to send something like SELECT 'IM OUTTA HERE';. Different applications would require different new_appender hacks.
The writing to the magic $in and $out scalars feels weird and action-at-a-distance-y. I dislike it.
One cannot do line-oriented processing on the scalars as one could if they were file handles. They are therefore less efficient.
The ability to use new_chunker to get line-oriented output is nice, if still a bit weird. That regains a bit of the efficiency on reading output from a program, though, assuming it is buffered efficiently by IPC::Run.
I now realize that, although the interface for IPC::Run could potentially be a bit nicer, overall the weaknesses of the IPC model in particular makes it tricky to deal with at all. There is no generally-useful IPC interface, because one has to know too much about the specifics of the particular program being run to get it to work. This is okay, maybe, if you know exactly how it will react to inputs, and can reliably recognize when it is done emitting output, and don't need to worry much about cross-platform compatibility. But that was far from sufficient for my need for a generally useful way to interact with various database command-line clients in a CPAN module that could be distributed to a whole host of operating systems.
In the end, thanks to packaging suggestions in comments on a blog post, I decided to abandon the use of IPC for controlling those clients, and to use the DBI, instead. It provides an excellent API, robust, stable, and mature, and suffers none of the drawbacks of IPC.
My recommendation for those who come after me is this:
If you just need to execute another program and wait for it to finish, or collect its output when it is done running, use IPC::System::Simple. Otherwise, if what you need to do is to interactively interface with something else, use an API whenever possible. And if it's not possible, then use something like IPC::Run and try to make the best of it—and be prepared to give up quite a bit of your time to get it "just right."
I've done something similar to this. Although it depends on the parent program and what you are trying to pipe. My solution was to spawn a child process (leaving $SIG{PIPE} to function) and writing that to the log, or handling the error in what way you see fit. I use POSIX to handle my child process and am able to utilize all the functionality of the parent. However if you're trying to have the child communicate back to the parent - then things get difficult. Do you have an example of the main program and what you're trying to PIPE?

What can be the possible situations where one should prefer the unbuffered output?

By the discussion in my previous question I came to know that Perl gives line buffer output by default.
$| = 0; # for buffered output (by default)
If you want to get unbuffered output then set the special variable $| to 1 i.e.
$| = 1; # for unbuffered output
Now I want to know that what can be the possible situations where one should prefer the unbuffered output?
You want unbuffered output for interactive tasks. By that, I mean you don't want output stuck in some buffer when you expect someone or something else to respond to the output.
For example, you wouldn't want user prompts sent to STDOUT to be buffered. (That's why STDOUT is never fully buffered when attached to a terminal. It is only line buffered, and the buffer is flushed by attempts to read from STDIN.)
For example, you'd want requests sent over pipes and sockets to not get stuck in some buffer, as the other end of the connection would never see it.
The only other reason I can think of is when you don't want important data to be stuck in a buffer in the event of a unrecoverable error such as a panic or death by signal.
For example, you might want to keep a log file unbuffered in order to be able to diagnose serious problems. (This is why STDERR isn't buffered by default.)
Here's a small sample of Perl users from StackOverflow who have benefited from learning to set $| = 1:
STDOUT redirected externally and no output seen at the console
Perl Print function does not work properly when Sleep() is used
can't write to file using print
perl appending issues
Unclear perl script execution
Perl: Running a "Daemon" and printing
Redirecting STDOUT of a pipe in Perl
Why doesn't my parent process see the child's output until it exits?
Why does adding or removing a newline change the way this perl for loop functions?
Perl not printing properly
In Perl, how do I process input as soon as it arrives, instead of waiting for newline?
What is the simple way to keep the output stream exactly as it shown out on the screen (while interactive data used)?
Is it possible to print from a perl CGI before the process exits?
Why doesn't print output anything on each iteration of a loop when I use sleep?
Perl Daemon Not Working with Sleep()
It can be useful when writing to another program over a socket or pipe. It can also be useful when you are writing debugging information to STDOUT to watch the state of your program live.

How do I flush a file in Perl?

I have Perl script which appends a new line to the existing file every 3 seconds. Also, there is a C++ application which reads from that file.
The problem is that the application begins to read the file after the script is done and file handle is closed. To avoid this I want to flush after each line append. How can I do that?
Try:
use IO::Handle;
$fh->autoflush;
This was actually posted as a way of auto-flushing in an early question of mine, which asked about the universally accepted bad way of achieving this :-)
TL/DR: use IO::Handle and the flush method, eg:
use IO::Handle;
$myfile->flush();
First, you need to decide how "flushed" you want it. There can be quite a few layers of buffering:
Perl's internal buffer on the file handle. Other programs can't see data until it's left this buffer.
File-system level buffering of "dirty" file blocks. Other programs can still see these changes, they seem "written", but they'll be lost if the OS or machine crashes.
Disk-level write-back buffering of writes. The OS thinks these are written to disk, but the disk is actually just storing them in volatile memory on the drive. If the OS crashes the data won't be lost, but if power fails it might be unless the disk can write it out first. This is a big problem with cheap consumer SSDs.
It gets even more complicated when SANs, remote file systems, RAID controllers, etc get involved. If you're writing via pipes there's also the pipe buffer to consider.
If you just want to flush the Perl buffer, you can close the file, print a string containing "\n" (since it appears that Perl flushes on newlines), or use IO::Handle's flush method.
You can also, per the perl faq use binmode or play with $| to make the file handle unbuffered. This is not the same thing as flushing a buffered handle, since queuing up a bunch of buffered writes then doing a single flush has a much lower performance cost than writing to an unbuffered handle.
If you want to flush the file system write back buffer you need to use a system call like fsync(), open your file in O_DATASYNC mode, or use one of the numerous other options. It's painfully complicated, as evidenced by the fact that PostgreSQL has its own tool just to test file syncing methods.
If you want to make sure it's really, truly, honestly on the hard drive in permanent storage you must flush it to the file system in your program. You also need to configure the hard drive/SSD/RAID controller/SAN/whatever to really flush when the OS asks it to. This can be surprisingly complicated to do and is quite OS/hardware specific. "plug-pull" testing is strongly recommended to make sure you've really got it right.
From 'man perlfaq5':
$old_fh = select(OUTPUT_HANDLE);
$| = 1;
select($old_fh);
If you just want to flush stdout, you can probably just do:
$| = 1;
But check the FAQ for details on a module that gives you a nicer-to-use abstraction, like IO::Handle.
Here's the answer - the real answer.
Stop maintaining an open file handle for this file for the life of the process.
Start abstracting your file-append operation into a sub that opens the file in append mode, writes to it, and closes it.
# Appends a new line to the existing file
sub append_new_line{
my $linedata = shift;
open my $fh, '>>', $fnm or die $!; # $fnm is file-lexical or something
print $fh $linedata,"\n"; # Flavor to taste
close $fh;
}
The process observing the file will encounter a closed file that gets modified whenever the function is called.
All of the solutions suggesting setting autoflush are ignoring the basic fact that most modern OS's are buffering file I/O irrespective of what Perl is doing.
You only possibility to force the commitment of the data to disk is by closing the file.
I'm trapped with the same dilemma atm where we have an issue with rotation of the log being written.
To automatically flush the output, you can set autoflush/$| as described by others before you output to the filehandle.
If you've already output to the filehandle and need to ensure that it gets to the physical file, you need to use the IO::Handle flush and sync methods.
There an article about this in PerlDoc: How do I flush/unbuffer an output filehandle? Why must I do this?
Two solutions:
Unbuffer the output filehandler with : $|
Call the autoflush method if you are using IO::Handle or one of its subclasses.
An alternative approach would be to use a named pipe between your Perl script and C++ program, in lieu of the file you're currently using.
For those who are searching a solution to flush output line by line to a file in Ansys CFD Post using a Session File (*.cse), this is the only solution that worked for me:
! $file="Test.csv";
! open(OUT,"+>>$file");
! select(OUT);$|=1; # This is the important line
! for($i=0;$i<=10;$i++)
! {
! print out "$i\n";
! sleep(3);
! }
Note that you need the exclamation marks at every begin of every line that contains Perl script. sleep(3); is only applied for demonstration reasons. use IO::Handle; is not needed.
The genuine correct answer is to use:-
$|=1; # Make STDOUT immediate (non-buffered)
and although that is one cause of your problem, the other cause of the same problem is this: "Also, there is a C++ application which reads from that file."
It is EXTREMELY NON-TRIVIAL to write C++ code which can properly read from a file that is growing, because your "C++" program will encounter an EOF when it gets to the end... (you cannot read past the end of a file without serious extra trickery) - you have to do a pile of complicated stuff with IO blocking and flags to properly monitor a file this way (like how the linux "tail" command works).
I had the same problem with the only difference of writing the same file over and over again with new content. This association of "$| = 1" and autoflush worked for me:
open (MYFILE, '>', '/internet/web-sites/trot/templates/xml_queries/test.xml');
$| = 1; # Before writing!
print MYFILE "$thisCardReadingContentTemplate\n\n";
close (MYFILE);
MYFILE->autoflush(1); # After writing!
Best of luck.
H