I have a program in Perl that reads one line at a time from a data file and computes certain statistics for each line of data. Every now and then, while the program reads through my dataset, I get a warning about an ...uninitialized value... and I would like to know which line of data generates this warning.
Is there any way I can tell Perl to print (to screen or file) the data point that is generating the error?
If your script prints one line for each input line, it would be simpler to see when the error occurs by flushing the standard error along with the output (making the message appear at the "right" point):
$| = 1;
That is, turn on the perl autoflush feature, as discussed in How to flush output to the console?
What (auto)flushing does:
error messages are written to the predefined STDERR stream, normal printf's go to the (default) predefined STDOUT.
data written on these streams is saved up by the system (under Perl's control) to write out in chunks (called "buffers") to improve efficiency.
STDERR and STDOUT buffers are independent, and could be written line-by-line or by buffers (many characters, not necessarily lines).
using autoflush tells Perl to modify its scheme for writing buffers so that their content is written via the operating system at the end of a print/printf call.
normally STDERR is written line-by-line. The command tells Perl to enable this feature for the current/default stream, i.e., STDOUT.
doing that makes both of them write line-by-line, so that messages sent close in time via either appears close together in the output of your script.
Perl usually includes the file handle and line number in warnings by default; i.e.
>echo hello | perl -lnwe 'print $x'
Name "main::x" used only once: possible typo at -e line 1.
Use of uninitialized value $x in print at -e line 1, <> line 1.
So if you're doing the computation while reading, you get the appropriate warning.
Related
I have a Perl script that received input piped from another program. It's buffering with an 8k (Ubuntu default) input buffer, which is causing problems. I'd like to use line buffering or disable buffering completely. It doesn't look like there's a good way to do this. Any suggestions?
use IO::Handle;
use IO::Poll qw[ POLLIN POLLHUP POLLERR ];
use Text::CSV;
my $stdin = new IO::Handle;
$stdin->fdopen(fileno(STDIN), 'r');
$stdin->setbuf(undef);
my $poll = IO::Poll->new() or die "cannot create IO::Poll object";
$poll->mask($stdin => POLLIN);
STDIN->blocking(0);
my $halt = 0;
for(;;) {
$poll->poll($config{poll_timout});
for my $handle ($poll->handles(POLLIN | POLLHUP | POLLERR)) {
next unless($handle eq $stdin);
if(eof) {
$halt = 1;
last;
}
my #row = $csv->getline($stdin);
# Do more stuff here
}
last if($halt);
}
Polling STDIN kind of throws a wrench into things since IO::Poll uses buffering and direct calls like sysread do not (and they can't mix). I don't want to infinitely call sysread without no blocking. I require the use of select or poll since I don't want to hammer the CPU.
PLEASE NOTE: I'm talking about STDIN, NOT STDOUT. $|++ is not the solution.
[EDIT]
Updating my question to clarify based on the comments and other answers.
The program that is writing to STDOUT (on the other side of the pipe) is line buffered and flushed after every write. Every write contains a newline, so in effect, buffering is not an issue for STDOUT of the first program.
To verify this is true, I wrote a small C program that reads piped input from the same program with STDIN buffering disabled (setvbuf with _IONBF). The input appears in STDIN of the test program immediately. Sadly, it does not appear to be an issue with the output from the first program.
[/EDIT]
Thanks for any insight!
PS. I have done a fair amount of Googling. This link is the closest I've found to an answer, but it certainly doesn't satisfy all my needs.
Say there are two short lines in the pipe's buffer.
IO::Poll notifies you there's data to read, which you proceed to read (indirectly) using readline.
Reading one character at a time from a file handle is very inefficient. As such, readline (aka <>) reads a block of data from the file handle at a time. The two lines ends up in a buffer and the first of the two lines is returned.
Then you wait for IO::Poll to notify you that there is more data. It doesn't know about Perl's buffer; it just knows the pipe is empty. As such, it blocks.
This post demonstrates the problem. It uses IO::Select, but the principle (and solution) is the same.
You're actually talking about the other program's STDOUT. The solution is $|=1; (or equivalent) in the other program.
If you can't, you might be able to convince the other program use line-buffering instead of block buffering by connecting its STDOUT to a pseudo-tty instead of a pipe (like Expect.pm does, for example).
The unix program expect has a tool called unbuffer which does that exactly that. (It's part of the expect-dev package on Ubuntu.) Just prefix the command name with unbuffer.
I have a perl script, say "process_output.pl" which is used in the following context:
long_running_command | "process_output.pl"
The process_output script, needs to be like the unix "tee" command, which dumps output of "long_running_command" to the terminal as it gets generated, and in addition captures output to a text file, and at the end of "long_running_command", forks another process with the text file as an input.
The behavior I am currently seeing is that, the output of "long_running_command" gets dumped to the terminal, only when it gets completed instead of, dumping output as it gets generated. Do I need to do something special to fix this?
Based on my reading in a few other stackexchange posts, i tried the following in "process_output.pl", without much help:
select(STDOUT); $| =1;
select(STDIN); $| =1; # Not sure even if this is needed
use FileHandle; STDOUT->autoflush(1);
stdbuf -oL -eL long_running_command | "process_output.pl"
Any pointers on how to proceed further.
Thanks
AB
This is more likely an issue with the output of the first process being buffered, rather than the input of your script. The easiest solution would be to try using the unbuffer command (I believe it's part of the expect package), something like
unbuffer long_running_command | "process_output.pl"
The unbuffer command will disable the buffering that happens normally when output is directed to a non-interactive place.
This will be the output processing of long_running_processing. More than likely it is using stdio - which will look to see what the output file descriptor is connected to before it does outputing. If it is a terminal (tty), then it will generally output line based, but in the above case - it will notice it is writing to a pipe and will therefore buffer the output into larger chunks.
You can control the buffering in your own process by using, as you showed
select(STDOUT); $| =1;
This means that things that your process prints to STDIO, are not buffered - it makes no sense doing this for input, as you control how much buffering is done - if you use sysread() then you are reading unbuffered, if you use a construct like <$fh> then perl will await until it has a "whole line" (it actually reads up to the next input line separator (as defined in variable $/ which is newline by default)) before it returns data to you.
unbuffer can be used to "disable" the output buffering, what it actually does is make the outputing process think that it is talking to a tty (by using a pseudo tty) so the output process does not buffer.
I was working with a file parser in perl that prints the name of every file it processes. But i noticed that these print outputs appeared out of order which got my attention. After further digging, i found out that this is because, Perl is using Buffering and releases these print statements to the output only when the buffer is full. I also learned that there is a work around by "making the filehandle hot". Whenever you print to a hot filehandle, Perl flushes the buffer immediately. So my question is :
Are there any consequences of "making the filehandle hot" ?
Does leaving the buffer to get filled up before flushing vs flushing immediately have any effect on performance ?
Perl uses different output buffering modes depending on context: Writing to files etc. buffers in chunks (this is important for performance), while a handle is flushed after each line if perl has reason to believe that the output goes to a terminal. STDERR is unbuffered by default.
You can deactivate buffering for the currently selected file handle by setting the special $| variable to a true value. However, this is better expressed as:
use IO::File; # on older perls
...
$some_file_handle->autoflush(1);
print { $some_file_handle } "this isn't buffered";
which has the advantage that you don't have to use the annoying select function for handles other than STDOUT. Why is this method called autoflush? The file handle is still buffered, but the buffer is automatically flushed after each print or say call.
Careful: The autoflush method won't work on truly ancient perls where file handles aren't objects yet. In that case, do the select dance:
my $old_fh = select $my_$fh;
$| = 1;
select $old_fh;
print { $my_fh } "this isn't buffered";
(select returns the currently selected file handle).
I've been writing output from perl scripts to files for some time using code as below:
open( OUTPUT, ">:utf8", $output_file ) or die "Can't write new file: $!";
print OUTPUT "First line I want printed\n";
print OUTPUT "Another line I want printing\n";
close(OUTPUT);
This works, and is faster than my initial approach which used "say" instead of print (Thank you NYTProf for enlightening my to that!)
However, my current script is looping over hundreds of thousands of lines and is taking many hours to run using this method and NYTProf is pointing the finger at my thousands of 'print' commands. So, the question is... Is there a faster way of doing this?
Other Info that's possibly relevant...
Perl Version: 5.14.2 (On Ubuntu)
Background of the script in question...
A number of '|' delimited flat files are being read into hashes, each file has some sort of primary key matching entries from one to another. I'm manipulating this data and them combining them into one file for import into another system.
The output file is around 3 Million lines, and the program starts to noticeably slow down after writing around 30,000 lines to said file. (A little reading around seemed to point towards running out of write buffer in other languages but I couldn't find anything about this with regard to perl?)
EDIT: I've now tried adding the line below, just after the open() statement, to disable print buffering, but the program still slows around the 30,000th line.
OUTPUT->autoflush(1);
I think you need to redesign the algorithm your program uses. File output speed isn't influenced by the amount of data that has been output, and it is far more likely that your program is reading and processing data but not releasing it.
Check the amount of memory used by your process to see if it increases inexorably
Beware of for (<$filehandle>) loops, which read whole files into memory at once
As I said in my comment, disable the relevant print statements to see how performance changes
Have you tried to concat all the single print's into a single scalar and then print scalar all at once? I have a script that outputs an average of 20 lines of text for each input line. When using individual print statements, even sending the output to /dev/null, took a long time. But when I packed all the output (for a single input line) together, using things like:
$output .= "...";
$output .= sprintf("%s...", $var);
Then just before leaving the line processing sub-routine, I 'print $output'. Printing all the lines at once. The number of calls to print went from ~7.7M to about 386K - equal to the number of lines in the input date file. This shaved about 10% off of my total execution time.
By the discussion in my previous question I came to know that Perl gives line buffer output by default.
$| = 0; # for buffered output (by default)
If you want to get unbuffered output then set the special variable $| to 1 i.e.
$| = 1; # for unbuffered output
Now I want to know that what can be the possible situations where one should prefer the unbuffered output?
You want unbuffered output for interactive tasks. By that, I mean you don't want output stuck in some buffer when you expect someone or something else to respond to the output.
For example, you wouldn't want user prompts sent to STDOUT to be buffered. (That's why STDOUT is never fully buffered when attached to a terminal. It is only line buffered, and the buffer is flushed by attempts to read from STDIN.)
For example, you'd want requests sent over pipes and sockets to not get stuck in some buffer, as the other end of the connection would never see it.
The only other reason I can think of is when you don't want important data to be stuck in a buffer in the event of a unrecoverable error such as a panic or death by signal.
For example, you might want to keep a log file unbuffered in order to be able to diagnose serious problems. (This is why STDERR isn't buffered by default.)
Here's a small sample of Perl users from StackOverflow who have benefited from learning to set $| = 1:
STDOUT redirected externally and no output seen at the console
Perl Print function does not work properly when Sleep() is used
can't write to file using print
perl appending issues
Unclear perl script execution
Perl: Running a "Daemon" and printing
Redirecting STDOUT of a pipe in Perl
Why doesn't my parent process see the child's output until it exits?
Why does adding or removing a newline change the way this perl for loop functions?
Perl not printing properly
In Perl, how do I process input as soon as it arrives, instead of waiting for newline?
What is the simple way to keep the output stream exactly as it shown out on the screen (while interactive data used)?
Is it possible to print from a perl CGI before the process exits?
Why doesn't print output anything on each iteration of a loop when I use sleep?
Perl Daemon Not Working with Sleep()
It can be useful when writing to another program over a socket or pipe. It can also be useful when you are writing debugging information to STDOUT to watch the state of your program live.