How can I debug a Perl program that suddenly exits? - perl

I have Perl program based on IO::Async, and it sometimes just exits after a few hours/days without printing any error message whatsoever. There's nothing in dmesg or /var/log either. STDOUT/STDERR are both autoflush(1) so data shouldn't be lost in buffers. It doesn't actually exit from IO::Async::Loop->loop_forever - print I put there just to make sure of that never gets triggered.
Now one way would be to keep peppering the program with more and more prints and hope one of them gives me some clue. Is there better way to get information what was going on in a program that made it exit/silently crash?

One trick I've used is to run the program under strace or ltrace (or attach to the process using strace). Naturally that was under Linux. Under other operating systems you'd use ktrace or dtrace or whatever is appropriate.
A trick I've used for programs which only exhibit sparse issues over days or week and then only over handfuls among hundreds of systems is to direct the output from my tracer to a FIFO, and have a custom program keep only 10K lines in a ring buffer (and with a handler on SIGPIPE and SIGHUP to dump the current buffer contents into a file. (It's a simple program, but I don't have a copy handy and I'm not going to re-write it tonight; my copy was written for internal use and is owned by a former employer).
The ring buffer allows the program to run indefinitely with fear of running systems out of disk space ... we usually only need a few hundred, even a couple thousand lines of the trace in such matters.

If you are capturing STDERR, you could start the program as perl -MCarp::Always foo_prog. Carp::Always forces a stack trace on all errors.

A sudden exit without any error message is possibly a SIGPIPE. Traditionally SIGPIPE is used to stop things like the cat command in the following pipeline:
cat file | head -10
It doesn't usually result in anything being printed either by libc or perl to indicate what happened.
Since in an IO::Async-based program you'd not want to silently exit on SIGPIPE, my suggestion would be to put somewhere in the main file of the program a line something like
$SIG{PIPE} = sub { die "Aborting on SIGPIPE\n" };
which will at least alert you to this fact. If instead you use Carp::croak without the \n you might even be lucky enough to get the file/line number of the syswrite, etc... that caused the SIGPIPE.

Related

Fortran and Eclipse: Displaying text in console

I'm having a small difficulty with Fortran 90 and Eclipse. I installed the "Photran" plugin to Eclipse, and have managed to compile everything perfect, and overall the program does what it has to do. The problem comes when displaying text in the Eclipse console. The code it self not that important, since it does what it has to do, but more the output generation.
The piece of the code I'm having trouble with is the following:
subroutine main_program
write(*,*) "Program begins!"
<Program that takes ~5mins to run>
write(*,*) "Program ends!"
end subroutine main_program
Specifically, the problem is that in the console, the first message should be shown immediately, "Program begins!", and after ~5 minutes it should show "Program ends!". It happens that both of these messages get displayed only after the program is done running, not while the programs is executing.
I have used:
subroutine main_program
print*, "Program begins!"
<Program that takes ~5mins to run>
print*, "Program ends!"
end subroutine main_program
but it keeps on doing the same thing. I saw a "similar" post earlier (can't find the link though, sorry about that) but it was not really what I was looking for.
OK, here's the answer. Insert the statement
flush 6
after the first write statement to have its output sent immediately to the console. Insert it anywhere else you wish once you understand what it is doing.
It is obvious (to me) from the situation OP describes that the output is being buffered, that is the program issues a write statement and passes the output off to the operating system which does as it damn well pleases -- here it waits until the program ends before writing anything to the console. I guess that its buffering capabilities have some limits and if the program exceeded them the o/s would empty its buffers prior to program end.
Fortran now (since 2003 I think) provides a standard way of telling the o/s to actually flush the buffer to the output device -- the flush statement. In its simplest form flush takes only one argument, the unit number of the output channel to be flushed. I guessed that OP had unit 6 connected to stdout (aka *), since this is a near-universal default configuration, though not one guaranteed by the Fortran language standard.
I don't think that flush * is correct.
If you have a pre-2003 compiler then (a) for Backus' sake update and (b) it is likely that it supports a non-standard way to flush buffers; if memory serves gfortran used to provide a subroutine which would be called something like call flush(6).
There are other ways, outside Fortran, to tell the o/s to write to disk (or console or what have you) immediately. Look at the documentation for your o/s if you are interested in them.

back ticks not working in perl

Got stuck with one problem in our live server.
Have script (perl) which runs almost 15 to 18 hrs a day. it creates 100+ sub process every day . One place it has command (product command which we run in command line solaris box) which is being triggerred with back ticks inside perl code.
It looks like the back ticks command gets skipped or failed randomly.
for eg. if i need to run for 50 customers 2 or 3 gets failed randomly.
I do not see the evidence that the command has been triggerred in anywhere.
since its live server we can't even try making much in code change until we are sure about the problem.
here is the code..
my $comm = "inventory -noX customer1"; #sample command i have given here
my $newLogFile = "To capture command output here we have path whre the file gets created");
my $piddy = `$comm 2>&1 > $newLogFile`;
Is it because of the back ticks it happens I am really not sure :(.
Also tried various analysis like memory/CPU/diskspace/Adding librtld_db.so in LD_LIBRARY_PATH etc....but no luck...Also the perl is in 64 bit ...what else Can i? :(
I suspect you are not checking for errors (and perl doesn't make that easy to do correctly for backticks).
Consider using IPC::System::Simple's capture in place of your backticks/qx.
As its doc says, "If there's an error, it will die with a detailed description of what went wrong."
It shouldn't fail just because of backticks, however because it is spawning a new process, that process may be periodically subject to failure due to system conditions (eg. sysLoad). Backticks are really a "fire and forget" method and should never be used for anything critical in a production environment. As previously suggested, there are far more detailed ways to manage spawning external processes.
If the command's output is being lost due to buffering, you might try turning off buffering, but keep an eye on it for performance degradation (it's usually not significant).
Buffering can be turned off for an entire script by adding this near the top:
$|=1;
When calling external commands, I'm using system of IPC::System::Simple or open3 of IPC::Open3.

what might cause a print error in perl?

I have a long running script that every hour opens a file, prints to it and closes the file. I've recently found very rarely, the print is failing, not because I'm testing the status of the print itself but rather due to the fact of missing entries in the file until the system is actually rebooted!
I do trap for file open failures and write a message to syslog when that happens and I'm not seeing any open failures so I'm now guessing it may be the print that is failing. I'm not trapping the print failures, which I suspect most people don't but am now going to update that one print.
Meanwhile, my question is does anyone know what types of situations could cause a print statement to fail when there is plenty of disk storage and no contention for a file which has been successfully opened in append mode?
You could be out of memory (ENOMEM) or over a filesize limit (E2BIG or SIGXFSZ). You could have an old-fashioned I/O error (EIO). You could have a race condition if the script is run concurrently or if the file is accessed over NFS. And, of course, you could have an error in the expression whose value you would print.
An exotic cause that I once saw is that a CPU heatsink failure can lead to sprintf spuriously failing, causing some surprising results including writing garbage to file descriptors.
Finally, I remind you that print will often write its stuff in an I/O buffer. This means two things. (1) You need to check the result of close() as well. (2) If you print but you don't immediately close() or flush() then your data can be buffered and not actually written until much later (or not at all if the process dies horribly).

When do you need to `END { close STDOUT}` in Perl?

In the tchrists broilerplate i found this explicit closing of STDOUT in the END block.
END { close STDOUT }
I know END and close, but i'm missing why it is needed.
When start searching about it, found in the perlfaq8 the following:
For example, you can use this to make
sure your filter program managed to
finish its output without filling up
the disk:
END {
close(STDOUT) || die "stdout close failed: $!";
}
and don't understand it anyway. :(
Can someone explain (maybe with some code-examples):
why and when it is needed
how and in what cases can my perl filter fill up the disk and so on.
when things getting wrong without it...
etc??
A lot of systems implement "optimistic" file operations. By this I mean that a call to for instance print which should add some data to a file can return successfully before the data is actually written to the file, or even before enough space is reserved on disk for the write to succeed.
In these cases, if you disk is nearly full, all your prints can appear successful, but when it is time to close the file, and flush it out to disk, the system realizes that there is no room left. You then get an error when closing the file.
This error means that all the output you thought you saved might actually not have been saved at all (or partially saved). If that was important, your program needs to report an error (or try to correct the situation, or ...).
All this can happen on the STDOUT filehandle if it is connected to a file, e.g. if your script is run as:
perl script.pl > output.txt
If the data you're outputting is important, and you need to know if all of it was indeed written correctly, then you can use the statement you quoted to detect a problem. For example, in your second snippet, the script explicitly calls die if close reports an error; tchrist's boilerplate runs under use autodie, which automatically invokes die if close fails.
(This will not guarantee that the data is stored persistently on disk though, other factors come into play there as well, but it's a good error indication. i.e. if that close fails, you know you have a problem.)
I believe Mat is mistaken.
Both Perl and the system have buffers. close causes Perl's buffers to be flushed to the system. It does not necessarily cause the system's buffers to be written to disk as Mat claimed. That's what fsync does.
Now, this would happen anyway on exit, but calling close gives you a chance to handle any error it encountered flushing the buffers.
The other thing close does is report earlier errors in attempts by the system to flush its buffers to disk.

Why does writing to an unconnected socket send SIGPIPE first?

There are so many possible errors in the POSIX environment. Why do some of them (like writing to an unconnected socket in particular) get special treatment in the form of signals?
This is by design, so that simple programs producing text (e.g. find, grep, cat) used in a pipeline would die when their consumer dies. That is, if you're running a chain like find | grep | sed | head, head will exit as soon as it reads enough lines. That will kill sed with SIGPIPE, which will kill grep with SIGPIPE, which will kill find with SEGPIPE. If there were no SIGPIPE, naively written programs would continue running and producing content that nobody needs.
If you don't want to get SIGPIPE in your program, just ignore it with a call to signal(). After that, syscalls like write() that hit a broken pipe will return with errno=EPIPE instead.
See this SO answer for a detailed explanation of why writing a closed descriptor / socket generates SIGPIPE.
Why is writing a closed TCP socket worse than reading one?
SIGPIPE isn't specific to sockets — as the name would suggest, it is also sent when you try to write to a pipe (anonymous or named) as well. I guess the reason for having separate error-handling behaviour is that broken pipes shouldn't always be treated as an error (whereas, for example, trying to write to a file that doesn't exist should always be treated as an error).
Consider the program less. This program reads input from stdin (unless a filename is specified) and only shows part of it at a time. If the user scrolls down, it will try to read more input from stdin, and display that. Since it doesn't read all the input at once, the pipe will be broken if the user quits (e.g. by pressing q) before the input has all been read. This isn't really a problem, though, so the program that's writing down the pipe should handle it gracefully.
it's up to the design.
at the beginning people use signal to control events notification which were sent to the user space, and later it is not necessary because there're more popular skeletons such as polling which don't require a system caller to make a signal handler.