Within a loop construct of a perl script, I have written the following lines to parse text files using system tools, generating temporary text files in the process, and subsequently read the temporary output into an array for processing within the perl script:
system("awk '(NR+2)%4==0' $infile[$i+$j] | tre-agrep -ns -E $dist[$a][$b] -k $query[$a][$b] | awk 'BEGIN{FS=\":\";OFS=\":\"}{print \$1,\$2}' > $outfile");
open(my $FH, "<", $outfile) || die "Can't open $outfile: $!";
while(<$FH>) {
...
}
close($FH);
These commands are repeated twice nearly verbatim (with modification of some of the parameters, but recycling of the file handle) within a loop construct that itself is iterated numerous times. Unexpectedly and seemingly arbitrarily, the program sometimes fails to complete the system call, causing the subsequent lines, which depend on the output generated by the system call, to fail in turn, triggering abortion of the script and display of the rather unhelpful error message "No such file or directory" (with reference to the open statement). Executing the same system call directly from a console rather than within the context of the perl script shows that the command produces the expected output. I refer to this behavior as arbitrary because sometimes my script will variously complete 1 to 3 iterations before failing at the open line, and the basis for the varying success is not clear. When the script is working properly, the system call takes quite some time (around 2 minutes), whereas when it fails, the program moves to the following open line in less than a second. I would thus like to figure out why the system call is sometimes skipped.
The script is run in a bash shell session and the following are included in the script header:
#! /usr/bin/perl
use warnings;
use strict;
I fundamentally agree with #ThisSuitIsBlackNot. However, not knowing what tre-agrep is, it's difficult to translate that part into straight Perl.
That said, at the very least, why not skip the generation of the output file and just read the Unix output directly from Perl?
open my $FH, '-|', "awk '(NR+2)%4==0' $infile[$i+$j] | " .
"tre-agrep -ns -E $dist[$a][$b] -k $query[$a][$b]" or die "$!";
while (<$FH>) {
chomp;
my ($field1, $field2) = split /:/, $_, 2;
}
close $FH;
At worst, the standard output from your system call would be blank, but this wouldn't impact Perl's ability to read nothing (and therefore do nothing).
Of course, it wouldn't hurt to first execute an -e (exists) call to be sure that infile[$i + $j] isn't a ghost.
Related
Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH
The following script creates a gziped file named "input.gz". Then the script attempts to open "input.gz" using gzip -dc. Intuitively, die should be triggered if a wrong input file name is provided. However, as in the following script, the program will not die even if a wrong input file name is provided ("inputx.gz"):
use warnings;
use strict;
system("echo PASS | gzip -c > input.gz");
open(IN,"-|","gzip -dc inputx.gz") || die "can't open input.gz!";
print STDOUT "die statment was not triggered!\n";
close IN;
The output of the script above was
die statment was not triggered!
gzip: inputx.gz: No such file or directory
My questions is: why wasn't die statement triggered even though gzip quit with error? And how can I make die statement triggered when a wrong file name is given?
It's buried in perlipc, but this seems relevant (emphasis added):
Be careful to check the return values from both open() and close(). If you're writing to a pipe, you should also trap SIGPIPE. Otherwise, think of what happens when you start up a pipe to a command that doesn't exist: the open() will in all likelihood succeed (it only reflects the fork()'s success), but then your output will fail--spectacularly. Perl can't know whether the command worked, because your command is actually running in a separate process whose exec() might have failed. Therefore, while readers of bogus commands return just a quick EOF, writers to bogus commands will get hit with a signal, which they'd best be prepared to handle.
Use IO::Uncompress::Gunzip to read gzipped files instead.
The open documentation is explicit about open-ing a process since that is indeed different
If you open a pipe on the command - (that is, specify either |- or -| with the one- or two-argument forms of open), an implicit fork is done, so open returns twice: in the parent process it returns the pid of the child process, and in the child process it returns (a defined) 0. Use defined($pid) or // to determine whether the open was successful.
For example, use either
my $child_pid = open(my $from_kid, "-|") // die "Can't fork: $!";
or
my $child_pid = open(my $to_kid, "|-") // die "Can't fork: $!";
(with code following that shows one use of this, which you don't need) The main point is to check for defined -- by design we get undef if open for a process fails, not just any "false."
While this should be corrected, keep in mind that the open call fails if fork itself fails, what is rare; in most cases when a "command fails" the fork was successful but something later wasn't. So in such cases we just cannot get the // die message, but end up seeing messages from the shell or command or OS, hopefully.
This is alright though, if informative messages indeed get emitted by some part of the process. Wrap the whole thing in eval and you'll have manageable error reporting.
But it is in general difficult to ensure to get all the right messages, and in some cases not possible. One good approach is to use a module for running and managing external commands. Among the many other advantages they also usually handle errors much more nicely. If you need to handle process's output right as it is emitted I recommend IPC::Run (which i'd recommend otherwise as well).
Read on what linked docs say, for specific examples on what you need and for much useful insight.
In your case
# Check input, depending on how it is given,
# consider String::ShellQuote if needed
my $file = ...;
my #cmd = ('gzip', '-dc', $file);
my $child_pid = open my $in, '-|', #cmd
// die "Can't fork for '#cmd': $!";
while (<$in>) {
...
}
close $in or die "Error closing pipe: $!";
Note a few other points
the "list form" of the command bypasses the shell
lexical filehandle (my $fh) is much better than typeglobs (IN)
print the actual error in the die statement, in $! variable
check close for a good final check on how it all went
Consider these 2 snippets :
#!/bin/bash/perl
open(DATA,"<input.txt");
while(<DATA>)
{
print($_) ;
}
and
$abcd = `cat input.txt`;
print $abcd;
Both will print the content of file input.txt as output
Question : Is there any standard, as to which one (backticks or native-method) should be preferred over the other, in any particular case or both are equal always??
Reason i am asking this is because i find cat method to be easier than opening a file in native perl method, so, this puts me in doubt that if i can achieve something through backtick way, shall i go with it or prefer other native ways of doing it!!
I checked this thread too : What's the difference between Perl's backticks, system, and exec? but it went a different route than my doubt!!
Use builtin functions wherever possible:
They are more portable: open works on Windows, while `cat input.txt` will not.
They have less overhead: Using backticks will fork, exec a shell which parses the command, which execs the cat program. This unnecessarily loads two programs. This is in contrast to open which is a builtin Perl function.
They make error handling easier. The open function will return a false value on error, which allows you to take different actions, e.g. like terminating the program with an error message:
open my $fh, "<", "input.txt" or die "Couldn't open input.txt: $!";
They are more flexible. For example, you can add encoding layers if your data isn't Latin-1 text:
open my $fh, "<:utf8", "input.txt" or die "Couldn't open input.txt: $!";
open my $fh, "<:raw", "input.bin" or die "Couldn't open input.bin: $!";
If you want a “just read this file into a scalar” function, look at the File::Slurp module:
use File::Slurp;
my $data = read_file "input.txt";
Using the back tick operators to call cat is highly inefficient, because:
It spawns a separate process (or maybe more than one if a shell is used) which does nothing more than read the file, which perl could do itself.
You are reading the whole file into memory instead of processing it one line at a time. OK for a small file, not so good for a large one.
The back tick method is ok for a quick and dirty script but I would not use it for anything serious.
I basically want to reopen STDERR/STDOUT so they write to one logfile with both the stream and the timestamp included on every line. So print STDERR "Hello World" prints STDERR: 20130215123456: Hello World. I don't want to rewrite all my print statements into function calls, also some of the output will be coming from external processes via system() calls anyway which I won't be able to rewrite.
I also need for the output to be placed in the file "live", i.e. not only written when the process completes.
(p.s. I'm not asking particularly for details of how to generate timestamps, just how to redirect to a file and prepend a string)
I've worked out the following code, but it's messy:
my $mode = ">>";
my $file = "outerr.txt";
open(STDOUT, "|-", qq(perl -e 'open(FILE, "$mode", "$file"); while (<>) { print FILE "STDOUT: \$\_"; }'));
open(STDERR, "|-", qq(perl -e 'open(FILE, "$mode", "$file"); while (<>) { print FILE "STDERR: \$\_"; }'));
(The above doesn't add dates, but that should be trivial to add)
I'm looking for a cleaner solution, one that doesn't require quoting perl code and passing it on the command line, or at least module that hides some of the complexity. Looking at the code for Capture::Tiny it doesn't look like it can handle writing a part of output, though I'm not sure about that. annotate-output only works on an external command sadly, I need this to work on both external commands and ordinary perl printing.
The child launched via system doesn't write to STDOUT because it does not have access to variables in your program. Therefore, means having code run on a Perl file handle write (e.g. tie) won't work.
Write another script that runs your script with STDOUT and STDERR replaced with pipes. Read from those pipes and print out the modified output. I suggest using IPC::Run to do this, because it'll save you from using select. You can get away without it if you combine STDOUT and STDERR in one stream.
I have the code:
open(FILE, "<$new_file") or die "Cant't open file \n";
#lines=<FILE>;
close FILE;
open(STDOUT, ">$new_file") or die "Can't open file\n";
$old_fh = select(OUTPUT_HANDLE);
$| = 1;
select($old_fh);
for(#lines){
s/(.*?xsl.*?)xsl/$1xslt/;
print;
}
close(STDOUT);
STDOUT -> autoflush(1);
print "file changed";
After closing STDOUT closing the program does not write the last print print "file changed". Why is this?
*Edited* Print message I want to write on Console no to file
I suppose it is because print default filehandle is STDOUT, which at that point it is already closed. You could reopen it, or print to other filehandle, for example, STDERR.
print STDERR "file changed";
It's because you've closed the filehandle stored in STDOUT, so print can't use it anymore. Generally speaking opening a new filehandle into one of the predefined handle names isn't a very good idea because it's bound to lead to confusion. It's much clearer to use lexical filehandles, or just a different name for your output file. Yes you then have to specify the filehandle in your print call, but then you don't have any confusion over what's happened to STDOUT.
A print statement will output the string in the STDOUT, which is the default output file handle.
So the statement
print "This is a message";
is same as
print STDOUT "This is a message";
In your code, you have closed STDOUT and then printing the message, which will not work. Reopen the STDOUT filehandle or do not close it. As the script ends, the file handles will be automatically closed
open OLDOUT, ">&", STDOUT;
close STDOUT;
open(STDOUT, ">$new_file") or die "Can't open file\n";
...
close(STDOUT);
open (STDOUT, ">&",OLDOUT);
print "file changed";
You seem to be confused about how file IO operations are done in perl, so I would recommend you read up on that.
What went wrong?
What you are doing is:
Open a file for reading
Read the entire file and close it
Open the same file for overwrite (org file is truncated), using the STDOUT file handle.
Juggle around the default print handle in order to set autoflush on a file handle which is not even opened in the code you show.
Perform a substitution on all lines and print them
Close STDOUT then print a message when everything is done.
Your main biggest mistake is trying to reopen the default output file handle STDOUT. I assume this is because you do not know how print works, i.e. that you can supply a file handle to print to print FILEHANDLE "text". Or that you did not know that STDOUT was a pre-defined file handle.
Your other errors:
You did not use use strict; use warnings;. No program you write should be without these. They will prevent you from doing bad things, and give you information on errors, and will save you hours of debugging.
You should never "slurp" a file (read the entire file to a variable) unless you really need to, because this is ineffective and slow and for huge files will cause your program to crash due to lack of memory.
Never reassign the default file handles STDIN, STDOUT, STDERR, unless A) you really need to, B) you know what you are doing.
select sets the default file handle for print, read the documentation. This is rarely something that you need to concern yourself with. The variable $| sets autoflush on (if set to a true value) for the currently selected file handle. So what you did actually accomplished nothing, because OUTPUT_HANDLE is a non-existent file handle. If you had skipped the select statements, it would have set autoflush for STDOUT. (But you wouldn't have noticed any difference)
print uses print buffers because it is efficient. I assume you are trying to autoflush because you think your prints get caught in the buffer, which is not true. Generally speaking, this is not something you need to worry about. All the print buffers are automatically flushed when a program ends.
For the most part, you do not need to explicitly close file handles. File handles are automatically closed when they go out of scope, or when the program ends.
Using lexical file handles, e.g. open my $fh, ... instead of global, e.g. open FILE, .. is recommended, because of the previous statement, and because it is always a good idea to avoid global variables.
Using three-argument open is recommended: open FILEHANDLE, MODE, FILENAME. This is because you otherwise risk meta-characters in your file names to corrupt your open statement.
The quick fix:
Now, as I said in the comments, this -- or rather, what you intended, because this code is wrong -- is pretty much identical to the idiomatic usage of the -p command line switch:
perl -pi.bak -e 's/(.*?xsl.*?)xsl/$1xslt/' file.txt
This short little snippet actually does all that your program does, but does it much better. Explanation:
-p switch automatically assumes that the code you provide is inside a while (<>) { } loop, and prints each line, after your code is executed.
-i switch tells perl to do inplace-edit on the file, saving a backup copy in "file.txt.bak".
So, that one-liner is equivalent to a program such as this:
$^I = ".bak"; # turns inplace-edit on
while (<>) { # diamond operator automatically uses STDIN or files from #ARGV
s/(.*?xsl.*?)xsl/$1xslt/;
print;
}
Which is equivalent to this:
my $file = shift; # first argument from #ARGV -- arguments
open my $fh, "<", $file or die $!;
open my $tmp, ">", "/tmp/foo.bar" or die $!; # not sure where tmpfile is
while (<$fh>) { # read lines from org file
s/(.*?xsl.*?)xsl/$1xslt/;
print $tmp $_; # print line to tmp file
}
rename($file, "$file.bak") or die $!; # save backup
rename("/tmp/foo.bar", $file) or die $!; # overwrite original file
The inplace-edit option actually creates a separate file, then copies it over the original. If you use the backup option, the original file is first backed up. You don't need to know this information, just know that using the -i switch will cause the -p (and -n) option to actually perform changes on your original file.
Using the -i switch with the backup option activated is not required (except on Windows), but recommended. A good idea is to run the one-liner without the option first, so the output is printed to screen instead, and then adding it once you see the output is ok.
The regex
s/(.*?xsl.*?)xsl/$1xslt/;
You search for a string that contains "xsl" twice. The usage of .*? is good in the second case, but not in the first. Any time you find yourself starting a regex with a wildcard string, you're probably doing something wrong. Unless you are trying to capture that part.
In this case, though, you capture it and remove it, only to put it back, which is completely useless. So the first order of business is to take that part out:
s/(xsl.*?)xsl/$1xslt/;
Now, removing something and putting it back is really just a magic trick for not removing it at all. We don't need magic tricks like that, when we can just not remove it in the first place. Using look-around assertions, you can achieve this.
In this case, since you have a variable length expression and need a look-behind assertion, we have to use the \K (mnemonic: Keep) option instead, because variable length look-behinds are not implemented.
s/xsl.*?\Kxsl/xslt/;
So, since we didn't take anything out, we don't need to put anything back using $1. Now, you may notice, "Hey, if I replace 'xsl' with 'xslt', I don't need to remove 'xsl' at all." Which is true:
s/xsl.*?xsl\K/t/;
You may consider using options for this regex, such as /i, which causes it to ignore case and thus also match strings such as "XSL FOO XSL". Or the /g option which will allow it to perform all possible matches per line, and not just the first match. Read more in perlop.
Conclusion
The finished one-liner is:
perl -pi.bak -e 's/xsl.*?xsl\K/t/' file.txt