How to read and overwrite the same input file using perl?
My input:
GCGCCACTGCACTCCAGCCTGGGCGACAGAGC (873 TO 904) GCTCTGTCGCCCAGGCTGGAGTGCAGTGGCGC (3033 TO 3064)
CAAAAAAAAAAAAAAAAAAA (917 TO 936) TTTTTTTTTTTTTTTTTTTG (2998 TO 3017)
AAAAAAAAAAAAAAAAAAAG (922 TO 941) CTTTTTTTTTTTTTTTTTTT (2997 TO 3016)
I tried the below code:
#!/usr/local/bin/perl
open($in,'<',"/home/httpd/cgi-bin/exa.txt") || die("error");
open($out,'>>',"/home/httpd/cgi-bin/exa.txt")||die("error");
while(<$in>)
{
print $out;
}
close $in;
close $out;
Bear in mind what you're looking at doing here - you're opening a file for reading, reading it one line at a time.
What do you think is going to happen when you modify that file in the process?
There's also some constraints - Windows doesn't support concurrent opening for read/write anyway.
However take a look at open specifically:
You can put a + in front of the > or < to indicate that you want both read and write access to the file; thus +< is almost always preferred for read/write updates--the +> mode would clobber the file first. You can't usually use either read-write mode for updating textfiles, since they have variable-length records. See the -i switch in perlrun for a better approach. The file is created with permissions of 0666 modified by the process's umask value.
What I would suggest instead though - don't read and write from the same file at all. Rename one, execute your process, verify that it worked properly, and then tidy up afterwards.
That way a partial success won't mean corrupted data.
You can use the -i flag - see perlrun - this allows you to in place edit as you might be used to with sed. (Can be used within program via $^I - see perlvar )
There's a couple of constraints on doing this though - specifically it only works if you're using the while ( <> ) { construct. Practically speaking, I think this wouldn't be a good choice outside more simplistic programs - it's doing something implicitly, so might not be entirely clear to future readers, and it's doing essentially the same thing as opening and renaming anyway.
Related
Was debugging a perl script for the first time in my life and came over this:
$my_temp_file = File::Temp->tmpnam();
system("cmd $blah | cmd2 > $my_temp_file");
open(FIL, "$my_temp_file");
...
unlink $my_temp_file;
This works pretty much like I want, except the obvious race conditions in lines 1-3. Even if using proper tempfile() there is no way (I can think of) to ensure that the file streamed to at line 2 is the same opened at line 3. One solution might be pipes, but the errors during cmd might occur late because of limited pipe buffering, and that would complicate my error handling (I think).
How do I:
Write all output from cmd $blah | cmd2 into a tempfile opened file handle?
Read the output without re-opening the file (risking race condition)?
You can open a pipe to a command and read its contents directly with no intermediate file:
open my $fh, '-|', 'cmd', $blah;
while( <$fh> ) {
...
}
With short output, backticks might do the job, although in this case you have to be more careful to scrub the inputs so they aren't misinterpreted by the shell:
my $output = `cmd $blah`;
There are various modules on CPAN that handle this sort of thing, too.
Some comments on temporary files
The comments mentioned race conditions, so I thought I'd write a few things for those wondering what people are talking about.
In the original code, Andreas uses File::Temp, a module from the Perl Standard Library. However, they use the tmpnam POSIX-like call, which has this caveat in the docs:
Implementations of mktemp(), tmpnam(), and tempnam() are provided, but should be used with caution since they return only a filename that was valid when function was called, so cannot guarantee that the file will not exist by the time the caller opens the filename.
This is discouraged and was removed for Perl v5.22's POSIX.
That is, you get back the name of a file that does not exist yet. After you get the name, you don't know if that filename was made by another program. And, that unlink later can cause problems for one of the programs.
The "race condition" comes in when two programs that probably don't know about each other try to do the same thing as roughly the same time. Your program tries to make a temporary file named "foo", and so does some other program. They both might see at the same time that a file named "foo" does not exist, then try to create it. They both might succeed, and as they both write to it, they might interleave or overwrite the other's output. Then, one of those programs think it is done and calls unlink. Now the other program wonders what happened.
In the malicious exploit case, some bad actor knows a temporary file will show up, so it recognizes a new file and gets in there to read or write data.
But this can also happen within the same program. Two or more versions of the same program run at the same time and try to do the same thing. With randomized filenames, it is probably exceedingly rare that two running programs will choose the same name at the same time. However, we don't care how rare something is; we care how devastating the consequences are should it happen. And, rare is much more frequent than never.
File::Temp
Knowing all that, File::Temp handles the details of ensuring that you get a filehandle:
my( $fh, $name ) = File::Temp->tempfile;
This uses a default template to create the name. When the filehandle goes out of scope, File::Temp also cleans up the mess.
{
my( $fh, $name ) = File::Temp->tempfile;
print $fh ...;
...;
} # file cleaned up
Some systems might automatically clean up temp files, although I haven't care about that in years. Typically is was a batch thing (say once a week).
I often go one step further by giving my temporary filenames a template, where the Xs are literal characters the module recognizes and fills in with randomized characters:
my( $name, $fh ) = File::Temp->tempfile(
sprintf "$0-%d-XXXXXX", time );
I'm often doing this while I'm developing things so I can watch the program make the files (and in which order) and see what's in them. In production I probably want to obscure the source program name ($0) and the time; I don't want to make it easier to guess who's making which file.
A scratchpad
I can also open a temporary file with open by not giving it a filename. This is useful when you want to collect outside the program. Opening it read-write means you can output some stuff then move around that file (we show a fixed-length record example in Learning Perl):
open(my $tmp, "+>", undef) or die ...
print $tmp "Some stuff\n";
seek $tmp, 0, 0;
my $line = <$tmp>;
File::Temp opens the temp file in O_RDWR mode so all you have to do is use that one file handle for both reading and writing, even from external programs. The returned file handle is overloaded so that it stringifies to the temp file name so you can pass that to the external program. If that is dangerous for your purpose you can get the fileno() and redirect to /dev/fd/<fileno> instead.
All you have to do is mind your seeks and tells. :-) Just remember to always set autoflush!
use File::Temp;
use Data::Dump;
$fh = File::Temp->new;
$fh->autoflush;
system "ls /tmp/*.txt >> $fh" and die $!;
#lines = <$fh>;
printf "%s\n\n", Data::Dump::pp(\#lines);
print $fh "How now brown cow\n";
seek $fh, 0, 0 or die $!;
#lines2 = <$fh>;
printf "%s\n", Data::Dump::pp(\#lines2);
Which prints
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
]
[
"/tmp/cpan_htmlconvert_DPzx.txt\n",
"/tmp/cpan_htmlconvert_DunL.txt\n",
"/tmp/cpan_install_HfUe.txt\n",
"/tmp/cpan_install_XbD6.txt\n",
"/tmp/cpan_install_yzs9.txt\n",
"How now brown cow\n",
]
HTH
In my script I am dealing with opening files and writing to files. I found that there is some thing wrong with a file I try to open, the file exists, it is not empty and I am passing the right path to file handle.
I know that my question might sounds weird but while I was debugging my code I put the following command in my script to check some files
system ("ls");
Then my script worked well, when it's removed it does not work correctly anymore.
my #unique = ("test1","test2");
open(unique_fh,">orfs");
print unique_fh #unique ;
open(ORF,"orfs")or die ("file doesnot exist");
system ("ls");
while(<ORF>){
split ;
}
#neworfs=#_ ;
print #neworfs ;
Perl buffers the output when you print to a file. In other words, it doesn't actually write to the file every time you say print; it saves up a bunch of data and writes it all at once. This is faster.
In your case, you couldn't see anything you had written to the file, because Perl hadn't written anything yet. Adding the system("ls") call, however, caused Perl to write your output first (the interpreter is smart enough to do this, because it thinks you might want to use the system() call to do something with the file you just created).
How do you get around this? You can close the file before you open it again to read it, as choroba suggested. Or you can disable buffering for that file. Put this code just after you open the file:
my $fh = select (unique_fh);
$|=1;
select ($fh);
Then anytime you print to the file, it will get written immediately ($| is a special variable that sets the output buffering behavior).
Closing the file first is probably a better idea, although it is possible to have a filehandle for reading and writing open at the same time.
You did not close the filehandle before trying to read from the same file.
I have a Perl script which reads three files and writes new files after reading each one of them. Everything is one thread.
In this script, I open and work with three text files and store the contents in a hash. The files are large (close to 3 MB).
I am using a loop to go through each of the files (open -> read -> Do some action (hash table) -> close)
I am noticing that the whenever I am scanning through the first file, the Perl terminal window in my Cygwin shell gets stuck. The moment I hit the enter key I can see the script process the rest of the files without any issues.
It's very odd as there is no read from STDIN in my script. Moreover, the same logic applies to all the three files as everything is in the same loop.
Has anyone here faced a similar issue? Does this usually happen when dealing with large files or big hashes?
I can't post the script here, but there is not much in it to post anyway.
Could this just be a problem in my Cygwin shell?
If this problem does not go away, how can I circumvent it? Like providing the enter input when the script is in progress? More importantly, how can I debug such a problem?
sub read_set
{
#lines_in_set = ();
push #lines_in_set , $_[0];
while (<INPUT_FILE>)
{ $line = $_;
chomp($line);
if ($line=~ /ENDNEWTYPE/i or $line =~ /ENDSYNTYPE/ or eof())
{
push #lines_in_set , $line;
last;
}
else
{
push #lines_in_set , $line;
}
}
return #lines_in_set;
}
--------> I think i found the problem :- or eof() call was ensuring that the script would be stuck !! Somehow happening only at the first time. I have no idea why though
The eof() call is the problem. See perldoc -f eof.
eof with empty parentheses refers to the pseudo file accessed via while (<>), which consists of either all the files named in #ARGV, or to STDIN if there are none.
And in particular:
Note that this function actually reads a character and then "ungetc"s it, so isn't useful in an interactive context.
But your loop reads from another handle, one called INPUT_FILE.
It would make more sense to call eof(INPUT_FILE). But even that probably isn't necessary; your outer loop will terminate when it reaches the end of INPUT_FILE.
Some more suggestions, not related to the symptoms you're seeing:
Add
use strict;
use warnings;
near the top of your script, and correct any error messages this produces (perl -cw script-name does a compile-only check). You'll need to declare your variables using my (perldoc -f my). And use consistent indentation; I recommend the same style you'll find in most Perl documentation.
I have Perl script which appends a new line to the existing file every 3 seconds. Also, there is a C++ application which reads from that file.
The problem is that the application begins to read the file after the script is done and file handle is closed. To avoid this I want to flush after each line append. How can I do that?
Try:
use IO::Handle;
$fh->autoflush;
This was actually posted as a way of auto-flushing in an early question of mine, which asked about the universally accepted bad way of achieving this :-)
TL/DR: use IO::Handle and the flush method, eg:
use IO::Handle;
$myfile->flush();
First, you need to decide how "flushed" you want it. There can be quite a few layers of buffering:
Perl's internal buffer on the file handle. Other programs can't see data until it's left this buffer.
File-system level buffering of "dirty" file blocks. Other programs can still see these changes, they seem "written", but they'll be lost if the OS or machine crashes.
Disk-level write-back buffering of writes. The OS thinks these are written to disk, but the disk is actually just storing them in volatile memory on the drive. If the OS crashes the data won't be lost, but if power fails it might be unless the disk can write it out first. This is a big problem with cheap consumer SSDs.
It gets even more complicated when SANs, remote file systems, RAID controllers, etc get involved. If you're writing via pipes there's also the pipe buffer to consider.
If you just want to flush the Perl buffer, you can close the file, print a string containing "\n" (since it appears that Perl flushes on newlines), or use IO::Handle's flush method.
You can also, per the perl faq use binmode or play with $| to make the file handle unbuffered. This is not the same thing as flushing a buffered handle, since queuing up a bunch of buffered writes then doing a single flush has a much lower performance cost than writing to an unbuffered handle.
If you want to flush the file system write back buffer you need to use a system call like fsync(), open your file in O_DATASYNC mode, or use one of the numerous other options. It's painfully complicated, as evidenced by the fact that PostgreSQL has its own tool just to test file syncing methods.
If you want to make sure it's really, truly, honestly on the hard drive in permanent storage you must flush it to the file system in your program. You also need to configure the hard drive/SSD/RAID controller/SAN/whatever to really flush when the OS asks it to. This can be surprisingly complicated to do and is quite OS/hardware specific. "plug-pull" testing is strongly recommended to make sure you've really got it right.
From 'man perlfaq5':
$old_fh = select(OUTPUT_HANDLE);
$| = 1;
select($old_fh);
If you just want to flush stdout, you can probably just do:
$| = 1;
But check the FAQ for details on a module that gives you a nicer-to-use abstraction, like IO::Handle.
Here's the answer - the real answer.
Stop maintaining an open file handle for this file for the life of the process.
Start abstracting your file-append operation into a sub that opens the file in append mode, writes to it, and closes it.
# Appends a new line to the existing file
sub append_new_line{
my $linedata = shift;
open my $fh, '>>', $fnm or die $!; # $fnm is file-lexical or something
print $fh $linedata,"\n"; # Flavor to taste
close $fh;
}
The process observing the file will encounter a closed file that gets modified whenever the function is called.
All of the solutions suggesting setting autoflush are ignoring the basic fact that most modern OS's are buffering file I/O irrespective of what Perl is doing.
You only possibility to force the commitment of the data to disk is by closing the file.
I'm trapped with the same dilemma atm where we have an issue with rotation of the log being written.
To automatically flush the output, you can set autoflush/$| as described by others before you output to the filehandle.
If you've already output to the filehandle and need to ensure that it gets to the physical file, you need to use the IO::Handle flush and sync methods.
There an article about this in PerlDoc: How do I flush/unbuffer an output filehandle? Why must I do this?
Two solutions:
Unbuffer the output filehandler with : $|
Call the autoflush method if you are using IO::Handle or one of its subclasses.
An alternative approach would be to use a named pipe between your Perl script and C++ program, in lieu of the file you're currently using.
For those who are searching a solution to flush output line by line to a file in Ansys CFD Post using a Session File (*.cse), this is the only solution that worked for me:
! $file="Test.csv";
! open(OUT,"+>>$file");
! select(OUT);$|=1; # This is the important line
! for($i=0;$i<=10;$i++)
! {
! print out "$i\n";
! sleep(3);
! }
Note that you need the exclamation marks at every begin of every line that contains Perl script. sleep(3); is only applied for demonstration reasons. use IO::Handle; is not needed.
The genuine correct answer is to use:-
$|=1; # Make STDOUT immediate (non-buffered)
and although that is one cause of your problem, the other cause of the same problem is this: "Also, there is a C++ application which reads from that file."
It is EXTREMELY NON-TRIVIAL to write C++ code which can properly read from a file that is growing, because your "C++" program will encounter an EOF when it gets to the end... (you cannot read past the end of a file without serious extra trickery) - you have to do a pile of complicated stuff with IO blocking and flags to properly monitor a file this way (like how the linux "tail" command works).
I had the same problem with the only difference of writing the same file over and over again with new content. This association of "$| = 1" and autoflush worked for me:
open (MYFILE, '>', '/internet/web-sites/trot/templates/xml_queries/test.xml');
$| = 1; # Before writing!
print MYFILE "$thisCardReadingContentTemplate\n\n";
close (MYFILE);
MYFILE->autoflush(1); # After writing!
Best of luck.
H
We have a script on an FTP endpoint that monitors the FTP logs spewed out by our FTP daemon.
Currently what we do is have a perl script essentially runs a tail -F on the file and sends every single line into a remote MySQL database, with slightly different column content based off the record type.
This database has tables for content of both the tarball names/content, as well as FTP user actions with said packages; Downloads, Deletes, and everything else VSFTPd logs.
I see this as particularly bad, but I'm not sure what's better.
The goal of a replacement is to still get log file content into a database as quick as possible. I'm thinking doing something like making a FIFO/pipe file in place of where the FTP log file is, so I can read it in once periodically, ensuring I never read the same thing in twice. Assuming VSFTPd will place nice with that (I'm thinking it won't, insight welcome!).
The FTP daemon is VSFTPd, I'm at least fairly sure the extent of their logging capabilies are: xfer style log, vsftpd style log, both, or no logging at all.
The question is, what's better than what we're already doing, if anything?
Honestly, I don't see much wrong with what you're doing now. tail -f is very efficient. The only real problem with it is that it loses state if your watcher script ever dies (which is a semi-hard problem to solve with rotating logfiles). There's also a nice File::Tail module on CPAN that saves you from the shellout and has some nice customization available.
Using a FIFO as a log can work (as long as vsftpd doesn't try to unlink and recreate the logfile at any point, which it may do) but I see one major problem with it. If no one is reading from the other end of the FIFO (for instance if your script crashes, or was never started), then a short time later all of the writes to the FIFO will start blocking. And I haven't tested this, but it's pretty likely that having logfile writes block will cause the entire server to hang. Not a very pretty scenario.
Your problem with reading a continually updated file is you want to keep reading, even after the end of file is reached. The solution to this is to re-seek to your current position in the file:
seek FILEHANDLE, 0, 1;
Here is my code for doing this sort of thing:
open(FILEHANDLE,'<', '/var/log/file') || die 'Could not open log file';
seek(FILEHANDLE, 0, 2) || die 'Could not seek to end of log file';
for (;;) {
while (<FILEHANDLE>) {
if ( $_ =~ /monitor status down/ ) {
print "Gone down\n";
}
}
sleep 1;
seek FILEHANDLE, 0, 1; # clear eof
}
You should look into inotify (assuming you are on a nice, posix based OS) so you can run your perl script whenever the logfile is updated. If this level of IO causes problems you could always keep the logfile on a RAMdisk so IO is very fast.
This should help you set this up:
http://www.cyberciti.biz/faq/linux-inotify-examples-to-replicate-directories/
You can open the file as an in pipe.
open(my $file, '-|', '/ftp/file/path');
while (<$file>) {
# you know the rest
}
File::Tail does this, plus heuristic sleeping and nice error handling and recovery.
Edit: On second thought, a real system pipe is better if you can manage it. If not, you need to find the last thing you put in the database, and spin through the file until you get to the last thing you put in the database whenever your process starts. Not that easy to accomplish, and potentially impossible if you have no way of identifying where you left off.