LIne-by-line file-io not working as expected in Windows - perl

I'm using Perl 5.16.1 from Strawberry in a Windows environment. I have a Perl script reading very large text files. The smallest text file is 30M. When reading files that do not have a line feed at the end of the very last line I get very peculiar results. It may not happen all the time but when it does It's as though it is reading cached data from the I/O system for another file that I previously opened with the Perl script. If I manually edit the file and add a line feed it's fine. I added a line counter and some inline code to display what happens when I'm near the end of the file to make sure I wasn't going nuts. To try and fix I tried adding this to my script:
open (SS_LOG, ">>", $SSFile) or die "Can't open $SSFile\r\n $!\r\n";
print SS_LOG "\r\n";
close SS_LOG;
but it does nothing. The file stays the same size. I'm also storing data in large arrays.
Has anyone else seen anything like this?

Try unbuffering your output:
SS_LOG->autoflush(1);

Related

How to locate code causing corrupt binary output in Perl

I have a relatively complex Perl program that manages various pages and resources for my sites. Somewhere along the line I messed up something in a library of several thousand lines that provides essential services to most of the different scripts for the system so that scripts within my codebase that output PDF or PNG files no longer can output those files validly. If I rewrite the scripts that do the output to avoid using that library, they work, but I'd like to figure out what I broke within my library that is causing it to hurt binary output.
For example, in one snippet of code, I open a PDF file (or any sort of file -- it detects the mime type automatically) and then print it directly:
#Figure out MIME type.
use File::MimeInfo::Magic;
$mimeType = mimetype($filename);
my $fileData;
open (resource, $filename);
foreach my $self (<resource>) { $fileData .= $self; }
close (resource);
print "Content-type: " . $mimeType . "\n\n";
print $fileData;
exit;
This worked great, but at some point while editing the suspect library I mentioned, I did something that broke it and I'm stumped as to what I did. I had been playing with different utf8 encoding functions, but as far as I can tell, I removed all of my experimental code and the problem remains. Merely loading that library, without calling any of its functions, breaks the script's ability to output the file.
The output is actually being corrupted visibly, if I open it in a text editor. If I compare the source file that is opened by the code given above and the output, the source file and the output file have many differences despite there being no processing in the code above before output (those links are to a sample PDF that was run through the broken code).
I've tried retracing my steps for days and cannot find what is wrong in the problematic library -- I hadn't used this function in awhile and I wrote a lot of new code since I last tested it, so it is hard to know precisely where the problem is. My hope is someone may be able to look at the corrupted output file in comparison to the source file and at least point me in the direction of what I should be looking for that could cause such a result. I feel like I'm looking for a needle in the haystack.

Perl: Is it better to clobber a file or remove it and open a new one?

For example,
#!/usr/bin/perl
open FILE1, '>out/existing_file1.txt';
open FILE2, '>out/existing_file2.txt';
open FILE3, '>out/existing_file3.txt';
versus
#!/usr/bin/perl
if (-d out) {
system('rm -f out/*');
}
open FILE1, '>out/new_file1.txt';
open FILE2, '>out/new_file2.txt';
open FILE3, '>out/new_file3.txt';
In the first example, we clobber the files (truncate them to zero length). In the second, we clean the directory and then create new files.
The second method (where we clean the directory) seems redundant and unnecessary. The only advantage to doing this (in my mind) is that it resets permissions, as well as the change date.
Which is considered the best practice? (I suspect the question is pedantic, and the first example is more common.)
Edit: The reason I ask is because I have a script that will parse data and write output files to a directory - each time with the same filename/path. This script will be run many times, and I'm curious whether at the start of the script I should partially clean the directory (of the files I am writing to) or just let the file handle '>' clobber the files for me, and take no extra measures myself.
Other than the permissions issue you mentioned, the only significant difference between the two methods is if another process has one of the output files open while you do this. If you remove the file and then recreate it, the other process will continue to see the data in the original file. If you clobber the file, the other process will see the file contents change immediately (although if it's using buffered I/O, it may not notice it until it needs to refill the buffer).
Removing the files will also update the modification time of the containing directory.

Check progress of silent Terminal command writing a file?

Not really sure if this is possible, but I am running this on Terminal:
script -q \/my\/directory\/\/$outfile \.\/lexparser.csh $file
Explanation
Through a perl script. The first directory and $outfile is where I am saving the output of the Terminal command. the \.\/lexparser.csh $file is just calling on that script to work on the input file, $file.
Problem
However, I put -q b/c I didn't want to save the unnecessary print to the file. The file is big ~ 30 thousand lines of text. It has been running for some time now, which was expected.
Question
I would like to check and ensure everything is going smoothly. The file output name is in Finder, but I'm afraid if I click on it, it will ruin the output. How can check the progress (possibly the current text file) without disrupting the process?
Thanks for your time, let me know if the question is unclear.
Open a new Terminal, navigate to the output directory, and:
tail -f <output_file>
You will continue to see new appends to the file without interruption to any writing process. Just leave the Terminal open with the tail going, and you can watch it all day long. Grab some popcorn.
In addition to tail, you can also learn about tee. The point of tee is to output to a file while also outputting to STDOUT in your terminal. Best of both worlds! Well, someone good aspects of two possible worlds.
You could tail the file via the command line which shouldn't cause problems.
Additionally you could have the program print to stderr as well as stdout, redirect stdout to the file and allow stderr through so it could tell you it's progress. Though that is more of a 20 / 20 hindsight solution.

perl IO::Handle appending file by two scripts at the same time

I have two scripts. Which opens a file by
IO::Handle open for appending (">>filename"). then I call $io->autoflush(1);
The question is will it work fine if I do it in two scripts at the same time? Or would some lines be lost while appending?
You'll want to use syswrite, like the Log4Perl docs suggest for this sort of situation. syswrite blocks other writers while writing, and shares the end of file marker with other processes when appending.
That will not work, as append mode is more like shortcut to "open the file, don't truncate it and after opening, seek to the end of file". So yes, you will lose lines.

How can I print "Done" or "Fail" at the end of line to stdout in Perl?

I just have begun with Perl and I want to write my own script to scan a document and convert the resulting TIFF file to a PDF file. If the conversion succeeds (using tiff2pdf), I want to print "Done" at the end of the line, but I can't seem to find a hint to do this on the Web.
My guess is that I have to get the geometry of the terminal and count the letters I already printed but that seems to be to complicated. Do you have any advice?
You're right about having to inspect the size of the terminal you're printing to. There's many ways to do that, but the most portable and reliable way I'm aware of is Term::Size::Any.
With that, you can get the width of the terminal you're running in:
use Term::Size::Any;
my $cols = chars *STDOUT{IO};
With that, you can then print whatever you want, padded with the right amount of whitespace, e.g.:
printf "% ${cols}s", "Done\n";
Also be aware that programs don't always output to terminals. Output could, for example, be redirected to a file, so you might want to have an appropriate fallback if determining the terminal size fails.