Perl printing binary to files - cr lf - perl

I am not a regular Perl programmer and I could not find anything about this in the forum or few books I have.
I am trying to write binary data to a file using the construct:
print filehandle $record
I note that all of my records truncate when an x'0A' is encountered so apparently Perl uses the LF as and end of record indicator. How can I write the complete records, using for example, a length specifier? I am worried about Perl tampering with other binary "non printables" as well.
thanks
Fritz

You want to use
open(my $fh, '<', $qfn) or die $!;
binmode($fh);
or
open(my $fh, '<:raw', $qfn) or die $!;
to prevent modifications. Same goes for output handles.
This "truncation at 0A" talk makes it sound like you're using readline and expect to do something other than read a line.
Well, actually, it can! You just need to tell readline you want it to read fix width records.
local $/ = \4096;
while (my $rec = <$fh>) {
...
}
The other alternative would be to use read.
while (1) {
my $rv = read($fh, my $rec, 4096);
die $! if !defined($rv);
last if !$rv;
...
}
binmode
open
read
readline (aka <> and <$fh>)
$/

Perl is not "tampering" with your writes. If your records are being truncated when they encounter a line feed, then that's a problem with the code that reads them, not the code that writes them. (Unless the format specifies that line feeds must be escaped, in which case the "problem" with the code writing the file is that it doesn't tamper with the data (by escaping line feeds) and instead writes exactly what you tell it to.)
Please provide a small (but runnable) code sample demonstrating your issue, ideally including both reading and writing, along with the actual result and the desired result, and we'll be able to give more specific help.
Note, however, that \n does not map directly to a single data byte (ASCII character) unless you're in binary mode. If the file is being read or written in text mode, \n could be just a CR, just a LF, or a CRLF, depending on the operating system it's being run under.

Related

Out of memory when serving a very big binary file over HTTP

The code below is the original code of a Perl CGI script we are using. Even for very big files it seems to be working, but not for really huge files.
The current code is :
$files_location = $c->{target_dir}.'/'.$ID;
open(DLFILE, "<$files_location") || Error('open', 'file');
#fileholder = <DLFILE>;
close (DLFILE) || Error ('close', 'file');
print "Content-Type:application/x-download\n";
print "Content-Disposition:attachment;filename=$name\n\n";
print #fileholder;
binmode $DLFILE;
If I understand the code correctly, it is loading the whole file in memory before "printing" it. Of course I suppose it would be a lot better to load and display it by chunks ? But after having read many forums and tutorials I am still not sure how to do it best, with standard Perl libraries...
Last question, why is "binmode" specified at the end ?
Thanks a lot for any hint or advice,
I have no idea what binmode $DLFILE is for. $DLFILE is nothing to do with the file handle DLFILE, and it's a bit late to set the binmode of the file now that it has been read to the end. It's probably just a mistake
You can use this instead. It uses modern Perl best practices and reads and sends the file in 8K chunks
The file name seems to be made from $ID so I'm not sure that $name would be correct, but I can't tell
Make sure to keep the braces, as the block makes Perl restore the old value of $/ and close the open file handle
my $files_location = "$c->{target_dir}/$ID";
{
print "Content-Type: application/x-download\n";
print "Content-Disposition: attachment; filename=$name\n\n";
open my $fh, '<:raw', $files_location or Error('open', "file $files_location");
local $/ = \( 8 * 1024 );
print while <$fh>;
}
You're pulling the entire file at once into memory. Best to loop over the file line-by-line, which eliminates this problem.
Note also that I've modified the code to use the proper 3-arg open, and to use a lexical file handle instead of a global bareword one.
open my $fh, '<', $files_location or die $!;
print "Content-Type:application/x-download\n";
print "Content-Disposition:attachment;filename=$name\n\n";
while (my $line = <$fh>){
print $line;
}
The binmode call appears to be useless in the context of what you've shown here, as $DLFILE doesn't appear to be a valid, in-use variable (add use strict; and use warnings; at the top of your script...)

Perl incorrectly adding newline characters?

This is my tab delimited input file
Name<tab>Street<tab>Address
This is how I want my output file to look like
Street<tab>Address<tab>Address
(yes duplicate the next two columns) My output file looks like this instead
Street<tab>Address
<tab>Address
What is going on with perl? This is my code.
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while ($line = <IN>){
chomp $line;
#line=split/\t/,$line;
$line[2]=~s/\n//g;
print OUT $line[1]."\t".$line[2]."\t".$line[2]."\n";
}
close( OUT);
First of all, you should always
use strict and use warnings for even the most trivial programs. You will also need to declare each of your variables using my as close as possible to their first use
use lexical file handles and the three-parameter form of open
check the success of every open call, and die with a string that includes $! to show the reason for the failure
Note also that there is no need to explicitly open files named on the command line that appear in #ARGV: you can just read from them using <>.
As others have said, it looks like you are reading a file of DOS or Windows origin on a Linux system. Instead of using chomp, you can remove all trailing whitespace characters from each line using s/\s+\z//. Since CR and LF both count as "whitespace", this will remove all line terminators from each record. Beware, however, that, if trailing space is significant or if the last field may be blank, then this will also remove spaces and tabs. In that case, s/[\r\n]+\z// is more appropriate.
This version of your program works fine.
use strict;
use warnings;
#ARGV = 'addr.txt';
open my $out, '>', 'output.txt' or die $!;
while (<>) {
s/\s+\z//;
my #fields = split /\t/;
print $out join("\t", #fields[1, 2, 2]), "\n";
}
close $out or die $!;
If you know beforehand the origin of your data file, and know it to be a DOS-like file that terminates records with CR LF, you can use the PerlIO crlf layer when you open the file. Like this
open my $in, '<:crlf', $ARGV[0] or die $!;
then all records will appear to end in just "\n" when they are read on a Linux system.
A general solution to this problem is to install PerlIO::eol. Then you can write
open my $in, '<:raw:eol(LF)', $ARGV[0] or die $!;
and the line ending will always be "\n" regardless of the origin of the file, and regardless of the platform where Perl is running.
Did you try to eliminate not only the "\n" but also the "\r"???
$file[2] =~ s/\r\n//g;
$file[3] =~ s/\r\n//g; # Is it the "good" one?
It could work. DOS line endings could also be "\r" (not only "\n").
Another way to avoid end of line problems is to only capture the characters you're interested in:
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while (<IN>) {
print OUT "$1\t$2\t$2\n" if /^(\w+)\t\w+\t(\w+)\s*/;
}
close( OUT);

In Perl, why does print not generate any output after I close STDOUT?

I have the code:
open(FILE, "<$new_file") or die "Cant't open file \n";
#lines=<FILE>;
close FILE;
open(STDOUT, ">$new_file") or die "Can't open file\n";
$old_fh = select(OUTPUT_HANDLE);
$| = 1;
select($old_fh);
for(#lines){
s/(.*?xsl.*?)xsl/$1xslt/;
print;
}
close(STDOUT);
STDOUT -> autoflush(1);
print "file changed";
After closing STDOUT closing the program does not write the last print print "file changed". Why is this?
*Edited* Print message I want to write on Console no to file
I suppose it is because print default filehandle is STDOUT, which at that point it is already closed. You could reopen it, or print to other filehandle, for example, STDERR.
print STDERR "file changed";
It's because you've closed the filehandle stored in STDOUT, so print can't use it anymore. Generally speaking opening a new filehandle into one of the predefined handle names isn't a very good idea because it's bound to lead to confusion. It's much clearer to use lexical filehandles, or just a different name for your output file. Yes you then have to specify the filehandle in your print call, but then you don't have any confusion over what's happened to STDOUT.
A print statement will output the string in the STDOUT, which is the default output file handle.
So the statement
print "This is a message";
is same as
print STDOUT "This is a message";
In your code, you have closed STDOUT and then printing the message, which will not work. Reopen the STDOUT filehandle or do not close it. As the script ends, the file handles will be automatically closed
open OLDOUT, ">&", STDOUT;
close STDOUT;
open(STDOUT, ">$new_file") or die "Can't open file\n";
...
close(STDOUT);
open (STDOUT, ">&",OLDOUT);
print "file changed";
You seem to be confused about how file IO operations are done in perl, so I would recommend you read up on that.
What went wrong?
What you are doing is:
Open a file for reading
Read the entire file and close it
Open the same file for overwrite (org file is truncated), using the STDOUT file handle.
Juggle around the default print handle in order to set autoflush on a file handle which is not even opened in the code you show.
Perform a substitution on all lines and print them
Close STDOUT then print a message when everything is done.
Your main biggest mistake is trying to reopen the default output file handle STDOUT. I assume this is because you do not know how print works, i.e. that you can supply a file handle to print to print FILEHANDLE "text". Or that you did not know that STDOUT was a pre-defined file handle.
Your other errors:
You did not use use strict; use warnings;. No program you write should be without these. They will prevent you from doing bad things, and give you information on errors, and will save you hours of debugging.
You should never "slurp" a file (read the entire file to a variable) unless you really need to, because this is ineffective and slow and for huge files will cause your program to crash due to lack of memory.
Never reassign the default file handles STDIN, STDOUT, STDERR, unless A) you really need to, B) you know what you are doing.
select sets the default file handle for print, read the documentation. This is rarely something that you need to concern yourself with. The variable $| sets autoflush on (if set to a true value) for the currently selected file handle. So what you did actually accomplished nothing, because OUTPUT_HANDLE is a non-existent file handle. If you had skipped the select statements, it would have set autoflush for STDOUT. (But you wouldn't have noticed any difference)
print uses print buffers because it is efficient. I assume you are trying to autoflush because you think your prints get caught in the buffer, which is not true. Generally speaking, this is not something you need to worry about. All the print buffers are automatically flushed when a program ends.
For the most part, you do not need to explicitly close file handles. File handles are automatically closed when they go out of scope, or when the program ends.
Using lexical file handles, e.g. open my $fh, ... instead of global, e.g. open FILE, .. is recommended, because of the previous statement, and because it is always a good idea to avoid global variables.
Using three-argument open is recommended: open FILEHANDLE, MODE, FILENAME. This is because you otherwise risk meta-characters in your file names to corrupt your open statement.
The quick fix:
Now, as I said in the comments, this -- or rather, what you intended, because this code is wrong -- is pretty much identical to the idiomatic usage of the -p command line switch:
perl -pi.bak -e 's/(.*?xsl.*?)xsl/$1xslt/' file.txt
This short little snippet actually does all that your program does, but does it much better. Explanation:
-p switch automatically assumes that the code you provide is inside a while (<>) { } loop, and prints each line, after your code is executed.
-i switch tells perl to do inplace-edit on the file, saving a backup copy in "file.txt.bak".
So, that one-liner is equivalent to a program such as this:
$^I = ".bak"; # turns inplace-edit on
while (<>) { # diamond operator automatically uses STDIN or files from #ARGV
s/(.*?xsl.*?)xsl/$1xslt/;
print;
}
Which is equivalent to this:
my $file = shift; # first argument from #ARGV -- arguments
open my $fh, "<", $file or die $!;
open my $tmp, ">", "/tmp/foo.bar" or die $!; # not sure where tmpfile is
while (<$fh>) { # read lines from org file
s/(.*?xsl.*?)xsl/$1xslt/;
print $tmp $_; # print line to tmp file
}
rename($file, "$file.bak") or die $!; # save backup
rename("/tmp/foo.bar", $file) or die $!; # overwrite original file
The inplace-edit option actually creates a separate file, then copies it over the original. If you use the backup option, the original file is first backed up. You don't need to know this information, just know that using the -i switch will cause the -p (and -n) option to actually perform changes on your original file.
Using the -i switch with the backup option activated is not required (except on Windows), but recommended. A good idea is to run the one-liner without the option first, so the output is printed to screen instead, and then adding it once you see the output is ok.
The regex
s/(.*?xsl.*?)xsl/$1xslt/;
You search for a string that contains "xsl" twice. The usage of .*? is good in the second case, but not in the first. Any time you find yourself starting a regex with a wildcard string, you're probably doing something wrong. Unless you are trying to capture that part.
In this case, though, you capture it and remove it, only to put it back, which is completely useless. So the first order of business is to take that part out:
s/(xsl.*?)xsl/$1xslt/;
Now, removing something and putting it back is really just a magic trick for not removing it at all. We don't need magic tricks like that, when we can just not remove it in the first place. Using look-around assertions, you can achieve this.
In this case, since you have a variable length expression and need a look-behind assertion, we have to use the \K (mnemonic: Keep) option instead, because variable length look-behinds are not implemented.
s/xsl.*?\Kxsl/xslt/;
So, since we didn't take anything out, we don't need to put anything back using $1. Now, you may notice, "Hey, if I replace 'xsl' with 'xslt', I don't need to remove 'xsl' at all." Which is true:
s/xsl.*?xsl\K/t/;
You may consider using options for this regex, such as /i, which causes it to ignore case and thus also match strings such as "XSL FOO XSL". Or the /g option which will allow it to perform all possible matches per line, and not just the first match. Read more in perlop.
Conclusion
The finished one-liner is:
perl -pi.bak -e 's/xsl.*?xsl\K/t/' file.txt

Unicode in Perl not working

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?
OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)
I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.

What's the best way to open and read a file in Perl?

Please note - I am not looking for the "right" way to open/read a file, or the way I should open/read a file every single time. I am just interested to find out what way most people use, and maybe learn a few new methods at the same time :)*
A very common block of code in my Perl programs is opening a file and reading or writing to it. I have seen so many ways of doing this, and my style on performing this task has changed over the years a few times. I'm just wondering what the best (if there is a best way) method is to do this?
I used to open a file like this:
my $input_file = "/path/to/my/file";
open INPUT_FILE, "<$input_file" || die "Can't open $input_file: $!\n";
But I think that has problems with error trapping.
Adding a parenthesis seems to fix the error trapping:
open (INPUT_FILE, "<$input_file") || die "Can't open $input_file: $!\n";
I know you can also assign a filehandle to a variable, so instead of using "INPUT_FILE" like I did above, I could have used $input_filehandle - is that way better?
For reading a file, if it is small, is there anything wrong with globbing, like this?
my #array = <INPUT_FILE>;
or
my $file_contents = join( "\n", <INPUT_FILE> );
or should you always loop through, like this:
my #array;
while (<INPUT_FILE>) {
push(#array, $_);
}
I know there are so many ways to accomplish things in perl, I'm just wondering if there are preferred/standard methods of opening and reading in a file?
There are no universal standards, but there are reasons to prefer one or another. My preferred form is this:
open( my $input_fh, "<", $input_file ) || die "Can't open $input_file: $!";
The reasons are:
You report errors immediately. (Replace "die" with "warn" if that's what you want.)
Your filehandle is now reference-counted, so once you're not using it it will be automatically closed. If you use the global name INPUT_FILEHANDLE, then you have to close the file manually or it will stay open until the program exits.
The read-mode indicator "<" is separated from the $input_file, increasing readability.
The following is great if the file is small and you know you want all lines:
my #lines = <$input_fh>;
You can even do this, if you need to process all lines as a single string:
my $text = join('', <$input_fh>);
For long files you will want to iterate over lines with while, or use read.
If you want the entire file as a single string, there's no need to iterate through it.
use strict;
use warnings;
use Carp;
use English qw( -no_match_vars );
my $data = q{};
{
local $RS = undef; # This makes it just read the whole thing,
my $fh;
croak "Can't open $input_file: $!\n" if not open $fh, '<', $input_file;
$data = <$fh>;
croak 'Some Error During Close :/ ' if not close $fh;
}
The above satisfies perlcritic --brutal, which is a good way to test for 'best practices' :). $input_file is still undefined here, but the rest is kosher.
Having to write 'or die' everywhere drives me nuts. My preferred way to open a file looks like this:
use autodie;
open(my $image_fh, '<', $filename);
While that's very little typing, there are a lot of important things to note which are going on:
We're using the autodie pragma, which means that all of Perl's built-ins will throw an exception if something goes wrong. It eliminates the need for writing or die ... in your code, it produces friendly, human-readable error messages, and has lexical scope. It's available from the CPAN.
We're using the three-argument version of open. It means that even if we have a funny filename containing characters such as <, > or |, Perl will still do the right thing. In my Perl Security tutorial at OSCON I showed a number of ways to get 2-argument open to misbehave. The notes for this tutorial are available for free download from Perl Training Australia.
We're using a scalar file handle. This means that we're not going to be coincidently closing someone else's file handle of the same name, which can happen if we use package file handles. It also means strict can spot typos, and that our file handle will be cleaned up automatically if it goes out of scope.
We're using a meaningful file handle. In this case it looks like we're going to write to an image.
The file handle ends with _fh. If we see us using it like a regular scalar, then we know that it's probably a mistake.
If your files are small enough that reading the whole thing into memory is feasible, use File::Slurp. It reads and writes full files with a very simple API, plus it does all the error checking so you don't have to.
There is no best way to open and read a file. It's the wrong question to ask. What's in the file? How much data do you need at any point? Do you need all of the data at once? What do you need to do with the data? You need to figure those out before you think about how you need to open and read the file.
Is anything that you are doing now causing you problems? If not, don't you have better problems to solve? :)
Most of your question is merely syntax, and that's all answered in the Perl documentation (especially (perlopentut). You might also like to pick up Learning Perl, which answers most of the problems you have in your question.
Good luck, :)
It's true that there are as many best ways to open a file in Perl as there are
$files_in_the_known_universe * $perl_programmers
...but it's still interesting to see who usually does it which way. My preferred form of slurping (reading the whole file at once) is:
use strict;
use warnings;
use IO::File;
my $file = shift #ARGV or die "what file?";
my $fh = IO::File->new( $file, '<' ) or die "$file: $!";
my $data = do { local $/; <$fh> };
$fh->close();
# If you didn't just run out of memory, you have:
printf "%d characters (possibly bytes)\n", length($data);
And when going line-by-line:
my $fh = IO::File->new( $file, '<' ) or die "$file: $!";
while ( my $line = <$fh> ) {
print "Better than cat: $line";
}
$fh->close();
Caveat lector of course: these are just the approaches I've committed to muscle memory for everyday work, and they may be radically unsuited to the problem you're trying to solve.
I once used the
open (FILEIN, "<", $inputfile) or die "...";
my #FileContents = <FILEIN>;
close FILEIN;
boilerplate regularly. Nowadays, I use File::Slurp for small files that I want to hold completely in memory, and Tie::File for big files that I want to scalably address and/or files that I want to change in place.
For OO, I like:
use FileHandle;
...
my $handle = FileHandle->new( "< $file_to_read" );
croak( "Could not open '$file_to_read'" ) unless $handle;
...
my $line1 = <$handle>;
my $line2 = $handle->getline;
my #lines = $handle->getlines;
$handle->close;
Read the entire file $file into variable $text with a single line
$text = do {local(#ARGV, $/) = $file ; <>};
or as a function
$text = load_file($file);
sub load_file {local(#ARGV, $/) = #_; <>}
If these programs are just for your productivity, whatever works! Build in as much error handling as you think you need.
Reading in a whole file if it's large may not be the best way long-term to do things, so you may want to process lines as they come in rather than load them up in an array.
One tip I got from one of the chapters in The Pragmatic Programmer (Hunt & Thomas) is that you might want to have the script save a backup of the file for you before it goes to work slicing and dicing.
The || operator has higher precedence, so it is evaluated first before sending the result to "open"... In the code you've mentioned, use the "or" operator instead, and you wouldn't have that problem.
open INPUT_FILE, "<$input_file"
or die "Can't open $input_file: $!\n";
Damian Conway does it this way:
$data = readline!open(!((*{!$_},$/)=\$_)) for "filename";
But I don't recommend that to you.