Setting diamond operator to work in binmode? - perl

Diamond operator (<>) works in text mode by default, is it possible to change binmode for it? Seems binmode function accepts handle only.

See perldoc perlopentut:
Binary Files
On certain legacy systems with what could charitably be called terminally
convoluted (some would say broken) I/O models, a file isn't a file--at
least, not with respect to the C standard I/O library. On these old
systems whose libraries (but not kernels) distinguish between text and
binary streams, to get files to behave properly you'll have to bend over
backwards to avoid nasty problems. On such infelicitous systems, sockets
and pipes are already opened in binary mode, and there is currently no way
to turn that off. With files, you have more options.
Another option is to use the "binmode" function on the appropriate handles
before doing regular I/O on them:
binmode(STDIN);
binmode(STDOUT);
while (<STDIN>) { print }
Passing "sysopen" a non-standard flag option will also open the file in
binary mode on those systems that support it. This is the equivalent of
opening the file normally, then calling "binmode" on the handle.
sysopen(BINDAT, "records.data", O_RDWR | O_BINARY)
|| die "can't open records.data: $!";
Now you can use "read" and "print" on that handle without worrying about
the non-standard system I/O library breaking your data. It's not a pretty
picture, but then, legacy systems seldom are. CP/M will be with us until
the end of days, and after.
On systems with exotic I/O systems, it turns out that, astonishingly
enough, even unbuffered I/O using "sysread" and "syswrite" might do sneaky
data mutilation behind your back.
while (sysread(WHENCE, $buf, 1024)) {
syswrite(WHITHER, $buf, length($buf));
}
Depending on the vicissitudes of your runtime system, even these calls may
need "binmode" or "O_BINARY" first. Systems known to be free of such
difficulties include Unix, the Mac OS, Plan 9, and Inferno.

<> is a convenience. If it only iterated through filenames specified on the command line, you could use $ARGV within while (<>) to detect when a new file was opened, binmode it, and then fseek to the beginning. Of course, this does not work in the presence of redirection (console input is a whole other story).
One solution is to detect if #ARGV does contain something, and open each file individual, and default to reading from STDIN. A rudimentary implementation of this using an iterator could be:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp qw( croak );
my $argv = sub {
#_ or return sub {
my $done;
sub {
$done and return;
$done = 1;
binmode STDIN;
\*STDIN;
}
}->();
my #argv = #_;
sub {
#argv or return;
my $file = shift #argv;
open my $fh, '<', $file
or croak "Cannot open '$file': $!";
binmode $fh;
$fh;
};
}->(#ARGV);
binmode STDOUT;
while (my $fh = $argv->()) {
while (my $line = <$fh>) {
print $line;
}
}
Note:
C:\...\Temp> xxd test.txt
00000000: 7468 6973 2069 7320 6120 7465 7374 0a0a this is a test..
Without binmode:
C:\...\Temp> perl -e "print " test.txt | xxd
00000000: 7468 6973 2069 7320 6120 7465 7374 0d0a this is a test..
00000010: 0d0a ..
With the script above:
C:\...\Temp> perl argv.pl test.txt | xxd
00000000: 7468 6973 2069 7320 6120 7465 7374 0a0a this is a test..
Same results using perl ... < test.txt | xxd, or piping text through perl ...

Related

simply tee in Perl without fork, File::Tee, or piping to tee

Is there a simple way in Perl to send STDOUT or STDERR to multiple places without forking, using File::Tee, or opening a pipe to /usr/bin/tee?
Surely there is a way to do this in pure perl without writing 20+ lines of code, right? What am I missing? Similar questions have been asked, both here on SO and elsewhere, but none of the answers satisfy the requirements that I not have to
fork
use File::Tee / IO::Tee / some other module+dependencies
whose code footprint is 1000x larger than my actual script
open a pipe to the actual tee command
I can see the use of a Core module as a tradeoff here, but really is that needed?
It looks like I can simply do this:
BEGIN {
open my $log, '>>', 'error.log' or die $!;
$SIG{__WARN__} = sub { print $log #_ and print STDERR #_ };
$SIG{__DIE__} = sub { warn #_ and exit 1 };
}
This simply and effectively sends most error messages both to the original STDERR and to a log file (apparently stuff trapped in an eval doesn't show up, I'm told). So there are downsides to this, mentioned in the comments. But as mentioned in the original question, the need was specific. This isn't meant for reuse. It's for a simple, small script that will never be more than 100 lines long.
If you are looking for a way to do this that isn't a "hack", the following was adapted from http://grokbase.com/t/perl/beginners/096pcz62bk/redirecting-stderr-with-io-tee
use IO::Tee;
open my $save_stderr, '>&STDERR' or die $!;
close STDERR;
open my $error_log, '>>', 'error.log' or die $!;
*STDERR = IO::Tee->new( $save_stderr, $error_log ) or die $!;

How reliable is the -B file test?

When I open a SQLite database file there is a lot of readable text in the beginning of the file - how big is the chance that a SQLite file is filtered wrongly away due the -B file test?
#!/usr/bin/env perl
use warnings;
use strict;
use 5.10.1;
use File::Find;
my $dir = shift;
my $databases;
find( {
wanted => sub {
my $file = $File::Find::name;
return if not -B $file;
return if not -s $file;
return if not -r $file;
say $file;
open my $fh, '<', $file or die "$file: $!";
my $firstline = readline( $fh ) // '';
close $fh or die $!;
push #$databases, $file if $firstline =~ /\ASQLite\sformat/;
},
no_chdir => 1,
},
$dir );
say scalar #$databases;
The perlfunc man page has the following to say about -T and -B:
The -T and -B switches work as follows. The first block or so of the file is
examined for odd characters such as strange control codes or characters with
the high bit set. If too many strange characters (>30%) are found, it's a -B
file; otherwise it's a -T file. Also, any file containing a zero byte in the
first block is considered a binary file.
Of course you could now do a statistic analysis of a number of sqlite files, parse their "first block or so" for "odd characters", calculate the probability of their occurrence, and that would give you an idea of how likely it is that -B fails for sqlite files.
However, you could also go the easy route. Can it fail? Yes, it's a heuristic. And a bad one at that. So don't use it.
File type recognition on Unix is usually done by evaluating the file's content. And yes, there are people who've done all the work for you already: it's called libmagic (the thingy that yields the file command line tool). You can use it from Perl with e.g. File::MMagic.
Well, all files are technically a collection of bytes, and thus binary. Beyond that, there is no accepted definition of binary, so it's impossible to evaluate -B's reliability unless you care to posit a definition by which it is to be evaluated.

Is there an issue with opening filenames provided on the command line through $_?

I'm having trouble modifying a script that processes files passed as command line arguments, merely for copying those files, to additionally modifying those files. The following perl script worked just fine for copying files:
use strict;
use warnings;
use File::Copy;
foreach $_ (#ARGV) {
my $orig = $_;
(my $copy = $orig) =~ s/\.js$/_extjs4\.js/;
copy($orig, $copy) or die(qq{failed to copy $orig -> $copy});
}
Now that I have files named "*_extjs4.js", I would like to pass those into a script that similarly takes file names from the command line, and further processes the lines within those files. So far I am able get a file handle successfully as the following script and it's output shows:
use strict;
use warnings;
foreach $_ (#ARGV) {
print "$_\n";
open(my $fh, "+>", $_) or die $!;
print $fh;
#while (my $line = <$fh>) {
# print $line;
#}
close $fh;
}
Which outputs (in part):
./filetree_extjs4.js
GLOB(0x1a457de8)
./async_submit_extjs4.js
GLOB(0x1a457de8)
What I really want to do though rather than printing a representation of the file handle, is to work with the contents of the files themselves. A start would be to print the files lines, which I've tried to do with the commented out code above.
But that code has no effect, the files' lines do not get printed. What am I doing wrong? Is there a conflict between the $_ used to process command line arguments, and the one used to process file contents?
It looks like there are a couple of questions here.
What I really want to do though rather than printing a representation of the file handle, is to work with the contents of the files themselves.
The reason why print $fh is returning GLOB(0x1a457de8) is because the scalar $fh is a filehandle and not the contents of the file itself. To access the contents of the file itself, use <$fh>. For example:
while (my $line = <$fh>) {
print $line;
}
# or simply print while <$fh>;
will print the contents of the entire file.
This is documented in pelrdoc perlop:
If what the angle brackets contain is a simple scalar variable (e.g.,
<$foo>), then that variable contains the name of the filehandle to
input from, or its typeglob, or a reference to the same.
But it has already been tried!
I can see that. Try it after changing the open mode to +<.
According to perldoc perlfaq5:
How come when I open a file read-write it wipes it out?
Because you're using something like this, which truncates the file
then gives you read-write access:
open my $fh, '+>', '/path/name'; # WRONG (almost always)
Whoops. You should instead use this, which will fail if the file
doesn't exist:
open my $fh, '+<', '/path/name'; # open for update
Using ">" always clobbers or creates. Using "<" never does either. The
"+" doesn't change this.
It goes without saying that the or die $! after the open is highly recommended.
But take a step back.
There is a more Perlish way to back up the original file and subsequently manipulate it. In fact, it is doable via the command line itself (!) using the -i flag:
$ perl -p -i._extjs4 -e 's/foo/bar/g' *.js
See perldoc perlrun for more details.
I can't fit my needs into the command-line.
If the manipulation is too much for the command-line to handle, the Tie::File module is worth a try.
To read the contents of a filehandle you have to call readline read or place the filehandle in angle brackets <>.
my $line = readline $fh;
my $actually_read = read $fh, $text, $bytes;
my $line = <$fh>; # similar to readline
To print to a filehandle other than STDIN you have to have it as the first argument to print, followed by what you want to print, without a comma between them.
print $fh 'something';
To prevent someone from accidentally adding a comma, I prefer to put the filehandle in a block.
print {$fh} 'something';
You could also select your new handle.
{
my $oldfh = select $fh;
print 'something';
select $oldfh; # reset it back to the previous handle
}
Also your mode argument to open, causes it to clobber the contents of the file. At which point there is nothing left to read.
Try this instead:
open my $fh, '+<', $_ or die;
I'd like to add something to Zaid's excellent suggestion of using a one-liner.
When you are new to perl, and trying some tricky regexes, it can be nice to use a source file for them, as the command line may get rather crowded. I.e.:
The file:
#!/usr/bin/perl
use warnings;
use strict;
s/complicated/regex/g;
While tweaking the regex, use the source file like so:
perl -p script.pl input.js
perl -p script.pl input.js > testfile
perl -p script.pl input.js | less
Note that you don't use the -i flag here while testing. These commands will not change the input files, only print the changes to stdout.
When you're ready to execute the (permanent!) changes, just add the in-place edit -i flag, and if you wish (recommended), supply an extension for backups, e.g. ".bak".
perl -pi.bak script.pl *.js

CRLF translation with Unicode in Perl

I'm trying to write to a Unicode (UCS-2 Little Endian) file in Perl on Windows, like this.
open my $f, ">$fName" or die "can't write $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
print $f, "ohai\ni can haz unicodez?\nkthxbye\n";
close $f;
It basically works except I no longer get the automatic LF -> CR/LF translation on output that I get on regular text files. (The output files just have LF) If I leave out :raw or add :crlf in the "binmode" call, then the output file is garbled. I've tried re-ordering the "directives" (i.e. :encoding before :raw) and can't get it to work. The same problem exists for reading.
This works for me on windows:
open my $f, ">:encoding(UCS-2LE):crlf", "test.txt";
print $f "ohai\ni can haz unicodez?\nkthxbye\n";
close $f;
Yielding UCS-16 LE output in test.txt of
ohai
i can haz unicodez?
kthxbye
Here is what I have found to work, at least with perl 5.10.1:
Input:
open(my $f_in, '<:raw:perlio:via(File::BOM):crlf', $file);
Output:
open(my $f_out, '>:raw:perlio:encoding(UTF-16LE):crlf:via(File::BOM)', $file);
These handle BOM, CRLF translation, and UTF-16LE encoding/decoding transparently.
Note that according to the perlmonks post below, if trying to specify with binmode() instead of open(), an extra ":pop" is required:
binmode $f_out, ':raw:pop:perlio:encoding(UTF-16LE):crlf';
which my experience corroborates. I was not able to get this to work with the ":via(File::BOM)" layer, however.
References:
http://www.perlmonks.org/?node_id=608532
http://metacpan.org/pod/File::BOM
The :crlf layer does a simple byte mapping of 0x0A -> 0x0D 0x0A (\n --> \r\n) in the output stream, but for the most part this isn't valid in any wide character encoding.
How about using a raw mode but explicitly print the CR?
print $f "ohai\r\ni can haz unicodez?\r\nkthxbye\r\n";
Or if portability is a concern, discover and explicitly use the correct line ending:
## never mind - $/ doesn't work
# print $f "ohai$/i can haz unicodez?$/kthxbye$/";
open DUMMY, '>', 'dummy'; print DUMMY "\n"; close DUMMY;
open DUMMY, '<:raw', 'dummy'; $EOL = <DUMMY>; close DUMMY;
unlink 'dummy';
...
print $f "ohai${EOL}i can haz unicodez?${EOL}kthxbye${EOL}";
Unrelated to the question, but Ωmega
asked in a comment about the difference between :raw and :bytes. As documented in perldoc perlio, you can think of :raw as removing all I/O layers, and :bytes as removing a :utf8 layer. Compare the output of these two commands:
$ perl -E 'binmode *STDOUT,":crlf:raw"; say' | od -c
0000000 \n
0000001
$ perl -E 'binmode *STDOUT,":crlf:bytes";say' | od -c
0000000 \r \n
0000002

How do I read UTF-8 with diamond operator (<>)?

I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.
So my script should be callable in these two ways, as usual, giving the same output:
./script.pl utf8.txt
cat utf8.txt | ./script.pl
But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:
#!/usr/bin/perl -w
binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
while(<>){
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.
EDIT:
I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b. The method with cat correctly outputs:
a
\xCA\xA7
b
But the other method gives me this:
a
\xC3\x8A
\xC2\xA7
b
Try to use the pragma open instead:
use strict;
use warnings;
use open qw(:std :utf8);
while(<>){
my #chars = split //, $_;
print "$_" foreach(#chars);
}
You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in #ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in #ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when #ARGV has files. In this case the <> operator opens a new file handle for each file in #ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.
Your script works if you do this:
#!/usr/bin/perl -w
binmode STDOUT, ':utf8';
while(<>){
binmode ARGV, ':utf8';
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
The magic filehandle that <> reads from is called *ARGV, and it is
opened when you call readline.
But really, I am a fan of explicitly using Encode::decode and
Encode::encode when appropriate.
You can switch on UTF8 by default with the -C flag:
perl -CSD -ne 'print join("\n",split //);' utf8.txt
The switch -CSD turns on UTF8 unconditionally; if you use simply -C it will turn on UTF8 only if the relevant environment variables (LC_ALL, LC_TYPE and LANG) indicate so. See perlrun for details.
This is not recommended if you don't invoke perl directly (in particular, it might not work reliably if you pass options to perl from the shebang line). See the other answers in that case.
If you put a call to binmode inside of the while loop, then it will switch the handle to utf8 mode AFTER the first line is read in. That is probably not what you want to do.
Something like the following might work better:
#!/usr/bin/env perl -w
binmode STDOUT, ':utf8';
eof() ? exit : binmode ARGV, ':utf8';
while( <> ) {
my #chars = split //, $_;
print "$_\n" foreach(#chars);
} continue {
binmode ARGV, ':utf8' if eof && !eof();
}
The call to eof() with parens is magical, as it checks for end of file on the pseudo-filehandle used by <>. It will, if necessary, open the next handle that needs to be read, which typically has the effect of making *ARGV valid, but without reading anything out of it. This allows us to binmode the first file that's read from, before anything is read from it.
Later, eof (without parens) is used; this checks the last handle that was read from for end of file. It will be true after we process the last line of each file from the commandline (or when stdin reaches it's end).
Obviously, if we've just processed the last line of one file, calling eof() (with parens) opens the next file (if there is one), makes *ARGV valid (if it can), and tests for end of file on that next file. If that next file is present, and isn't at end of file, then we can safely use binmode on ARGV.