"Turn Off" binmode(STDOUT, ":utf8") Locally - perl

I Have The following block in the beginning of my script:
#!/usr/bin/perl5 -w
use strict;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
In some subroutines when there is other encoding(from a distant subroutine), the data will not display correctly, when receiving cyrillic or other characters. It is the "binmode", that causes the problem.
Can I "turn off" the binmode utf8 locally, for the subroutine only?
I can't remove the global binmode setting and I can't change the distant encoding.

One way to achieve this is to "dup" the STD handle, set the duplicated filehandle to use the :raw layer, and assign it to a local version of the STD handle. For example, the following code
binmode(STDOUT, ':utf8');
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
{
open(my $duped, '>&', STDOUT);
# The ':raw' argument could also be omitted.
binmode($duped, ':raw');
local *STDOUT = $duped;
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
close($duped);
}
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
prints
unix, perlio, utf8
unix, perlio
unix, perlio, utf8
on my system.

I like #nwellnhof's approach. Dealing only with Unicode and ASCII - a luxury few enjoy - my instinct would be to leave the bytes as is and selectively make use of Encode to decode()/encode() when needed. If you are able to determine which of your data sources are problematic you could filter/insert decode when dealing with them.
% file koi8r.txt
koi8r.txt: ISO-8859 text
% cat koi8r.txt
������ �� ����� � ������� ���. ���
���� ����� ������ ����� �����.
% perl -CO -MEncode="encode,decode" -E 'say decode("koi8-r", <>) ;' koi8r.txt
Американские суда находятся в международных водах. Япония

You could use something like Scope::Guard - lexically-scoped resource management to ensure it gets set back to :utf8 when you leave the scope, regardless of how (return, die, whatever):
#!/usr/bin/perl -w
use strict;
use Scope::Guard qw(guard);
binmode(STDOUT, ':utf8');
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
{
# When guard goes out of scope, this sub is guaranteed to be called:
my $guard = guard {
binmode(STDOUT, ':utf8');
};
binmode(STDOUT, ':raw');
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
}
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
Or, if you don't want to include a new dependency like Scope::Guard (Scope::Guard is awesome for this kind of localizing...):
#!/usr/bin/perl -w
use strict;
binmode(STDOUT, ':utf8');
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
{
my $guard = PoorMansGuard->new(sub {
binmode(STDOUT, ':utf8');
});
binmode(STDOUT, ':raw');
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
}
print(join(', ', PerlIO::get_layers(STDOUT)), "\n");
package PoorMansGuard;
sub new {
my ($class, $sub) = #_;
bless { sub => $sub }, $class;
}
sub DESTROY {
my ($self) = #_;
$self->{sub}->();
}

Related

Why is the utf8 IO layer stripped when spawning a daemon process?

The following code works as expected:
use feature qw(say);
use strict;
use warnings;
use open qw/:std IN :encoding(utf-8) OUT :utf8/;
say join ' ', (PerlIO::get_layers(\*STDOUT));
my $pid = fork();
die "fork() failed: $!" unless defined $pid;
if ($pid == 0) {
say join ' ', (PerlIO::get_layers(\*STDOUT));
}
Output:
unix perlio utf8
unix perlio utf8
But if I use a daemon process instead of a regular fork:
use feature qw(say);
use strict;
use warnings;
use open qw/:std IN :encoding(utf-8) OUT :utf8/;
use Cwd qw(getcwd);
use Proc::Daemon;
my $work_dir = getcwd;
my $daemon = Proc::Daemon->new(
work_dir => $work_dir,
child_STDOUT => 'stdout.txt',
child_STDERR => 'stderr.txt',
pid_file => 'pid.txt',
);
my $pid = $daemon->Init();
if ( $pid == 0 ) {
say join ' ', (PerlIO::get_layers(\*STDOUT));
}
The output to the file stdout.txt is:
unix perlio
so the utf8 IO layer has been stripped off.
Consider this example:
Foo.pm:
package Foo;
use warnings;
use strict;
sub test {
close STDIN;
open \*STDIN, "<", "/dev/null";
}
1;
test.pl:
#!/usr/bin/perl
use strict;
use warnings;
use feature qw/say/;
use open qw/:std IN :encoding(UTF-8) OUT :encoding(UTF-8)/;
use lib qw/./;
use Foo;
$, = ' ';
say "Original STDIN:", PerlIO::get_layers(\*STDIN);
close STDIN;
open \*STDIN, "<", "/dev/null";
say "Reopened STDIN:", PerlIO::get_layers(\*STDIN);
Foo::test();
say "Reopened STDIN in different package:", PerlIO::get_layers(\*STDIN);
Results:
$ perl test.pl
Original STDIN: unix perlio encoding(utf-8-strict) utf8
Reopened STDIN: unix perlio encoding(utf-8-strict) utf8
Reopened STDIN in different package: unix perlio
Looks like use open is like other pragmas and only applies to the file it's in it. So when Proc::Daemon closes standard input, output, and error, and reopens them, it naturally doesn't see the extra layers.

binmode(STDOUT, ":utf8"); and Unix line endings in Strawberry perl

With Strawberry perl v5.28.1 on Windows 10 I am trying to achieve the same result as on Linux - namely get a UTF8 encoded file with Unix line endings.
Here is my Perl script:
#!perl -w
use strict;
use utf8;
use Encode qw(encode_utf8);
use Digest::MD5 qw(md5_hex);
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
my %words;
while(<>) {
# change yo to ye
tr/ёЁ/еЕ/;
# extract russian word and its optional explanation
next unless /^([А-Я]{2,})\|?([А-Я ,-]*)/i;
my ($word, $expl) = (uc $1, $2);
if (length($word) <= 3) {
print $word;
# if explanation is missing, omit the pipe
print (length($expl) > 3 ? "|$expl\x0A" : "\x0A");
} else {
# print the md5 hash and omit the pipe and explanation
print md5_hex(encode_utf8('my secret' . $word)) . "\x0A";
}
}
Here is my input file:
ААК|Плоскодонное речное судно
ААРОНОВЕЦ|
ААРОНОВЩИНА|
ААТ|Драгоценный красный камень в Японии
АБА|Толстое и редкое белое сукно
АБАЖУР|
АБАЖУРОДЕРЖАТЕЛЬ|
АБАЗ|Грузинская серебряная монета
АБАЗА|
Here is how I run it (I use type instead of < because I have numerous input files in my real use case):
type input.txt | perl encode-words-ru.pl > output.txt
Regardless of what I try in the above Perl source code, the lines in output.txt are terminated by \x0D\x0A
Please help me to stop perl from "helping" me!
There is probably a better way, but you could make STDOUT a :raw file handle and then encode the output there yourself.
binmode STDOUT; # or binmode STDOUT, ":raw";
...
print (length($expl) > 3 ? encode_utf8("|$expl\n") : "\n"); # $exp1 is already decoded
...
print md5_hex(encode_utf8('my secret' . $word)) . "\n";

Filehandle Quirk Perl

In the following code if there is space between FILE and ( in the printf statement
like
printf FILE ("Test string inline\n");
Perl will treat FILE as a filehandle otherwise
printf FILE("Test string inline\n");
will be treated as subroutine call(If no subroutine is defined by FILE perl will through an error Undefined subroutine &main::FILE called at ./test.pl line xx ). Isn't there a better way Perl can implement this ? (Maybe this is why bareword filehandles are considered outdated ?)
#!/usr/bin/perl
use warnings;
open(FILE,">test.txt");
printf FILE ("Test string inline\n");
close(FILE);
sub FILE
{
return("Test string subroutine\n");
}
Are you asking how to avoid that error accidentally? You could wrap the handle in curlies
printf({ HANDLE } $pattern, #args);
print({ HANDLE } #args);
say({ HANDLE } #args);
Or since parens are often omitted for say, print and printf,
printf { HANDLE } $pattern, #args;
print { HANDLE } #args;
say { HANDLE } #args;
Or you could use a method call
HANDLE->printf($pattern, #args);
HANDLE->print(#args);
HANDLE->say(#args);
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ) ; # Avoids regex performance penalty
my $test_file = 'test.txt';
open my $test_fh, '>', $test_file or die "could not open $test_file: $OS_ERROR\n";
printf {$test_fh} "Test string inline" or die "could not print $test_file: $OS_ERROR\n";
close $test_fh or die "could not close $test_file: $OS_ERROR\n";

Perl - File Encoding and Word Comparison

I have a file with one phrase/terms each line which i read to perl from STDIN. I have a list of stopwords (like "á", "são", "é") and i want to compare each one of them with each term, and remove if they are equal. The problem is that i'm not certain of the file's encoding format.
I get this from the file command:
words.txt: Non-ISO extended-ASCII English text
My linux terminal is in UTF-8 and it shows the right content for some words and for others don't. Here is the output from some of them:
condi<E3>
conte<FA>dos
ajuda, mas não resolve
mo<E7>ambique
pedagógico são fenómenos
You can see that the 3rd and 5th lines are correctly identifying words with accents and special characters while others don't. The correct output for the other lines should be: condiã, conteúdos and moçambique.
If i use binmode(STDOUT, utf8) the "incorrect" lines now output correctly while the other ones don't. For example the 3rd line:
ajuda, mas não resolve
What should i do guys?
I strongly suggest you create a filter that takes a file with lines in mixed encodings and translates them to pure UTF-8. Then instead
open(INPUT, "< badstuff.txt") || die "open failed: $!";
you would open either the fixed version, or a pipe from the fixer, like:
open(INPUT, "fixit < badstuff.txt |") || die "open failed: $!"
In either event, you would then
binmode(INPUT, ":encoding(UTF-8)") || die "binmode failed";
Then the fixit program could just do this:
use strict;
use warnings;
use Encode qw(decode FB_CROAK);
binmode(STDIN, ":raw") || die "can't binmode STDIN";
binmode(STDOUT, ":utf8") || die "can't binmode STDOUT";
while (my $line = <STDIN>) {
$line = eval { decode("UTF-8", $line, FB_CROAK() };
if ($#) {
$line = decode("CP1252", $line, FB_CROAK()); # no eval{}!
}
$line =~ s/\R\z/\n/; # fix raw mode reads
print STDOUT $line;
}
close(STDIN) || die "can't close STDIN: $!";
close(STDOUT) || die "can't close STDOUT: $!";
exit 0;
See how that works? Of course, you could change it to default to some other encoding, or have multiple fall backs. Probably it would be best to take a list of them in #ARGV.
It works like this:
C:\Dev\Perl :: chcp
Aktive Codepage: 1252.
C:\Dev\Perl :: type mixed-encoding.txt
eins zwei drei Käse vier fünf Wurst
eins zwei drei Käse vier fünf Wurst
C:\Dev\Perl :: perl mixed-encoding.pl < mixed-encoding.txt
eins zwei drei vier fünf
eins zwei drei vier fünf
Where mixed-encoding.pl goes like this:
use strict;
use warnings;
use utf8; # source in UTF-8
use Encode 'decode_utf8';
use List::MoreUtils 'any';
my #stopwords = qw( Käse Wurst );
while ( <> ) { # read octets
chomp;
my #tokens;
for ( split /\s+/ ) {
# Try UTF-8 first. If that fails, assume legacy Latin-1.
my $token = eval { decode_utf8 $_, Encode::FB_CROAK };
$token = $_ if $#;
push #tokens, $token unless any { $token eq $_ } #stopwords;
}
print "#tokens\n";
}
Note that the script doesn't have to be encoded in UTF-8. It's just that if you have funky character data in your script you have to make sure the encoding matches, so use utf8 if your encoding is UTF-8, and don't if it isn't.
Update based on tchrist's sound advice:
use strict;
use warnings;
# source in Latin1
use Encode 'decode';
use List::MoreUtils 'any';
my #stopwords = qw( Käse Wurst );
while ( <> ) { # read octets
chomp;
my #tokens;
for ( split /\s+/ ) {
# Try UTF-8 first. If that fails, assume 8-bit encoding.
my $token = eval { decode utf8 => $_, Encode::FB_CROAK };
$token = decode Windows1252 => $_, Encode::FB_CROAK if $#;
push #tokens, uc $token unless any { $token eq $_ } #stopwords;
}
print "#tokens\n";
}

Convert UTF8 string into numeric values in Perl

For example,
my $str = '中國c'; # Chinese language of china
I want to print out the numeric values
20013,22283,99
unpack will be more efficient than split and ord, because it doesn't have to make a bunch of temporary 1-character strings:
use utf8;
my $str = '中國c'; # Chinese language of china
my #codepoints = unpack 'U*', $str;
print join(',', #codepoints) . "\n"; # prints 20013,22283,99
A quick benchmark shows it's about 3 times faster than split+ord:
use utf8;
use Benchmark 'cmpthese';
my $str = '中國中國中國中國中國中國中國中國中國中國中國中國中國中國c';
cmpthese(0, {
'unpack' => sub { my #codepoints = unpack 'U*', $str; },
'split-map' => sub { my #codepoints = map { ord } split //, $str },
'split-for' => sub { my #cp; for my $c (split(//, $str)) { push #cp, ord($c) } },
'split-for2' => sub { my $cp; for my $c (split(//, $str)) { $cp = ord($c) } },
});
Results:
Rate split-map split-for split-for2 unpack
split-map 85423/s -- -7% -32% -67%
split-for 91950/s 8% -- -27% -64%
split-for2 125550/s 47% 37% -- -51%
unpack 256941/s 201% 179% 105% --
The difference is less pronounced with a shorter string, but unpack is still more than twice as fast. (split-for2 is a bit faster than the other splits because it doesn't build a list of codepoints.)
See perldoc -f ord:
foreach my $c (split(//, $str))
{
print ord($c), "\n";
}
Or compressed into a single line: my #chars = map { ord } split //, $str;
Data::Dumpered, this produces:
$VAR1 = [
20013,
22283,
99
];
To have utf8 in your source code recognized as such, you must use utf8; beforehand:
$ perl
use utf8;
my $str = '中國c'; # Chinese language of china
foreach my $c (split(//, $str))
{
print ord($c), "\n";
}
__END__
20013
22283
99
or more tersely,
print join ',', map ord, split //, $str;
http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html
#!/usr/bin/env perl
use utf8; # so literals and identifiers can be in UTF-8
use v5.12; # or later to get "unicode_strings" feature
use strict; # quote strings, declare variables
use warnings; # on by default
use warnings qw(FATAL utf8); # fatalize encoding glitches
use open qw(:std :utf8); # undeclared streams in UTF-8
# use charnames qw(:full :short); # unneeded in v5.16
# http://perldoc.perl.org/functions/sprintf.html
# vector flag
# This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string.
my $str = '中國c';
printf "%*vd\n", ",", $str;