Perl UTF8 output to a variable - perl

I have the following Perl code, in which I am opening a handle to a scalar variable and writing some utf8 text to it:
use warnings;
use strict;
use 5.010;
use utf8;
use open qw( :std :encoding(utf8) );
my $output;
open my $oh, ">", \$output;
say $oh "Žluťoučký kůň.";
close $oh;
say "Žluťoučký kůň.";
print $output;
and when I run it I get the following output:
Žluťoučký kůň.
ŽluÅ¥ouÄký kůÅ.
(without any warnings or errors). So, obviously, writing an utf8 string into a variable via a handle does not work correctly here as the string seems to be double-encoded. I have tried opening $oh with >:raw, >:bytes, >:encoding(ascii), but none of it helped.
I might be doing something stupid but I cannot figure how to fix this. Any ideas?

First of all, :encoding(utf8) should be :encoding(utf-8).
UTF-8 is the well known encoding standard.
utf8 is a Perl-specific extension to UTF-8.
Reference
(Encoding names are case-insensitive.)
use open qw( :std :encoding(utf8) ); has two effects:
It adds :encoding(utf8) to STDIN, STDOUT and STDERR.
It sets the default layer for open in the lexical scope of the use to :encoding(utf8).
So,
use utf8;
use open qw( :std :encoding(UTF-8) );
# String of decoded text aka string of Unicode Code Points, thanks to `use utf8`.
my $text_ucp = "Žluťoučký kůň.";
# $output will contain text encoded using UTF-8 thanks to `use open`.
open my $oh, ">", \my $text_utf8;
say $oh $text_ucp;
close $oh;
# ok. Will encode the decoded text using UTF-8 thanks to `use open`.
say $text_ucp;
# XXX. Will encode the already-encoded text using UTF-8 thanks to `use open`.
print $text_utf8;
You tried to override the second effect of use open to obtain a file of Unicode Code Points, but that's futile since files can only contain bytes. Some kind of encoding or failure must occur if you try to store something other than bytes in a file.
So live with it, and decode the "file" before using it.
use utf8;
use open qw( :std :encoding(UTF-8) );
use Encode qw( decode_utf8 );
my $text_ucp = "Žluťoučký kůň.";
open my $oh, ">", \my $text_utf8;
say $oh $text_ucp;
close $oh;
my $text2_ucp = decode_utf8($text_utf8);
... Do stuff with $text_ucp and/or $text2_ucp ...
say $text_ucp;
say $text2_ucp;
It is possible to avoid the decode by working directly with UTF-8 in the second half of the program.
use utf8;
BEGIN { binmode(STDERR, ":encoding(UTF-8)"); } # We'll handle STDOUT manually.
use open qw( :encoding(UTF-8) );
use Encode qw( encode_utf8 );
my $text_ucp = "Žluťoučký kůň.";
open my $oh, ">", \my $text_utf8;
say $oh $text_ucp;
close $oh;
say encode_utf8($text_ucp);
say $text_utf8;
Of course, that means you can't use $text_utf8 anywhere that expects decoded text.

Related

Why does File::Slurp get UTF8 characters wrong when I use open ':std', ':encoding(UTF-8)';?

I have a Perl 5.30.0 program on Ubuntu where the combination of File::Slurp and open ':std', ':encoding(UTF-8)' results in UTF8 not getting read correctly:
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
use File::Slurp;
my $text = File::Slurp::slurp('input.txt');
print "$text\n";
with "input.txt" being an UTF8 encoded text file with this content (no BOM):
ö
When I run this, the ö gets displayed as ö. Only when I remove the use open... line, it works as expected and the ö is printed as an ö.
When I manually read the file like below, everything works as expected and I do get the ö:
$text = '';
open my $F, '<', "input.txt" or die "Cannot open file: $!";
while (<$F>) {
$text .= $_;
}
close $F;
print "$text\n";
Why is that and what is the best way to go here? Is the open pragma outdated or am I missing something else?
As with many pragmas,[1] the effect of use open is lexically-scoped.[2] This means it only affects the remainder of the block or file in which it's found. Such a pragma doesn't affect code in functions outside of its scope, even if they are called from which its scope.
You need to communicate the desire to decode the stream to File::Slurp. This can't be done using slurp, but it can be done using read_file via its binmode parameter.
use open ':std', ':encoding(UTF-8)'; # Still want for effect on STDOUT.
use File::Slurp qw( read_file );
my $text = read_file('input.txt', { binmode => ':encoding(UTF-8)' });
A better module is File::Slurper.
use open ':std', ':encoding(UTF-8)'; # Still want for effect on STDOUT.
use File::Slurper qw( read_text );
my $text = read_text('input.txt');
File::Slurper's read_text defaults to decoding using UTF-8.
Without modules, you could use
use open ':std', ':encoding(UTF-8)';
my $text = do {
my $qfn = "input.txt";
open(my $F, '<', $qfn)
or die("Can't open file \"$file\": $!\n");
local $/;
<$fh>
};
Of course, that's not as clear as the earlier solutions.
Other notable examples include use VERSION, use strict, use warnings, use feature and use utf8.
The effect on STDIN, STDOUT and STDERR from :std is global.
Not really an answer to your question, but my favourite file I/O module these days is Path::Tiny.
use Path::Tiny;
my $text = path('input.txt')->slurp_utf8;

Sorting UTF-8 input

I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.
sub sort_by_default {
my #sorted_lines = sort {
$a <=> $b
||
fc( $a) cmp fc($b)
} #_;
}
The cmp used with sort can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
chomp #lines;
my $uc = Unicode::Collate->new();
my #sorted = $uc->sort(#lines);
say for #sorted;
The module's cmp method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my #sorted = map { $uc->cmp($a, $b) } #data;
where $a and $b need be set suitably so to extract what to compare from #data.
If you have utf8 data right in the source you need use utf8, while if you receive utf8 via yet other channels (from #ARGV included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.
† Example: by codepoint comparison ä > b while the accepted order in German is ä < b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
#s = qw(ä b);
say join " ", sort { $a cmp $b } #s; #--> b ä
say join " ", Unicode::Collate->new->sort(#s); #--> ä b
'
so we need to use Unicode::Collate (or a custom sort routine).
To open a file saved as UTF-8, use the appropriate layer:
open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;
Don't forget to set the same layer for the output.
#! /usr/bin/perl
use warnings;
use strict;
binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;
__DATA__
Борис
Peter
John
Владимир
The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:
open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";
If you are going to have UTF-8 inside the source of your script, then make sure you have:
use utf8;
At the beginning of the script.
If you are going to get UTF-8 characters from STDIN, use this at the beginning of the script:
binmode(STDIN, ':encoding(UTF-8)');
For STDOUT use:
binmode(STDOUT, ':encoding(UTF-8)');
Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8 or UTF8 will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8 will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict. You may also read the question How do I sanitize invalid UTF-8 in Perl?
.
Finally, following #zdim advise, you may use at the beginning of the script:
use open ':encoding(UTF-8)';
And other variants as described here. That will set the encoding layer for all open instructions that do not specify a layer explicitly.

Perl: substitute .* to §

I have to substitute multiple substrings from expressions like $fn = '1./(4.*z.^2-1)' in Perl (v5.24.1#Win10):
$fn =~ s/.\//_/g;
$fn =~ s/.\*/§/g;
$fn =~ s/.\^/^/g;
but § is not working; I get a ┬º in the expression result (1_(4┬ºz^2-1)). I need this for folder and file names and it worked fine in Matlab#Win10 with fn = strrep(fn, '.*', '§').
How can a get the § in the Perl substitution result?
It works for me:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use utf8;
use open IO => ':encoding(UTF-8)', ':std';
my $fn = '1./(4.*z.^2-1)';
s/.\//_/g,
s/.\*/§/g,
s/.\^/^/g
for $fn;
say $fn;
Output:
1_(4§z^2-1)
You can see use utf8, it tells Perl the source code is in UTF-8. Make sure you save the source as UTF-8, then.
The use open sets the UTF-8 encoding for standard input and output. Make sure the terminal to which you print is configured to work in UTF-8, too.

Understanding lexical scoping of "use open ..." of Perl

use open qw( :encoding(UTF-8) :std );
Above statement seems to be effective in its lexical scope only and should not affect outside of it's scope. But I have observed the following.
$ cat data
€
#1
$ perl -e '
open (my $fh, "<encoding(UTF-8)", "data");
print($_) while <$fh>;'
Wide character in print at -e line 1, <$fh> line 1.
€
The Wide character ... warning is perfect here. But
#2
$ perl
my ($fh, $row);
{
use open qw( :encoding(UTF-8) :std );
open ($fh, "<", "data");
}
$row = <$fh>;
chomp($row);
printf("%s (0x%X)", $row, ord($row));
€ (0x20AC)
Does not show the wide character warning!! Here is whats going on here imo
We are using open pragma to set the IO streams to UTF-8, including STDOUT.
Opening the file inside the same scope. It reads the character as multibyte char.
But printing outside the scope. The print statement should show "Wide character" warning, but it is not. Why?
Now look at the following, a little variation
#3
my ($fh, $row);
{
use open qw( :encoding(UTF-8) :std );
}
open ($fh, "<", "data");
$row = <$fh>;
chomp($row);
printf("%s (0x%X)", $row, ord($row));
⬠(0xE2)
Now this time since the open statement is out of the lexical scope, the open opened the file in non utf-8 mode.
Does this mean use open qw( :encoding(UTF-8) :std ); statement changes the STDOUT globally but STDIN within lexical scope?
You aren't using STDIN. You're opening a file with an explicit encoding (except for your last example) and reading from that.
The use open qw(:std ...) affects the standard file handles, but you're only using standard output. When you don't use that and print UTF-8 data to standard output, you get the warning.
In your last example, you don't read the data with an explicit encoding, so when you print it to standard output, it's already corrupted.
That's the trick of encodings no matter what they are. Every part of the process has to be correct.
If you want use open to affect all file handles, you have to import it differently. There are several examples in the top of the documentation.
Unfortunately, the open qw(:std) pragma does not seem to behave as a lexical pragma since it changes the IO layers associated with the standard handles STDIN, STDOUT and STDERR globally. Even code earlier in source file is affected since the use statement happens at compile time. So the following
say join ":", PerlIO::get_layers(\*STDIN);
{
use open qw( :encoding(UTF-8) :std );
}
prints ( on my linux platform ) :
unix:perlio:encoding(utf-8-strict):utf8
whereas without the use open qw( :encoding(UTF-8) :std ) it would just print
unix:perlio.
A way to not affect the global STDOUT for example is to duplicate the handle within a lexical scope and then add IO layers to the duplicate handle within that scope:
use feature qw(say);
use strict;
use warnings;
use utf8;
my $str = "€";
say join ":", PerlIO::get_layers(\*STDOUT);
{
open ( my $out, '>&STDOUT' ) or die "Could not duplicate stdout: $!";
binmode $out, ':encoding(UTF-8)';
say $out $str;
}
say join ":", PerlIO::get_layers(\*STDOUT);
say $str;
with output:
unix:perlio
€
unix:perlio
Wide character in say at ./p.pl line 16.
€

Perl code Save ANSI encoding format xml file into UTF-8 encoding

I need to change the encoding format of a file from ANSI to UTF-8... Please suggest me to complete this, I have done using some methods. But it didn't work. Herewith I have written the code, which I have did.
use utf8;
use File::Slurp;
$File_Name="c:\\test.xml";
$file_con=read_file($File_Name);
open (OUT, ">c:\\b.xml");
binmode(OUT, ":utf8");
print OUT $file_con;
close OUT;
Assuming you have a valid XML file, this would do it:
use XML::LibXML qw( );
my $doc = XML::LibXML->new()->parse_file('text.xml');
$doc->setEncoding('UTF-8');
open(my $fh, '>:raw', 'test.utf8.xml')
or die("Can't create test.utf8.xml: $!\n");
print($fh $doc->toString());
This handles both converting the encoding and adjusting the <?xml?> directive. The previous answers left the wrong encoding in the <?xml?> directive.
If you just want to make a filter, try this:
perl -MEncode -pwe 's/(.*)/encode('utf8', $1)/e;'
For example:
type c:\text.xml |perl -MEncode -pwe 's/(.*)/encode('utf8', $1)/e;' >c:\b.xml
Or modifying your code:
use File::Slurp;
use Encode;
$File_Name="c:\\test.xml";
$file_con=read_file($File_Name);
open (OUT, ">c:\\b.xml");
print OUT encode('utf8', $file_con);
close OUT;
Use Text::Iconv:
use Text::Iconv;
$converter = Text::Iconv->new("cp1252", "utf-8");
$converted = $converter->convert($file_con);
(assuming you are using codepage 1252 as your default codepage).