Win32::Console::ANSI and uri_unescape - perl

When I run this script in a WinXP Terminal with CP850 the Ü and ö are displayed correct. When I uncomment the use Win32::Console::ANSI; line the output is broken.
Is this behavior expectable or is this a bug?
#!perl
use warnings;
use strict;
use 5.10.0;
binmode STDOUT, ':encoding(cp850)';
use Encode qw(decode_utf8);
use URI::Escape qw(uri_unescape);
#use Win32::Console::ANSI;
my $uri_escaped = '%C3%9Cberraschungsei+R%C3%B6ntgen';
say $uri_escaped;
my $uri_unescaped = uri_unescape( $uri_escaped );
say $uri_unescaped;
my $utf8_decoded = decode_utf8( $uri_unescaped );
say "Result: $utf8_decoded";
%C3%9Cberraschungsei+R%C3%B6ntgen
"\x{009c}" does not map to cp850 at C:perl.pl line 15.
Ã\x{009c}berraschungsei+Röntgen
Result: Überraschungsei+Röntgen
With Win32::Console::ANSI enabled:
%C3%9Cberraschungsei+R%C3%B6ntgen
"\x{009c}" does not map to cp850 at C:perl.pl line 15.
Ç\x{009c}berraschungsei+RÇôntgen
Result: sberraschungsei+R"ntgen

Use the ANSI code page (cp1252) rather than the OEM one.
>chcp
Active code page: 437
>perl a.pl cp437
%C3%9Cberraschungsei+R%C3%B6ntgen
Überraschungsei+Röntgen
>perl -MWin32::Console::ANSI a.pl cp1252
%C3%9Cberraschungsei+R%C3%B6ntgen
Überraschungsei+Röntgen

Related

Using Term::ReadLine with Unicode input

I am trying to figure out how to read Unicode input from the terminal using Term::ReadLine. It turns out, if I enter a Unicode character at the prompt, the returned string varies depending on various settings. (I am running Ubuntu 14.10, and have installed Term::ReadLine::Gnu). For example (p.pl):
use open qw( :std :utf8 );
use strict;
use warnings;
use Devel::Peek;
use Term::ReadLine;
my $term = Term::ReadLine->new('ProgramName');
$term->ornaments( 0 );
my $ans = $term->readline("Enter message: ");
Dump ( $ans );
Running p.pl and typing å at the prompt gives output:
Enter message: å
SV = PV(0x83a5a0) at 0x87c080
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x917500 "\303\245"\0
CUR = 2
LEN = 10
So the returned string $ans has not set the UTF-8 flag. However, if I run the program using perl -CS p.pl, the output is:
Enter message: å
SV = PVMG(0x24c12e0) at 0x23050a0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x248faf0 "\303\245"\0 [UTF8 "\x{e5}"]
CUR = 2
LEN = 10
the UTF-8 flag is correctly set on $ans. So the first question is: Why is command line option -CS different from using the pragma use open qw( :std :utf8 )?
Next, I tested Term::ReadLine::Stub with -CS option:
$ PERL_RL=Stub perl -CS p.pl
the output is now:
Enter message: å
SV = PV(0xf97260) at 0xfd90c8
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x10746e0 "\303\203\302\245"\0 [UTF8 "\x{c3}\x{a5}"]
CUR = 4
LEN = 10
and the output string $ans has been doubly encoded, so the output is corrupted.. Is this a bug, or is it expected behavior?
As explained by Denis Ibaev in his answer, the problem is that Term::ReadLine does not read STDIN, it opens a new input filehandle. As an alternative to calling binmode($term->IN, ':utf8'), it turns out one can make either of command line option -CS or use open qw( :std :utf8) work out of the box with Term::ReadLine by supplying STDIN as an argument to Term::ReadLine->new(), as explained in the answer to this question: Term::Readline: encoding-question.
For example:
use strict;
use utf8;
use open qw( :std :utf8 );
use warnings;
use Term::ReadLine;
my $term = Term::ReadLine->new('Test', \*STDIN, \*STDOUT);
my $answer = $term->readline( 'Enter input: ' );
Term::ReadLine does not read STDIN, it opens new filehandle. And so use open qw(:std :utf8); has no effect.
You need to do something like this:
my $term = Term::ReadLine->new('name');
binmode($term->IN, ':utf8');
Update about -CS:
Option -C sets some value to the magic variable ${^UNICODE}. -CS (or -CI) option makes expression ${^UNICODE} & 0x0001 true. And Term::ReadLine sets UTF-8 flag on for input string if ${^UNICODE} & 0x0001 is true.
Notice, option -CS is different from binmode($term->IN, ':utf8'). The first of which sets UTF-8 flag only, and the second encodes string.

Creating filenames with unicode characters

I am looking for some guidelines for how to create filenames with Unicode characters. Consider:
use open qw( :std :utf8 );
use strict;
use utf8;
use warnings;
use Data::Dump;
use Encode qw(encode);
my $utf8_file_name1 = encode('UTF-8', 'æ1', Encode::FB_CROAK | Encode::LEAVE_SRC);
my $utf8_file_name2 = 'æ2';
dd $utf8_file_name1;
dd $utf8_file_name2;
qx{touch $utf8_file_name1};
qx{touch $utf8_file_name2};
print (qx{ls æ*});
The output is:
"\xC3\xA61"
"\xE62"
æ1
æ2
Why doesn't it matter if I encode the filename in UTF8 or not? (The filename still becomes valid UTF8 either way.)
Because of a bug called "The Unicode Bug". The equivalent of the following is happening:
use Encode qw( encode_utf8 is_utf8 );
my $bytes = is_utf8($str) ? encode_utf8($str) : $str;
is_utf8 checks which of two string storage format is used by the scalar. This is an internal implementation detail you should never have to worry about, except for The Unicode Bug.
Your program works because encode always returns a string for which is_utf8 returns false, and use utf8; always returns a string for which is_utf8 returns true if the string contains non-ASCII characters.
If you don't encode as you should, you will sometimes get the wrong result. For example, if you had used "\x{E6}2" instead of 'æ2', you would have gotten a different file name even though the strings have the same length and the same characters.
$ dir
total 0
$ perl -wE'
use utf8;
$fu="æ";
$fd="\x{E6}";
say sprintf "%vX", $_ for $fu, $fd;
say $fu eq $fd ? "eq" : "ne";
system("touch", $_) for "u".$fu, "d".$fd
'
E6
E6
eq
$ dir
total 0
-rw------- 1 ikegami ikegami 0 Jul 12 12:18 uæ
-rw------- 1 ikegami ikegami 0 Jul 12 12:18 d?

How to use utf8 encode with open pragma

I have problem with utf8::encode when use pragma use open qw(:std :utf8);
Example
#!/usr/bin/env perl
use v5.16;
use utf8;
use open qw(:std :utf8);
use Data::Dumper;
my $word = "+банк";
say Dumper($word);
say utf8::is_utf8($word) ? 1 : 0;
utf8::encode($word);
say Dumper($word);
say utf8::is_utf8($word) ? 1 : 0;
Output
$VAR1 = "+\x{431}\x{430}\x{43d}\x{43a}";
1
$VAR1 = '+банк';
0
When I remove this pragma use open qw(:std :utf8);, everything is OK.
$VAR1 = "+\x{431}\x{430}\x{43d}\x{43a}";
1
$VAR1 = '+банк';
0
Thank you in advanced!
If you're going to replace utf8::encode($word); with use open qw(:std :utf8);, you'll actually need to remove the utf8::encode($word);. In the version that doesn't work, you're encoding twice.
utf8::encode is not what you want if you are going to print to a filehandle upon which perl expects to output utf8.
utf8::encode says take this string and give me a string where each character is a byte of the utf8 encoding of the input string. This would normally be only done if you are then going to use that string in some way where perl won't be automatically converting to utf8 if necessary.
If you add a say length($word); after the encode, you will see that $word is 9 characters, not the original 5.

searching words in Greek in Unix and Perl

I have txt files that are greek and now I want to search specific words in them using perl and bash ... the words are like ?a?, t?, e??
I was searching for words in english and now want to replace them by greek but all I get is ??? mostly... for Perl:
my %word = map { $_ => 1 } qw/name date birth/;
and for bash
for X in name date birth
do
can someone please help me?
#!/usr/bin/perl
use strict;
use warnings;
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
my #words = qw( καί τό εἰς );
my %words = map { $_ => 1 } #words;
my $pat = join '|', map quotemeta, keys %words;
while (<>) {
if (/$pat/) {
print;
}
}
Usage:
script.pl file.in >file.out
Notes:
Make sure the source code is encoded using UTF-8 and that you use use utf8;.
Make sure you use the use open line and specify the appropriate encoding for your data file. (If it's not UTF-8, change it.)

Perl's YAML::XS and unicode

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.
I write this in the script (which is saved in utf-8)
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.
Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.
Am I doing something wrong? I am always unsure about unicode in perl, to be frank...
clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.
Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.
When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.
Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.
So, either
use utf8;
binmode STDOUT, ":raw";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Or
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;
Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.
When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.
It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.
I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do.
The bellow allows me:
uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
can edit the YAML/JSON files with utf8 enabled vim and bash tools
"inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
source code is utf8, so can write regexes /á/
My "broilerplate"...
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use YAML::XS;
use JSON::XS;
run();
exit;
sub run {
my $yfilein = "./in.yaml"; #input yaml
my $jfilein = "./in.json"; #input json
my $yfileout = "./out.yaml"; #output yaml
my $jfileout = "./out.json"; #output json
my $ydata = load_utf8_yaml($yfilein);
my $jdata = load_utf8_json($jfilein);
#the "uc" is not "fully correct" but works for my needs
$ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
$jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;
save_utf8_yaml($yfileout, $ydata);
save_utf8_json($jfileout, $jdata);
}
#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems
sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }
You can try the next in.yaml
---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž