Using Term::ReadLine with Unicode input - perl

I am trying to figure out how to read Unicode input from the terminal using Term::ReadLine. It turns out, if I enter a Unicode character at the prompt, the returned string varies depending on various settings. (I am running Ubuntu 14.10, and have installed Term::ReadLine::Gnu). For example (p.pl):
use open qw( :std :utf8 );
use strict;
use warnings;
use Devel::Peek;
use Term::ReadLine;
my $term = Term::ReadLine->new('ProgramName');
$term->ornaments( 0 );
my $ans = $term->readline("Enter message: ");
Dump ( $ans );
Running p.pl and typing å at the prompt gives output:
Enter message: å
SV = PV(0x83a5a0) at 0x87c080
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x917500 "\303\245"\0
CUR = 2
LEN = 10
So the returned string $ans has not set the UTF-8 flag. However, if I run the program using perl -CS p.pl, the output is:
Enter message: å
SV = PVMG(0x24c12e0) at 0x23050a0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x248faf0 "\303\245"\0 [UTF8 "\x{e5}"]
CUR = 2
LEN = 10
the UTF-8 flag is correctly set on $ans. So the first question is: Why is command line option -CS different from using the pragma use open qw( :std :utf8 )?
Next, I tested Term::ReadLine::Stub with -CS option:
$ PERL_RL=Stub perl -CS p.pl
the output is now:
Enter message: å
SV = PV(0xf97260) at 0xfd90c8
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x10746e0 "\303\203\302\245"\0 [UTF8 "\x{c3}\x{a5}"]
CUR = 4
LEN = 10
and the output string $ans has been doubly encoded, so the output is corrupted.. Is this a bug, or is it expected behavior?

As explained by Denis Ibaev in his answer, the problem is that Term::ReadLine does not read STDIN, it opens a new input filehandle. As an alternative to calling binmode($term->IN, ':utf8'), it turns out one can make either of command line option -CS or use open qw( :std :utf8) work out of the box with Term::ReadLine by supplying STDIN as an argument to Term::ReadLine->new(), as explained in the answer to this question: Term::Readline: encoding-question.
For example:
use strict;
use utf8;
use open qw( :std :utf8 );
use warnings;
use Term::ReadLine;
my $term = Term::ReadLine->new('Test', \*STDIN, \*STDOUT);
my $answer = $term->readline( 'Enter input: ' );

Term::ReadLine does not read STDIN, it opens new filehandle. And so use open qw(:std :utf8); has no effect.
You need to do something like this:
my $term = Term::ReadLine->new('name');
binmode($term->IN, ':utf8');
Update about -CS:
Option -C sets some value to the magic variable ${^UNICODE}. -CS (or -CI) option makes expression ${^UNICODE} & 0x0001 true. And Term::ReadLine sets UTF-8 flag on for input string if ${^UNICODE} & 0x0001 is true.
Notice, option -CS is different from binmode($term->IN, ':utf8'). The first of which sets UTF-8 flag only, and the second encodes string.

Related

Why is my Perl script printing incorrect values when not using hard-coded values?

I use an A/D converter to get some results of an electric conductivity probe. I use a miniEC interface from Sparky's Widgets. We run a calibration and get the slope and the intercept values. When I am testing this values with this calibration in a static script the result is correct.
See here, not a big thing but a proof that my calibration works well. The result is correct.
#!/usr/bin/perl
my $slope = "0.048684077307972626";
my $intercept = "24.831896523430906";
$ECdec = 62.5;
print "$ECdec \n";
###lin
$EC1 = ( ( $ECdec - $intercept ) / $slope );
print "Electric Conductivity $EC1 µS/m \n";
Output is:
62.5
Electric Conductivity 773.725323749752 �S/m
When I swap the static value $ECdec to the output of the A/D Converter and try to get a result it is totally wrong. Can anyone see my failure?
Here is the Perl which reads the probes value from the converter, swap the bytes, convert it to decimal and then add the linear regression. What did I do wrong?
#!/usr/bin/perl
my $dir = '/var/www/motion';
my $slope = "0.048684077307972626";
my $intercept = "24.831896523430906";
###get value
my $EC = `sudo i2cget -y 1 0x4a 0x00 w` ;
print "$EC \n";
###swap
my $ECswap = $EC;
substr $ECswap, 4, 0, substr $ECswap, 2, 2, q();
print "$ECswap \n";
###convert to decimal
$ECdec = hex($ECswap);
print "$ECdec \n";
$ECvalue = ($ECdec - $ECintercept)/$slope);
print "$ECvalue"
#$rrd = `/usr/bin/rrdtool update $dir/homeec.rrd N:$ECdec`;
####system ("clear");
print "Electric Conductivity $ECdec µS/m \n";
Output here is:
0x5303
0x0353
851
Electric Conductivity 16969.9858590372 �S/m
You are printing $ECdec in your output instead of $ECvalue
Also, please always post your real code. The program you have shown won't compile and is clearly not the one that is giving you problems
This is how your program should look
You must always use strict and use warnings 'all' at the top of even the most trivial Perl programs, and declare all of your variables with my
You should always use utf8 if your code contains non-ASCII characters like the Greek mu µ in microSiemens. Perl doesn't support source code encoded other than in 7-bit ASCII or UTF-8. I don't know whether your terminal expects UTF-8 characters, and you may need to alter the use open statement
I have commented out your call to i2cget to retrieve a real value and subsituted a constant string instead
I have also converted the hex string to binary before swapping the bytes for speed, but it's far from critical and you should retain the character swap if you find it more readable. I would use a regular expression and write it like this
die unless $EChex =~ /0x(\p{hex}{2})(\p{hex}{2})/;
my $EC = hex($2.$1);
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf8) /;
use constant DIR => '/var/www/motion';
use constant SLOPE => 0.048684077307972626;
use constant INTERCEPT => 24.831896523430906;
# my $EChex = `sudo i2cget -y 1 0x4a 0x00 w` ;
my $EChex = '0x5303';
printf "\$EChex = %s\n", $EChex;
my $EC = hex $EChex;
printf "\$EC = %s\n", $EC;
$EC = (($EC & 0xFF00) >> 8) | (($EC & 0xFF) << 8); # swap bytes
my $ECvalue = ($EC - INTERCEPT) / SLOPE;
printf "Electric Conductivity %.3fµS/m \n", $ECvalue;
output
$EChex = 0x5303
$EC = 21251
Electric Conductivity 16969.986µS/m

Win32::Console::ANSI and uri_unescape

When I run this script in a WinXP Terminal with CP850 the Ü and ö are displayed correct. When I uncomment the use Win32::Console::ANSI; line the output is broken.
Is this behavior expectable or is this a bug?
#!perl
use warnings;
use strict;
use 5.10.0;
binmode STDOUT, ':encoding(cp850)';
use Encode qw(decode_utf8);
use URI::Escape qw(uri_unescape);
#use Win32::Console::ANSI;
my $uri_escaped = '%C3%9Cberraschungsei+R%C3%B6ntgen';
say $uri_escaped;
my $uri_unescaped = uri_unescape( $uri_escaped );
say $uri_unescaped;
my $utf8_decoded = decode_utf8( $uri_unescaped );
say "Result: $utf8_decoded";
%C3%9Cberraschungsei+R%C3%B6ntgen
"\x{009c}" does not map to cp850 at C:perl.pl line 15.
Ã\x{009c}berraschungsei+Röntgen
Result: Überraschungsei+Röntgen
With Win32::Console::ANSI enabled:
%C3%9Cberraschungsei+R%C3%B6ntgen
"\x{009c}" does not map to cp850 at C:perl.pl line 15.
Ç\x{009c}berraschungsei+RÇôntgen
Result: sberraschungsei+R"ntgen
Use the ANSI code page (cp1252) rather than the OEM one.
>chcp
Active code page: 437
>perl a.pl cp437
%C3%9Cberraschungsei+R%C3%B6ntgen
Überraschungsei+Röntgen
>perl -MWin32::Console::ANSI a.pl cp1252
%C3%9Cberraschungsei+R%C3%B6ntgen
Überraschungsei+Röntgen

How to use utf8 encode with open pragma

I have problem with utf8::encode when use pragma use open qw(:std :utf8);
Example
#!/usr/bin/env perl
use v5.16;
use utf8;
use open qw(:std :utf8);
use Data::Dumper;
my $word = "+банк";
say Dumper($word);
say utf8::is_utf8($word) ? 1 : 0;
utf8::encode($word);
say Dumper($word);
say utf8::is_utf8($word) ? 1 : 0;
Output
$VAR1 = "+\x{431}\x{430}\x{43d}\x{43a}";
1
$VAR1 = '+банк';
0
When I remove this pragma use open qw(:std :utf8);, everything is OK.
$VAR1 = "+\x{431}\x{430}\x{43d}\x{43a}";
1
$VAR1 = '+банк';
0
Thank you in advanced!
If you're going to replace utf8::encode($word); with use open qw(:std :utf8);, you'll actually need to remove the utf8::encode($word);. In the version that doesn't work, you're encoding twice.
utf8::encode is not what you want if you are going to print to a filehandle upon which perl expects to output utf8.
utf8::encode says take this string and give me a string where each character is a byte of the utf8 encoding of the input string. This would normally be only done if you are then going to use that string in some way where perl won't be automatically converting to utf8 if necessary.
If you add a say length($word); after the encode, you will see that $word is 9 characters, not the original 5.

How can I dump a string in perl to see if there are any character differences?

I've occasionally had problems with strings being subtly different, in some cases utf8::all changed the behavior, so I assume the subtle differences are unicode. I'd like to dump strings in such a way that the differences will be visual to me. What are my options for doing this?
I recommend the Dump function in the Devel::Peek module in the Perl core:
$ perl -MDevel::Peek -e 'Dump "abc"'
SV = PV(0x10441500) at 0x10491680
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x10442224 "abc"\0
CUR = 3
LEN = 4
$ perl -MDevel::Peek -e 'Dump "\x{FEFF}abc"'
SV = PV(0x10441050) at 0x10443be0
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x10449bc0 "\357\273\277abc"\0 [UTF8 "\x{feff}abc"]
CUR = 6
LEN = 8
(You see how FLAGS contains UTF8 in the second example, because of the wide character, but not in the first?)
For most uses, Data::Dumper with Useqq will do.
use utf8;
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
print(Dumper("foo–bar"));
print(Dumper("foo-bar"));
Output:
$VAR1 = "foo\x{2013}bar";
$VAR1 = "foo-bar";
If you want internal details (such as the UTF8 flag), use Devel::Peek.
use utf8;
use Devel::Peek;
Dump("foo–bar");
Dump("foo-bar");
Output:
SV = PV(0x328ccc) at 0x1d6a0c4
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK,UTF8)
PV = 0x1d6d52c "foo\342\200\223bar"\0 [UTF8 "foo\x{2013}bar"]
CUR = 9
LEN = 12
SV = PV(0x328dcc) at 0x32b594
REFCNT = 1
FLAGS = (PADTMP,POK,READONLY,pPOK)
PV = 0x1d6d50c "foo-bar"\0
CUR = 7
LEN = 12
Have you tried Test::LongString? Even though it's really a test module, it is handy for showing you where the differences in a string occur. It focuses on the parts that are different instead of showing you the whole string, and it make \x{} escapes for specials.
I'd like to see an example where utf8::all changed the behavior, even if just to see an interesting edge case.
All you need to dump out any string is:
printf "U+%v04X\n", $string;
You could use this to format a string:
($print_string = $string) =~ s/([^\x20-\x7E])/sprintf "\\x{%x}", $1/ge;
or even
use charnames ();
($print_string = $string) =~ s/([^\x20-\x7E])/sprintf "\\N{%s}", charnames::viacode(ord $1)/ge;
I have no idea why in the wolrd you would use the misleadingly named utf8::all. It’s not a core module, and you seem to be having some sort of trouble with knowing what it is really doing. If you explicitly used the individual core pieces that go into it, maybe you would understand it all better.

how to convert from gbk encoding to utf-8 encoding in Perl

I have a simple question which I do not know how to solve in Perl. I know how to convert from utf-8 to GBK, for example, from e4b8ad to d6d0. But I am not sure how to go backward, i.e. given d6d0, how do I know e4b8ad.
Please enlighten me! Many thanks.
When you have hex digits, pack is your friend. Following is a REPL session. Notes:
To reverse the direction, pack the hex digits into octets, decode from GB octets to character string, encode character string to UTF-8 octets, unpack octets into hex digits.
GBK is superseded. Use of GB18030 (provided by Encode::HanExtra in Perl) has been mandatory for five years already.
$ use Encode qw(decode encode); use Encode::HanExtra; use Devel::Peek qw(Dump);
$ 'e4b8ad'
e4b8ad # hex digits
$ pack('H*', 'e4b8ad')
中
$ Dump(pack('H*', 'e4b8ad'))
SV = PV(0x3657680) at 0x36b7188
REFCNT = 1
FLAGS = (PADTMP,POK,pPOK)
PV = 0x36c0768 "\344\270\255"\0 # octets of UTF-8 encoded data
CUR = 3
LEN = 8
$ decode('UTF-8', pack('H*', 'e4b8ad'))
中
$ Dump(decode('UTF-8', pack('H*', 'e4b8ad')))
SV = PV(0x326c3a0) at 0x36a50c8
REFCNT = 1
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x3698a48 "\344\270\255"\0 [UTF8 "\x{4e2d}"] # character string
CUR = 3
LEN = 8
$ encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad')))
"\xd6\xd0"
$ Dump(encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad'))))
SV = PV(0x36a2da0) at 0x36b6d98
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x36db3e8 "\326\320"\0 # octets of GB18030 encoded data
CUR = 2
LEN = 8
$ unpack('H*', encode('GB18030', decode('UTF-8', pack('H*', 'e4b8ad'))))
d6d0 # hex digits
The answer to the question asked:
use Encode qw( from_to );
my $gbk = "\xD6\xD0";
from_to(my $utf8 = $gbk, 'GB18030', 'UTF-8'); # E4 B8 AD
or
use Encode qw( decode encode );
my $gbk = "\xD6\xD0";
my $utf8 = encode('UTF-8', decode('GB18030', $gbk)); # E4 B8 AD
However, a more normal flow looks like the following:
open(my $fh_in, '<:encoding(GB18030)', ...) or die ...;
open(my $fh_out, '>:encoding(UTF-8)', ...) or die ...;
while (<$fh_in>) {
...
print $fh_out ...;
...
}
Encode::HanExtra must be installed for Encode to find the encoding.
use Encode qw/encode decode/;
$utf8 = decode("euc-cn", $euc_cn); # ditto
You can also normally specify the encoding when you open or close a FD and it will perform necessary conversions.
Works like a charm:
perl -e 'open(X,">","/tmp/x"); print X chr(0xd6).chr(0xd0);close(X)'
perl -mEncode -e 'open(X,"<","/tmp/x"); $x=<X>; print Encode::decode("euc-cn",$x);' > /tmp/xx