how to decode_entities in utf8 - perl

In perl, I am working with the following utf-8 text:
my $string = 'a 3.9 kΩ resistor and a 5 µF capacitor';
However, when I run the following:
decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
I get
a 3.9 kΩ resistor and a 5 µF capacitor
The Ω symbol has successfully decoded, but the µ symbol now has gibberish before it.
How can I use decode_entities while making sure non-encoded utf-8 symbols (such as µ) are not converted to gibberish?

This isn't a very well-phrased question. You didn't tell us where your decode_entities() function comes from and you didn't give a simple example that we could just run to reproduce your problem.
But I was able to reproduce your problem with this code:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
The problem here is that by default, Perl will interpret your source code (and, therefore, any strings included in it) as ISO-8859-1. As your string is in UTF8, you just need to tell Perl to interpret your source code as UTF8 by adding use utf8 to your code.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use utf8; # Added this line
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Running this will give you the correct string, but you'll also get a warning.
Wide character in say
This is because Perl's IO layer expects single-byte characters by default and any attempt to send a multi-byte character through it is seen as a potential problem. You can fix that by telling Perl that STDOUT should accept UTF8 characters. There are many ways to do that. The easiest is probably to add -CS to the shebang line.
#!/usr/bin/perl -CS
use strict;
use warnings;
use 5.010;
use utf8;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Perl has great support for Unicode, but it can be hard to get started with it. I recommend reading perlunitut to see how it all works.

You are using the Encode CPAN library. If that is true, you can try this...
my $string = "...";
$string = decode_entities(decode('utf-8', $string));
This may seem illogical. If Perl is natively UTF-8 itself, why should you need to decode a UTF-8 string? It is simply another way of telling Perl that you have a UTF-8 value that it needs to interpret as natively UTF-8.
The corruption you are seeing is when a UTF-8 value doesn't have the rights bytes recognized (it shows "0xC1 0xAF" when Dumpered; after the above change, it ought to show "0x1503", or some similar concat'ed bytes) .
There are a ton of settings that can affect this in perl. The above is most likely the right combination of changes that you need for your given settings. Otherwise, some variation (swap encode with decode('latin1', ...), etc.) of the above should solve the problem.

Related

perl prints 3 wrong characters instead of unicode character

Been having trouble with the print function, I know I'm missing something small. I've been looking everywhere and trying stuff out but can't seem to find the solution.
I'm trying to print braille characters in perl, I got the value of 2881 from a table and converted it to hexa. When I try to print the hexadecimal character, perl prints 3 characters instead.
Code:
#!/usr/local/bin/perl
use utf8;
print "\x{AF1}";
Output:
C:\Users\ElizabethTosh\Desktop>perl testff.pl
Wide character in print at testff.pl line 3.
૱
Issue #1: You need to tell Perl to encode the output for your terminal.
Add the following to your program.
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
use utf8; merely specifies the that source file is encoded using UTF-8 instead of ASCII.
Issue #2: Your terminal probably can't handle that character.
The console of US-English machines likely expect cp437. It's character set doesn't include any braille characters.
You could try switching to code page 65001 (UTF-8) using chcp 65001. You may also need to switch the console's font to one that includes braille characters. (MS Gothic worked for me, although it does weird things to the backslashes.)
Issue #3: You have the wrong character code.
U+0AF1 GUJARATI RUPEE SIGN (૱): "\x{AF1}" or "\N{U+0AF1}" or chr(2801)
U+0B41 ORIYA VOWEL SIGN U (ୁ): "\x{B41}" or "\N{U+0B41}" or chr(2881)
U+2801 BRAILLE PATTERN DOTS-1 (⠁): "\x{2801}" or "\N{U+2801}" or chr(10241)
U+2881 BRAILLE PATTERN DOTS-18 (⢁): "\x{2881}" or "\N{U+2881}" or chr(10369)
All together,
use strict;
use warnings;
use feature qw( say );
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
say(chr($_)) for 0x2801, 0x2881;
Output:
>chcp 65001
Active code page: 65001
>perl a.pl
⠁
⢁
If you save a character with UTF-8, and it's displayed as 3 strange characters instead of 1, it means that the character is in the range U+0800 to U+FFFF, and that you decode it with some single-byte encoding instead of UTF-8.
So, change the encoding of your terminal to UTF-8. If you can't do this, redirect the output to a file:
perl testff.pl >file
And open the file with a text editor that supports UTF-8, to see if the character is displayed correctly.
You want to print the character U+2881 (⢁), and not U+0AF1. 2881 is already in hexadecimal.
To get rid of the Wide character in print warning, set the input and output of your Perl program to UTF-8:
use open ':std', ':encoding(UTF-8)';
Instead of use utf8;, which only enables the interpretation of the program text as UTF-8.
Summary
Source file (testff.pl):
#!/usr/local/bin/perl
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
print "\x{2881}";
Run:
> perl testff.pl
⢁

Perl HTML Encoding Named Entities

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings
If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));
Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.
I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

UTF-8 in a Perl module name

How can I write a Perl module with UTF-8 in its name and filename? My current try yields "Can't locate Täst.pm in #INC", but the file does exist. I'm on Windows, and haven't tried this on Linux yet.
test.pl:
use strict;
use warnings;
use utf8;
use Täst;
Täst.pm:
package Täst;
use utf8;
Update: My current work-around it so use Tast (ASCII) and put package Täst (Unicode) in Tast.pm (ASCII). It's confusing, though.
Unfortunately, Perl, Windows, and Unicode filenames really don't go together at the moment. My advice is to save yourself a lot of hassle and stick with plain ASCII for your module names. This blog post mentions a few of the problems.
The use utf8 needs to appear before the package Täst, so that the latter can be correctly interpreted. On my Mac:
test.pl:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Tëst;
# 'use utf8' only indicates the code's encoding, but we also want stdout to be utf8
use encoding "utf8";
Tëst::hëllö();
Tëst.pm:
use utf8;
package Tëst;
sub Tëst::hëllö() {
print "Hëllö, wörld!\n";
}
1;
Output:
Macintosh:Desktop sherm$ ./test.pl
Hëllö, wörld!
As I said though - I ran this on my Mac. As cjm said above, your mileage may vary on Windows.
Unicode support often fails at the boundaries. Package and subroutine names need to map cleanly onto filenames, which is problematic on some operating systems. Not only does the OS have to create the filename that you expect, but you also have to be able to find it later as the same name.
We talked a little about the filename issue in Effective Perl Programming, but I also summarized much more in How do I create then use long Windows paths from Perl?. Jeff Atwood mentions this as part of his post on his Filesystem Paths: How Long is Too Long?.
I wouldn't recommend this approach if this is software you plan to release, to be honest. Even if you get it working fine for you, it's likely to be somewhat fragile on machines where UTF-8 isn't configured quite right, and/or filenames may not contain UTF-8 characters, etc.

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.

Why do my Perl tests fail with use encoding 'utf8'?

I'm puzzled with this test script:
#!perl
use strict;
use warnings;
use encoding 'utf8';
use Test::More 'no_plan';
ok('áá' =~ m/á/, 'ok direct match');
my $re = qr{á};
ok('áá' =~ m/$re/, 'ok qr-based match');
like('áá', $re, 'like qr-based match');
The three tests fail, but I was expecting that the use encoding 'utf8' would upgrade both the literal áá and the qr-based regexps to utf8 strings, and thus passing the tests.
If I remove the use encoding line the tests pass as expected, but I can't figure it out why would they fail in utf8 mode.
I'm using perl 5.8.8 on Mac OS X (system version).
Do not use the encoding pragma. It’s broken. (Juerd Waalboer gave a great talk where he mentioned this at YAPC::EU 2k8.)
It does at least two things at once that do not belong together:
It specifies an encoding for your source file.
It specifies an encoding for your file input/output.
And to add injury to insult it also does #1 in a broken fashion: it reinterprets \xNN sequences as being undecoded octets as opposed to treating them like codepoints, and decodes them, preventing you from being able to express characters outside the encoding you specified and making your source code mean different things depending on the encoding. That’s just astonishingly wrong.
Write your source code in ASCII or UTF-8 only. In the latter case, the utf8 pragma is the correct thing to use. If you don’t want to use UTF-8, but you do want to include non-ASCII charcters, escape or decode them explicitly.
And use I/O layers explicitly or set them using the open pragma to have I/O automatically transcoded properly.
It works fine on my computer (on perl 5.10). Maybe you should try replacing that use encoding 'utf8' with use utf8.
What version of perl are you using? I think older versions had bugs with UTF-8 in regexps.
The Test::More documentation contains a fix for this issue, which I just found today (and this entry shows higher in the googles).
utf8 / "Wide character in print"
If you use utf8 or other non-ASCII characters with Test::More you might get a "Wide character in print" warning. Using binmode STDOUT, ":utf8" will not fix it. Test::Builder (which powers Test::More) duplicates STDOUT and STDERR. So any changes to them, including changing their output disciplines, will not be seem by Test::More. The work around is to change the filehandles used by Test::Builder directly.
my $builder = Test::More->builder;
binmode $builder->output, ":utf8";
binmode $builder->failure_output, ":utf8";
binmode $builder->todo_output, ":utf8";
I added this bit of boilerplate to my testing code and it works a charm.