perl prints 3 wrong characters instead of unicode character - perl

Been having trouble with the print function, I know I'm missing something small. I've been looking everywhere and trying stuff out but can't seem to find the solution.
I'm trying to print braille characters in perl, I got the value of 2881 from a table and converted it to hexa. When I try to print the hexadecimal character, perl prints 3 characters instead.
Code:
#!/usr/local/bin/perl
use utf8;
print "\x{AF1}";
Output:
C:\Users\ElizabethTosh\Desktop>perl testff.pl
Wide character in print at testff.pl line 3.
૱

Issue #1: You need to tell Perl to encode the output for your terminal.
Add the following to your program.
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
use utf8; merely specifies the that source file is encoded using UTF-8 instead of ASCII.
Issue #2: Your terminal probably can't handle that character.
The console of US-English machines likely expect cp437. It's character set doesn't include any braille characters.
You could try switching to code page 65001 (UTF-8) using chcp 65001. You may also need to switch the console's font to one that includes braille characters. (MS Gothic worked for me, although it does weird things to the backslashes.)
Issue #3: You have the wrong character code.
U+0AF1 GUJARATI RUPEE SIGN (૱): "\x{AF1}" or "\N{U+0AF1}" or chr(2801)
U+0B41 ORIYA VOWEL SIGN U (ୁ): "\x{B41}" or "\N{U+0B41}" or chr(2881)
U+2801 BRAILLE PATTERN DOTS-1 (⠁): "\x{2801}" or "\N{U+2801}" or chr(10241)
U+2881 BRAILLE PATTERN DOTS-18 (⢁): "\x{2881}" or "\N{U+2881}" or chr(10369)
All together,
use strict;
use warnings;
use feature qw( say );
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
say(chr($_)) for 0x2801, 0x2881;
Output:
>chcp 65001
Active code page: 65001
>perl a.pl
⠁
⢁

If you save a character with UTF-8, and it's displayed as 3 strange characters instead of 1, it means that the character is in the range U+0800 to U+FFFF, and that you decode it with some single-byte encoding instead of UTF-8.
So, change the encoding of your terminal to UTF-8. If you can't do this, redirect the output to a file:
perl testff.pl >file
And open the file with a text editor that supports UTF-8, to see if the character is displayed correctly.
You want to print the character U+2881 (⢁), and not U+0AF1. 2881 is already in hexadecimal.
To get rid of the Wide character in print warning, set the input and output of your Perl program to UTF-8:
use open ':std', ':encoding(UTF-8)';
Instead of use utf8;, which only enables the interpretation of the program text as UTF-8.
Summary
Source file (testff.pl):
#!/usr/local/bin/perl
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
print "\x{2881}";
Run:
> perl testff.pl
⢁

Related

how to decode_entities in utf8

In perl, I am working with the following utf-8 text:
my $string = 'a 3.9 kΩ resistor and a 5 µF capacitor';
However, when I run the following:
decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
I get
a 3.9 kΩ resistor and a 5 µF capacitor
The Ω symbol has successfully decoded, but the µ symbol now has gibberish before it.
How can I use decode_entities while making sure non-encoded utf-8 symbols (such as µ) are not converted to gibberish?
This isn't a very well-phrased question. You didn't tell us where your decode_entities() function comes from and you didn't give a simple example that we could just run to reproduce your problem.
But I was able to reproduce your problem with this code:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
The problem here is that by default, Perl will interpret your source code (and, therefore, any strings included in it) as ISO-8859-1. As your string is in UTF8, you just need to tell Perl to interpret your source code as UTF8 by adding use utf8 to your code.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use utf8; # Added this line
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Running this will give you the correct string, but you'll also get a warning.
Wide character in say
This is because Perl's IO layer expects single-byte characters by default and any attempt to send a multi-byte character through it is seen as a potential problem. You can fix that by telling Perl that STDOUT should accept UTF8 characters. There are many ways to do that. The easiest is probably to add -CS to the shebang line.
#!/usr/bin/perl -CS
use strict;
use warnings;
use 5.010;
use utf8;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Perl has great support for Unicode, but it can be hard to get started with it. I recommend reading perlunitut to see how it all works.
You are using the Encode CPAN library. If that is true, you can try this...
my $string = "...";
$string = decode_entities(decode('utf-8', $string));
This may seem illogical. If Perl is natively UTF-8 itself, why should you need to decode a UTF-8 string? It is simply another way of telling Perl that you have a UTF-8 value that it needs to interpret as natively UTF-8.
The corruption you are seeing is when a UTF-8 value doesn't have the rights bytes recognized (it shows "0xC1 0xAF" when Dumpered; after the above change, it ought to show "0x1503", or some similar concat'ed bytes) .
There are a ton of settings that can affect this in perl. The above is most likely the right combination of changes that you need for your given settings. Otherwise, some variation (swap encode with decode('latin1', ...), etc.) of the above should solve the problem.

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

Why does decoding "€" to "€" also turn "é" into "é" in output?

I'm new to Perl scripting, and I'm facing some issues in decoding a string:
use HTML::Entities;
my $string='Rémunération €';
$string=decode_entitie($string);
print "$string";
The output I get looks like Rémunération €, when it should look like Rémunération €.
Can anyone please help me with this?
If you run this version of your code (with the typo in decode_entities fixed, strict mode and warnings enabled, and an extra print added) at a terminal:
use strict;
use warnings;
use HTML::Entities;
my $string='Rémunération €';
print "$string\n";
$string=decode_entities($string);
print "$string\n";
you should see the following output:
Rémunération €
Wide character in print at test.pl line 7.
Rémunération €
What happens is the following chain of events:
Your code is written in UTF-8, but don't have use utf8; in it, so Perl is parsing your source code (and, in particular, any string literals in it) byte by byte. Thus, the string literal 'é' is parsed as a two-character string, because the UTF-8 encoding of é takes up two bytes.
Normally, this doesn't matter (much), because your STDOUT is also not in UTF-8 mode, and so it just takes any byte string you give it and spits it out byte by byte, and your terminal then interprets the resulting output as UTF-8 (or tries to).
So, when you do print 'é'; Perl thinks you're printing a two-character string in byte mode, and writes out two bytes, which just happen to make up the UTF-8 encoding of the single character é.
However, when you run your string through decode_entities(), it decodes the € into an actual Unicode € character, which does not fit inside a single byte.
When you try to print the resulting string, Perl notices the "wide" € character. It can't print it as a single byte, so instead, it falls back to encoding the whole string as UTF-8 (and emitting a warning, if you have those enabled, as you should). But that causes the és (which were already encoded, since Perl never decoded them while parsing your code) to get double-UTF8-encoded, producing the mojibake output you see.
A simple fix is to add use utf8; to your code, and also set all your filehandles (including STDIN / STDOUT / STDERR) to UTF-8 mode by default, e.g. like this:
use utf8;
use open qw(:std :utf8);
With those lines prepended to the test script above, the output you get should be:
Rémunération €
Rémunération €

Perl HTML Encoding Named Entities

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings
If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));
Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.
I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

Perl Using Foreign Characters in Windows

I'm trying to print characters like ş,ı,ö,ç in Turkish language in Windows using perl but I couldn't do it. My main purpose is creating folders using special characters in Windows.
This is my code:
use Text::Iconv;
use strict;
use warnings;
$conve = Text::Iconv->new("windows-1254","UTF-16");
$converted = $conve->convert("ş");
print $converted;
system("mkdir $converted");
I get a malformed utf-8 character (byte 0xfe) aa.pl at line 7
Save the following as UTF-8:
use utf8;
use strict;
use warnings;
use open ":std", ":encoding(cp1254)"; # Set encoding for STD*
use Encode qw( encode );
my $file_name = "ş";
print "$file_name\n";
system(encode('cp1254', qq{mkdir "$file_name"}));
use utf8 tells Perl the source is UTF-8.
use open ":std", ":encoding(cp1254)"; causes text sent to STDOUT and STDERR to be encoded using cp1254, and it causes text read from STDIN to be decoded from cp1254.
It doesn't affect what is sent to sustem calls like system, so you need to encode those explicitly.