Printing UTF8 in Perl? - perl

Why does adding the use utf8 pragma produce garbled output (see below) versus when I don't use this pragma
The code:
use strict;
use v5.10;
use Data::Dumper;
# if I comment this line out, then the results print fine
use utf8;
my $s = {
'data' => 'The size is 200 μg'
};
say Dumper( $s );
Results without use utf8:
$VAR1 = {
'data' => 'The size is 200 μg'
};
Results WITH using use utf8:
$VAR1 = {
'data' => "The size is 200 \x{3bc}g"
};
Thanks for any insights

It is not garbled, but a standard Data::Dumper escape by the default "Useqq" configuration option listed here. Data::Dumper is designed for debugging and so this option lets you see what exact characters are when they may not be printable.
Without use utf8;, your string actually contains the UTF-8 encoded bytes of that character rather than the character itself, since that is what the file contains. You can verify this by checking the length of the string. use utf8; causes the interpreter to decode the source code from UTF-8, including your literal string.
In order to print such characters, it needs to be encoded back to UTF-8 bytes. You can either do this directly:
use strict;
use warnings;
use utf8;
use Encode 'encode';
print encode 'UTF-8', 'The size is 200 μg';
Or you can set an encoding layer on STDOUT, so that all printed text will be encoded to UTF-8:
use strict;
use warnings;
use utf8;
binmode *STDOUT, ':encoding(UTF-8)';
print 'The size is 200 μg';
Encoding to UTF-8 for Data::Dumper debugging is generally unnecessary, because it will escape such characters for your view already.

Related

Is it possible to print 'é' as '%C3%A9' in Perl?

I have some string with accent like "é" and the goal is to put my string into an URL so I need to convert "é" to "%C3%A9"
I have tested some module as HTML::Entitie, Encode or URI::Encode without any success
Actual Result:
%C3%83%C2%A9
Expected Result:
%C3%A9
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
use feature 'say';
use URI::Encode qw( uri_encode );
my $var = "é";
say $var;
$var = uri_encode( $var );
say $var;
You are missing use utf8.
The use utf8 pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope. The no utf8 pragma tells
Perl to switch back to treating the source text as literal bytes in
the current lexical scope. (On EBCDIC platforms, technically it is
allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic,
so in this document the term UTF-8 is used to mean both).
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are
directly usable without use utf8;.

Convert mangled unicode in perl

I have a listing of filenames where - due to bad conversion from unicode - some of the names are mangled.
This:
Naturalis_Biodiversity_Center_-_RMNH.ART.40_-_Gymnothorax_hepaticus_(Rüppell)_-_Kawahara_Keiga_-_1823_-_1829_-_Siebold_Collection_-_pencil_drawing_-_water_colour.jpeg
should read like this (notice the umlaut about halfway):
Naturalis_Biodiversity_Center_-_RMNH.ART.40_-_Gymnothorax_hepaticus_(Rüppell)_-_Kawahara_Keiga_-_1823_-_1829_-_Siebold_Collection_-_pencil_drawing_-_water_colour.jpeg
other cases are
Günther => Günther
ForsskÃ¥l => Forsskål
Is there a way to find and correct these cases with perl, apart from manual search and replace?
The String is ISO-8859-1 encoded instead of utf8.
You could decode the string:
use strict;
use warnings;
use Encode qw(decode);
use utf8;
use DDP;
my $str = 'Günther';
my $newStr = decode("iso-8859-1", $str);
p $newStr;
Output:
Günther

PERL: how to detect string encoding so I can use the right charset

I have these 2 example strings:
$a = "點看";
$b = "pøp";
First one is displayed correctly using charset UTF-8, but second string not.
Second is displayed correctly if charset is changed to iso-8859-1.
I don't know how to display latin1 characters with charset utf-8.
Or at least, I need a solution to detect string type (e.g this is "utf-8" or this is "iso-8859-1"), so I can use appropriate charset to display it.
Decode inputs. Encode outputs.
use strict;
use warnings qw( all );
use feature qw( say );
use utf8; # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $s1 = "點看";
my $s2 = "pøp";
say for $s1, $s2;

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

Perl Using Foreign Characters in Windows

I'm trying to print characters like ş,ı,ö,ç in Turkish language in Windows using perl but I couldn't do it. My main purpose is creating folders using special characters in Windows.
This is my code:
use Text::Iconv;
use strict;
use warnings;
$conve = Text::Iconv->new("windows-1254","UTF-16");
$converted = $conve->convert("ş");
print $converted;
system("mkdir $converted");
I get a malformed utf-8 character (byte 0xfe) aa.pl at line 7
Save the following as UTF-8:
use utf8;
use strict;
use warnings;
use open ":std", ":encoding(cp1254)"; # Set encoding for STD*
use Encode qw( encode );
my $file_name = "ş";
print "$file_name\n";
system(encode('cp1254', qq{mkdir "$file_name"}));
use utf8 tells Perl the source is UTF-8.
use open ":std", ":encoding(cp1254)"; causes text sent to STDOUT and STDERR to be encoded using cp1254, and it causes text read from STDIN to be decoded from cp1254.
It doesn't affect what is sent to sustem calls like system, so you need to encode those explicitly.