Convert mangled unicode in perl - perl

I have a listing of filenames where - due to bad conversion from unicode - some of the names are mangled.
This:
Naturalis_Biodiversity_Center_-_RMNH.ART.40_-_Gymnothorax_hepaticus_(Rüppell)_-_Kawahara_Keiga_-_1823_-_1829_-_Siebold_Collection_-_pencil_drawing_-_water_colour.jpeg
should read like this (notice the umlaut about halfway):
Naturalis_Biodiversity_Center_-_RMNH.ART.40_-_Gymnothorax_hepaticus_(Rüppell)_-_Kawahara_Keiga_-_1823_-_1829_-_Siebold_Collection_-_pencil_drawing_-_water_colour.jpeg
other cases are
Günther => Günther
ForsskÃ¥l => Forsskål
Is there a way to find and correct these cases with perl, apart from manual search and replace?

The String is ISO-8859-1 encoded instead of utf8.
You could decode the string:
use strict;
use warnings;
use Encode qw(decode);
use utf8;
use DDP;
my $str = 'Günther';
my $newStr = decode("iso-8859-1", $str);
p $newStr;
Output:
Günther

Related

Is it possible to print 'é' as '%C3%A9' in Perl?

I have some string with accent like "é" and the goal is to put my string into an URL so I need to convert "é" to "%C3%A9"
I have tested some module as HTML::Entitie, Encode or URI::Encode without any success
Actual Result:
%C3%83%C2%A9
Expected Result:
%C3%A9
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
use feature 'say';
use URI::Encode qw( uri_encode );
my $var = "é";
say $var;
$var = uri_encode( $var );
say $var;
You are missing use utf8.
The use utf8 pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope. The no utf8 pragma tells
Perl to switch back to treating the source text as literal bytes in
the current lexical scope. (On EBCDIC platforms, technically it is
allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic,
so in this document the term UTF-8 is used to mean both).
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are
directly usable without use utf8;.

Printing UTF8 in Perl?

Why does adding the use utf8 pragma produce garbled output (see below) versus when I don't use this pragma
The code:
use strict;
use v5.10;
use Data::Dumper;
# if I comment this line out, then the results print fine
use utf8;
my $s = {
'data' => 'The size is 200 μg'
};
say Dumper( $s );
Results without use utf8:
$VAR1 = {
'data' => 'The size is 200 μg'
};
Results WITH using use utf8:
$VAR1 = {
'data' => "The size is 200 \x{3bc}g"
};
Thanks for any insights
It is not garbled, but a standard Data::Dumper escape by the default "Useqq" configuration option listed here. Data::Dumper is designed for debugging and so this option lets you see what exact characters are when they may not be printable.
Without use utf8;, your string actually contains the UTF-8 encoded bytes of that character rather than the character itself, since that is what the file contains. You can verify this by checking the length of the string. use utf8; causes the interpreter to decode the source code from UTF-8, including your literal string.
In order to print such characters, it needs to be encoded back to UTF-8 bytes. You can either do this directly:
use strict;
use warnings;
use utf8;
use Encode 'encode';
print encode 'UTF-8', 'The size is 200 μg';
Or you can set an encoding layer on STDOUT, so that all printed text will be encoded to UTF-8:
use strict;
use warnings;
use utf8;
binmode *STDOUT, ':encoding(UTF-8)';
print 'The size is 200 μg';
Encoding to UTF-8 for Data::Dumper debugging is generally unnecessary, because it will escape such characters for your view already.

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

Perl HTML Encoding Named Entities

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings
If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));
Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.
I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

Uppercase accented characters in perl

Is there a way to uppercase accented characters in perl,
my $string = "éléphant";
print uc($string);
So that it actually prints ÉLÉPHANT ?
My perl script is encoded in ISO-8859-1 and $string is printed in an xml file with the same encoding.
perl only understands US-ASCII and UTF-8, and the latter requires
use utf8;
If you want to keep the file as iso-8859-1, you'll need to decode the text explicitly.
use open ':std', ':encoding(locale)';
use Encode qw( decode );
# Source is encoded using iso-8859-1, so we need to decode ourselves.
my $string = decode("iso-8859-1", "éléphant");
print uc($string);
But it's probably better to convert the script to UTF-8.
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(locale)';
my $string = "éléphant";
print uc($string);
If you're printing to a file, make sure you use :encoding(iso-8859-1) when you open the file (no matter which alternative you use).
Try doing this :
use Encode qw/encode decode/;
my $enc = 'utf-8'; # This script is stored as UTF-8
my $str = "éléphant\n";
my $text_str = decode($enc, $str);
$text_str = uc $text_str;
print encode($enc, $text_str);
OUTPUT :
ÉLÉPHANT