Uppercase accented characters in perl - perl

Is there a way to uppercase accented characters in perl,
my $string = "éléphant";
print uc($string);
So that it actually prints ÉLÉPHANT ?
My perl script is encoded in ISO-8859-1 and $string is printed in an xml file with the same encoding.

perl only understands US-ASCII and UTF-8, and the latter requires
use utf8;
If you want to keep the file as iso-8859-1, you'll need to decode the text explicitly.
use open ':std', ':encoding(locale)';
use Encode qw( decode );
# Source is encoded using iso-8859-1, so we need to decode ourselves.
my $string = decode("iso-8859-1", "éléphant");
print uc($string);
But it's probably better to convert the script to UTF-8.
use utf8; # Source is encoded using UTF-8
use open ':std', ':encoding(locale)';
my $string = "éléphant";
print uc($string);
If you're printing to a file, make sure you use :encoding(iso-8859-1) when you open the file (no matter which alternative you use).

Try doing this :
use Encode qw/encode decode/;
my $enc = 'utf-8'; # This script is stored as UTF-8
my $str = "éléphant\n";
my $text_str = decode($enc, $str);
$text_str = uc $text_str;
print encode($enc, $text_str);
OUTPUT :
ÉLÉPHANT

Related

Convert mangled unicode in perl

I have a listing of filenames where - due to bad conversion from unicode - some of the names are mangled.
This:
Naturalis_Biodiversity_Center_-_RMNH.ART.40_-_Gymnothorax_hepaticus_(Rüppell)_-_Kawahara_Keiga_-_1823_-_1829_-_Siebold_Collection_-_pencil_drawing_-_water_colour.jpeg
should read like this (notice the umlaut about halfway):
Naturalis_Biodiversity_Center_-_RMNH.ART.40_-_Gymnothorax_hepaticus_(Rüppell)_-_Kawahara_Keiga_-_1823_-_1829_-_Siebold_Collection_-_pencil_drawing_-_water_colour.jpeg
other cases are
Günther => Günther
ForsskÃ¥l => Forsskål
Is there a way to find and correct these cases with perl, apart from manual search and replace?
The String is ISO-8859-1 encoded instead of utf8.
You could decode the string:
use strict;
use warnings;
use Encode qw(decode);
use utf8;
use DDP;
my $str = 'Günther';
my $newStr = decode("iso-8859-1", $str);
p $newStr;
Output:
Günther

PERL: how to detect string encoding so I can use the right charset

I have these 2 example strings:
$a = "點看";
$b = "pøp";
First one is displayed correctly using charset UTF-8, but second string not.
Second is displayed correctly if charset is changed to iso-8859-1.
I don't know how to display latin1 characters with charset utf-8.
Or at least, I need a solution to detect string type (e.g this is "utf-8" or this is "iso-8859-1"), so I can use appropriate charset to display it.
Decode inputs. Encode outputs.
use strict;
use warnings qw( all );
use feature qw( say );
use utf8; # Source code is encoded using UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $s1 = "點看";
my $s2 = "pøp";
say for $s1, $s2;

perl uri_escape_utf8 with arabic

I am trying to escape some Arabic to LWP::UserAgent. I am testing this with a script below:
my $files = "/home/root/temp.txt";
unlink ($files);
open (OUTFILE, '>>', $files);
my $text = "ضثصثضصثشس";
print OUTFILE uri_escape_utf8($text)."\n";
close (OUTFILE);
However, this seems to cause the following:
%C3%96%C3%8B%C3%95%C3%8B%C3%96%C3%95%C3%8B%C3%94%C3%93
which is not correct. Any pointers to what I need to do in order to escape this correctly?
Thank you for your help in advance.
Regards,
Olli
Perl consideres your source file to be encoded as Latin-1 until you tell it to use utf8. If we do that, the string "ضثصثضصثشس" does not contain some jumbled bytes, but is rather a string of codepoints.
The uri_escape_utf8 expects a string of codepoints (not bytes!), encodes them, and then URI-escapes them. Ergo, the correct thing to do is
use utf8;
use URI::Escape;
print uri_escape_utf8("ضثصثضصثشس"), "\n";
Output: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3
If we fail to use utf8, then uri_escape_utf8 gets a string of bytes (which are accidentally encoded in UTF8), so we should have used uri_escape:
die "This is the wrong way to do it";
use URI::Escape;
print uri_escape("ضثصثضصثشس"), "\n";
which produces the same output as above – but only by accident.
Using uri_escape_utf8 whith a bytestring (that would decode to arabic characters) produces the totally wrong
%C3%98%C2%B6%C3%98%C2%AB%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B6%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B4%C3%98%C2%B3
because this effectively double-encodes the data. It is the same as
use utf8;
use URI::Escape;
use Encode;
print uri_escape(encode "utf8", encode "utf8", "ضثصثضصثشس"), "\n";
Edit: So you used CP-1256, which is a non-portable single byte encoding. It is unable to encode arbitrary Unicode characters, and should therefore be avoided along with other pre-Unicode encodings. You didn't declare your encoding, so perl thinks you meant Latin-1. This means that what you saw as "ضثصثضصثشس" was actually the byte stream D6 CB D5 CB D6 D5 CB D4 D3, which decodes to some unprintable junk in Latin-1.
Edit: So you want to decode command line arguments. The Encode::Locale module should manage this. Before accessing any parameters from #ARGV, do
use Encode::Locale;
decode_argv(Encode::FB_CROAK); # possibly: BEGIN { decode_argv(...) }
or use the locale pseudoencoding which it provides:
my $decoded_string = decode "locale" $some_binary_data;
Use this as a part in the overall strategy of decoding all input, and always encoding your output.

Perl HTML Encoding Named Entities

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings
If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));
Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.
I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

Perl Using Foreign Characters in Windows

I'm trying to print characters like ş,ı,ö,ç in Turkish language in Windows using perl but I couldn't do it. My main purpose is creating folders using special characters in Windows.
This is my code:
use Text::Iconv;
use strict;
use warnings;
$conve = Text::Iconv->new("windows-1254","UTF-16");
$converted = $conve->convert("ş");
print $converted;
system("mkdir $converted");
I get a malformed utf-8 character (byte 0xfe) aa.pl at line 7
Save the following as UTF-8:
use utf8;
use strict;
use warnings;
use open ":std", ":encoding(cp1254)"; # Set encoding for STD*
use Encode qw( encode );
my $file_name = "ş";
print "$file_name\n";
system(encode('cp1254', qq{mkdir "$file_name"}));
use utf8 tells Perl the source is UTF-8.
use open ":std", ":encoding(cp1254)"; causes text sent to STDOUT and STDERR to be encoded using cp1254, and it causes text read from STDIN to be decoded from cp1254.
It doesn't affect what is sent to sustem calls like system, so you need to encode those explicitly.