How to convert special characters to UTF-8? - perl

When I extract data from a MySQL database, some of the output have special characters,
when opened in e.g. emacs it decodes to \240 and \346.
When shown in an UTF-8 terminal, the special characters is shown as �
So the used encoding seams to only use 1 byte per character.
I can e.g. see that \346 should be æ.
Question
Does Perl have a module that can encode these special characters to UTF-8?

Use Encode::decode to decode your data from whatever encoding it's in to Perl's internal format.
Then, when writing the data out to a file, set the 'utf8' layer to make the data be written in UTF-8.
use Encode;
my $data_from_database = ...;
my $perl_data = decode('ISO-8859-1', $data_from_database);
binmode STDOUT, ':utf8';
print $perl_data;

Related

How to fix UTF-8 encoding error with Russian words

My Perl script reads from an text file which contains mainly English ANSI words.
But there are Russian words sometimes, which I can not convert back to UTF-8.
See same example (the words in brackets are the English translations):
Êîìïîíåíò (Component)
Àâòîð (Author)
Ãýíäàëüô (Gandalf)
Äàòà ñîçäàíèÿ (Create date): 20-ìàé(may)-2003
Äàòà êîððåêöèè (Last correction Date): 25-ìàð(mar)-2003
Âåðñèÿ (Version): 0.92
Áëàãîäàðíîñòè (Thanks):
Íîâîå â (New in):
Ïîääåðæêà (Support)
Î÷åíü ìíîãî (Very much)
I've read the UTF-8 Encoding Debugging Chart and tried also the following
$s='Àâòîð';
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
my $s = Encode::decode( 'iso-8859-5', 'Àâòîð' );
from_to($s, "iso-8859-5","utf-8");
print "$s\n";
I've tried also cp1252 instead of iso-8859-5, but nothing helps.
I've tried also Encode::Guess, but the result is not helpful: iso-8859-5 or cp1251 or koi8-r or iso-8859-1.
Any idea how to convert 'Àâòîð' back to the Cyrillic text 'автор'?
After some tries, I get the expected output Автор when switching the (Windows) console code page to 65001 (UTF-8) and decoding the input data from Windows-1251:
perl -MEncode -wle "print encode('UTF-8',decode('Windows-1251',shift))" "Àâòîð"
This suggests that the input data is encoded as Windows-1251 and decoding from that should give you the cyrrillic letters in Unicode. To output the data to a file, make sure you either set the encoding when opening the file (easiest) or encode each string to the target encoding on output (hard to keep track of):
my $octets = <$input_file>;
my $data = decode('Windows-1251', $octets;
open my $fh, '>:encoding(UTF-8)', $filename
or die "Couldn't write to $filename: $!";
print $fh decode('Windows-1251', $data);
Your bytes sequence is 0xc0 0xe2 0xf2 0xee 0xf0. This is russian word 'author' in cp1251. Representation given by you can be get if your application assumes that this is cp1252 encoding. Now the question is here what codepage do you like to have? Or, what codepage needed to your application?
To read file in cp1251 in correct way you have to use construction like this:
open (my $tmp_h,"<:encoding(cp-1251)", $ARGV[0]) or die $!;
That allows perl to know what codepage do you use in your file. And then when you will read file into string it allows perl to correctly convert values from cp1251 to Perl's internal form (UTF-8) and use these string as you want without any problems.
For internal form perl set UTF8 flag you can check using Devel::Peek module.
I think, that using internal form also will give you chance to use any string operation correctly and will help avoid mistakes.
I would recommend to use "use utf8" pragma in our source code. Now, all literals in the source code will be threated as utf8 and automatically converted into internal form correctly. Now, we know that our source code is in UTF8 (and it would also better if with BOM, because detecting BOM usualy is the first thing different IDE and editor will typical do). Later, we can open other files in any encoding using "<:encoding(....)" construction get data from the web, from the databases and again make sure that data were converted into internal form correctly checking utf8 flag. Having all this, we would be able to work with all this data in one manner, correcly compare string, use regular expression and so on.

In what encoding does readpipe return the result of an executed command?

Here's a simple perl script that is supposed to write a utf-8 encoded file:
use warnings;
use strict;
open (my $out, '>:encoding(utf-8)', 'tree.out') or die;
print $out readpipe ('tree ~');
close $out;
I have expected readpipe to return a utf-8 encoded string since LANG is set toen_US.UTF-8. However, looking at tree.out (while making sure the editor recognizes it a as utf-8 encoded) shows me all garbled text.
If I change the >:encoding(utf-8) in the open statement to >:encoding(latin-1), the script creates a utf-8 file with the expected
text.
This is all a bit strange to me. What is the explanation for this behavior?
readpipe is returning to perl a string of undecoded bytes. We know that that string is UTF-8 encoded, but you've not told Perl.
The IO layer on your output handle is taking that string, assuming it is Unicode code-points and re-encoding them as UTF-8 bytes.
The reason that the latin-1 IO layer appears to be functioning correctly is that it is writing out each undecoded byte unmolested because the 1st 256 unicode code-points correspond nicely with latin-1.
The proper thing to do would be to decode the byte-string returned by readpipe into a code-point-string, before feeding it to an IO-layer. The statement use open ':utf8', as mentioned by Borodin, should be a viable solution as readpipe is specifically mentioned in the open manual page.

perl uri_escape_utf8 with arabic

I am trying to escape some Arabic to LWP::UserAgent. I am testing this with a script below:
my $files = "/home/root/temp.txt";
unlink ($files);
open (OUTFILE, '>>', $files);
my $text = "ضثصثضصثشس";
print OUTFILE uri_escape_utf8($text)."\n";
close (OUTFILE);
However, this seems to cause the following:
%C3%96%C3%8B%C3%95%C3%8B%C3%96%C3%95%C3%8B%C3%94%C3%93
which is not correct. Any pointers to what I need to do in order to escape this correctly?
Thank you for your help in advance.
Regards,
Olli
Perl consideres your source file to be encoded as Latin-1 until you tell it to use utf8. If we do that, the string "ضثصثضصثشس" does not contain some jumbled bytes, but is rather a string of codepoints.
The uri_escape_utf8 expects a string of codepoints (not bytes!), encodes them, and then URI-escapes them. Ergo, the correct thing to do is
use utf8;
use URI::Escape;
print uri_escape_utf8("ضثصثضصثشس"), "\n";
Output: %D8%B6%D8%AB%D8%B5%D8%AB%D8%B6%D8%B5%D8%AB%D8%B4%D8%B3
If we fail to use utf8, then uri_escape_utf8 gets a string of bytes (which are accidentally encoded in UTF8), so we should have used uri_escape:
die "This is the wrong way to do it";
use URI::Escape;
print uri_escape("ضثصثضصثشس"), "\n";
which produces the same output as above – but only by accident.
Using uri_escape_utf8 whith a bytestring (that would decode to arabic characters) produces the totally wrong
%C3%98%C2%B6%C3%98%C2%AB%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B6%C3%98%C2%B5%C3%98%C2%AB%C3%98%C2%B4%C3%98%C2%B3
because this effectively double-encodes the data. It is the same as
use utf8;
use URI::Escape;
use Encode;
print uri_escape(encode "utf8", encode "utf8", "ضثصثضصثشس"), "\n";
Edit: So you used CP-1256, which is a non-portable single byte encoding. It is unable to encode arbitrary Unicode characters, and should therefore be avoided along with other pre-Unicode encodings. You didn't declare your encoding, so perl thinks you meant Latin-1. This means that what you saw as "ضثصثضصثشس" was actually the byte stream D6 CB D5 CB D6 D5 CB D4 D3, which decodes to some unprintable junk in Latin-1.
Edit: So you want to decode command line arguments. The Encode::Locale module should manage this. Before accessing any parameters from #ARGV, do
use Encode::Locale;
decode_argv(Encode::FB_CROAK); # possibly: BEGIN { decode_argv(...) }
or use the locale pseudoencoding which it provides:
my $decoded_string = decode "locale" $some_binary_data;
Use this as a part in the overall strategy of decoding all input, and always encoding your output.

Perl HTML Encoding Named Entities

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings
If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));
Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.
I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

Perl: String literal in module in latin1 - I want utf8

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?
use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);
I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print "January 1st is '$x'\n";
gives the output
SV = PV(0x15eabe8) at 0x1492a10
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
CUR = 10
LEN = 16
January 1st is 'Nyt sdag'
(with an invalid character between t and s).
use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.
Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.
I also tried to use Encode's decode, with no luck.
You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.
with an invalid character between t and s
You also interpret this wrong, it is in fact the å character.
You want to output UTF-8, so you are lacking the encoding step.
my $octets = encode 'UTF-8', $x;
print $octets;
Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.
use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.
If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.
Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.
Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.
For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.
To change the default input/output encoding you can use something like this.
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.
But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.
You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.
To correct your example. One possible way is this:
#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");