Perl: converting from cp1251 to utf8 - perl

I try to convert string to utf8.
#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = "\320\300\304\310\323\321 \316\320\300\312\313";
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";
And in this case I get what I need:
# ./convert.pl
converted:
РАДИУС ОРАКЛ
But if I use external variable:
#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = $ARGV[0];
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";
Nothing happens.
# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"
converted:
\320\300\304\310\323\321 \316\320\300\312\313
This is the dump of the first example:
SV = PV(0x1dceb78) at 0x1ded120
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1de7970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
CUR = 12
LEN = 16
And the second:
SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\\320\\300\\304\\310\\323\\321 \\316\\320\\300\\312\\313"\0
CUR = 45
LEN = 48
I've tried this method:
#!/usr/bin/perl -w
use Devel::Peek;
$str = pack 'C*', map oct, $ARGV[0] =~ /\\(\d{3})/g;
print Dump ($str);
# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"
SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\320\300\304\310\323\321\316\320\300\312\313"\0
CUR = 11
LEN = 48
But again it's not what I need. Could you help me to get the result like in the first script?
After using this
($str = shift) =~ s/\\([0-7]+)/chr oct $1/eg
as suggested by Borodin, I get this
SV = PVMG(0x13fa7f0) at 0x134d0f0
REFCNT =
FLAGS = (SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x1347970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
CUR = 12
LEN = 16
MAGIC = 0x1358290
MG_VIRTUAL = &PL_vtbl_mglob
MG_TYPE = PERL_MAGIC_regex_global(g)
MG_LEN = -1

It's not clear exactly what input you're getting or where from, or what you want your output to be, but you shouldn't be encoding your data into UTF-8 for use within the program because you want to deal with characters and not encoded bytes. You should just decode it from whatever external encoding is being sent to the program and work with it like that
It sounds like the input is Windows-1251 and the output is UTF-8 (?) and I assume the backslashes are a distraction. There are no backslashes in the file or typed on the keyboard are there? So changing the base to hex for clarity, your input string is like this
"\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB"
and you want to convert it to a Perl character string, do some stuff with it, and print it to the output. If you're on a Linux machine and you want to explicitly decode it from raw input bytes, then you need to write something like this
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw/ :std OUT :encoding(UTF-8) /;
use Encode qw/ decode /;
my $str = "\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB";
$str = decode('Windows-1251', $str);
say $str;
output
РАДИУС ОРАКЛ
But that's a contrived situation. The string is actually coming from an input stream, so it's better to set the encoding of the stream and forget about manual decoding. You can use binmode if you're reading from STDIN, like this
binmode STDIN, 'encoding(Windows-1251)';
and then text input from STDIN will be converted implicitly from Windows-1251-encoded bytes to a character string. Alternatively, if you're opening a file on your own handle, you can put the encoding in the open call
open my $fh, '<:encoding(Windows-1251)', $file or die $!;
and then you don't need to add a binmode either
As I said, I've assumed your output is UTF-8, and in the program above the line
use open qw/ :std OUT :encoding(UTF-8) /;
sets all output file handles to have a default of UTF-8 encoding. The :std also sets the built-in handles STDOUT and STDERR to UTF-8. If this isn't what you want and you can't figure out how to set it up as you need it then please do ask

think about this:
$ perl -le 'print length("\320\300\304\310\323\321 \316\320\300\312\313")'
12
$ perl -le 'print length($ARGV[0])' "\320\300\304\310\323\321 \316\320\300\312\313"
45
here we recieve the number of characters in given string.
pay attention that when string is inside perl script, perl interprets backslashed symbols according to their codes. but if backslashed symbols are outside perl script, the are just shell symbols and shell doesn't interpret them somehow and so you get exactly what you give.

A couple of simple methods to convert backslashes and octal digits typed in utf-8 terminal to cp1251:
$str = perl -e 'print "$ARGV[0]"' | iconv -f windows-1251;
print $str;
or
$str = pack "C*", map oct()? oct : 32, $ARGV[0] =~ / \d{3} | \s /gx;
print $str;

Related

How to make the output from Text::CSV utf8?

I have a CSV file, say win.csv, whose text is encoded in windows-1252. First I use iconv to make it in utf8.
$iconv -o test.csv -f windows-1252 -t utf-8 win.csv
Then I read the converted CSV file with the following Perl script (utfcsv.pl).
#!/usr/bin/perl
use utf8;
use Text::CSV;
use Encode::Detect::Detector;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';',});
open my $fh, "<encoding(utf8)", "test.csv";
while (my $row = $csv->getline($fh)) {
my $line = join " ", #$row;
my $enc = Encode::Detect::Detector::detect($line);
print "($enc) $line\n";
}
$csv->eof || $csv->error_diag();
close $fh;
$csv->eol("\r\n");
exit;
Then the output is like the following.
(UFT-8) .........
() .....
Namely the encoding of all lines are detected as UTF-8 (or ASCII). But the actual output does not seem to be UTF-8. In fact, if I save the output on a file
$./utfcsv.pl > output.txt
then the encoding of output.txt is detected as windows-1252.
Question: How can I get the output text in UFT-8?
Notes:
Environment: openSUSE 13.2 x86_64, perl 5.20.1
I do not use Text::CSV::Encoded because the installation fails. (Because test.csv is converted in UTF-8, so it is strange to use Text::CSV::Encoded.)
I use the following script to check the encoding. (I also use it to find out the encoding of the initial CSV file win.csv.)
.
#!/usr/bin/perl
use Encode::Detect::Detector;
open my $in, "<","$ARGV[0]" || die "open failed";
while (my $line = <$in>) {
my $enc = Encode::Detect::Detector::detect($line);
chomp $enc;
if ($enc) {
print "$enc\n";
}
}
You have set the encoding of the input file handle (which, by the way, should be <:encoding(utf8) -- note the colon) but you haven't specified the encoding of the output channel, so Perl will send unencoded character values to the output
The Unicode values for characters that will fit in a single byte -- Basic Latin (ASCII) between 0 and 0x7F, and Latin-1 Supplement between 0x80 and 0xFF -- are very similar to Windows code page 1252. In particular a small letter u with a diaresis is 0xFC in both Unicode and CP1252, so the text will look like CP1252 if it is output unencoded, instead of the two-byte sequence 0xC3 0xBC which is the same codepoint encoded in UTF-8
If you use binmode on STDOUT to set the encoding then the data will be output correctly, but it is simplest to use the open pragma like this
use open qw/ :std :encoding(utf-8) /;
which will set the encoding for STDIN, STDOUT and STDERR, as well as any newly-opened file handles. That means you don't have to specify it when you open the CSV file, and your code will look like this
Note that I have also added use strict and use warnings, which are essential in any Perl program. I have also
used autodie to remove the need for checks on the status of all IO operations, and I have taken advantage of the way Perl interpolates arrays inside double quotes by putting a space between the elements which avoids the need for a join call
#!/usr/bin/perl
use utf8;
use strict;
use warnings 'all';
use open qw/ :std :encoding(utf-8) /;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1, sep_char => ';' });
open my $fh, '<', 'test.csv';
while ( my $row = $csv->getline($fh) ) {
print "#$row\n";
}
close $fh;

Converting to unicode characters in Perl?

I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.
My Input is:
इस परीक्षण के लिए है
My expected output is:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
How to achieve this in Perl?
Give me some suggestions.
Try this
use utf8;
my $str = 'इस परीक्षण के लिए है';
for my $c (split //, $str) {
printf("\\u%04x", ord($c));
}
print "\n";
You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:
use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.
This is done by:
Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.
This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.
As a Perl subroutine, this would look like
use strict;
use warnings;
use Encode ();
sub unicode_escape {
my ($str) = #_;
my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
my #code_units = unpack "n*", $UTF_16BE_octets;
return join '', map { sprintf "\\u%04x", $_ } #code_units;
}
Test cases:
use Test::More tests => 3;
use utf8;
is unicode_escpape(''), '',
'empty string is empty string';
is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
'non-BMP code points are escaped as surrogate halves';
my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
'ordinary BMP code points each have a single escape';
If you want only an simple converter, you can use the following filter
perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'
like:
echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<< "इस परीक्षण के लिए है"
prints:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a
Unicode with surrogate pairs.
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";
print "$str\n";
for my $ch (unpack "U*", $str) {
if( $ch > 0xffff ) {
my $h = ($ch - 0x10000) / 0x400 + 0xD800;
my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
printf "\\u%04x\\u%04x", $h, $l;
}
else {
printf "\\u%04x", $ch;
}
}
print "\n";
prints
if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020

Proper handing of UTF-8 in Perl

I have been given a file, (probably) encoded in Latin-1 (ISO 8859-1), and there are some conversions and data mining to be done with it. The output is supposed to be in UTF-8, and I have tried about anything I could find about encoding conversion in Perl, none of them produced any usable output.
I know that use utf8; does nothing to begin with. I have tried the Encode package, which looked promising:
open FILE, '<', $ARGV[0] or die $!;
my %tmp = ();
my $last_num = 0;
while (<FILE>) {
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
chomp;
next unless length;
process($_);
}
I tried that in any combination I could think of, also thrown in a binmode(STDOUT, ":utf8");, open FILE, '<:encoding(ISO-8859-1)', $ARGV[0] or die $!; and much more. The result were either scrambled umlauts, or an error message like \xC3 is not a valid UTF-8 character, or even mixed text (Some in UTF-8, some in Latin-1).
All I wanna have is a simple way to read in a Latin-1 text file and produce UTF-8 output on the console via print. Is there any simple way to do that in Perl?
See Perl encoding introduction and the Unicode cookbook.
Easiest with piconv:
$ piconv -f Latin1 -t UTF-8 < input.file > output.file
Easy, with encoding layers:
use autodie qw(:all);
open my $input, '<:encoding(Latin1)', $ARGV[0];
binmode STDOUT, ':encoding(UTF-8)';
Moderately, with manual de-/encoding:
use Encode qw(decode encode);
use autodie qw(:all);
open my $input, '<:raw', $ARGV[0];
binmode STDOUT, ':raw';
while (my $raw = <$input>) {
my $line = decode 'Latin1', $raw, Encode::FB_CROAK | Encode::LEAVE_SRC;
my $result = process($line);
print {STDOUT} encode 'UTF-8', $result, Encode::FB_CROAK | Encode::LEAVE_SRC;
}
Maybe as :
$_ = encode('utf-8', decode('ISO-8859-1', $_));
The Data is gb2312 encode, so this can convert it to utf-8:
#!/usr/bin/env perl
use Encode qw(encode decode);
while (<DATA>) {
$_ = encode('utf-8', decode('gb2312', $_));
print;
}
__DATA__
Â׶ذÂÔË»á
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
This line has two problems with it. Firstly you are encoding your input to UTF-8 and then decoding it from ISO-8859-1. These two operations are the wrong way round.
Secondly, you almost certainly don't want to decode and encode at the same time. The Golden Rule of handling character encodings in Perl is to follow this process:
Decode data as soon as you get it from the outside world. This takes your input bytestream and converts it into Perl's internal representation for character strings.
Process the data according to your requirements.
Encode the data just before sending it to the outside world. This takes Perl's internal representation for character strings and converts it to a correctly-encoded bytestream for your required output encoding.

Question about pathname encoding

What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.
#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);
my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;
say Dump $s1;
say Dump $s2;
# SV = PV(0x824b50) at 0x9346d8
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
# CUR = 4
# LEN = 16
# SV = PV(0x7a7150) at 0x934c30
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
# CUR = 8
# LEN = 16
say $s1;
say $s2;
# Léo
# Lakmé
$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );
say $s1;
say $s2;
# L�o
# Lakmé
Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode and Encode::decode to get predictable results.
Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir() is simply a sequence of octets, and File::Find doesn't know that you want the path name as Unicode code points. It forms $File::Find::name by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir(), and that's how you got code points mashed with octets.
Rule of thumb: Whenever passing path names to the OS, Encode::encode() it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode() it to the character set that your application wants it in.
You can make your program work by calling find this way:
find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );
And then calling Encode::decode() when using the value of $File::Find::name:
my $path = Encode::decode('utf8', $File::Find::name);
To be more clear, this is how $File::Find::name was formed:
use Encode;
# This is a way to get $dir to be represented as a UTF-8 string
my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;
say "dir: ", d($dir); # length = 3
# This is what readdir() is returning:
my $leaf = encode('utf8', 'Lakem' . chr(233));
say "leaf: ", d($leaf); # length = 7
$File::Find::name = $dir . '/' . $leaf;
say "File::Find::name: ", d($File::Find::name);
sub d {
join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}
The POSIX filesystem API is broken as no encoding is enforced. Period.
Many problems can happen. For example a pathname can even contain both latin1 and UTF-8 depending on how various filesystems on a path handle encoding (and if they do).

shift jis decoding/encoding in perl

When I try decode a shift-jis encoded string and encode it back, some of the characters get garbled:
I have following code:
use Encode qw(decode encode);
$val=;
print "\nbefore decoding: $val";
my $ustr = Encode::decode("shiftjis",$val);
print "\nafter decoding: $ustr";
print "\nbefore encoding: $ustr";
$val = Encode::encode("shiftjis",$ustr);
print "\nafter encoding: $val";
when I use a string : helloソworld in input it gets properly decoded and encoded back,i.e. before decoding and after encoding prints in above code print the same value.
But when I tried another string like : ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ
The end output got garbled.
Is it a perl library specific problem or it is a general shift jis mapping problem?
Is there any solution for it?
You should simply replace the shiftjis with cp932.
http://en.wikipedia.org/wiki/Code_page_932
You lack error-checking.
use utf8;
use Devel::Peek qw(Dump);
use Encode qw(encode);
sub as_shiftjis {
my ($string) = #_;
return encode(
'Shift_JIS', # http://www.iana.org/assignments/character-sets
$string,
Encode::FB_CROAK
);
}
Dump as_shiftjis 'helloソworld';
Dump as_shiftjis 'ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ';
Output:
SV = PV(0x9148a0) at 0x9dd490
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x930e80 "hello\203\\world"\0
CUR = 12
LEN = 16
"\x{2160}" does not map to shiftjis at …