Windows-1252 to unicode conversion in perl - perl

I have the ef(cyrillic) character in hex format of Windows-1251.
The value is 0xF4. I want to convert and print the character in perl.
And the way i can do it is via unicode 0x0444.
I am looking for a way to convert 0xF4 to 0x044.
My eventual plan is given a hex value of any character in any encoding, i should be able to convert it into hex value of unicode and finally able to print it.
But its not working
Below is the code i am using
#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(decode encode);
binmode(STDOUT, ":utf8");
my $runtime = chr(0x0444);
print "theta || ".$runtime." ||";
my $smiley = "\x{0444}";
print "theta || ".$smiley." ||";
my $georgian_an = pack("U", 0x0444);
print "theta || ".$georgian_an." ||";
my $hexstr = "0xF4";
my $num = hex $hexstr;
print $num; # printing the hex value
my $be_num = pack("N", $num);
$runtime = decode( "cp1252",$be_num);
print "\n".$runtime."\n"; # i should have got ф here
Output
perl mychar_new.pl
theta || ф ||theta || ф ||theta || ф ||244
ô

The output is correct – in CP-1252, 0xF4 is indeed ô (Wikipedia).
You wanted to specify CP-1251 instead!
use Encode 'decode';
my $cp1251 = "\xF4";
my $decoded = decode "cp1251", $cp1251;
print "$decoded\n";

Related

Displaying CP437 ('extended ascii') in Perl/Gtk

Is there any way in which I can display old-fashioned extended ASCII (cp437) in a Gtk2::TextView? (Google suggests no answers.)
If there is some way of changing the charset used by a GTK widget, I can't find it.
Or maybe it's necessary to use Perl's Encode module, as I tried in the script below, but that doesn't work either.
#!/usr/bin/perl
# Display ASCII
use strict;
use diagnostics;
use warnings;
use Encode;
use Glib qw(TRUE FALSE);
use Gtk2 '-init';
# Open a Gtk2 window, with a Gtk2::TextView to display text
my $window = Gtk2::Window->new('toplevel');
$window->set_title('Extended ASCII viewer');
$window->set_position('center');
$window->set_default_size(600, 400);
$window->signal_connect('delete-event' => sub {
Gtk2->main_quit();
exit;
});
my $scrollWin = Gtk2::ScrolledWindow->new(undef, undef);
$window->add($scrollWin);
$scrollWin->set_policy('automatic', 'automatic');
$scrollWin->set_border_width(0);
my $textView = Gtk2::TextView->new;
$scrollWin->add_with_viewport($textView);
$textView->can_focus(FALSE);
$textView->set_wrap_mode('word-char');
$textView->set_justification('left');
my $buffer = $textView->get_buffer();
$window->show_all();
# In cp437, this is a series of accented A characters
my $string = chr (131) . chr (132) . chr (133) . chr (134);
# Display plain text
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $string . "\n");
# Display UTF-8 text
my $utfString = encode('utf8', $string);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $utfString . "\n");
# Display cp437
my $cpString = decode ('cp437', $string);
my $utfString2 = encode('utf-8', $cpString);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $utfString2 . "\n");
# Other suggestion
my $otherString = encode("utf-8", decode ("cp437", $string));
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $otherString . "\n");
# Directly decode a hex character (as suggested)
my $hexString = encode("utf-8", decode("cp437", "\xBA"));
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $hexString . "\n");
Gtk2->main();
Gtk wants to receive UTF-8 encoded strings, so anything you pass to a Gtk widget should be UTF-8 encoded.
If your input is cp437, then you'll want to decode it first and reencode it as UTF-8.
my $cp437_string = chr(153) x 10; # cp437 encoded
my $string = decode('cp437', $cp437_string); # Unicode code point encoded
my $utf8_string = encode('utf-8', $string); # utf-8 encoded
$buffer->insert_with_tags_by_name(
$buffer->get_end_iter(), $utf8_string . "\n");
Gtk2.pm expects Perl character strings (Encode::decode(...)), which are internally stored as UTF-8.
If you feed it a byte string (Encode::encode(...)), it will try to display it as latin1.
# In cp437, this is a series of accented A characters
my $string = chr (131) . chr (132) . chr (133) . chr (134);
my $perlString = decode ('cp437', $string);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $perlString . "\n\n");
my $charmap = join("", map chr, 128..255);
$charmap =~ s!.{16}\K!\n!g;
$perlString = decode ('cp437', $charmap);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $perlString . "\n");
Gtk2->main();
Resulting screenshot:

Chinese character doesn't always display correctly

I've been trying to pass a Chinese character to a JSON hash but it always comes out as "女"
#!/usr/bin/perl
use JSON;
# variable declaration
my $gender = "Female"
# turning english selection to Chinese character
if ($gender eq 'Female') {
$gender = "女";
} elsif ($gender eq 'Male') {
$gender = "男";
} elsif ($gender eq 'Decline to state') {
$gender = "";
}
my $hash_ref = {};
$hash_ref->{'detail_sex'} = $gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
This is the result I get:
{
"detail_sex" : "女"
}
However, when I test another script the chinese character comes out perfectly.
#!/usr/bin/perl
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Encode qw(encode_utf8);
use JSON;
my $userid = 1616589;
my $time = 2015811;
my $ejob_id = 1908063;
# md5 encryption without chinese characters
my $md5_hex_sign = md5_hex($userid,$time,$job_id);
print "$md5_hex_sign\n";
# seeing if character will print
print "let's try encoding and decrypting \n";
print "the character to encrypt.\n";
print "女\n";
print "unicode print out\n";
print "\x{5973}\n";
my $char = "\x{5973}";
my $sign_char = "女";
print "unicode stored in \$char variable \n";
print $char, "\n";
print "md5 encryption of said chinese character from \$char with utf8 encoding\n";
print md5_hex(encode_utf8($char)), "\n";
print "md5 encryption of wide character with utf8 encoding\n";
print md5_hex(encode_utf8("女")), "\n";
my $sign_gender = md5_hex(encode_utf8($sign_char));
#JSON
print "JSON print out\n";
my $hash_ref = {};
$hash_ref->{'gender'} = $char;
$hash_ref->{'md5_gender'} = md5_hex(encode_utf8($char));
$hash_ref->{'char_gender'} = md5_hex(encode_utf8("女"));
$hash_ref->{'sign_gender'} = $sign_gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
Here is the result:
160a6f4bf9aec1c2d102330716ca8f4e
let's try encoding and decrypting
the character to encrypt.
女
unicode print out
Wide character in print at md5check.pl line 18.
女
unicode stored in $char variable
Wide character in print at md5check.pl line 22.
女
md5 encryption of said chinese character from $char with utf8 encoding
87c835a6b1749374a7524a596087b296
md5 encryption of wide character with utf8 encoding
06c82a10da7e297180d696ed92f524c1
JSON print out
{
"char_gender" : "06c82a10da7e297180d696ed92f524c1",
"md5_gender" : "87c835a6b1749374a7524a596087b296",
"sign_gender" : "06c82a10da7e297180d696ed92f524c1",
"gender" : "女"
}
Would someone kindly explain to me what is going on?
Things I've tried:
use utf8;
print JSON->new->ascii(1)->pretty(1)->encode($hash_ref);
But I still get this as a result:
{
"detail_sex" : "女"
}
I'm mostly concerned about the Chinese character (女) being md5 encrypted instead of "女" being encrypted.
If it isn't already, save your source code in UTF-8.
Tell Perl that your script contains UTF-8 with the utf8 pragma:
use utf8;
Here's a very short test case for you to try:
use strict;
use warnings;
use utf8;
use JSON;
print encode_json({detail_sex => '女'});

Converting to unicode characters in Perl?

I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.
My Input is:
इस परीक्षण के लिए है
My expected output is:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
How to achieve this in Perl?
Give me some suggestions.
Try this
use utf8;
my $str = 'इस परीक्षण के लिए है';
for my $c (split //, $str) {
printf("\\u%04x", ord($c));
}
print "\n";
You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:
use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.
This is done by:
Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.
This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.
As a Perl subroutine, this would look like
use strict;
use warnings;
use Encode ();
sub unicode_escape {
my ($str) = #_;
my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
my #code_units = unpack "n*", $UTF_16BE_octets;
return join '', map { sprintf "\\u%04x", $_ } #code_units;
}
Test cases:
use Test::More tests => 3;
use utf8;
is unicode_escpape(''), '',
'empty string is empty string';
is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
'non-BMP code points are escaped as surrogate halves';
my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
'ordinary BMP code points each have a single escape';
If you want only an simple converter, you can use the following filter
perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'
like:
echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<< "इस परीक्षण के लिए है"
prints:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a
Unicode with surrogate pairs.
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";
print "$str\n";
for my $ch (unpack "U*", $str) {
if( $ch > 0xffff ) {
my $h = ($ch - 0x10000) / 0x400 + 0xD800;
my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
printf "\\u%04x\\u%04x", $h, $l;
}
else {
printf "\\u%04x", $ch;
}
}
print "\n";
prints
if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020

Proper handing of UTF-8 in Perl

I have been given a file, (probably) encoded in Latin-1 (ISO 8859-1), and there are some conversions and data mining to be done with it. The output is supposed to be in UTF-8, and I have tried about anything I could find about encoding conversion in Perl, none of them produced any usable output.
I know that use utf8; does nothing to begin with. I have tried the Encode package, which looked promising:
open FILE, '<', $ARGV[0] or die $!;
my %tmp = ();
my $last_num = 0;
while (<FILE>) {
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
chomp;
next unless length;
process($_);
}
I tried that in any combination I could think of, also thrown in a binmode(STDOUT, ":utf8");, open FILE, '<:encoding(ISO-8859-1)', $ARGV[0] or die $!; and much more. The result were either scrambled umlauts, or an error message like \xC3 is not a valid UTF-8 character, or even mixed text (Some in UTF-8, some in Latin-1).
All I wanna have is a simple way to read in a Latin-1 text file and produce UTF-8 output on the console via print. Is there any simple way to do that in Perl?
See Perl encoding introduction and the Unicode cookbook.
Easiest with piconv:
$ piconv -f Latin1 -t UTF-8 < input.file > output.file
Easy, with encoding layers:
use autodie qw(:all);
open my $input, '<:encoding(Latin1)', $ARGV[0];
binmode STDOUT, ':encoding(UTF-8)';
Moderately, with manual de-/encoding:
use Encode qw(decode encode);
use autodie qw(:all);
open my $input, '<:raw', $ARGV[0];
binmode STDOUT, ':raw';
while (my $raw = <$input>) {
my $line = decode 'Latin1', $raw, Encode::FB_CROAK | Encode::LEAVE_SRC;
my $result = process($line);
print {STDOUT} encode 'UTF-8', $result, Encode::FB_CROAK | Encode::LEAVE_SRC;
}
Maybe as :
$_ = encode('utf-8', decode('ISO-8859-1', $_));
The Data is gb2312 encode, so this can convert it to utf-8:
#!/usr/bin/env perl
use Encode qw(encode decode);
while (<DATA>) {
$_ = encode('utf-8', decode('gb2312', $_));
print;
}
__DATA__
Â׶ذÂÔË»á
$_ = decode('ISO-8859-1', encode('UTF-8', $_));
This line has two problems with it. Firstly you are encoding your input to UTF-8 and then decoding it from ISO-8859-1. These two operations are the wrong way round.
Secondly, you almost certainly don't want to decode and encode at the same time. The Golden Rule of handling character encodings in Perl is to follow this process:
Decode data as soon as you get it from the outside world. This takes your input bytestream and converts it into Perl's internal representation for character strings.
Process the data according to your requirements.
Encode the data just before sending it to the outside world. This takes Perl's internal representation for character strings and converts it to a correctly-encoded bytestream for your required output encoding.

shift jis decoding/encoding in perl

When I try decode a shift-jis encoded string and encode it back, some of the characters get garbled:
I have following code:
use Encode qw(decode encode);
$val=;
print "\nbefore decoding: $val";
my $ustr = Encode::decode("shiftjis",$val);
print "\nafter decoding: $ustr";
print "\nbefore encoding: $ustr";
$val = Encode::encode("shiftjis",$ustr);
print "\nafter encoding: $val";
when I use a string : helloソworld in input it gets properly decoded and encoded back,i.e. before decoding and after encoding prints in above code print the same value.
But when I tried another string like : ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ
The end output got garbled.
Is it a perl library specific problem or it is a general shift jis mapping problem?
Is there any solution for it?
You should simply replace the shiftjis with cp932.
http://en.wikipedia.org/wiki/Code_page_932
You lack error-checking.
use utf8;
use Devel::Peek qw(Dump);
use Encode qw(encode);
sub as_shiftjis {
my ($string) = #_;
return encode(
'Shift_JIS', # http://www.iana.org/assignments/character-sets
$string,
Encode::FB_CROAK
);
}
Dump as_shiftjis 'helloソworld';
Dump as_shiftjis 'ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ';
Output:
SV = PV(0x9148a0) at 0x9dd490
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x930e80 "hello\203\\world"\0
CUR = 12
LEN = 16
"\x{2160}" does not map to shiftjis at …