Displaying CP437 ('extended ascii') in Perl/Gtk - perl

Is there any way in which I can display old-fashioned extended ASCII (cp437) in a Gtk2::TextView? (Google suggests no answers.)
If there is some way of changing the charset used by a GTK widget, I can't find it.
Or maybe it's necessary to use Perl's Encode module, as I tried in the script below, but that doesn't work either.
#!/usr/bin/perl
# Display ASCII
use strict;
use diagnostics;
use warnings;
use Encode;
use Glib qw(TRUE FALSE);
use Gtk2 '-init';
# Open a Gtk2 window, with a Gtk2::TextView to display text
my $window = Gtk2::Window->new('toplevel');
$window->set_title('Extended ASCII viewer');
$window->set_position('center');
$window->set_default_size(600, 400);
$window->signal_connect('delete-event' => sub {
Gtk2->main_quit();
exit;
});
my $scrollWin = Gtk2::ScrolledWindow->new(undef, undef);
$window->add($scrollWin);
$scrollWin->set_policy('automatic', 'automatic');
$scrollWin->set_border_width(0);
my $textView = Gtk2::TextView->new;
$scrollWin->add_with_viewport($textView);
$textView->can_focus(FALSE);
$textView->set_wrap_mode('word-char');
$textView->set_justification('left');
my $buffer = $textView->get_buffer();
$window->show_all();
# In cp437, this is a series of accented A characters
my $string = chr (131) . chr (132) . chr (133) . chr (134);
# Display plain text
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $string . "\n");
# Display UTF-8 text
my $utfString = encode('utf8', $string);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $utfString . "\n");
# Display cp437
my $cpString = decode ('cp437', $string);
my $utfString2 = encode('utf-8', $cpString);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $utfString2 . "\n");
# Other suggestion
my $otherString = encode("utf-8", decode ("cp437", $string));
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $otherString . "\n");
# Directly decode a hex character (as suggested)
my $hexString = encode("utf-8", decode("cp437", "\xBA"));
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $hexString . "\n");
Gtk2->main();

Gtk wants to receive UTF-8 encoded strings, so anything you pass to a Gtk widget should be UTF-8 encoded.
If your input is cp437, then you'll want to decode it first and reencode it as UTF-8.
my $cp437_string = chr(153) x 10; # cp437 encoded
my $string = decode('cp437', $cp437_string); # Unicode code point encoded
my $utf8_string = encode('utf-8', $string); # utf-8 encoded
$buffer->insert_with_tags_by_name(
$buffer->get_end_iter(), $utf8_string . "\n");

Gtk2.pm expects Perl character strings (Encode::decode(...)), which are internally stored as UTF-8.
If you feed it a byte string (Encode::encode(...)), it will try to display it as latin1.
# In cp437, this is a series of accented A characters
my $string = chr (131) . chr (132) . chr (133) . chr (134);
my $perlString = decode ('cp437', $string);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $perlString . "\n\n");
my $charmap = join("", map chr, 128..255);
$charmap =~ s!.{16}\K!\n!g;
$perlString = decode ('cp437', $charmap);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $perlString . "\n");
Gtk2->main();
Resulting screenshot:

Related

Convert text to value hexadecimal

I'm trying to put the word (for sale) "عربي" in Arabic. But my terminal reverses itself from left to right. Knowing that Arabic is written from right to left. the word is equivalent to "llbye" but the terminal writes "eybll" (ﻊﻴﺒﻠﻟ).
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "for sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
print "Original string : $str\n"; # ل ل ب ي ع
print "Decoded string 1: $dec\n"; # ل ل ب ي ع
print "Decoded string 2: $decoded\n"; # ل ل ب ي ع
my $k = reverse($decoded);
print "Decode reverse : $k\n";
print "0x$_" for unpack "H*", scalar reverse "$decoded\n";
On line 21, I'm trying to better visualize converting these characters to hexdump, but I receive:
Character in 'H' format wrapped in unpack at line 21.
Term[Perl]:# perl schreib.pl
Original string : ﻟﻠﺒﻴﻊ
Decoded string 1: ﻟﻠﺒﻴﻊ
Decoded string 2: ﻟﻠﺒﻴﻊ
Decode reverse : ﻊﻴﺒﻠﻟ
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21
enter link description here
As in the image, the first blank frame is what I copy and paste, and the terminal inverts without my permission. having to use reverse to print from right to left as in the second frame, as it should have been when pasted.
How do I transform these characters into hexadecimal?
unpack H* expects a string of bytes (characters with value 00..FF), but you have a string of Unicode Code Points (characters with value 000000..10FFFF).
You can use
sprintf "%vX", $str
which is effectively the same as
join ".", map sprintf( "%X", ord( $_ ) ), split //, $str
and
join ".", map sprintf( "%X", $_ ), unpack "W*", $str
All three work for any string (bytes, UCP, whatever).
For $str, $dec and $decoded, the above produces
FEDF.FEE0.FE92.FEF4.FECA
For $enc, the above produces
EF.BB.9F.EF.BB.A0.EF.BA.92.EF.BB.B4.EF.BB.8A
(You may get something different since our files might not be the same.)
With Unicode Code Points, we can use charnames (and/or Unicode::UCD) for more info.
use charnames qw( :full );
use feature qw( say );
for my $cp ( unpack "W*", $str ) {
my $ch = chr( $ucp );
if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) { # Not sure if good enough.
printf "‹%s› ", $ch;
} else {
print "--- ";
}
printf "U+%X ", $ucp;
say charnames::viacode( $ucp );
}
For $str, $dec and $decoded, the above produces
‹ﻟ› U+FEDF ARABIC LETTER LAM INITIAL FORM
‹ﻠ› U+FEE0 ARABIC LETTER LAM MEDIAL FORM
‹ﺒ› U+FE92 ARABIC LETTER BEH MEDIAL FORM
‹ﻴ› U+FEF4 ARABIC LETTER YEH MEDIAL FORM
‹ﻊ› U+FECA ARABIC LETTER AIN FINAL FORM
Data::Dumper with local $Data::Dumper::Useqq = 1; will produce ASCII output as well.

Chinese character doesn't always display correctly

I've been trying to pass a Chinese character to a JSON hash but it always comes out as "女"
#!/usr/bin/perl
use JSON;
# variable declaration
my $gender = "Female"
# turning english selection to Chinese character
if ($gender eq 'Female') {
$gender = "女";
} elsif ($gender eq 'Male') {
$gender = "男";
} elsif ($gender eq 'Decline to state') {
$gender = "";
}
my $hash_ref = {};
$hash_ref->{'detail_sex'} = $gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
This is the result I get:
{
"detail_sex" : "女"
}
However, when I test another script the chinese character comes out perfectly.
#!/usr/bin/perl
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Encode qw(encode_utf8);
use JSON;
my $userid = 1616589;
my $time = 2015811;
my $ejob_id = 1908063;
# md5 encryption without chinese characters
my $md5_hex_sign = md5_hex($userid,$time,$job_id);
print "$md5_hex_sign\n";
# seeing if character will print
print "let's try encoding and decrypting \n";
print "the character to encrypt.\n";
print "女\n";
print "unicode print out\n";
print "\x{5973}\n";
my $char = "\x{5973}";
my $sign_char = "女";
print "unicode stored in \$char variable \n";
print $char, "\n";
print "md5 encryption of said chinese character from \$char with utf8 encoding\n";
print md5_hex(encode_utf8($char)), "\n";
print "md5 encryption of wide character with utf8 encoding\n";
print md5_hex(encode_utf8("女")), "\n";
my $sign_gender = md5_hex(encode_utf8($sign_char));
#JSON
print "JSON print out\n";
my $hash_ref = {};
$hash_ref->{'gender'} = $char;
$hash_ref->{'md5_gender'} = md5_hex(encode_utf8($char));
$hash_ref->{'char_gender'} = md5_hex(encode_utf8("女"));
$hash_ref->{'sign_gender'} = $sign_gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
Here is the result:
160a6f4bf9aec1c2d102330716ca8f4e
let's try encoding and decrypting
the character to encrypt.
女
unicode print out
Wide character in print at md5check.pl line 18.
女
unicode stored in $char variable
Wide character in print at md5check.pl line 22.
女
md5 encryption of said chinese character from $char with utf8 encoding
87c835a6b1749374a7524a596087b296
md5 encryption of wide character with utf8 encoding
06c82a10da7e297180d696ed92f524c1
JSON print out
{
"char_gender" : "06c82a10da7e297180d696ed92f524c1",
"md5_gender" : "87c835a6b1749374a7524a596087b296",
"sign_gender" : "06c82a10da7e297180d696ed92f524c1",
"gender" : "女"
}
Would someone kindly explain to me what is going on?
Things I've tried:
use utf8;
print JSON->new->ascii(1)->pretty(1)->encode($hash_ref);
But I still get this as a result:
{
"detail_sex" : "女"
}
I'm mostly concerned about the Chinese character (女) being md5 encrypted instead of "女" being encrypted.
If it isn't already, save your source code in UTF-8.
Tell Perl that your script contains UTF-8 with the utf8 pragma:
use utf8;
Here's a very short test case for you to try:
use strict;
use warnings;
use utf8;
use JSON;
print encode_json({detail_sex => '女'});

Converting to unicode characters in Perl?

I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.
My Input is:
इस परीक्षण के लिए है
My expected output is:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
How to achieve this in Perl?
Give me some suggestions.
Try this
use utf8;
my $str = 'इस परीक्षण के लिए है';
for my $c (split //, $str) {
printf("\\u%04x", ord($c));
}
print "\n";
You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:
use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.
This is done by:
Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.
This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.
As a Perl subroutine, this would look like
use strict;
use warnings;
use Encode ();
sub unicode_escape {
my ($str) = #_;
my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
my #code_units = unpack "n*", $UTF_16BE_octets;
return join '', map { sprintf "\\u%04x", $_ } #code_units;
}
Test cases:
use Test::More tests => 3;
use utf8;
is unicode_escpape(''), '',
'empty string is empty string';
is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
'non-BMP code points are escaped as surrogate halves';
my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
'ordinary BMP code points each have a single escape';
If you want only an simple converter, you can use the following filter
perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'
like:
echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<< "इस परीक्षण के लिए है"
prints:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a
Unicode with surrogate pairs.
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";
print "$str\n";
for my $ch (unpack "U*", $str) {
if( $ch > 0xffff ) {
my $h = ($ch - 0x10000) / 0x400 + 0xD800;
my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
printf "\\u%04x\\u%04x", $h, $l;
}
else {
printf "\\u%04x", $ch;
}
}
print "\n";
prints
if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020

Windows-1252 to unicode conversion in perl

I have the ef(cyrillic) character in hex format of Windows-1251.
The value is 0xF4. I want to convert and print the character in perl.
And the way i can do it is via unicode 0x0444.
I am looking for a way to convert 0xF4 to 0x044.
My eventual plan is given a hex value of any character in any encoding, i should be able to convert it into hex value of unicode and finally able to print it.
But its not working
Below is the code i am using
#!/usr/bin/perl
use strict;
use utf8;
use Encode qw(decode encode);
binmode(STDOUT, ":utf8");
my $runtime = chr(0x0444);
print "theta || ".$runtime." ||";
my $smiley = "\x{0444}";
print "theta || ".$smiley." ||";
my $georgian_an = pack("U", 0x0444);
print "theta || ".$georgian_an." ||";
my $hexstr = "0xF4";
my $num = hex $hexstr;
print $num; # printing the hex value
my $be_num = pack("N", $num);
$runtime = decode( "cp1252",$be_num);
print "\n".$runtime."\n"; # i should have got ф here
Output
perl mychar_new.pl
theta || ф ||theta || ф ||theta || ф ||244
ô
The output is correct – in CP-1252, 0xF4 is indeed ô (Wikipedia).
You wanted to specify CP-1251 instead!
use Encode 'decode';
my $cp1251 = "\xF4";
my $decoded = decode "cp1251", $cp1251;
print "$decoded\n";

shift jis decoding/encoding in perl

When I try decode a shift-jis encoded string and encode it back, some of the characters get garbled:
I have following code:
use Encode qw(decode encode);
$val=;
print "\nbefore decoding: $val";
my $ustr = Encode::decode("shiftjis",$val);
print "\nafter decoding: $ustr";
print "\nbefore encoding: $ustr";
$val = Encode::encode("shiftjis",$ustr);
print "\nafter encoding: $val";
when I use a string : helloソworld in input it gets properly decoded and encoded back,i.e. before decoding and after encoding prints in above code print the same value.
But when I tried another string like : ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ
The end output got garbled.
Is it a perl library specific problem or it is a general shift jis mapping problem?
Is there any solution for it?
You should simply replace the shiftjis with cp932.
http://en.wikipedia.org/wiki/Code_page_932
You lack error-checking.
use utf8;
use Devel::Peek qw(Dump);
use Encode qw(encode);
sub as_shiftjis {
my ($string) = #_;
return encode(
'Shift_JIS', # http://www.iana.org/assignments/character-sets
$string,
Encode::FB_CROAK
);
}
Dump as_shiftjis 'helloソworld';
Dump as_shiftjis 'ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ';
Output:
SV = PV(0x9148a0) at 0x9dd490
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x930e80 "hello\203\\world"\0
CUR = 12
LEN = 16
"\x{2160}" does not map to shiftjis at …