I'm trying to put the word (for sale) "عربي" in Arabic. But my terminal reverses itself from left to right. Knowing that Arabic is written from right to left. the word is equivalent to "llbye" but the terminal writes "eybll" (ﻊﻴﺒﻠﻟ).
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "for sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
print "Original string : $str\n"; # ل ل ب ي ع
print "Decoded string 1: $dec\n"; # ل ل ب ي ع
print "Decoded string 2: $decoded\n"; # ل ل ب ي ع
my $k = reverse($decoded);
print "Decode reverse : $k\n";
print "0x$_" for unpack "H*", scalar reverse "$decoded\n";
On line 21, I'm trying to better visualize converting these characters to hexdump, but I receive:
Character in 'H' format wrapped in unpack at line 21.
Term[Perl]:# perl schreib.pl
Original string : ﻟﻠﺒﻴﻊ
Decoded string 1: ﻟﻠﺒﻴﻊ
Decoded string 2: ﻟﻠﺒﻴﻊ
Decode reverse : ﻊﻴﺒﻠﻟ
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21.
Character in 'H' format wrapped in unpack at line 21
enter link description here
As in the image, the first blank frame is what I copy and paste, and the terminal inverts without my permission. having to use reverse to print from right to left as in the second frame, as it should have been when pasted.
How do I transform these characters into hexadecimal?
unpack H* expects a string of bytes (characters with value 00..FF), but you have a string of Unicode Code Points (characters with value 000000..10FFFF).
You can use
sprintf "%vX", $str
which is effectively the same as
join ".", map sprintf( "%X", ord( $_ ) ), split //, $str
and
join ".", map sprintf( "%X", $_ ), unpack "W*", $str
All three work for any string (bytes, UCP, whatever).
For $str, $dec and $decoded, the above produces
FEDF.FEE0.FE92.FEF4.FECA
For $enc, the above produces
EF.BB.9F.EF.BB.A0.EF.BA.92.EF.BB.B4.EF.BB.8A
(You may get something different since our files might not be the same.)
With Unicode Code Points, we can use charnames (and/or Unicode::UCD) for more info.
use charnames qw( :full );
use feature qw( say );
for my $cp ( unpack "W*", $str ) {
my $ch = chr( $ucp );
if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) { # Not sure if good enough.
printf "‹%s› ", $ch;
} else {
print "--- ";
}
printf "U+%X ", $ucp;
say charnames::viacode( $ucp );
}
For $str, $dec and $decoded, the above produces
‹ﻟ› U+FEDF ARABIC LETTER LAM INITIAL FORM
‹ﻠ› U+FEE0 ARABIC LETTER LAM MEDIAL FORM
‹ﺒ› U+FE92 ARABIC LETTER BEH MEDIAL FORM
‹ﻴ› U+FEF4 ARABIC LETTER YEH MEDIAL FORM
‹ﻊ› U+FECA ARABIC LETTER AIN FINAL FORM
Data::Dumper with local $Data::Dumper::Useqq = 1; will produce ASCII output as well.
Related
Is there any way in which I can display old-fashioned extended ASCII (cp437) in a Gtk2::TextView? (Google suggests no answers.)
If there is some way of changing the charset used by a GTK widget, I can't find it.
Or maybe it's necessary to use Perl's Encode module, as I tried in the script below, but that doesn't work either.
#!/usr/bin/perl
# Display ASCII
use strict;
use diagnostics;
use warnings;
use Encode;
use Glib qw(TRUE FALSE);
use Gtk2 '-init';
# Open a Gtk2 window, with a Gtk2::TextView to display text
my $window = Gtk2::Window->new('toplevel');
$window->set_title('Extended ASCII viewer');
$window->set_position('center');
$window->set_default_size(600, 400);
$window->signal_connect('delete-event' => sub {
Gtk2->main_quit();
exit;
});
my $scrollWin = Gtk2::ScrolledWindow->new(undef, undef);
$window->add($scrollWin);
$scrollWin->set_policy('automatic', 'automatic');
$scrollWin->set_border_width(0);
my $textView = Gtk2::TextView->new;
$scrollWin->add_with_viewport($textView);
$textView->can_focus(FALSE);
$textView->set_wrap_mode('word-char');
$textView->set_justification('left');
my $buffer = $textView->get_buffer();
$window->show_all();
# In cp437, this is a series of accented A characters
my $string = chr (131) . chr (132) . chr (133) . chr (134);
# Display plain text
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $string . "\n");
# Display UTF-8 text
my $utfString = encode('utf8', $string);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $utfString . "\n");
# Display cp437
my $cpString = decode ('cp437', $string);
my $utfString2 = encode('utf-8', $cpString);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $utfString2 . "\n");
# Other suggestion
my $otherString = encode("utf-8", decode ("cp437", $string));
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $otherString . "\n");
# Directly decode a hex character (as suggested)
my $hexString = encode("utf-8", decode("cp437", "\xBA"));
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $hexString . "\n");
Gtk2->main();
Gtk wants to receive UTF-8 encoded strings, so anything you pass to a Gtk widget should be UTF-8 encoded.
If your input is cp437, then you'll want to decode it first and reencode it as UTF-8.
my $cp437_string = chr(153) x 10; # cp437 encoded
my $string = decode('cp437', $cp437_string); # Unicode code point encoded
my $utf8_string = encode('utf-8', $string); # utf-8 encoded
$buffer->insert_with_tags_by_name(
$buffer->get_end_iter(), $utf8_string . "\n");
Gtk2.pm expects Perl character strings (Encode::decode(...)), which are internally stored as UTF-8.
If you feed it a byte string (Encode::encode(...)), it will try to display it as latin1.
# In cp437, this is a series of accented A characters
my $string = chr (131) . chr (132) . chr (133) . chr (134);
my $perlString = decode ('cp437', $string);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $perlString . "\n\n");
my $charmap = join("", map chr, 128..255);
$charmap =~ s!.{16}\K!\n!g;
$perlString = decode ('cp437', $charmap);
$buffer->insert_with_tags_by_name($buffer->get_end_iter(), $perlString . "\n");
Gtk2->main();
Resulting screenshot:
I am attempting to create a hashmap from a text file. The way the text file is set up is as follows.
(integer)<-- varying white space --> (string value)
. . .
. . .
. . .
(integer)<-- varying white space --> (string value)
eg:
5 this is a test
23 this is another test
123 this is the final test
What I want to do is assign the key to the integer, and then the entire string following to the value. I was trying something along the lines of
%myHashMap;
while(my $info = <$fh>){
chomp($info);
my ($int, $string) = split/ /,$info;
$myHashMap{$int} = $string;
}
This doesn't work though because I have spaces in the string. Is there a way to clear the initial white space, grab the integer, assign it to $int, then clear white space till you get to the string, then take the remainder of the text on that line and place it in my $string value?
You could replace
split / /, $info # Fields are separated by a space.
with
split / +/, $info # Fields are separated by spaces.
or the more general
split /\s+/, $info # Fields are separated by whitespace.
but you'd still face with the problem of the leading spaces. To ignore those, use
split ' ', $info
This special case splits on whitespace, ignoring leading whitespace.
Don't forget to tell Perl that you expect at most two fields!
$ perl -E'say "[$_]" for split(" ", " 1 abc def ghi", 2)'
[1]
[abc def ghi]
The other option would be to use the following:
$info =~ /^\s*(\S+)\s+(\S.*)/
You just need to split each line of text on whitespace into two fields
This example program assumes that the input file is passed as a parameter on the command line. I have used Data::Dump only to show the resulting hash structure
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
s/\s*\z//;
my ($key, $val) = split ' ', $_, 2;
next unless defined $val; # Ensure that there were two fields
$data{$key} = $val;
}
use Data::Dump;
dd \%data;
output
{
5 => "this is a test",
23 => "this is another test",
123 => "this is the final test",
}
First you clear initial white space use this
$info =~ s/^\s+//g;
second you have more than 2 spaces in between integer and string so use split like this to give 2 space with plus
split/ +/,$info;
The code is
use strict;
use warnings;
my %myHashMap;
while(my $info = <$fh>){
chomp($info);
$info =~ s/^\s+//g;
my ($int, $string) = split/ +/,$info;
$myHashMap{$int} = $string;
}
I've been trying to pass a Chinese character to a JSON hash but it always comes out as "女"
#!/usr/bin/perl
use JSON;
# variable declaration
my $gender = "Female"
# turning english selection to Chinese character
if ($gender eq 'Female') {
$gender = "女";
} elsif ($gender eq 'Male') {
$gender = "男";
} elsif ($gender eq 'Decline to state') {
$gender = "";
}
my $hash_ref = {};
$hash_ref->{'detail_sex'} = $gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
This is the result I get:
{
"detail_sex" : "女"
}
However, when I test another script the chinese character comes out perfectly.
#!/usr/bin/perl
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Encode qw(encode_utf8);
use JSON;
my $userid = 1616589;
my $time = 2015811;
my $ejob_id = 1908063;
# md5 encryption without chinese characters
my $md5_hex_sign = md5_hex($userid,$time,$job_id);
print "$md5_hex_sign\n";
# seeing if character will print
print "let's try encoding and decrypting \n";
print "the character to encrypt.\n";
print "女\n";
print "unicode print out\n";
print "\x{5973}\n";
my $char = "\x{5973}";
my $sign_char = "女";
print "unicode stored in \$char variable \n";
print $char, "\n";
print "md5 encryption of said chinese character from \$char with utf8 encoding\n";
print md5_hex(encode_utf8($char)), "\n";
print "md5 encryption of wide character with utf8 encoding\n";
print md5_hex(encode_utf8("女")), "\n";
my $sign_gender = md5_hex(encode_utf8($sign_char));
#JSON
print "JSON print out\n";
my $hash_ref = {};
$hash_ref->{'gender'} = $char;
$hash_ref->{'md5_gender'} = md5_hex(encode_utf8($char));
$hash_ref->{'char_gender'} = md5_hex(encode_utf8("女"));
$hash_ref->{'sign_gender'} = $sign_gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
Here is the result:
160a6f4bf9aec1c2d102330716ca8f4e
let's try encoding and decrypting
the character to encrypt.
女
unicode print out
Wide character in print at md5check.pl line 18.
女
unicode stored in $char variable
Wide character in print at md5check.pl line 22.
女
md5 encryption of said chinese character from $char with utf8 encoding
87c835a6b1749374a7524a596087b296
md5 encryption of wide character with utf8 encoding
06c82a10da7e297180d696ed92f524c1
JSON print out
{
"char_gender" : "06c82a10da7e297180d696ed92f524c1",
"md5_gender" : "87c835a6b1749374a7524a596087b296",
"sign_gender" : "06c82a10da7e297180d696ed92f524c1",
"gender" : "女"
}
Would someone kindly explain to me what is going on?
Things I've tried:
use utf8;
print JSON->new->ascii(1)->pretty(1)->encode($hash_ref);
But I still get this as a result:
{
"detail_sex" : "女"
}
I'm mostly concerned about the Chinese character (女) being md5 encrypted instead of "女" being encrypted.
If it isn't already, save your source code in UTF-8.
Tell Perl that your script contains UTF-8 with the utf8 pragma:
use utf8;
Here's a very short test case for you to try:
use strict;
use warnings;
use utf8;
use JSON;
print encode_json({detail_sex => '女'});
I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.
My Input is:
इस परीक्षण के लिए है
My expected output is:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
How to achieve this in Perl?
Give me some suggestions.
Try this
use utf8;
my $str = 'इस परीक्षण के लिए है';
for my $c (split //, $str) {
printf("\\u%04x", ord($c));
}
print "\n";
You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:
use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.
This is done by:
Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.
This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.
As a Perl subroutine, this would look like
use strict;
use warnings;
use Encode ();
sub unicode_escape {
my ($str) = #_;
my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
my #code_units = unpack "n*", $UTF_16BE_octets;
return join '', map { sprintf "\\u%04x", $_ } #code_units;
}
Test cases:
use Test::More tests => 3;
use utf8;
is unicode_escpape(''), '',
'empty string is empty string';
is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
'non-BMP code points are escaped as surrogate halves';
my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
'ordinary BMP code points each have a single escape';
If you want only an simple converter, you can use the following filter
perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'
like:
echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<< "इस परीक्षण के लिए है"
prints:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a
Unicode with surrogate pairs.
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";
print "$str\n";
for my $ch (unpack "U*", $str) {
if( $ch > 0xffff ) {
my $h = ($ch - 0x10000) / 0x400 + 0xD800;
my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
printf "\\u%04x\\u%04x", $h, $l;
}
else {
printf "\\u%04x", $ch;
}
}
print "\n";
prints
if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020
I need to handle URI (i.e. percent) encoding and decoding in my Perl script. How do I do that?
This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.
This is the official FAQ answer minus subsequent edits.
Those % encodings handle reserved characters in URIs, as described in RFC 2396, Section 2. This encoding replaces the reserved character with the hexadecimal representation of the character's number from the US-ASCII table. For instance, a colon, :, becomes %3A.
In CGI scripts, you don't have to worry about decoding URIs if you are using CGI.pm. You shouldn't have to process the URI yourself, either on the way in or the way out.
If you have to encode a string yourself, remember that you should never try to encode an already-composed URI. You need to escape the components separately then put them together. To encode a string, you can use the URI::Escape module. The uri_escape function returns the escaped string:
my $original = "Colon : Hash # Percent %";
my $escaped = uri_escape( $original );
print "$escaped\n"; # 'Colon%20%3A%20Hash%20%23%20Percent%20%25'
To decode the string, use the uri_unescape function:
my $unescaped = uri_unescape( $escaped );
print $unescaped; # back to original
If you wanted to do it yourself, you simply need to replace the reserved characters with their encodings. A global substitution is one way to do it:
# encode
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%0x", ord $1 /eg;
#decode
$string =~ s/%([A-Fa-f\d]{2})/chr hex $1/eg;
DIY encode (improving above version):
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%02x", ord $1 /eg;
(note the '%02x' rather than only '%0x')
DIY decode (adding '+' -> ' '):
$string =~ s/\+/ /g; $string =~ s/%([A-Fa-f\d]{2})/chr hex $1/eg;
Coders helping coders - bartering knowledge!
Maybe this will help deciding which method to choose.
Benchmarks on perl 5.32. Every function returns same result for given $input.
Code:
#!/usr/bin/env perl
my $input = "ala ma 0,5 litra 40%'owej vodki :)";
use Net::Curl::Easy;
my $easy = Net::Curl::Easy->new();
use URI::Encode qw( uri_encode );
use URI::Escape qw( uri_escape );
use Benchmark(cmpthese);
cmpthese(-3, {
'a' => sub {
my $string = $input;
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%0x", ord $1 /eg;
},
'b' => sub {
my $string = $input;
$string = $easy->escape( $string );
},
'c' => sub {
my $string = $input;
$string = uri_encode( $string, {encode_reserved => 1} );
},
'd' => sub {
my $string = $input;
$string = uri_escape( $string );
},
});
And results:
Rate c d a b
c 5618/s -- -98% -99% -100%
d 270517/s 4716% -- -31% -80%
a 393480/s 6905% 45% -- -71%
b 1354747/s 24016% 401% 244% --
Not surprising. A specialized C solution is the fast. An in-place regex with no sub calls is quite fast, followed closely by a copying regex with a sub call. I didn't look into why uri_encode was so much worse than uri_escape.
use URI and it will make URLs that just work.