Removing accents from 2 byte characters in ASCII string - perl

I am reading data from server which is returned as JSON and parsed using JSON::Parse. The JSON data includes accented characters such as é. These are encoded in the output strings as 2 byte characters such as \xc3\xa9. The rest of the string is standard 1 byte ASCII characters.
How can I remove the accents from these characters? I have tried all of the following methods without success:
use Unicode::Normalize;
use utf8;
use Text::Iconv;
use Encode qw(from_to);
use Text::Unidecode;
sub normalise_text {
my $text = shift;
my $decomposed = NFKD( $text );
$decomposed =~ s/\p{NonspacingMark}//g;
return $decomposed;
}
sub convert {
my $converter = Text::Iconv->new("utf16", "utf8");
return $converter->convert($_);
}
sub fromto {
return from_to($_, 'UTF-16LE', 'UTF-8');
}
These libraries tend to convert each character on a byte by byte basis which is no good. For the short term, I am doing the conversion as follows:
sub mine {
my $text = $_;
$text =~ s/\xc3\xa9/e/g;
$text =~ s/\xc3\xa1/a/g;
return $text;
}
There must be a better way! Any suggestions?

Related

Chinese character doesn't always display correctly

I've been trying to pass a Chinese character to a JSON hash but it always comes out as "女"
#!/usr/bin/perl
use JSON;
# variable declaration
my $gender = "Female"
# turning english selection to Chinese character
if ($gender eq 'Female') {
$gender = "女";
} elsif ($gender eq 'Male') {
$gender = "男";
} elsif ($gender eq 'Decline to state') {
$gender = "";
}
my $hash_ref = {};
$hash_ref->{'detail_sex'} = $gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
This is the result I get:
{
"detail_sex" : "女"
}
However, when I test another script the chinese character comes out perfectly.
#!/usr/bin/perl
use Digest::MD5 qw(md5 md5_hex md5_base64);
use Encode qw(encode_utf8);
use JSON;
my $userid = 1616589;
my $time = 2015811;
my $ejob_id = 1908063;
# md5 encryption without chinese characters
my $md5_hex_sign = md5_hex($userid,$time,$job_id);
print "$md5_hex_sign\n";
# seeing if character will print
print "let's try encoding and decrypting \n";
print "the character to encrypt.\n";
print "女\n";
print "unicode print out\n";
print "\x{5973}\n";
my $char = "\x{5973}";
my $sign_char = "女";
print "unicode stored in \$char variable \n";
print $char, "\n";
print "md5 encryption of said chinese character from \$char with utf8 encoding\n";
print md5_hex(encode_utf8($char)), "\n";
print "md5 encryption of wide character with utf8 encoding\n";
print md5_hex(encode_utf8("女")), "\n";
my $sign_gender = md5_hex(encode_utf8($sign_char));
#JSON
print "JSON print out\n";
my $hash_ref = {};
$hash_ref->{'gender'} = $char;
$hash_ref->{'md5_gender'} = md5_hex(encode_utf8($char));
$hash_ref->{'char_gender'} = md5_hex(encode_utf8("女"));
$hash_ref->{'sign_gender'} = $sign_gender;
print JSON->new->utf8(1)->pretty(1)->encode($hash_ref);`
Here is the result:
160a6f4bf9aec1c2d102330716ca8f4e
let's try encoding and decrypting
the character to encrypt.
女
unicode print out
Wide character in print at md5check.pl line 18.
女
unicode stored in $char variable
Wide character in print at md5check.pl line 22.
女
md5 encryption of said chinese character from $char with utf8 encoding
87c835a6b1749374a7524a596087b296
md5 encryption of wide character with utf8 encoding
06c82a10da7e297180d696ed92f524c1
JSON print out
{
"char_gender" : "06c82a10da7e297180d696ed92f524c1",
"md5_gender" : "87c835a6b1749374a7524a596087b296",
"sign_gender" : "06c82a10da7e297180d696ed92f524c1",
"gender" : "女"
}
Would someone kindly explain to me what is going on?
Things I've tried:
use utf8;
print JSON->new->ascii(1)->pretty(1)->encode($hash_ref);
But I still get this as a result:
{
"detail_sex" : "女"
}
I'm mostly concerned about the Chinese character (女) being md5 encrypted instead of "女" being encrypted.
If it isn't already, save your source code in UTF-8.
Tell Perl that your script contains UTF-8 with the utf8 pragma:
use utf8;
Here's a very short test case for you to try:
use strict;
use warnings;
use utf8;
use JSON;
print encode_json({detail_sex => '女'});

Converting to unicode characters in Perl?

I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.
My Input is:
इस परीक्षण के लिए है
My expected output is:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
How to achieve this in Perl?
Give me some suggestions.
Try this
use utf8;
my $str = 'इस परीक्षण के लिए है';
for my $c (split //, $str) {
printf("\\u%04x", ord($c));
}
print "\n";
You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:
use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.
This is done by:
Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.
This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.
As a Perl subroutine, this would look like
use strict;
use warnings;
use Encode ();
sub unicode_escape {
my ($str) = #_;
my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
my #code_units = unpack "n*", $UTF_16BE_octets;
return join '', map { sprintf "\\u%04x", $_ } #code_units;
}
Test cases:
use Test::More tests => 3;
use utf8;
is unicode_escpape(''), '',
'empty string is empty string';
is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
'non-BMP code points are escaped as surrogate halves';
my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
'ordinary BMP code points each have a single escape';
If you want only an simple converter, you can use the following filter
perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'
like:
echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<< "इस परीक्षण के लिए है"
prints:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a
Unicode with surrogate pairs.
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";
print "$str\n";
for my $ch (unpack "U*", $str) {
if( $ch > 0xffff ) {
my $h = ($ch - 0x10000) / 0x400 + 0xD800;
my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
printf "\\u%04x\\u%04x", $h, $l;
}
else {
printf "\\u%04x", $ch;
}
}
print "\n";
prints
if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020

Perl + Word: Degree sign shows up preceded by A-circumflex

I'm generating a Word document in Perl, and I'd like to include the degree symbol (°) in the text I generate. If I generate the code like so:
$cell .= qq/\xB0/;
This works, and generates (for a value of $cell of 55): 55°
However, perlcritic complains at me when I do this and suggests I use this construction instead:
$cell .= qq/\N{DEGREE SIGN}/;
This does not work; it generates: 55°
Looking through my code in perl -d, I see that running the following code:
my $cell = 55;
$cell .= qq/\N{DEGREE SIGN}/; # the PBP way
print sprintf("%x\n", ord($_)) for split //, $cell;
my $cell = 55;
$cell .= qq/\xB0/; # the non-PBP way
print sprintf("%x\n", ord($_)) for split //, $cell;
results in:
35
35
b0
I'm outputting text to the Word document using Win32::OLE:
my #column_headings = #{ shift $args->{'data'} };
my #rows = #{ $args->{'data'} };
my $word = Win32::OLE->new( 'Word.Application', 'Quit' );
my $doc = $word->Documents->Add();
my $select = $word->Selection;
$csv->combine(#column_headings);
$select->InsertAfter( $csv->string );
$select->InsertParagraphAfter;
for my $row (#rows) {
$csv->combine( #{$row} );
$select->InsertAfter( $csv->string );
$select->InsertParagraphAfter;
}
my $table =
$select->ConvertToTable( { 'Separator' => wdSeparateByCommas } );
$table->Rows->First->Range->Font->{'Bold'} = 1;
$table->Rows->First->Range->ParagraphFormat->{'Alignment'} =
wdAlignParagraphCenter;
#{ $table->Rows->First->Borders(wdBorderBottom) }{qw/LineStyle LineWidth/}
= ( wdLineStyleDouble, wdLineWidth100pt );
$doc->SaveAs( { 'Filename' => Cwd::getcwd . '/test.doc' } );
What can I do to get rid of the extraneous Â?
Of course, you are suffering from encoding issues. The degree sign is U+00B0, but this serializes to UTF-8 C2 B0, which renders as ° — if this multi byte character is correctly decoded as utf-8. If you were decoding the bytes as a single-byte encoding (say … cp1252), then the bytes would be considered seperate, and would display  °.
Now clearly, the solution is either to tell Perl to transform the unicode string to a byte string of cp1252 chars (the horror!). You will find the my $bytestring = Encode::encode("cp1252", $string) function interesting here.
Or you tell the document that it will consider itself UTF-8. I don't know how you would do that, but there has to be an option somewhere. This would actually be preferable, as there are thousands of characters that (unlike the °) don't fit into cp1252. Like the degree Celsius ℃ (U+2103) or degree Fahrenheit ℉ (U+2109) characters ;-)

Perl: how to replace extended characters by their corresponding entity in an XML file?

In an XML file, I need to convert all characters above character code 127 to their corresponding literal entity (typically convert é into é).
Here is what I wrote, but it doesn't work.
sub as_entity{
my $char = shift;
return sprintf("&#x%.4x;", ord($char));
}
sub entitify{
my $str = shift;
$str =~ s/([\x7f-\x{ffffff}])/(?{as_entity($1)})/g;
return $str;
}
It seems I can't use the (?{...}) in the replacement part...
What would be the best way to achieve this?
$str =~ s/([\x7f-\x{ffffff}])/as_entity($1)/ge;
should be enough. (Note the extra /e modifier.)

shift jis decoding/encoding in perl

When I try decode a shift-jis encoded string and encode it back, some of the characters get garbled:
I have following code:
use Encode qw(decode encode);
$val=;
print "\nbefore decoding: $val";
my $ustr = Encode::decode("shiftjis",$val);
print "\nafter decoding: $ustr";
print "\nbefore encoding: $ustr";
$val = Encode::encode("shiftjis",$ustr);
print "\nafter encoding: $val";
when I use a string : helloソworld in input it gets properly decoded and encoded back,i.e. before decoding and after encoding prints in above code print the same value.
But when I tried another string like : ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ
The end output got garbled.
Is it a perl library specific problem or it is a general shift jis mapping problem?
Is there any solution for it?
You should simply replace the shiftjis with cp932.
http://en.wikipedia.org/wiki/Code_page_932
You lack error-checking.
use utf8;
use Devel::Peek qw(Dump);
use Encode qw(encode);
sub as_shiftjis {
my ($string) = #_;
return encode(
'Shift_JIS', # http://www.iana.org/assignments/character-sets
$string,
Encode::FB_CROAK
);
}
Dump as_shiftjis 'helloソworld';
Dump as_shiftjis 'ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ';
Output:
SV = PV(0x9148a0) at 0x9dd490
REFCNT = 1
FLAGS = (TEMP,POK,pPOK)
PV = 0x930e80 "hello\203\\world"\0
CUR = 12
LEN = 16
"\x{2160}" does not map to shiftjis at …