Perl unicode conversion - perl

I'm using this code:
use Unicode::UTF8 qw[decode_utf8 encode_utf8];
my $d = "opposite Spencer\u2019s Aliganj, Lucknow";
my $string = decode_utf8($d);
my $octets = encode_utf8($d);
print "\nSTRING :: $string";
I want output like
opposite Spencer's Aliganj, Lucknow
what to do ?

If you just want unicode #2019 to become ’ you can use one of this ways:
use strict;
use warnings;
use open ':std', ':encoding(utf-8)';
print chr(0x2019);
print "\x{2019}"; # for characters 0x100 and above
print "\N{U+2019}";
\u \U in perl translates to uppercase in perl:
Case translation operators use the Unicode case translation tables
when character input is provided. Note that uc(), or \U in
interpolated strings, translates to uppercase, while ucfirst, or \u in
interpolated strings, translates to titlecase in languages that make
the distinction (which is equivalent to uppercase in languages without
the distinction).

You're trying to parse butchered JSON.
You could parse it yourself.
use Encode qw( decode );
my $incomplete_json = "opposite Spencer\u2019s Aliganj, Lucknow";
my $string = $incomplete_json;
$string =~ s{\\u([dD][89aAbB]..)\\u([dD][cCdDeEfF]..)|\\u(....)}
{ $1 ? decode('UTF-16be', pack('H*', $1.$2)) : chr(hex($3)) }eg;
Or you could fix it up then use an existing parser
use JSON::XS qw( decode_json );
my $incomplete_json = "opposite Spencer\u2019s Aliganj, Lucknow";
my $json = $incomplete_json;
$json =~ s/"/\\"/g;
$json = qq{["$json"]};
my $string = decode_json($json)->[0];
Untested. You may have to handle other slashes. Which solution is simpler depends on how you have to handle the other slashes.

Related

Perl - Convert integer to text Char(1,2,3,4,5,6)

I am after some help trying to convert the following log I have to plain text.
This is a URL so there maybe %20 = 'space' and other but the main bit I am trying convert is the char(1,2,3,4,5,6) to text.
Below is an example of what I am trying to convert.
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
What I have tried so far is the following while trying to added into the char(in here) to convert with the chr($2)
perl -pe "s/(char())/chr($2)/ge"
All this has manage to do is remove the char but now I am trying to convert the number to text and remove the commas and brackets.
I maybe way off with how I am doing as I am fairly new to to perl.
perl -pe "s/word to remove/word to change it to/ge"
"s/(char(what goes in here))/chr($2)/ge"
Output try to achieve is
select -x1-Q-,-x2-Q-,-x3-Q-
Or
select%20-x1-Q-,-x2-Q-,-x3-Q-
Thanks for any help
There's too much to do here for a reasonable one-liner. Also, a script is easier to adjust later
use warnings;
use strict;
use feature 'say';
use URI::Escape 'uri_unescape';
my $string = q{select%20}
. q{char(45,120,49,45,81,45),char(45,120,50,45,81,45),}
. q{char(45,120,51,45,81,45)};
my $new_string = uri_unescape($string); # convert %20 and such
my #parts = $new_string =~ /(.*?)(char.*)/;
$parts[1] = join ',', map { chr( (/([0-9]+)/)[0] ) } split /,/, $parts[1];
$new_string = join '', #parts;
say $new_string;
this prints
select -x1-Q-,-x2-Q-,-x3-Q-
Comments
Module URI::Escape is used to convert percent-encoded characters, per RFC 3986
It is unspecified whether anything can follow the part with char(...)s, and what that might be. If there can be more after last char(...) adjust the splitting into #parts, or clarify
In the part with char(...)s only the numbers are needed, what regex in map uses
If you are going to use regex you should read up on it. See
perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)
Alright, this is going to be a messy "one-liner". Assuming your text is in a variable called $text.
$text =~ s{char\( ( (?: (?:\d+,)* \d+ )? ) \)}{
my #arr = split /,/, $1;
my $temp = join('', map { chr($_) } #arr);
$temp =~ s/^|$/"/g;
$temp
}xeg;
The regular expression matches char(, followed by a comma-separated list of sequences of digits, followed by ). We capture the digits in capture group $1. In the substitution, we split $1 on the comma (since chr only works on one character, not a whole list of them). Then we map chr over each number and concatenate the result into a string. The next line simply puts quotation marks at the start and end of the string (presumably you want the output quoted) and then returns the new string.
Input:
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
Output:
select%20"-x1-Q-","-x2-Q-","-x3-Q-"
If you want to replace the % escape sequences as well, I suggest doing that in a separate line. Trying to integrate both substitutions into one statement is going to get very hairy.
This will do as you ask. It performs the decoding in two stages: first the URI-encoding is decoded using chr hex $1, and then each char() function is translated to the string corresponding to the character equivalents of its decimal parameters
use strict;
use warnings 'all';
use feature 'say';
my $s = 'select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)';
$s =~ s/%(\d+)/ chr hex $1 /eg;
$s =~ s{ char \s* \( ( [^()]+ ) \) }{ join '', map chr, $1 =~ /\d+/g }xge;
say $s;
output
select -x1-Q-,-x2-Q-,-x3-Q-

Perl | Print ASCII, but backslashed other

I want print 95 ASCII symblols unchanged, but for others to print its codes.
How make it in pure perl? 'unpack' function? Any module?
print BackSlashed('test folder'); # expected test\040folder
print BackSlashed('test тестовая folder');
# expected test\040\321\202\320\265\321\201\321\202\320\276\320\262\320\260\321\217\040folder
print BackSlashed('НОВАЯ ПАПКА');
# expected \320\235\320\236\320\222\320\220\320\257\040\320\237\320\220\320\237\320\232\320\220
sub BackSlashed() {
my $str = shift;
.. backslashed code here...
return $str
}
You can use a regular expression substitution with an evaled substitution part. In there, need to convert each character to its numeric value first, and then output it in octal notation. There's a good explanation for it in this answer. Attach an escaped backslash \ to get it to show up in the output.
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
I limited the capture group to basic ASCII letters and numbers. If you want something else, just change the character group.
Since your sample output has octets but you said your code has the use utf8 pragma, you need to convert Perl's representation of the string to the corresponding octet sequence before you run the substitution.
use utf8;
my $str = 'НОВАЯ ПАПКА';
print foo($str);
sub foo { # note that there are no () here!
my $str = shift;
utf8::encode($str);
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
return $str;
}

How do I decode a double backslashed PERLQQ escaped string into Perl characters?

I read lines from a file which contains semi-utf8 encoding and I wish to convert it to Perl-internal representation for further operations.
file.in (plain ASCII):
MO\\xc5\\xbdN\\xc3\\x81
NOV\\xc3\\x81
These should translate to MOŽNÁ and NOVÁ.
I load the lines and upgrade them to proper utf8 notation, ie. \\xc5\\xbd -> \x{00c5}\x{00bd}. Then I would like to take this upgraded $line and make perl to represent it internally:
for my $line (#lines) {
$line =~ s/x(..)/x{00$1}/g;
eval { $l = "$line"; };
}
Unfortunately, without success.
use File::Slurp qw(read_file);
use Encode qw(decode);
use Encode::Escape qw();
my $string =
decode 'UTF-8', # octets → characters
decode 'unicode-escape', # \x → octets
decode 'ascii-escape', # \\x → \x
read_file 'file.in';
Read from the bottom upwards.

Convert utf-8 into html &...;

In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?
First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:
use Encode qw( decode_utf8 );
my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);
my $html = join q(),
map { ord > 127 ? "&#" . ord . ";"
: $_
} split //, $unicode_string;
Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord + necessary HTML entity formatting. Perl regexp have /e flag to indicate that replacement should be treated as code.
use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест

How can I reverse a string that contains combining characters in Perl?

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer). How can I reverse the string, but still respect the combining characters?
You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:
#!/usr/bin/perl
use strict;
use warnings;
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";
The best answer is to use Unicode::GCString, as Sinan points out
I modified Chas's example a bit:
Set the encoding on STDOUT to avoid "wide character in print" warnings;
Use a positive lookahead assertion (and no separator retention mode) in split (doesn't work after 5.10, apparently, so I removed it)
It's basically the same thing with a couple of tweaks.
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print <<HERE;
original: [$original]
wrong: [$wrong]
right: [$right]
HERE
You can use Unicode::GCString:
Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);
use Unicode::GCString;
my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse #{ $y->as_arrayref };
say "$x -> $wrong";
say "$y -> $correct";
Output:
résumé -> ́emuśer
résumé -> émusér
Perl6::Str->reverse also works.
In the case of the string résumé, you can also use the Unicode::Normalize core module to change the string to a fully composed form (NFC or NFKC) before reverseing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.
Some of the other answers contain elements that don't work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'unicode_strings';
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";