Converting to unicode characters in Perl? - perl

I want to convert the text ( Hindi ) to Unicode in Perl. I have searched in CPAN. But, I could not find the exact module/way which I am looking for. Basically, I am looking for something like this.
My Input is:
इस परीक्षण के लिए है
My expected output is:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948
How to achieve this in Perl?
Give me some suggestions.

Try this
use utf8;
my $str = 'इस परीक्षण के लिए है';
for my $c (split //, $str) {
printf("\\u%04x", ord($c));
}
print "\n";

You don't really need any module to do that. ord for extracting char code and printf for formatting it as 4-numbers zero padded hex is more than enough:
use utf8;
my $str = 'इस परीक्षण के लिए है';
(my $u_encoded = $str) =~ s/(.)/sprintf "\\u%04x", ord($1)/sge;
# \u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948

Because I left a few comments on how the other answers might fall short of the expectations of various tools, I'd like to share a solution that encodes characters outside of the Basic Multilingual Plane as pairs of two escapes: "😃" would become \ud83d\ude03.
This is done by:
Encoding the string as UTF-16, without a byte order mark. We explicitly choose an endianess. Here, we arbitrarily use the big-endian form. This produces a string of octets (“bytes”), where two octets form one UTF-16 code unit, and two or four octets represent an Unicode code point.
This is done for convenience and performance; we could just as well determine the numeric values of the UTF-16 code units ourselves.
unpacking the resulting binary string into 16-bit integers which represent each UTF-16 code unit. We have to respect the correct endianess, so we use the n* pattern for unpack (i.e. 16-bit big endian unsigned integer).
Formatting each code unit as an \uxxxx escape.
As a Perl subroutine, this would look like
use strict;
use warnings;
use Encode ();
sub unicode_escape {
my ($str) = #_;
my $UTF_16BE_octets = Encode::encode("UTF-16BE", $str);
my #code_units = unpack "n*", $UTF_16BE_octets;
return join '', map { sprintf "\\u%04x", $_ } #code_units;
}
Test cases:
use Test::More tests => 3;
use utf8;
is unicode_escpape(''), '',
'empty string is empty string';
is unicode_escape("\N{SMILING FACE WITH OPEN MOUTH}"), '\ud83d\ude03',
'non-BMP code points are escaped as surrogate halves';
my $input = 'इस परीक्षण के लिए है';
my $output = '\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948';
is unicode_escape($input), $output,
'ordinary BMP code points each have a single escape';

If you want only an simple converter, you can use the following filter
perl -CSDA -nle 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -nlE 'printf "\\u%04x",$_ for unpack "U*"'
like:
echo "इस परीक्षण के लिए है" | perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_'
#or
perl -CSDA -ne 'printf "\\u%*v04x\n", "\\u",$_' <<< "इस परीक्षण के लिए है"
prints:
\u0907\u0938\u0020\u092a\u0930\u0940\u0915\u094d\u0937\u0923\u0020\u0915\u0947\u0020\u0932\u093f\u090f\u0020\u0939\u0948\u000a
Unicode with surrogate pairs.
use strict;
use warnings;
use utf8;
use open qw(:std :utf8);
my $str = "if( \N{U+1F42A}+\N{U+1F410} == \N{U+1F41B} ){ \N{U+1F602} = \N{U+1F52B} } # ορισμός ";
print "$str\n";
for my $ch (unpack "U*", $str) {
if( $ch > 0xffff ) {
my $h = ($ch - 0x10000) / 0x400 + 0xD800;
my $l = ($ch - 0x10000) % 0x400 + 0xDC00;
printf "\\u%04x\\u%04x", $h, $l;
}
else {
printf "\\u%04x", $ch;
}
}
print "\n";
prints
if( 🐪+🐐 == 🐛 ){ 😂 = 🔫 } # ορισμός
\u0069\u0066\u0028\u0020\ud83d\udc2a\u002b\ud83d\udc10\u0020\u003d\u003d\u0020\ud83d\udc1b\u0020\u0029\u007b\u0020\ud83d\ude02\u0020\u003d\u0020\ud83d\udd2b\u0020\u007d\u0020\u0023\u0020\u03bf\u03c1\u03b9\u03c3\u03bc\u03cc\u03c2\u0020

Related

how to extract the subset from a special character string using perl

I need to get the subset of a string starting from a specific start word and end before the specified word. Store in the string variable.
Example: pre-wrap">test-for??maths/camp
I need to fetch the subset.
Expected output: test-for??maths
After: pre-wrap"> or may be starting with: test
and before: /camp
I have no clue how to achieve this in Perl.
Here is the code I tried. The output is not coming as expected:
#!/usr/bin/perl
use warnings;
use strict;
my $string = 'pre-wrap">test-for??maths/camp';
my $quoted_substring = quotemeta($string);
my ($quoted_substring1) = split('/camp*', $quoted_substring);
my (undef, $substring2) = split('>\s*', $quoted_substring1);
print $string, "\n";
print $substring2, "\n";
Output:
$ perl test.pl
pre-wrap">test-for??maths/camp
test\-for\?\?maths\ # but why this \ is coming
The following code extracts the part between $before and $after (which may contain regex metacharacters, they are treated as pure characters inside the \Q...\E expressions):
my $string = 'pre-wrap">test-for??maths/camp';
my $before = 'pre-wrap">';
my $after = '/camp';
if ($string =~ /\Q$before\E(.*?)\Q$after\E/) {
print $1; # prints 'test-for??maths'
}
pre-wrap">test-for??maths/camp is in 'd',
perl -ne '/((?<=pre-wrap">)|(?<=>)(?=test))\S+(?=\/camp)/ ; print $&' d

Is there any way to make printf/sprintf handle combining characters correctly?

Combining characters appear to count as whole characters in printf and sprintf's calculations:
[ é]
[ é]
The text above was created by the following code:
#!/usr/bin/perl
use strict;
use warnings;
binmode STDOUT, ":utf8";
for my $s ("\x{e9}", "e\x{301}") {
printf "[%5s]\n", $s;
}
I expected the code to print:
[ é]
[ é]
I don't see any discussion of Unicode, let alone combining characters, in the function descriptions. Are printf and sprintf useless in the face of Unicode? Is this just a bug in Perl 5.20.1 that could be fixed? Is there a replacement someone has written?
It looks like the answer is to use Unicode::GCString
#!/usr/bin/perl
use strict;
use warnings;
use Unicode::GCString;
binmode STDOUT, ":utf8";
for my $s ("\x{e9}", "e\x{301}", "e\x{301}\x{302}") {
printf "[%s]\n", pad($s, 5);
}
sub pad {
my ($s, $length) = #_;
my $gcs = Unicode::GCString->new($s);
return((" " x ($length - $gcs->columns)) . $s);
}
You should probably be aware of the Perl Unicode Cookbook. In particular ℞ #34, which deals with this very issue. As a bonus, Perl v5.20.2 has it available as perldoc unicook.
In any case: The code included in that article is as follows:
use Unicode::GCString;
use Unicode::Normalize;
my #words = qw/crème brûlée/;
#words = map { NFC($_), NFD($_) } #words;
for my $str (#words) {
my $gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
my $pad = " " x (10 - $cols);
say str, $pad, " |";
}

How to truncate a string to a specific length in perl?

I am just unable to find "truncate a string to a specific length" in Perl.
Is there any built in way?
UPDATE:
input: $str = "abcd";
output (truncate for 3 characters): $str is abc
You want to use the substr() function.
$shortened = substr( $long, 0, 50 ); # 50 characters long, starting at the beginning.
For more, use perldoc
perldoc -f substr
In your case, it would be:
$str = 'abcd';
$short = substr( $str, 0, 3 );
For a string of arbitrary length, where truncate length can be longer than string length, I would opt for a substitution
$str =~ s/.{3}\K.*//s;
For shorter strings, the substitution will not match and the string will be unchanged. The convenient \K escape can be replaced with a lookbehind assertion, or a simple capture:
s/(?<=.{3}).*//s # lookbehind
s/(.{3}).*/$1/s # capture
It's probably useful to also mention that, instead of substr() or regular expressions, you could use printf or sprintf.
See perldoc -f sprintf :
For string conversions, specifying a precision truncates the string to
fit the specified width:
printf '<%.5s>', "truncated"; # prints "<trunc>"
printf '<%10.5s>', "truncated"; # prints "< trunc>"
As long as your original string is at least 3 characters long, you can use a call to substr as an lvalue.
my $str = "abcd";
substr($str, 3) = "";
print "$str\n"; # prints "abc"
The initial length of the string may need to be checked, as if it is shorter than 3 characters, the return value of this call to substr cannot be assigned to (see perldoc -f substr for more information) and attempting to do so will cause an error.
If I understand correctly, you need to do like php wordwrap() a string, so :
use Text::Format;
print Text::Format->new({columns => 50})->format($string);
If you just need the first N characters :
print substr $string, 0, 50;
Or you can use regexp to do the same.
#!/usr/bin/perl -w
use strict;
my $str = "abcd";
$str =~ /(\w{0,3})/;
print $1;
The most natural way is to use substr to extract the part you want:
$first_n = substr($string, 0, $n);
If you only want to modify the string and you are certain it is at least the desired length:
substr($string, $n) = '';
If you are not certain, you can do:
use List::Util "min";
substr($string, min($n, length($string))) = '';
or catch the exception:
eval { substr($string, $n) = '' };

Parse scientific integer representation in perl

What is the most elegant way to parse an integer given in scientific representation, i.e. I have an input file with lines like
value=1.04738e+06
Sure I can match the all the components (leading digit, decimal positions, exponent) and calculate the result, but it seems to me there is a more straight-forward way.
% perl -e 'print "1.04738e+06" + 0'
1047380
You just need to coerce it to a number and Perl will DWIM.
FYI: looks_like_number() from Scalar::Util might come in handy.
#!/usr/bin/env perl
use strict;
use warnings;
use Scalar::Util qw( looks_like_number );
my $line = "value=1.04738e+06";
my ( $tag, $value ) = split /\s*=\s*/, $line, 2;
if( looks_like_number( $value ) ){
$value = 0 + $value;
}
print "$tag=$value\n";

Using Perl, how do I decode or create those %-encodings on the web?

I need to handle URI (i.e. percent) encoding and decoding in my Perl script. How do I do that?
This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.
This is the official FAQ answer minus subsequent edits.
Those % encodings handle reserved characters in URIs, as described in RFC 2396, Section 2. This encoding replaces the reserved character with the hexadecimal representation of the character's number from the US-ASCII table. For instance, a colon, :, becomes %3A.
In CGI scripts, you don't have to worry about decoding URIs if you are using CGI.pm. You shouldn't have to process the URI yourself, either on the way in or the way out.
If you have to encode a string yourself, remember that you should never try to encode an already-composed URI. You need to escape the components separately then put them together. To encode a string, you can use the URI::Escape module. The uri_escape function returns the escaped string:
my $original = "Colon : Hash # Percent %";
my $escaped = uri_escape( $original );
print "$escaped\n"; # 'Colon%20%3A%20Hash%20%23%20Percent%20%25'
To decode the string, use the uri_unescape function:
my $unescaped = uri_unescape( $escaped );
print $unescaped; # back to original
If you wanted to do it yourself, you simply need to replace the reserved characters with their encodings. A global substitution is one way to do it:
# encode
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%0x", ord $1 /eg;
#decode
$string =~ s/%([A-Fa-f\d]{2})/chr hex $1/eg;
DIY encode (improving above version):
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%02x", ord $1 /eg;
(note the '%02x' rather than only '%0x')
DIY decode (adding '+' -> ' '):
$string =~ s/\+/ /g; $string =~ s/%([A-Fa-f\d]{2})/chr hex $1/eg;
Coders helping coders - bartering knowledge!
Maybe this will help deciding which method to choose.
Benchmarks on perl 5.32. Every function returns same result for given $input.
Code:
#!/usr/bin/env perl
my $input = "ala ma 0,5 litra 40%'owej vodki :)";
use Net::Curl::Easy;
my $easy = Net::Curl::Easy->new();
use URI::Encode qw( uri_encode );
use URI::Escape qw( uri_escape );
use Benchmark(cmpthese);
cmpthese(-3, {
'a' => sub {
my $string = $input;
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%0x", ord $1 /eg;
},
'b' => sub {
my $string = $input;
$string = $easy->escape( $string );
},
'c' => sub {
my $string = $input;
$string = uri_encode( $string, {encode_reserved => 1} );
},
'd' => sub {
my $string = $input;
$string = uri_escape( $string );
},
});
And results:
Rate c d a b
c 5618/s -- -98% -99% -100%
d 270517/s 4716% -- -31% -80%
a 393480/s 6905% 45% -- -71%
b 1354747/s 24016% 401% 244% --
Not surprising. A specialized C solution is the fast. An in-place regex with no sub calls is quite fast, followed closely by a copying regex with a sub call. I didn't look into why uri_encode was so much worse than uri_escape.
use URI and it will make URLs that just work.