perl- trim utf8 bytes to 'length' and sanitize the data - perl

I have utf8 sequence of bytes and need to trim it to say 30bytes. This may result in incomplete sequence at the end. I need to figure out how to remove the incomplete sequence.
e.g
$b="\x{263a}\x{263b}\x{263c}";
my $sstr;
print STDERR "length in utf8 bytes =" . length(Encode::encode_utf8($b)) . "\n";
{
use bytes;
$sstr= substr($b,0,29);
}
#After this $sstr contains "\342\230\272\342"\0
# How to remove \342 from the end

UTF-8 has some neat properties that allow us to do what you want while dealing with UTF-8 rather than characters. So first, you need UTF-8.
use Encode qw( encode_utf8 );
my $bytes = encode_utf8($str);
Now, to split between codepoints. The UTF-8 encoding of every code point will start with a byte matching 0b0xxxxxxx or 0b11xxxxxx, and you will never find those bytes in the middle of a code point. That means you want to truncate before
[\x00-\x7F\xC0-\xFF]
Together, we get:
use Encode qw( encode_utf8 );
my $max_bytes = 8;
my $str = "\x{263a}\x{263b}\x{263c}"; # ☺☻☼
my $bytes = encode_utf8($str);
$bytes =~ s/^.{0,$max_bytes}(?![^\x00-\x7F\xC0-\xFF])\K.*//s;
# $bytes contains encode_utf8("\x{263a}\x{263b}")
# instead of encode_utf8("\x{263a}\x{263b}") . "\xE2\x98"
Great, yes? Nope. The above can truncate in the middle of a grapheme. A grapheme (specifically, an "extended grapheme cluster") is what someone would perceive as a single visual unit. For example, "é" is a grapheme, but it can be encoded using two codepoints ("\x{0065}\x{0301}"). If you cut between the two code points, it would be valid UTF-8, but the "é" would become a "e"! If that's not acceptable, neither is the above solution. (Oleg's solution suffers from the same problem too.)
Unfortunately, UTF-8's properties are no longer sufficient to help us here. We'll need to grab one grapheme at a time, and add it to the output until we can't fit one.
my $max_bytes = 6;
my $str = "abcd\x{0065}\x{0301}fg"; # abcdéfg
my $bytes = '';
my $bytes_left = $max_bytes;
while ($str =~ /(\X)/g) {
my $grapheme = $1;
my $grapheme_bytes = encode_utf8($grapheme);
$bytes_left -= length($grapheme_bytes);
last if $bytes_left < 0;
$bytes .= $grapheme_bytes;
}
# $bytes contains encode_utf8("abcd")
# instead of encode_utf8("abcde")
# or encode_utf8("abcde") . "\xCC"

First, please don't use bytes (and never assume that any internal encoding in Perl). As documentation says: This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded <...> use of this module for anything other than debugging purposes is strongly discouraged.
To strip incomplete sequence at end of line, assuming it contains octets, use Encode::decode's Encode::FB_QUIET handling mode to stop processing once you hit invalid sequence and then just encode result back:
my $valid = Encode::decode('utf8', $sstr, Encode::FB_QUIET);
$sstr = Encode::encode('utf8', $valid);
Note that if you plan to use it with another encoding in future, not all of encodings may support this handling method.

Related

How can I escape a string in Perl for LDAP searching?

I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.

Encode::Guess:guess_encoding gives different results in different contexts

I have the following sub that opens a text file and attempts to ensure its encoding is one of either UTF-8, ISO-8859-15 or ASCII.
The problem I have with it is different behaviours in interactive vs. non-interactive use.
when I run interactively with a file that contains a UTF-8 line, $decoder is, as expected, a reference object whose name returns utf8 for that line.
non-interactively (as it runs as part of a subversion commit hook) guess_encoding returns a scalar string of value utf8 or iso-8859-15 for the utf8 check line, and iso-8859-15 or utf8 for the other two lines.
I can't for the life of me, work out where the difference in behaviour is coming from. If I force the encoding of the open to say <:encoding(utf8), it accepts every line as UTF-8 without question.
The problem is I can't assume that every file it receives will be UTF-8, so I don't want to force the encoding as a work-around. Another potential workaround is to parse the scalar text, but that just seems messy, especially when it seems to work correctly in an interactive context.
From the shell, I've tried overriding $LANG (as non-interactively that isn't set, nor are any of the LC_ variables), however the interactive version still runs correctly.
The commented out line that reports $Encode::Guess::NoUTFAutoGuess returns 0 in both interactive and non-interactive use when commented in.
Ultimately, the one thing we're trying to prevent is having UTF-16 or other wide-char encodings in our repository (as some of our tooling doesn't play well with it): I thought that looking for a white-list of encodings is an easier job than looking for a black-list of encodings.
sub checkEncoding
{
my ($file) = #_;
my ($b1, $b2, $b3);
my $encoding = "";
my $retval = 1;
my $line = 0;
say("Checking encoding of $file");
#say($Encode::Guess::NoUTFAutoGuess);
open (GREPFILE, "<", $file);
while (<GREPFILE>) {
chomp($_);
$line++;
my $decoder = Encode::Guess::guess_encoding($_, 'utf8');
say("A: $decoder");
$decoder = Encode::Guess::guess_encoding($_, 'iso-8859-15') unless ref $decoder;
say("B: $decoder");
$decoder = Encode::Guess::guess_encoding($_, 'ascii') unless ref $decoder;
say("C: $decoder");
if (ref $decoder) {
$encoding = $decoder->name;
} else {
say "Mis-identified encoding '$decoder' on line $line: [$_]";
my $z = unpack('H*', $_);
say $z;
$encoding = $decoder;
$retval = 0;
}
last if ($retval == 0);
}
close GREPFILE;
return $retval;
}
No need to guess. For the specific options of UTF-8, ISO-8859-1 and US-ASCII, you can use Encoding::FixLatin's fix_latin. It's virtually guaranteed to succeed.
That said, I think the use of ISO-8859-1 in the OP is a typo for ISO-8859-15.
The method used by fix_latin would work just as well for ISO-8859-15 as it does for ISO-8859-1. It's simply a question of replacing _init_byte_map with the following:
sub _init_byte_map {
foreach my $i (0x80..0xFF) {
my $byte = chr($i);
my $utf8 = Encode::from_to($byte, 'iso-8859-15', 'UTF-8');
$byte_map->{$byte} = $utf8;
}
}
Alternatively, if you're willing to assume the data is all of one encoding or another (rather than a mix), you could also use the following approach:
my $text;
if (!eval {
$text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$text = decode("ISO-8859-15", $bytes);
}
Keep in mind that US-ASCII is a proper subset of both UTF-8 and ISO-8859-15, so it doesn't need to be handled specially.

performance issue with substr on a very long UTF-8 string

I am using substr on a very long UTF-8 string (~250,000,000 characters).
The thing is my program almost freeze around the 200,000,000th character.
Does somebody know about this issue? What are my options?
As I am indexing a document using a suffix array, I need:
to keep my string in one piece;
to access variable length substrings using an index.
As for a MWE:
use strict;
use warnings;
use utf8;
my $text = 'あいうえお' x 50000000;
for( my $i = 0 ; $i < length($text) ; $i++ ){
print "\r$i";
my $char = substr($text,$i,1);
}
print "\n";
Perl has two string storage formats. One that's capable of storing 8-bit characters, and one capable of storing 72-bit characters (practically limited to 32 or 64). Your string necessarily uses the latter format. This wide-character format uses a variable number of bytes per character like UTF-8 does.
Finding the ith element of a string in the first format is trivial: Add the offset to the string pointer. With the second format, finding the ith character requires scanning the string from the beginning, just like you would have to scan a file from the beginning to find the nth line. There is a mechanism that caches information about the string as it's discovered, but it's not perfect.
The problem goes away if you use a fixed number of bytes per character.
use utf8;
use Encode qw( encode );
my $text = 'あいうえお' x 50000000;
my $packed = encode('UCS-4le', $text);
for my $i (0..length($packed)/4) {
print "\r$i";
my $char = chr(unpack('V', substr($packed, $i*4, 4)));
}
print "\n";
Note that the string will use 33% more memory for hiragana characters. Or maybe not, since there's no cache anymore.
I suggest that you use a regular expression instead of substr.
Benchmarking these two methods shows that a regex is nearly 100 times faster:
use strict;
use warnings;
use utf8;
my $text = 'あいうえお' x 50_000;
sub mysubstr {
for( my $i = 0 ; $i < length($text) ; $i++ ){
my $char = substr($text,$i,1);
}
}
sub myregex {
while ($text =~ /(.)/g) {
my $char = $1;
}
}
use Benchmark qw(:all) ;
timethese(10, {
'substr' => \&mysubstr,
'regex' => \&myregex,
});
Outputs:
Benchmark: timing 10 iterations of regex, substr...
regex: 2 wallclock secs ( 2.18 usr + 0.00 sys = 2.18 CPU) # 4.58/s (n=10)
substr: 198 wallclock secs (184.66 usr + 0.16 sys = 184.81 CPU) # 0.05/s (n=10)
It is a known issue listed under Bugs for Perl 5.20.0:
http://perldoc.perl.org/perlunicode.html#Speed
The most important part is the first paragraph of my quote:
Speed
Some functions are slower when working on UTF-8 encoded strings than on byte encoded strings. All functions that need to hop over characters such as length(), substr() or index(), or matching regular expressions can work much faster when the underlying data are byte-encoded.
In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 a caching scheme was introduced which will hopefully make the slowness somewhat less spectacular, at least for some operations. In general, operations with UTF-8 encoded strings are still slower. As an example, the Unicode properties (character classes) like \p{Nd} are known to be quite a bit slower (5-20 times) than their simpler counterparts like \d (then again, there are hundreds of Unicode characters matching Nd compared with the 10 ASCII characters matching d ).
The easiest way to avoid it is using byte-strings instead of unicode-strings.
In your particular sample, you can just remove characters from the beginning of the $text string as they are processed in order to avoid the linear lookup:
use utf8;
use Encode qw( encode );
$| = 1;
my $text = 'あいうえお' x 50000000;
while ($text ne '') {
print ".";
my $char = substr($text, 0, 1, '');
}
print "\n";

Perl substr based on bytes

I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.
My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.
I tried to use use bytes for byte semantic or later
substr(encode_utf8($str), $start, $length) but it does not help at all.
Any help would be appreciated.
UTF-8 was engineered so that character boundaries are easy to detect. To split the string into chunks of valid UTF-8, you can simply use the following:
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
Then either
# The saving code expects bytes.
store($_) for #utf8_chunks;
or
# The saving code expects decoded text.
store(decode_utf8($_)) for #utf8_chunks;
Demonstration:
$ perl -e'
use Encode qw( encode_utf8 );
# This character encodes to three bytes using UTF-8.
my $text = "\N{U+2660}" x 342;
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
CORE::say(length($_)) for #utf8_chunks;
'
1023
3
substr operates on 1-byte characters unless the string has the UTF-8 flag on. So this will give you the first 1024 bytes of a decoded string:
substr encode_utf8($str), 0, 1024;
although, not necessarily splitting the string on character boundaries. To discard any split characters at the end, you can use:
$str = decode_utf8($str, Encode::FB_QUIET);

Perl - Unicode::String sub need to add/convert for Latin-9 support

Part 3 (Part 2 is here) (Part 1 is here)
Here is the perl Mod I'm using: Unicode::String
How I'm calling it:
print "Euro: ";
print unicode_encode("€")."\n";
print "Pound: ";
print unicode_encode("£")."\n";
would like it to return this format:
€ # Euro
£ # Pound
The function is below:
sub unicode_encode {
shift() if ref( $_[0] );
my $toencode = shift();
return undef unless defined($toencode);
print "Passed: ".$toencode."\n";
Unicode::String->stringify_as("utf8");
my $unicode_str = Unicode::String->new();
my $text_str = "";
my $pack_str = "";
# encode Perl UTF-8 string into latin1 Unicode::String
# - currently only Basic Latin and Latin 1 Supplement
# are supported here due to issues with Unicode::String .
$unicode_str->latin1($toencode);
print "Latin 1: ".$unicode_str."\n";
# Convert to hex format ("U+XXXX U+XXXX ")
$text_str = $unicode_str->hex;
# Now, the interesting part.
# We must search for the (now hex-encoded)
# Unicode escape sequence.
my $pattern =
'U\+005[C|c] U\+0058 U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f])';
# Replace escapes with entities (beginning of string)
$_ = $text_str;
if (/^$pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/^$pattern/\&#x$pack_str/;
}
# Replace escapes with entities (middle of string)
$_ = $text_str;
while (/ $pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/ $pattern/\;\&#x$pack_str/;
$_ = $text_str;
}
# Replace "U+" with "&#x" (beginning of string)
$text_str =~ s/^U\+/&#x/;
# Replace " U+" with ";&#x" (middle of string)
$text_str =~ s/ U\+/;&#x/g;
# Append ";" to end of string to close last entity.
# This last ";" at the end of the string isn't necessary in most parsers.
# However, it is included anyways to ensure full compatibility.
if ( $text_str ne "" ) {
$text_str .= ';';
}
return $text_str;
}
I need to get the same output but need to Support Latin-9 characters as well, but the Unicode::String is limited to latin1. any thoughts on how I can get around this?
I have a couple of other questions and think I have a somewhat understanding of Unicode and Encodings but having time issues as well.
Thanks to anyone who helps me out!
As you have been told already, Unicode::String is not an appropriate choice of module. Perl ships with a module called 'Encode' which can do everything you need.
If you have a character string in Perl like this:
my $euro = "\x{20ac}";
You can convert it to a string of bytes in Latin-9 like this:
my $bytes = encode("iso8859-15", $euro);
The $bytes variable will now contain \xA4.
Or you can have Perl automatically convert it out output to a filehandle like this:
binmode(STDOUT, ":encoding(iso8859-15)");
You can refer to the documentation for the Encode module. And also, PerlIO describes the encoding layer.
I know you are determined to ignore this final piece of advice but I'll offer it one last time. Latin-9 is a legacy encoding. Perl can quite happily read Latin-9 data and convert it to UTF-8 on the fly (using binmode). You should not be writing more software that generates Latin-9 data you should be migrating away from it.