How do I find the length of a Unicode string in Perl? - perl

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.
use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';
print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";
The output of this script, however, disagrees with the manpage:
ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35
It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?
Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.

If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.
$ascii = 'Lorem ipsum dolor sit amet';
{
use utf8;
$unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';
no bytes; # default, can be omitted
print "Character semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
print "----\n";
use bytes;
print "Byte semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
This outputs:
Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35

The purpose of the bytes pragma is to replace the length function (and several other string related functions) in the current scope. So every call to length in your program is a call to the length that bytes provides. This is more in line with what you were trying to do:
#!/usr/bin/perl
use strict;
use warnings;
sub bytes($) {
use bytes;
return length shift;
}
my $ascii = "foo"; #really UTF-8, but everything is in the ASCII range
my $utf8 = "\x{24d5}\x{24de}\x{24de}";
print "[$ascii] characters: ", length $ascii, "\n",
"[$ascii] bytes : ", bytes $ascii, "\n",
"[$utf8] characters: ", length $utf8, "\n",
"[$utf8] bytes : ", bytes $utf8, "\n";
Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is &#x24d5 (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.

I found that it is possible to use Encode module to influence how the length works.
if $string is utf8 encoded string.
Encode::_utf8_on($string); # the length function will show number of code points after this.
Encode::_utf8_off($string); # the length function will show number of bytes in the string after this.

There’s a fair bit of problematic commentary here.
Perl doesn’t know—and doesn’t care—which strings are “Unicode” and which aren’t. All it knows is the code points that make up the string.
Peeking at Perl’s internal UTF8 flag indicates you likely have the wrong idea about Perl strings. A “UTF-8 encoded string”—that is, the result of an encode operation like utf8::encode—usually does NOT have that flag set, for example.
There are some interfaces where that abstraction leaks, and strings with the internal UTF8 flag set DO behave differently from the same set of code points without that flag (that is, after utf8::downgrade). It’s unwise to rely on these behaviours since Perl’s own maintainers regard them as bugs. Most are fixed by the “unicode_strings” and “unicode_eval” features, and the rest by Sys::Binmode from CPAN.

Related

how to print the result from pack function?

I like to verify what pack does. I have the following code to give it a try.
$bits = pack 'N','134744072';
how to print bits ?
I did the following:
printf ("bits = %032b \n", $bits);
but it does not work.
Thanks !!
If you want the binary representation of a number, use
my $num = 134744072;
printf("bits = %032b\n", $num);
If you want the binary representation of a string of bytes, use
my $bytes = pack('N', 134744072);
printf("bits = %s\n", unpack('B*', $bytes));
The Devel::Peek module (which comes with Perl) allows you to examine Perl's representation of the variable. This is probably more useful than just a raw print when you're dealing with binary data rather than printable character strings.
#!/usr/bin/perl
use strict;
use warnings;
use Devel::Peek qw(Dump);
my $bits = pack 'N','134744072';
Dump($bits);
Which produces output like this:
SV = PV(0xaedb20) at 0xb15650
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb06630 "\10\10\10\10"\0
CUR = 4
LEN = 10
The 'SV' at the beginning indicates that this is a dump of a 'scalar value' (as opposed to say an array or a hash value).
The 'SV = PV' indicates that this scalar contains a string of bytes (as opposed to say an integer or floating point value).
The 'PV = 0xb06630' is the pointer to where those bytes are located.
The "\10\10\10\10"\0 is probably the bit you're interested in. The double quoted string represents the bytes making up the contents of this string.
Inside the string, you would typically see the bytes interpreted as if they were ASCII, so the byte 65 decimal would appear as 'A'. All non-printable characters are displayed in octal with a preceding \.
So your $bits variable contains 4 bytes, each octal '10' which is hex 0x08.
The LEN and CUR are telling you that Perl allocated 10 bytes of storage and is currently using 4 of them (so length($bits) would return 4).

Reading unicode chars on the byte level

Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?
I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.
I want to improve this functionality to write \u0000 notation.
You need to have Unicode characters, so start by decoding your byte array.
use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);
Only then can you escape Unicode characters.
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
For example,
$ perl -CSDA -MEncode=decode -E'
my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
my $decoded_text = decode("UTF-8", $encoded_text);
say $decoded_text;
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl

perl- trim utf8 bytes to 'length' and sanitize the data

I have utf8 sequence of bytes and need to trim it to say 30bytes. This may result in incomplete sequence at the end. I need to figure out how to remove the incomplete sequence.
e.g
$b="\x{263a}\x{263b}\x{263c}";
my $sstr;
print STDERR "length in utf8 bytes =" . length(Encode::encode_utf8($b)) . "\n";
{
use bytes;
$sstr= substr($b,0,29);
}
#After this $sstr contains "\342\230\272\342"\0
# How to remove \342 from the end
UTF-8 has some neat properties that allow us to do what you want while dealing with UTF-8 rather than characters. So first, you need UTF-8.
use Encode qw( encode_utf8 );
my $bytes = encode_utf8($str);
Now, to split between codepoints. The UTF-8 encoding of every code point will start with a byte matching 0b0xxxxxxx or 0b11xxxxxx, and you will never find those bytes in the middle of a code point. That means you want to truncate before
[\x00-\x7F\xC0-\xFF]
Together, we get:
use Encode qw( encode_utf8 );
my $max_bytes = 8;
my $str = "\x{263a}\x{263b}\x{263c}"; # ☺☻☼
my $bytes = encode_utf8($str);
$bytes =~ s/^.{0,$max_bytes}(?![^\x00-\x7F\xC0-\xFF])\K.*//s;
# $bytes contains encode_utf8("\x{263a}\x{263b}")
# instead of encode_utf8("\x{263a}\x{263b}") . "\xE2\x98"
Great, yes? Nope. The above can truncate in the middle of a grapheme. A grapheme (specifically, an "extended grapheme cluster") is what someone would perceive as a single visual unit. For example, "é" is a grapheme, but it can be encoded using two codepoints ("\x{0065}\x{0301}"). If you cut between the two code points, it would be valid UTF-8, but the "é" would become a "e"! If that's not acceptable, neither is the above solution. (Oleg's solution suffers from the same problem too.)
Unfortunately, UTF-8's properties are no longer sufficient to help us here. We'll need to grab one grapheme at a time, and add it to the output until we can't fit one.
my $max_bytes = 6;
my $str = "abcd\x{0065}\x{0301}fg"; # abcdéfg
my $bytes = '';
my $bytes_left = $max_bytes;
while ($str =~ /(\X)/g) {
my $grapheme = $1;
my $grapheme_bytes = encode_utf8($grapheme);
$bytes_left -= length($grapheme_bytes);
last if $bytes_left < 0;
$bytes .= $grapheme_bytes;
}
# $bytes contains encode_utf8("abcd")
# instead of encode_utf8("abcde")
# or encode_utf8("abcde") . "\xCC"
First, please don't use bytes (and never assume that any internal encoding in Perl). As documentation says: This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded <...> use of this module for anything other than debugging purposes is strongly discouraged.
To strip incomplete sequence at end of line, assuming it contains octets, use Encode::decode's Encode::FB_QUIET handling mode to stop processing once you hit invalid sequence and then just encode result back:
my $valid = Encode::decode('utf8', $sstr, Encode::FB_QUIET);
$sstr = Encode::encode('utf8', $valid);
Note that if you plan to use it with another encoding in future, not all of encodings may support this handling method.

Perl substr based on bytes

I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.
My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.
I tried to use use bytes for byte semantic or later
substr(encode_utf8($str), $start, $length) but it does not help at all.
Any help would be appreciated.
UTF-8 was engineered so that character boundaries are easy to detect. To split the string into chunks of valid UTF-8, you can simply use the following:
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
Then either
# The saving code expects bytes.
store($_) for #utf8_chunks;
or
# The saving code expects decoded text.
store(decode_utf8($_)) for #utf8_chunks;
Demonstration:
$ perl -e'
use Encode qw( encode_utf8 );
# This character encodes to three bytes using UTF-8.
my $text = "\N{U+2660}" x 342;
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
CORE::say(length($_)) for #utf8_chunks;
'
1023
3
substr operates on 1-byte characters unless the string has the UTF-8 flag on. So this will give you the first 1024 bytes of a decoded string:
substr encode_utf8($str), 0, 1024;
although, not necessarily splitting the string on character boundaries. To discard any split characters at the end, you can use:
$str = decode_utf8($str, Encode::FB_QUIET);

Perl: utf8::decode vs. Encode::decode

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.
What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:
#!/usr/bin/perl
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";
open FILE, "test.txt" or die $!;
my #lines = <FILE>;
my $test = $lines[0];
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my #unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
my #hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
This gives the following output:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))
This gives almost identical output, only the result of length differs:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?
Thanks,Matt
You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.
You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters 諆 and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.
length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);
my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter
Dump $test;
# FLAGS = (PADMY,POK,pPOK)
# PV = 0x8d8520 "\350\253\206\n"\0
$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.