I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.
My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.
I tried to use use bytes for byte semantic or later
substr(encode_utf8($str), $start, $length) but it does not help at all.
Any help would be appreciated.
UTF-8 was engineered so that character boundaries are easy to detect. To split the string into chunks of valid UTF-8, you can simply use the following:
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
Then either
# The saving code expects bytes.
store($_) for #utf8_chunks;
or
# The saving code expects decoded text.
store(decode_utf8($_)) for #utf8_chunks;
Demonstration:
$ perl -e'
use Encode qw( encode_utf8 );
# This character encodes to three bytes using UTF-8.
my $text = "\N{U+2660}" x 342;
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
CORE::say(length($_)) for #utf8_chunks;
'
1023
3
substr operates on 1-byte characters unless the string has the UTF-8 flag on. So this will give you the first 1024 bytes of a decoded string:
substr encode_utf8($str), 0, 1024;
although, not necessarily splitting the string on character boundaries. To discard any split characters at the end, you can use:
$str = decode_utf8($str, Encode::FB_QUIET);
Related
I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.
I like to verify what pack does. I have the following code to give it a try.
$bits = pack 'N','134744072';
how to print bits ?
I did the following:
printf ("bits = %032b \n", $bits);
but it does not work.
Thanks !!
If you want the binary representation of a number, use
my $num = 134744072;
printf("bits = %032b\n", $num);
If you want the binary representation of a string of bytes, use
my $bytes = pack('N', 134744072);
printf("bits = %s\n", unpack('B*', $bytes));
The Devel::Peek module (which comes with Perl) allows you to examine Perl's representation of the variable. This is probably more useful than just a raw print when you're dealing with binary data rather than printable character strings.
#!/usr/bin/perl
use strict;
use warnings;
use Devel::Peek qw(Dump);
my $bits = pack 'N','134744072';
Dump($bits);
Which produces output like this:
SV = PV(0xaedb20) at 0xb15650
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0xb06630 "\10\10\10\10"\0
CUR = 4
LEN = 10
The 'SV' at the beginning indicates that this is a dump of a 'scalar value' (as opposed to say an array or a hash value).
The 'SV = PV' indicates that this scalar contains a string of bytes (as opposed to say an integer or floating point value).
The 'PV = 0xb06630' is the pointer to where those bytes are located.
The "\10\10\10\10"\0 is probably the bit you're interested in. The double quoted string represents the bytes making up the contents of this string.
Inside the string, you would typically see the bytes interpreted as if they were ASCII, so the byte 65 decimal would appear as 'A'. All non-printable characters are displayed in octal with a preceding \.
So your $bits variable contains 4 bytes, each octal '10' which is hex 0x08.
The LEN and CUR are telling you that Perl allocated 10 bytes of storage and is currently using 4 of them (so length($bits) would return 4).
Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?
I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.
I want to improve this functionality to write \u0000 notation.
You need to have Unicode characters, so start by decoding your byte array.
use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);
Only then can you escape Unicode characters.
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
For example,
$ perl -CSDA -MEncode=decode -E'
my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
my $decoded_text = decode("UTF-8", $encoded_text);
say $decoded_text;
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl
I have utf8 sequence of bytes and need to trim it to say 30bytes. This may result in incomplete sequence at the end. I need to figure out how to remove the incomplete sequence.
e.g
$b="\x{263a}\x{263b}\x{263c}";
my $sstr;
print STDERR "length in utf8 bytes =" . length(Encode::encode_utf8($b)) . "\n";
{
use bytes;
$sstr= substr($b,0,29);
}
#After this $sstr contains "\342\230\272\342"\0
# How to remove \342 from the end
UTF-8 has some neat properties that allow us to do what you want while dealing with UTF-8 rather than characters. So first, you need UTF-8.
use Encode qw( encode_utf8 );
my $bytes = encode_utf8($str);
Now, to split between codepoints. The UTF-8 encoding of every code point will start with a byte matching 0b0xxxxxxx or 0b11xxxxxx, and you will never find those bytes in the middle of a code point. That means you want to truncate before
[\x00-\x7F\xC0-\xFF]
Together, we get:
use Encode qw( encode_utf8 );
my $max_bytes = 8;
my $str = "\x{263a}\x{263b}\x{263c}"; # ☺☻☼
my $bytes = encode_utf8($str);
$bytes =~ s/^.{0,$max_bytes}(?![^\x00-\x7F\xC0-\xFF])\K.*//s;
# $bytes contains encode_utf8("\x{263a}\x{263b}")
# instead of encode_utf8("\x{263a}\x{263b}") . "\xE2\x98"
Great, yes? Nope. The above can truncate in the middle of a grapheme. A grapheme (specifically, an "extended grapheme cluster") is what someone would perceive as a single visual unit. For example, "é" is a grapheme, but it can be encoded using two codepoints ("\x{0065}\x{0301}"). If you cut between the two code points, it would be valid UTF-8, but the "é" would become a "e"! If that's not acceptable, neither is the above solution. (Oleg's solution suffers from the same problem too.)
Unfortunately, UTF-8's properties are no longer sufficient to help us here. We'll need to grab one grapheme at a time, and add it to the output until we can't fit one.
my $max_bytes = 6;
my $str = "abcd\x{0065}\x{0301}fg"; # abcdéfg
my $bytes = '';
my $bytes_left = $max_bytes;
while ($str =~ /(\X)/g) {
my $grapheme = $1;
my $grapheme_bytes = encode_utf8($grapheme);
$bytes_left -= length($grapheme_bytes);
last if $bytes_left < 0;
$bytes .= $grapheme_bytes;
}
# $bytes contains encode_utf8("abcd")
# instead of encode_utf8("abcde")
# or encode_utf8("abcde") . "\xCC"
First, please don't use bytes (and never assume that any internal encoding in Perl). As documentation says: This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded <...> use of this module for anything other than debugging purposes is strongly discouraged.
To strip incomplete sequence at end of line, assuming it contains octets, use Encode::decode's Encode::FB_QUIET handling mode to stop processing once you hit invalid sequence and then just encode result back:
my $valid = Encode::decode('utf8', $sstr, Encode::FB_QUIET);
$sstr = Encode::encode('utf8', $valid);
Note that if you plan to use it with another encoding in future, not all of encodings may support this handling method.
The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.
use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';
print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";
The output of this script, however, disagrees with the manpage:
ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35
It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?
Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.
If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.
$ascii = 'Lorem ipsum dolor sit amet';
{
use utf8;
$unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';
no bytes; # default, can be omitted
print "Character semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
print "----\n";
use bytes;
print "Byte semantics:\n";
print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";
This outputs:
Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35
The purpose of the bytes pragma is to replace the length function (and several other string related functions) in the current scope. So every call to length in your program is a call to the length that bytes provides. This is more in line with what you were trying to do:
#!/usr/bin/perl
use strict;
use warnings;
sub bytes($) {
use bytes;
return length shift;
}
my $ascii = "foo"; #really UTF-8, but everything is in the ASCII range
my $utf8 = "\x{24d5}\x{24de}\x{24de}";
print "[$ascii] characters: ", length $ascii, "\n",
"[$ascii] bytes : ", bytes $ascii, "\n",
"[$utf8] characters: ", length $utf8, "\n",
"[$utf8] bytes : ", bytes $utf8, "\n";
Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is ⓕ (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.
I found that it is possible to use Encode module to influence how the length works.
if $string is utf8 encoded string.
Encode::_utf8_on($string); # the length function will show number of code points after this.
Encode::_utf8_off($string); # the length function will show number of bytes in the string after this.
There’s a fair bit of problematic commentary here.
Perl doesn’t know—and doesn’t care—which strings are “Unicode” and which aren’t. All it knows is the code points that make up the string.
Peeking at Perl’s internal UTF8 flag indicates you likely have the wrong idea about Perl strings. A “UTF-8 encoded string”—that is, the result of an encode operation like utf8::encode—usually does NOT have that flag set, for example.
There are some interfaces where that abstraction leaks, and strings with the internal UTF8 flag set DO behave differently from the same set of code points without that flag (that is, after utf8::downgrade). It’s unwise to rely on these behaviours since Perl’s own maintainers regard them as bugs. Most are fixed by the “unicode_strings” and “unicode_eval” features, and the rest by Sys::Binmode from CPAN.