Perl Encode.pm cannot decode string with wide character - perl

I was running a perl app which uses /opt/local/lib/perl5/5.12.4/darwin-thread-multi-2level/Encode.pm
and issues an error
Cannot decode string with wide characters at /opt/local/lib/perl5/5.12.4/darwin-thread-multi-2level/Encode.pm line 174.
Line 174 of Encode.pm reads
sub decode($$;$) {
my ( $name, $octets, $check ) = #_;
return undef unless defined $octets;
$octets .= '' if ref $octets;
$check ||= 0;
my $enc = find_encoding($name);
unless ( defined $enc ) {
require Carp;
Carp::croak("Unknown encoding '$name'");
}
my $string = $enc->decode( $octets, $check ); # line 174
$_[1] = $octets if $check and !ref $check and !( $check & LEAVE_SRC() );
return $string;
}
Any workaround?

encode takes a string of Unicode code points and serialises them into a string of bytes.
decode takes a string of bytes and deserialises them into Unicode code points.
That message means you passed a string containing one or more characters above 255 (non-bytes) to decode, which is obviously an incorrect argument.
>perl -MEncode -E"for (254..257) { say; decode('iso-8859-1', chr($_)); }"
254
255
256
Wide character in subroutine entry at .../Encode.pm line 176.
You ask for a workaround, but the bug is yours. Perhaps you are accidentally trying to decode something you already decoded?

I had a similar problem.
$enc->decode( $octets, $check ); expects octets.
So put Encode::_utf8_off($octets) before. It made it work for me.

That error message is saying that you have passed in a string that has already been decoded (and contains characters above codepoint 255). You can't decode it again.

Related

How can I convert a hex string in perl to ascii?

I have an old legacy system that is pulling postgres bytea data from a database (can't change the code as it's compiled). How do I go about converting bytea data that is in this format to an ascii string?
Here is a sample of the data:
my($value) = "\x46726f6d3a20224d";
I found this example online to decode a hex string, but the input has to start with 0x, not \x. If I change the test string from \x to 0x then this example works however the \x is treating the next HEX 46 as a capital F as it's doing the regex match, then the rest is failing to decode.
Here is the regex I had found that works with a string starting with 0x but not \x, Is it possible to decode this type of hex string somehow?
$value =~ s/0x(([0-9a-f][0-9a-f])+)/pack('H*', $1)/ie;
print $value, "\n";
Correct output when you use 0x on the input string:
From: "M
Incorrect output (not decoded) when using \x on the input string:
F726f6d3a20224d
Cheers, Mike
Assuming you actually have the string
\x46726f6d3a20224d
as produced by
my $value = "\\x46726f6d3a20224d";
Then all you just need to replace the 0 with \\.
$value =~ s/\\x(([0-9a-f][0-9a-f])+)/pack('H*', $1)/ie;
Better (less repetition and it avoids slowdowns related to ß):
$value =~ s/\\x((?:[0-9a-fA-F]{2})+)/ pack( 'H*', $1 ) /e;
If you expect this to be the whole string, then I'd use
$value = pack( 'H*', $value =~ s/^\\x//r );
The following is faster, but loses some validation:
$value = pack( 'H*', substr( $value, 2 ) );

How can I escape a string in Perl for LDAP searching?

I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.

Perl | Print ASCII, but backslashed other

I want print 95 ASCII symblols unchanged, but for others to print its codes.
How make it in pure perl? 'unpack' function? Any module?
print BackSlashed('test folder'); # expected test\040folder
print BackSlashed('test тестовая folder');
# expected test\040\321\202\320\265\321\201\321\202\320\276\320\262\320\260\321\217\040folder
print BackSlashed('НОВАЯ ПАПКА');
# expected \320\235\320\236\320\222\320\220\320\257\040\320\237\320\220\320\237\320\232\320\220
sub BackSlashed() {
my $str = shift;
.. backslashed code here...
return $str
}
You can use a regular expression substitution with an evaled substitution part. In there, need to convert each character to its numeric value first, and then output it in octal notation. There's a good explanation for it in this answer. Attach an escaped backslash \ to get it to show up in the output.
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
I limited the capture group to basic ASCII letters and numbers. If you want something else, just change the character group.
Since your sample output has octets but you said your code has the use utf8 pragma, you need to convert Perl's representation of the string to the corresponding octet sequence before you run the substitution.
use utf8;
my $str = 'НОВАЯ ПАПКА';
print foo($str);
sub foo { # note that there are no () here!
my $str = shift;
utf8::encode($str);
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
return $str;
}

Perl Regex Error Help

I'm receiving a similar error in two completely unrelated places in our code that we can't seem to figure out how to resolve. The first error occurs when we try to parse XML using XML::Simple:
Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/local/lib/perl5/XML/LibXML/Error.pm line 217.
And the second is when we try to do simple string substitution:
Malformed UTF-8 character (unexpected non-continuation byte 0x78, immediately after start byte 0xe9) in substitution (s///) at /gold/content/var/www/alltrails.com/cgi-bin/API/Log.pm line 365.
The line in question in our Log.pm file is as follows where $message is a string:
$message =~ s/\s+$//g;
Our biggest problem in troubleshoot this is that we haven't found a way to identify the input that is causing this to occur. My hope is that some else has run into this issue before and can provide advice or sample code that will help us resolve it.
Thanks in advance for your help!
Not sure what the cause is, but if you want to log the message that is causing this, you could always add a __DIE__ signal handler to make sure you capture the error:
$SIG{__DIE__} = sub {
if ($_[0] =~ /Malformed UTF-8 character/) {
print STDERR "message = $message\n";
}
};
That should at least let you know what string is triggering these errors.
Can you do a hex dump of the source data to see what it looks like?
If your reading this from a file, you can do this with a tool like "od".
Or, you can do this inside the perl script itself by passing the string to a function like this:
sub DumpString {
my #a = unpack('C*',$_[0]);
my $o = 0;
while (#a) {
my #b = splice #a,0,16;
my #d = map sprintf("%03d",$_), #b;
my #x = map sprintf("%02x",$_), #b;
my $c = substr($_[0],$o,16);
$c =~ s/[[:^print:]]/ /g;
printf "%6d %s\n",$o,join(' ',#d);
print " "x8,join(' ',#x),"\n";
print " "x9,join(' ',split(//,$c)),"\n";
$o += 16;
}
}
Sounds like you have an "XML" file that is expected to have UTF-8 encoded characters but doesn't. Try just opening it and looking for hibit characters.

Perl - Unicode::String sub need to add/convert for Latin-9 support

Part 3 (Part 2 is here) (Part 1 is here)
Here is the perl Mod I'm using: Unicode::String
How I'm calling it:
print "Euro: ";
print unicode_encode("€")."\n";
print "Pound: ";
print unicode_encode("£")."\n";
would like it to return this format:
€ # Euro
£ # Pound
The function is below:
sub unicode_encode {
shift() if ref( $_[0] );
my $toencode = shift();
return undef unless defined($toencode);
print "Passed: ".$toencode."\n";
Unicode::String->stringify_as("utf8");
my $unicode_str = Unicode::String->new();
my $text_str = "";
my $pack_str = "";
# encode Perl UTF-8 string into latin1 Unicode::String
# - currently only Basic Latin and Latin 1 Supplement
# are supported here due to issues with Unicode::String .
$unicode_str->latin1($toencode);
print "Latin 1: ".$unicode_str."\n";
# Convert to hex format ("U+XXXX U+XXXX ")
$text_str = $unicode_str->hex;
# Now, the interesting part.
# We must search for the (now hex-encoded)
# Unicode escape sequence.
my $pattern =
'U\+005[C|c] U\+0058 U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f])';
# Replace escapes with entities (beginning of string)
$_ = $text_str;
if (/^$pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/^$pattern/\&#x$pack_str/;
}
# Replace escapes with entities (middle of string)
$_ = $text_str;
while (/ $pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/ $pattern/\;\&#x$pack_str/;
$_ = $text_str;
}
# Replace "U+" with "&#x" (beginning of string)
$text_str =~ s/^U\+/&#x/;
# Replace " U+" with ";&#x" (middle of string)
$text_str =~ s/ U\+/;&#x/g;
# Append ";" to end of string to close last entity.
# This last ";" at the end of the string isn't necessary in most parsers.
# However, it is included anyways to ensure full compatibility.
if ( $text_str ne "" ) {
$text_str .= ';';
}
return $text_str;
}
I need to get the same output but need to Support Latin-9 characters as well, but the Unicode::String is limited to latin1. any thoughts on how I can get around this?
I have a couple of other questions and think I have a somewhat understanding of Unicode and Encodings but having time issues as well.
Thanks to anyone who helps me out!
As you have been told already, Unicode::String is not an appropriate choice of module. Perl ships with a module called 'Encode' which can do everything you need.
If you have a character string in Perl like this:
my $euro = "\x{20ac}";
You can convert it to a string of bytes in Latin-9 like this:
my $bytes = encode("iso8859-15", $euro);
The $bytes variable will now contain \xA4.
Or you can have Perl automatically convert it out output to a filehandle like this:
binmode(STDOUT, ":encoding(iso8859-15)");
You can refer to the documentation for the Encode module. And also, PerlIO describes the encoding layer.
I know you are determined to ignore this final piece of advice but I'll offer it one last time. Latin-9 is a legacy encoding. Perl can quite happily read Latin-9 data and convert it to UTF-8 on the fly (using binmode). You should not be writing more software that generates Latin-9 data you should be migrating away from it.