How to handle over non-ANSI characters to Crypt::Blowfish in Perl? - perl

How to handle over non-ANSI characters to Crypt::Blowfish in Perl?
The following script was written in charset UTF-8 and fails only on § or ö.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use utf8;
use Crypt::Blowfish;
my $cipher = Crypt::Blowfish->new( pack 'H16', '12345678' );
my #chars = ( 'a', '§', 'ö', '9' );
printf( "%s: %s",
$_, ( eval { $cipher->encrypt( $_ x 8 ) } ) ? "ok\n" : "fail: $#" )
for ( #chars );

Ciphers work on streams or blocks of bytes, but you aren't providing it with bytes. You are providing it with Unicode cope points.
You need to serialise any text you want to encrypt into bytes before you can encrypt it, which is to say, you need to encode your text.
use Encode qw( encode_utf8 );
my $bytes = encode_utf8($char x 8);
Furthermore, you shouldn't use Crypt::Blowfish directly. That will produce weak encryption. You want to access it through Crypt::CBC. This provides salting, chaining and padding.
use Crypt::CBC qw( );
use Encode qw( encode_utf8 decode_utf8 );
my $cipher = Crypt::CBC->new(
-key => '... key phrase ...',
-cipher => 'Blowfish',
);
my $cipher_bytes = $cipher->encrypt(encode_utf8($plain_text));
my $plain_text = decode_utf8($cipher->decrypt($cipher_bytes));

Many of the Crypt::* modules are block encryption algorithms. So, they can work only with blocks with fixed length. Since '§' is a UTF8 character, it actually contain more than 1 byte, thats why your code is failing. Another issue is that you using use utf8 pragma, which means utf8 constant strings will be created with "utf8 flag". This can lead to big changes in binary algorithms, like encryption.
I'd suggest you to use Crypt::CBC module(check it on the CPAN); and, remove utf8 flag before encryption: utf8::encode($_);

Related

UTF8 is causing error when using Archive::Zip

#!/usr/bin/env perl
use utf8; #this causes error
use strict;
use warnings;
use Archive::Zip qw( :ERROR_CODES :CONSTANTS );
my $zip = Archive::Zip->new();
my $my_string_with_utf8 = <<'END_UTF8_STRING';
Text with UTF8 open/close 201c/201d “hello”
END_UTF8_STRING
my $zip_pathname = 'myfiles/myfile.txt';
$zip->addString($my_string_with_utf8, $zip_pathname);
unless ( $zip->writeToFileNamed('myZip.zip') == AZ_OK ) {
die 'write error';
}
Error:
Wide character in Compress::Raw::Zlib::crc32
Why does utf8 cause error in these package?
perl5/vendor_perl/Archive/Zip.pm line 303
$my_string_with_utf8 is not encoded using UTF-8 as the name suggests. It's a string of decoded text aka a string of Unicode Code Points.
Files can only contain bytes, so you need to encode those Code Points into bytes, say by using a character encoding such as UTF-8.

Fixing a file consisting of both UTF-8 and Windows-1252

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?
Yes!
Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.
A line can contain a mix of encodings
Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.
$ perl -e'
use Encoding::FixLatin qw( fix_latin );
$bytes = "\xD0 \x92 \xD0\x92\n";
$text = fix_latin($bytes);
printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A
Heuristics are employed, but they are fairly reliable. Only the following cases will fail:
One of[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]encoded using iso-8859-1 or cp1252, followed by one of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]encoded using iso-8859-1 or cp1252.
One of[àáâãäåæçèéêëìíîï]encoded using iso-8859-1 or cp1252, followed by two of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]encoded using iso-8859-1 or cp1252.
One of[ðñòóôõö÷]encoded using iso-8859-1 or cp1252, followed by two of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]encoded using iso-8859-1 or cp1252.
The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.
$ perl -e'
use Encode qw( decode_utf8 encode_utf8 decode );
$bytes = "\xD0 \x92 \xD0\x92\n";
$text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A
Each line only uses one encoding
fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.
$ perl -e'
use Encode qw( decode );
for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
if (!eval {
$text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$text = decode("cp1252", $bytes);
}
printf("U+%v04X\n", $text);
}
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A
Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:
The line is encoded using iso-8859-1 or cp1252,
At least one of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]is present in the line,
All instances of[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]are always followed by exactly one of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of[àáâãäåæçèéêëìíîï]are always followed by exactly two of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
All instances of[ðñòóôõö÷]are always followed by exactly three of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],
None of[øùúûüýþÿ]are present in the line, and
None of[€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]are present in the line except where previously mentioned.
Notes:
Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.
This is one of the reasons I wrote Unicode::UTF8. With Unicode::UTF8 this is trivial using the fallback option in Unicode::UTF8::decode_utf8().
use Unicode::UTF8 qw[decode_utf8];
use Encode qw[decode];
print "UTF-8 mixed with Latin-1 (ISO-8859-1):\n";
for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
no warnings 'utf8';
printf "U+%v04X\n", decode_utf8($octets, sub { $_[0] });
}
print "\nUTF-8 mixed with CP-1252 (Windows-1252):\n";
for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
no warnings 'utf8';
printf "U+%v04X\n", decode_utf8($octets, sub { decode('CP-1252', $_[0]) });
}
Output:
UTF-8 mixed with Latin-1 (ISO-8859-1):
U+00D0.0020.0092.0020.0412.000A
U+0412.000A
UTF-8 mixed with CP-1252 (Windows-1252):
U+00D0.0020.2019.0020.0412.000A
U+0412.000A
Unicode::UTF8 is written in C/XS and only invokes the callback/fallback when encountering an Ill-formed UTF-8 sequence.
Recently I came across files with a severe mix of UTF-8, CP1252, and UTF-8 encoded, then interpreted as CP1252, then that encoded as UTF-8 again, that interpreted as CP1252 again, and so forth.
I wrote the below code, which worked well for me. It looks for typical UTF-8 byte sequences, even if some of the bytes are not UTF-8, but the Unicode representation of the equivalent CP1252 byte.
my %cp1252Encoding = (
# replacing the unicode code with the original CP1252 code
# see e.g. http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
"\x{20ac}" => "\x80",
"\x{201a}" => "\x82",
"\x{0192}" => "\x83",
"\x{201e}" => "\x84",
"\x{2026}" => "\x85",
"\x{2020}" => "\x86",
"\x{2021}" => "\x87",
"\x{02c6}" => "\x88",
"\x{2030}" => "\x89",
"\x{0160}" => "\x8a",
"\x{2039}" => "\x8b",
"\x{0152}" => "\x8c",
"\x{017d}" => "\x8e",
"\x{2018}" => "\x91",
"\x{2019}" => "\x92",
"\x{201c}" => "\x93",
"\x{201d}" => "\x94",
"\x{2022}" => "\x95",
"\x{2013}" => "\x96",
"\x{2014}" => "\x97",
"\x{02dc}" => "\x98",
"\x{2122}" => "\x99",
"\x{0161}" => "\x9a",
"\x{203a}" => "\x9b",
"\x{0153}" => "\x9c",
"\x{017e}" => "\x9e",
"\x{0178}" => "\x9f",
);
my $re = join "|", keys %cp1252Encoding;
$re = qr/$re/;
my %cp1252Decoding = reverse % cp1252Encoding;
my $cp1252Characters = join "|", keys %cp1252Decoding;
sub decodeUtf8
{
my ($str) = #_;
$str =~ s/$re/ $cp1252Encoding{$&} /eg;
utf8::decode($str);
return $str;
}
sub fixString
{
my ($str) = #_;
my $r = qr/[\x80-\xBF]|$re/;
my $current;
do {
$current = $str;
# If this matches, the string is likely double-encoded UTF-8. Try to decode
$str =~ s/[\xF0-\xF7]$r$r$r|[\xE0-\xEF]$r$r|[\xC0-\xDF]$r/ decodeUtf8($&) /eg;
} while ($str ne $current);
# decodes any possible left-over cp1252 codes to Unicode
$str =~ s/$cp1252Characters/ $cp1252Decoding{$&} /eg;
return $str;
}
This has similar limitations as ikegami's answer, except that the same limitations are also applicable to UTF-8 encoded strings.

Perl unicode conversion

I'm using this code:
use Unicode::UTF8 qw[decode_utf8 encode_utf8];
my $d = "opposite Spencer\u2019s Aliganj, Lucknow";
my $string = decode_utf8($d);
my $octets = encode_utf8($d);
print "\nSTRING :: $string";
I want output like
opposite Spencer's Aliganj, Lucknow
what to do ?
If you just want unicode #2019 to become ’ you can use one of this ways:
use strict;
use warnings;
use open ':std', ':encoding(utf-8)';
print chr(0x2019);
print "\x{2019}"; # for characters 0x100 and above
print "\N{U+2019}";
\u \U in perl translates to uppercase in perl:
Case translation operators use the Unicode case translation tables
when character input is provided. Note that uc(), or \U in
interpolated strings, translates to uppercase, while ucfirst, or \u in
interpolated strings, translates to titlecase in languages that make
the distinction (which is equivalent to uppercase in languages without
the distinction).
You're trying to parse butchered JSON.
You could parse it yourself.
use Encode qw( decode );
my $incomplete_json = "opposite Spencer\u2019s Aliganj, Lucknow";
my $string = $incomplete_json;
$string =~ s{\\u([dD][89aAbB]..)\\u([dD][cCdDeEfF]..)|\\u(....)}
{ $1 ? decode('UTF-16be', pack('H*', $1.$2)) : chr(hex($3)) }eg;
Or you could fix it up then use an existing parser
use JSON::XS qw( decode_json );
my $incomplete_json = "opposite Spencer\u2019s Aliganj, Lucknow";
my $json = $incomplete_json;
$json =~ s/"/\\"/g;
$json = qq{["$json"]};
my $string = decode_json($json)->[0];
Untested. You may have to handle other slashes. Which solution is simpler depends on how you have to handle the other slashes.

How to convince SOAP::Lite to return UTF-8 data in responses as UTF-8?

I'm trying to transmit UTF-8 strings in complex data structures with SOAP::Lite. However, as it turns out, SOAP::Lite quietly converts all UTF-8 strings into base-64-encoded octets. The problem with that is that the deserializing does not revert the conversion and only does a straight base64 decode.
This leaves me confused as to how a user is supposed to ensure that they get UTF-8 data from the SOAP::Lite response. Walking the tree and running decode_utf8 on all strings seems wasteful.
Any suggestions?
Edit: In a nutshell, how do i make this test pass without monkey-patching?
I just hit the same problem and found the above discussion useful. As you say in the OP, the problem is that the data is encoded in base64 and the is_utf8 flag get lost. what happens in the serlializer treats any string with a non-ascii character as binary. I got it to do what I wanted by tweaking the serializer as below. It could have odd consequences, but it works in my situation..
use strictures;
use Test::More;
use SOAP::Lite;
use utf8;
use Data::Dumper;
my $data = "mü\x{2013}";
my $ser = SOAP::Serializer->new;
$ser->typelookup->{trick_into_ignoring} = [9, \&utf8::is_utf8 ,'as_utf8_string'];
my $xml = $ser->envelope( freeform => $data );
my ( $cycled ) = values %{ SOAP::Deserializer->deserialize( $xml )->body };
is( length( $data ), length( $cycled ), "UTF-8 string is the same after serializing" );
done_testing;
sub check_utf8 {
my ($val) = #_;
return utf8::is_utf8($val);
}
package SOAP::Serializer;
sub as_utf8_string {
my $self = shift;
my($value, $name, $type, $attr) = #_;
return $self->as_string($value, $name, $type, $attr);
}
1;
The 9 means the utf8 check is performed before the check for non-ascii characters. if the utf8 flag is on then it treats it as a 'normal' string.
Use of is_utf8 (line 278) is evil and wrong. As we can't trust SOAP::Lite with encoding character data properly (to be fair, this code was likely written before word got around in the community how to do this particular kind of string processing), we shall give it octet data only and therefore have to handle encoding/decoding ourself. Pick a single encoding, apply it before handing off data to S::L, reverse it after receiving data.
use utf8;
use strictures;
use Encode qw(decode encode);
use SOAP::Lite qw();
use Test::More;
my $original = 'mü';
my $xml = SOAP::Serializer->envelope(
freeform => encode('UTF-8', $original, Encode::FB_CROAK | Encode::LEAVE_SRC)
);
my ($roundtrip) = map {
decode('UTF-8', $_, Encode::FB_CROAK | Encode::LEAVE_SRC)
} values %{SOAP::Deserializer->deserialize($xml)->body};
is(length($original), length($roundtrip),
'Perl character string round-trips without changing length');
done_testing;

Perl's YAML::XS and unicode

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.
I write this in the script (which is saved in utf-8)
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.
Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.
Am I doing something wrong? I am always unsure about unicode in perl, to be frank...
clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.
Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.
When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.
Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.
So, either
use utf8;
binmode STDOUT, ":raw";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Or
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;
Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.
When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.
It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.
I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do.
The bellow allows me:
uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
can edit the YAML/JSON files with utf8 enabled vim and bash tools
"inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
source code is utf8, so can write regexes /á/
My "broilerplate"...
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use YAML::XS;
use JSON::XS;
run();
exit;
sub run {
my $yfilein = "./in.yaml"; #input yaml
my $jfilein = "./in.json"; #input json
my $yfileout = "./out.yaml"; #output yaml
my $jfileout = "./out.json"; #output json
my $ydata = load_utf8_yaml($yfilein);
my $jdata = load_utf8_json($jfilein);
#the "uc" is not "fully correct" but works for my needs
$ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
$jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;
save_utf8_yaml($yfileout, $ydata);
save_utf8_json($jfileout, $jdata);
}
#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems
sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }
You can try the next in.yaml
---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž