Perl's YAML::XS and unicode - perl

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.
I write this in the script (which is saved in utf-8)
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.
Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.
Am I doing something wrong? I am always unsure about unicode in perl, to be frank...
clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.

Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.
When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.
Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.
So, either
use utf8;
binmode STDOUT, ":raw";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Or
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;
Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.
When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.

It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.

I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do.
The bellow allows me:
uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
can edit the YAML/JSON files with utf8 enabled vim and bash tools
"inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
source code is utf8, so can write regexes /á/
My "broilerplate"...
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use YAML::XS;
use JSON::XS;
run();
exit;
sub run {
my $yfilein = "./in.yaml"; #input yaml
my $jfilein = "./in.json"; #input json
my $yfileout = "./out.yaml"; #output yaml
my $jfileout = "./out.json"; #output json
my $ydata = load_utf8_yaml($yfilein);
my $jdata = load_utf8_json($jfilein);
#the "uc" is not "fully correct" but works for my needs
$ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
$jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;
save_utf8_yaml($yfileout, $ydata);
save_utf8_json($jfileout, $jdata);
}
#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems
sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }
You can try the next in.yaml
---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž

Related

How do I decode a double backslashed PERLQQ escaped string into Perl characters?

I read lines from a file which contains semi-utf8 encoding and I wish to convert it to Perl-internal representation for further operations.
file.in (plain ASCII):
MO\\xc5\\xbdN\\xc3\\x81
NOV\\xc3\\x81
These should translate to MOŽNÁ and NOVÁ.
I load the lines and upgrade them to proper utf8 notation, ie. \\xc5\\xbd -> \x{00c5}\x{00bd}. Then I would like to take this upgraded $line and make perl to represent it internally:
for my $line (#lines) {
$line =~ s/x(..)/x{00$1}/g;
eval { $l = "$line"; };
}
Unfortunately, without success.
use File::Slurp qw(read_file);
use Encode qw(decode);
use Encode::Escape qw();
my $string =
decode 'UTF-8', # octets → characters
decode 'unicode-escape', # \x → octets
decode 'ascii-escape', # \\x → \x
read_file 'file.in';
Read from the bottom upwards.

searching words in Greek in Unix and Perl

I have txt files that are greek and now I want to search specific words in them using perl and bash ... the words are like ?a?, t?, e??
I was searching for words in english and now want to replace them by greek but all I get is ??? mostly... for Perl:
my %word = map { $_ => 1 } qw/name date birth/;
and for bash
for X in name date birth
do
can someone please help me?
#!/usr/bin/perl
use strict;
use warnings;
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
my #words = qw( καί τό εἰς );
my %words = map { $_ => 1 } #words;
my $pat = join '|', map quotemeta, keys %words;
while (<>) {
if (/$pat/) {
print;
}
}
Usage:
script.pl file.in >file.out
Notes:
Make sure the source code is encoded using UTF-8 and that you use use utf8;.
Make sure you use the use open line and specify the appropriate encoding for your data file. (If it's not UTF-8, change it.)

Comparing two non-ascii strings in perl

I am unable to compare two non-ascii strings, although both the strings appear the same on the console. Below is what I tried. Please let me know what code is missing here, so that the two variables shall be equal.
if($lineContent[7] ne $name) {
/*Control coming to here*/
print "###### Values MIS-MATCHED\n";
} else {
print "###### Values MATCHED\n";
}
$lineContent[7] is from a CSV file
$name is from an XML file
When Putty's console is in the default Characterset
CSV Val: ENB69-åºå°å±
XML Val: ENB69-åºå°å±
When Putty's Console is set to UTF-8
CSV Val: ENB69-基地局
XML Val: ENB69-基地局
#!/usr/bin/perl
use warnings;
use strict;
use Encode;
binmode STDOUT, ":encoding(utf8)";
open F1, "<:utf8", "$ARGV[0]" or die "$!";
open F2, "<", "$ARGV[0]" or die "$!";
my $a1 = <F1>;
chomp $a1;
my $a2 = <F2>;
chomp $a2;
if ($a1 eq $a2) {
print "$a1=$a2 is true\n";
} else {
print "$a1=$a2 is false\n";
}
my $b = decode("utf-8", $a2);
if ($a1 eq $b) {
print "$a1=$b is true\n";
} else {
print "$a1=$b is false\n";
}
I wrote a test program listed above. And create a text file with one line: 基地局.
When you run the program with this text file, you can get a false and a true.
I don't know what's in your program, but I guess the csv file is read as a plain text without any parsers or encode/decode procedures, whereas the xml file must be parsed by some library, so that the internal encoding mechanism is different for the two string variables, including some leading bytes of encoding notation.
Simply put, you can try to encode or decode one of the two string variables, and see if they match.
By the way, this is my first answer here, hope it can be a little bit helpful to you ;-)
From your dump results, it's obvious. The first variable stores 9 characters which constrcut 基地局 in utf-8 encoding in its internal structure. The second variable represents 3 characters in its internal structure. They have same byte stream, and are equal in a byte-stream view but not equal in a character-based comparison.
Use decode/encode can solve your problem.
Your inputs:
"ENB13-\345\237\272\345\234\260\345\261\200"
"ENB13-\x{57fa}\x{5730}\x{5c40}"
As you can see, these are clearly not the same. Specifically, the first is the UTF-8 encoding of the other. Always decode inputs. Always encode outputs.
use strict;
use warnings;
use utf8; # Source code is saved as UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $name = "ENB69-基地局";
while ($line = <STDIN>) {
chomp;
my #lineContent = split /\t/, $line;
print($lineContent[7] eq $name ?1:0, "\n"); # 1
}
Personally I would be a little more careful if you know that you are comparing unicode strings. Unicode::Collate is the module for the job.
Of course you should also read tchrist's now-famous SO post on the topic of enabling unicode in Perl, https://stackoverflow.com/a/6163129/468327, but utf8::all does an admirable job of turning on proper unicode support. Note that better unicode handling was added to the Perl core in version 5.14 so I require that here as well.
Finally here is a quick script that does the comparison, of course you would populate the variables by reading the files as needed:
#!/usr/bin/env perl
use v5.14;
use strict;
use warnings;
use utf8::all;
use Unicode::Collate;
my $collator = Unicode::Collate->new;
my $csv = "ENB69-基地局";
my $xml = "ENB69-基地局";
say $collator->eq($csv, $xml) ? "equal" : "unequal";

How to convince SOAP::Lite to return UTF-8 data in responses as UTF-8?

I'm trying to transmit UTF-8 strings in complex data structures with SOAP::Lite. However, as it turns out, SOAP::Lite quietly converts all UTF-8 strings into base-64-encoded octets. The problem with that is that the deserializing does not revert the conversion and only does a straight base64 decode.
This leaves me confused as to how a user is supposed to ensure that they get UTF-8 data from the SOAP::Lite response. Walking the tree and running decode_utf8 on all strings seems wasteful.
Any suggestions?
Edit: In a nutshell, how do i make this test pass without monkey-patching?
I just hit the same problem and found the above discussion useful. As you say in the OP, the problem is that the data is encoded in base64 and the is_utf8 flag get lost. what happens in the serlializer treats any string with a non-ascii character as binary. I got it to do what I wanted by tweaking the serializer as below. It could have odd consequences, but it works in my situation..
use strictures;
use Test::More;
use SOAP::Lite;
use utf8;
use Data::Dumper;
my $data = "mü\x{2013}";
my $ser = SOAP::Serializer->new;
$ser->typelookup->{trick_into_ignoring} = [9, \&utf8::is_utf8 ,'as_utf8_string'];
my $xml = $ser->envelope( freeform => $data );
my ( $cycled ) = values %{ SOAP::Deserializer->deserialize( $xml )->body };
is( length( $data ), length( $cycled ), "UTF-8 string is the same after serializing" );
done_testing;
sub check_utf8 {
my ($val) = #_;
return utf8::is_utf8($val);
}
package SOAP::Serializer;
sub as_utf8_string {
my $self = shift;
my($value, $name, $type, $attr) = #_;
return $self->as_string($value, $name, $type, $attr);
}
1;
The 9 means the utf8 check is performed before the check for non-ascii characters. if the utf8 flag is on then it treats it as a 'normal' string.
Use of is_utf8 (line 278) is evil and wrong. As we can't trust SOAP::Lite with encoding character data properly (to be fair, this code was likely written before word got around in the community how to do this particular kind of string processing), we shall give it octet data only and therefore have to handle encoding/decoding ourself. Pick a single encoding, apply it before handing off data to S::L, reverse it after receiving data.
use utf8;
use strictures;
use Encode qw(decode encode);
use SOAP::Lite qw();
use Test::More;
my $original = 'mü';
my $xml = SOAP::Serializer->envelope(
freeform => encode('UTF-8', $original, Encode::FB_CROAK | Encode::LEAVE_SRC)
);
my ($roundtrip) = map {
decode('UTF-8', $_, Encode::FB_CROAK | Encode::LEAVE_SRC)
} values %{SOAP::Deserializer->deserialize($xml)->body};
is(length($original), length($roundtrip),
'Perl character string round-trips without changing length');
done_testing;

How can I guess the encoding of a string in Perl?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:
use Encode;
my $a_with_ring =
eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
or die "Could not decode string: $#";
This has the drawback that the same octet sequence can be valid in multiple encodings
I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)
You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.
I like mscha's solution here, but simplified using Perl's defined-or operator (//):
sub slurp($file)
local $/;
open(my $fh, '<:raw', $file) or return undef();
my $raw = <$fh>;
close($fh);
# return the first successful decoding result
return
eval { Encode::decode('utf-8', $raw, Encode::FB_CROAK); } // # Try UTF-8
eval { Encode::decode('windows-1252', $raw, Encode::FB_CROAK); } // # Try windows-1252 (a superset of iso-8859-1 and ascii)
$raw; # Give up and return the raw bytes
}
The first successful decoding is returned. Plain ASCII content succeeds in the first decoding.
If you are working directly with string variables instead of reading in files, you can use just the successive-eval expression.
You can use the following code also, to encrypt and decrypt the code
sub ENCRYPT_DECRYPT() {
my $Str_Message=$_[0];
my $Len_Str_Message=length($Str_Message);
my $Str_Encrypted_Message="";
for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
my $Key_To_Use = (($Len_Str_Message+$Position)+1);
$Key_To_Use =(255+$Key_To_Use) % 255;
my $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
my $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
my $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
my $Encrypted_Byte = chr($Xored_Byte);
$Str_Encrypted_Message .= $Encrypted_Byte;
}
return $Str_Encrypted_Message;
}
my $var=&ENCRYPT_DECRYPT("hai");
print &ENCRYPT_DECRYPT($var);