Comparing two non-ascii strings in perl - perl

I am unable to compare two non-ascii strings, although both the strings appear the same on the console. Below is what I tried. Please let me know what code is missing here, so that the two variables shall be equal.
if($lineContent[7] ne $name) {
/*Control coming to here*/
print "###### Values MIS-MATCHED\n";
} else {
print "###### Values MATCHED\n";
}
$lineContent[7] is from a CSV file
$name is from an XML file
When Putty's console is in the default Characterset
CSV Val: ENB69-åºå°å±
XML Val: ENB69-åºå°å±
When Putty's Console is set to UTF-8
CSV Val: ENB69-基地局
XML Val: ENB69-基地局

#!/usr/bin/perl
use warnings;
use strict;
use Encode;
binmode STDOUT, ":encoding(utf8)";
open F1, "<:utf8", "$ARGV[0]" or die "$!";
open F2, "<", "$ARGV[0]" or die "$!";
my $a1 = <F1>;
chomp $a1;
my $a2 = <F2>;
chomp $a2;
if ($a1 eq $a2) {
print "$a1=$a2 is true\n";
} else {
print "$a1=$a2 is false\n";
}
my $b = decode("utf-8", $a2);
if ($a1 eq $b) {
print "$a1=$b is true\n";
} else {
print "$a1=$b is false\n";
}
I wrote a test program listed above. And create a text file with one line: 基地局.
When you run the program with this text file, you can get a false and a true.
I don't know what's in your program, but I guess the csv file is read as a plain text without any parsers or encode/decode procedures, whereas the xml file must be parsed by some library, so that the internal encoding mechanism is different for the two string variables, including some leading bytes of encoding notation.
Simply put, you can try to encode or decode one of the two string variables, and see if they match.
By the way, this is my first answer here, hope it can be a little bit helpful to you ;-)
From your dump results, it's obvious. The first variable stores 9 characters which constrcut 基地局 in utf-8 encoding in its internal structure. The second variable represents 3 characters in its internal structure. They have same byte stream, and are equal in a byte-stream view but not equal in a character-based comparison.
Use decode/encode can solve your problem.

Your inputs:
"ENB13-\345\237\272\345\234\260\345\261\200"
"ENB13-\x{57fa}\x{5730}\x{5c40}"
As you can see, these are clearly not the same. Specifically, the first is the UTF-8 encoding of the other. Always decode inputs. Always encode outputs.
use strict;
use warnings;
use utf8; # Source code is saved as UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $name = "ENB69-基地局";
while ($line = <STDIN>) {
chomp;
my #lineContent = split /\t/, $line;
print($lineContent[7] eq $name ?1:0, "\n"); # 1
}

Personally I would be a little more careful if you know that you are comparing unicode strings. Unicode::Collate is the module for the job.
Of course you should also read tchrist's now-famous SO post on the topic of enabling unicode in Perl, https://stackoverflow.com/a/6163129/468327, but utf8::all does an admirable job of turning on proper unicode support. Note that better unicode handling was added to the Perl core in version 5.14 so I require that here as well.
Finally here is a quick script that does the comparison, of course you would populate the variables by reading the files as needed:
#!/usr/bin/env perl
use v5.14;
use strict;
use warnings;
use utf8::all;
use Unicode::Collate;
my $collator = Unicode::Collate->new;
my $csv = "ENB69-基地局";
my $xml = "ENB69-基地局";
say $collator->eq($csv, $xml) ? "equal" : "unequal";

Related

Split a string based on ASCII value

I need to parse a delimited file.(generated by mainframe job and ftped over to windows).But got few Queries while using the split on delimiter.
As per the documentation, the file is separated by '1D'. But when I open the file in notepad++(when I check the encoding tab, it is set to 'Encode in ANSI'), it seems to me like a 'vertical broken bar'. Q. Not sure what is '1D'?
open my $handle, '<', 'sample.txt';
chomp(my #lines = <$handle>);
close $handle;
my #a = unpack("C*", $lines[0]);
print Dumper \#a;
# $VAR1 = [65,166,66,166,67,166];
From dumper output, we see perl considers the ASCII for vertical broken bar to be 166.
As per link1, 166 is indeed vertical broken bar whereas as per link2, 166 is feminine ordinal indicator.Q. Any suggestion as to why the difference ?
my $str = $lines[0];
print Dumper $str;
# $VAR1 = 'AªBªCª';
We can see that the output contains 'feminine ordinal indicator' not 'vertical broken bar'.Q. Not sure why perl reads a 'bar' but then starts treating it as something else.
# I copied the vertical broken bar from notepad++ for use below
my #b = split(/¦/, $lines[0]);
print Dumper \#b;
# $VAR1 = [ 'AªBªCª' ];
Since perl has started treating bar to be something else, as expected, no split here.I thought to split by giving the ascii code of 166 directly. Seems split() doesn't support ASCII as an argument. Q. Any workaround to pass ASCII code to split() ?
# I copied the vertical broken bar from notepad++ and created A¦B¦C
my #c = split(/¦/, 'A¦B¦C');
print Dumper \#c;
#$VAR1 = [ 'A','B','C']; # works as expected, added here just for completion
Any pointers will be a great help!
Update:
my #a = map {ord $_} split //, $lines[0]; print Dumper \#a;
# $VAR1 = [ 65,166,66,166,67,166];
When you receive an input file from an unknown source, the most important thing to need to know about it is "what character encoding does it use?" Without that information, any processing that you do on the file is based on guesswork.
The problem isn't helped by people who talk about "extended ASCII" as though it's a meaningful term. ASCII only contains 128 characters. There are many definitions of what the next 128 character codes represent, and many of them are contradictory.
It seems that you have a solution to your problem. Splitting on '¦' (copied from Notepad++) does what you want. So I suggest you do that. If you want to use the actual character code, then you can convert 116 to hexadecimal (0xA6) and use that:
split /\xA6/, ... ;
You should always decode your inputs and encode your outputs.
my $acp;
BEGIN {
require Win32;
$acp = "cp".Win32::GetACP();
}
use open ':std', ":encoding($acp)";
Now, #lines will contain strings of Unicode Code Points. As such, you can now use the following:
use utf8; # Source code is encoded using UTF-8.
my #b = split(/¦/, $lines[0]);
Alternatively, every one of the following will also work now:
my #b = split(/\N{BROKEN BAR}/, $lines[0]);
my #b = split(/\N{U+00A6}/, $lines[0]);
my #b = split(/\x{A6}/, $lines[0]);
my #b = split(/\xA6/, $lines[0]);

perl how to detect corrupt data in CSV file?

I download a CSV file from another server using perl script. After download I wish to check whether the file contains any corrupted data or not. I tried to use Encode::Detect::Detector to detect encoding but it returns 'undef' in both cases:
if the string is ASCII or
if the string is corrupted
So using the below program I can't differentiate between ASCII & Corrupted Data.
use strict;
use Text::CSV;
use Encode::Detect::Detector;
use XML::Simple;
use Encode;
require Encode::Detect;
my #rows;
my $init_file = "new-data-jp-2013-8-8.csv";
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, $init_file or die $init_file.": $!";
while ( my $row = $csv->getline( $fh ) ) {
my #fields = #$row; # get line into array
for (my $i=1; $i<=23; $i++){ # I already know that CSV file has 23 columns
if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
print "the encoding is undef in col".$i.
" where field is ".$fields[$i-1].
" and its length is ".length($fields[$i-1])." \n";
}
else {
my $string = decode("Detect", $fields[$i-1]);
print "this is string print ".$string.
" the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
" and its length is ".length($fields[$i-1])."\n";
}
}
}
You have some bad assumptions about encodings, and some errors in your script.
foo() eq undef
does not make any sense. You cannot compare to string equality to undef, as undef isn't a string. It does, however, stringify to the empty string. You should use warnings to get error messages when you do such rubbish. To test whether a value is not undef, use defined:
unless(defined foo()) { .... }
The Encode::Detector::Detect module uses an object oriented interface. Therefore,
Encode::Detect::Detector::detect($foo)
is wrong. According to the docs, you should be doing
Encode::Detect::Detector->detect($foo)
You probably cannot do decoding on a field-by-field basis. Usually, one document has one encoding. You need to specify the encoding when opening the file handle, e.g.
use autodie;
open my $fh, "<:utf8", $init_file;
While CSV can support some degree of binary data (like encoded text), it isn't well suited for this purpose, and you may want to choose another data format.
Finally, ASCII data effectively does not need any de- or encoding. The undef result for encoding detection does make sense here. It cannot be asserted with certaincy that a document was encoded to ASCII (as many encodings are a superset of ASCII), but given a certain document it can be asserted that it isn't valid ASCII (i.e. has the 8th bit set) but must rather be a more complex encoding like Latin-1, UTF-8.

searching words in Greek in Unix and Perl

I have txt files that are greek and now I want to search specific words in them using perl and bash ... the words are like ?a?, t?, e??
I was searching for words in english and now want to replace them by greek but all I get is ??? mostly... for Perl:
my %word = map { $_ => 1 } qw/name date birth/;
and for bash
for X in name date birth
do
can someone please help me?
#!/usr/bin/perl
use strict;
use warnings;
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
my #words = qw( καί τό εἰς );
my %words = map { $_ => 1 } #words;
my $pat = join '|', map quotemeta, keys %words;
while (<>) {
if (/$pat/) {
print;
}
}
Usage:
script.pl file.in >file.out
Notes:
Make sure the source code is encoded using UTF-8 and that you use use utf8;.
Make sure you use the use open line and specify the appropriate encoding for your data file. (If it's not UTF-8, change it.)

Perl's YAML::XS and unicode

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.
I write this in the script (which is saved in utf-8)
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.
Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.
Am I doing something wrong? I am always unsure about unicode in perl, to be frank...
clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.
Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.
When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.
Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.
So, either
use utf8;
binmode STDOUT, ":raw";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Or
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;
Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.
When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.
It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.
I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do.
The bellow allows me:
uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
can edit the YAML/JSON files with utf8 enabled vim and bash tools
"inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
source code is utf8, so can write regexes /á/
My "broilerplate"...
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use YAML::XS;
use JSON::XS;
run();
exit;
sub run {
my $yfilein = "./in.yaml"; #input yaml
my $jfilein = "./in.json"; #input json
my $yfileout = "./out.yaml"; #output yaml
my $jfileout = "./out.json"; #output json
my $ydata = load_utf8_yaml($yfilein);
my $jdata = load_utf8_json($jfilein);
#the "uc" is not "fully correct" but works for my needs
$ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
$jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;
save_utf8_yaml($yfileout, $ydata);
save_utf8_json($jfileout, $jdata);
}
#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems
sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }
You can try the next in.yaml
---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž

How can I guess the encoding of a string in Perl?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:
use Encode;
my $a_with_ring =
eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
or die "Could not decode string: $#";
This has the drawback that the same octet sequence can be valid in multiple encodings
I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)
You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.
I like mscha's solution here, but simplified using Perl's defined-or operator (//):
sub slurp($file)
local $/;
open(my $fh, '<:raw', $file) or return undef();
my $raw = <$fh>;
close($fh);
# return the first successful decoding result
return
eval { Encode::decode('utf-8', $raw, Encode::FB_CROAK); } // # Try UTF-8
eval { Encode::decode('windows-1252', $raw, Encode::FB_CROAK); } // # Try windows-1252 (a superset of iso-8859-1 and ascii)
$raw; # Give up and return the raw bytes
}
The first successful decoding is returned. Plain ASCII content succeeds in the first decoding.
If you are working directly with string variables instead of reading in files, you can use just the successive-eval expression.
You can use the following code also, to encrypt and decrypt the code
sub ENCRYPT_DECRYPT() {
my $Str_Message=$_[0];
my $Len_Str_Message=length($Str_Message);
my $Str_Encrypted_Message="";
for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
my $Key_To_Use = (($Len_Str_Message+$Position)+1);
$Key_To_Use =(255+$Key_To_Use) % 255;
my $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
my $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
my $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
my $Encrypted_Byte = chr($Xored_Byte);
$Str_Encrypted_Message .= $Encrypted_Byte;
}
return $Str_Encrypted_Message;
}
my $var=&ENCRYPT_DECRYPT("hai");
print &ENCRYPT_DECRYPT($var);