Perl Text::CSV_XS Encoding Issues - perl

I'm having issues with Unicode characters in Perl. When I receive data in from the web, I often get characters like “ or €. The first one is a quotation mark and the second is the Euro symbol.
Now I can easily substitute in the correct values in Perl and print to the screen the corrected words, but when I try to output to a .CSV file all the substitutions I have done are for nothing and I get garbage in my .CSV file. (The quotes work, guessing since it's such a general character). Also Numéro will give Numéro. The examples are endless.
I wrote a small program to try and figure this issue out, but am not sure what the problem is. I read on another stack overflow thread that you can import the .CSV in Excel and choose UTF8 encoding, this option does not pop up for me though. I'm wondering if I can just encode it into whatever Excel's native character set is (UTF16BE???), or if there is another solution. I have tried many variations on this short program, and let me say again that its just for testing out Unicode problems, not a part of a legit program. Thanks.
use strict;
use warnings;
require Text::CSV_XS;
use Encode qw/encode decode/;
my $text = 'Numéro Numéro Numéro Orkos Capital SAS (√¢¬Ä¬úOrkos√¢¬Ä¬ù) 325M√¢¬Ç¬¨ in 40 companies headquartered';
print("$text\n\n\n");
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(utf8)", "unicode.csv" or die "unicode.csv: $!";
my #row = ($text);
$CSV->print($OUTPUT, \#row);
$OUTPUT->autoflush(1);
I've also tried these two lines to no avail:
$text = decode("Guess", $text);
$text = encode("UTF-16BE", $text);

First, your strings are encoded in MacRoman. When you interpret them as byte sequences the second results in C3 A2 C2 82 C2 AC. This looks like UTF-8, and the decoded form is E2 82 AC. This again looks like UTF-8, and when you decode it you get €. So what you need to do is:
$step1 = decode("MacRoman", $text);
$step2 = decode("UTF-8", $step1);
$step3 = decode("UTF-8", $step2);
Don't ask me on which mysterious ways this encoding has been created in the first place. Your first character decodes as U+201C, which is indeed the LEFT DOUBLE QUOTATION MARK.
Note: If you are on a Mac, the first decoding step may be unnecessary since the encoding is only in the "presentation layer" (when you copied the Perl source into the HTML form and your browser did the encoding-translation for you) and not in the data itself.

So I figured out the answer, the comment from Roland Illig helped me get there (thanks again!). Decoding more than once causes the wide characters error, and therefore should not be done.
The key here is decoding the UTF-8 Text and then encoding it in MacRoman. To send the .CSV files to my Windows friends I have to save it as .XLSX first so that the coding doesn't get all screwy again.
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
$text = decode("UTF-8", $text);
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(MacRoman)", "unicode.csv" or die "unicode.csv: $!";

Related

Split a string based on ASCII value

I need to parse a delimited file.(generated by mainframe job and ftped over to windows).But got few Queries while using the split on delimiter.
As per the documentation, the file is separated by '1D'. But when I open the file in notepad++(when I check the encoding tab, it is set to 'Encode in ANSI'), it seems to me like a 'vertical broken bar'. Q. Not sure what is '1D'?
open my $handle, '<', 'sample.txt';
chomp(my #lines = <$handle>);
close $handle;
my #a = unpack("C*", $lines[0]);
print Dumper \#a;
# $VAR1 = [65,166,66,166,67,166];
From dumper output, we see perl considers the ASCII for vertical broken bar to be 166.
As per link1, 166 is indeed vertical broken bar whereas as per link2, 166 is feminine ordinal indicator.Q. Any suggestion as to why the difference ?
my $str = $lines[0];
print Dumper $str;
# $VAR1 = 'AªBªCª';
We can see that the output contains 'feminine ordinal indicator' not 'vertical broken bar'.Q. Not sure why perl reads a 'bar' but then starts treating it as something else.
# I copied the vertical broken bar from notepad++ for use below
my #b = split(/¦/, $lines[0]);
print Dumper \#b;
# $VAR1 = [ 'AªBªCª' ];
Since perl has started treating bar to be something else, as expected, no split here.I thought to split by giving the ascii code of 166 directly. Seems split() doesn't support ASCII as an argument. Q. Any workaround to pass ASCII code to split() ?
# I copied the vertical broken bar from notepad++ and created A¦B¦C
my #c = split(/¦/, 'A¦B¦C');
print Dumper \#c;
#$VAR1 = [ 'A','B','C']; # works as expected, added here just for completion
Any pointers will be a great help!
Update:
my #a = map {ord $_} split //, $lines[0]; print Dumper \#a;
# $VAR1 = [ 65,166,66,166,67,166];
When you receive an input file from an unknown source, the most important thing to need to know about it is "what character encoding does it use?" Without that information, any processing that you do on the file is based on guesswork.
The problem isn't helped by people who talk about "extended ASCII" as though it's a meaningful term. ASCII only contains 128 characters. There are many definitions of what the next 128 character codes represent, and many of them are contradictory.
It seems that you have a solution to your problem. Splitting on '¦' (copied from Notepad++) does what you want. So I suggest you do that. If you want to use the actual character code, then you can convert 116 to hexadecimal (0xA6) and use that:
split /\xA6/, ... ;
You should always decode your inputs and encode your outputs.
my $acp;
BEGIN {
require Win32;
$acp = "cp".Win32::GetACP();
}
use open ':std', ":encoding($acp)";
Now, #lines will contain strings of Unicode Code Points. As such, you can now use the following:
use utf8; # Source code is encoded using UTF-8.
my #b = split(/¦/, $lines[0]);
Alternatively, every one of the following will also work now:
my #b = split(/\N{BROKEN BAR}/, $lines[0]);
my #b = split(/\N{U+00A6}/, $lines[0]);
my #b = split(/\x{A6}/, $lines[0]);
my #b = split(/\xA6/, $lines[0]);

perl how to detect corrupt data in CSV file?

I download a CSV file from another server using perl script. After download I wish to check whether the file contains any corrupted data or not. I tried to use Encode::Detect::Detector to detect encoding but it returns 'undef' in both cases:
if the string is ASCII or
if the string is corrupted
So using the below program I can't differentiate between ASCII & Corrupted Data.
use strict;
use Text::CSV;
use Encode::Detect::Detector;
use XML::Simple;
use Encode;
require Encode::Detect;
my #rows;
my $init_file = "new-data-jp-2013-8-8.csv";
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, $init_file or die $init_file.": $!";
while ( my $row = $csv->getline( $fh ) ) {
my #fields = #$row; # get line into array
for (my $i=1; $i<=23; $i++){ # I already know that CSV file has 23 columns
if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
print "the encoding is undef in col".$i.
" where field is ".$fields[$i-1].
" and its length is ".length($fields[$i-1])." \n";
}
else {
my $string = decode("Detect", $fields[$i-1]);
print "this is string print ".$string.
" the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
" and its length is ".length($fields[$i-1])."\n";
}
}
}
You have some bad assumptions about encodings, and some errors in your script.
foo() eq undef
does not make any sense. You cannot compare to string equality to undef, as undef isn't a string. It does, however, stringify to the empty string. You should use warnings to get error messages when you do such rubbish. To test whether a value is not undef, use defined:
unless(defined foo()) { .... }
The Encode::Detector::Detect module uses an object oriented interface. Therefore,
Encode::Detect::Detector::detect($foo)
is wrong. According to the docs, you should be doing
Encode::Detect::Detector->detect($foo)
You probably cannot do decoding on a field-by-field basis. Usually, one document has one encoding. You need to specify the encoding when opening the file handle, e.g.
use autodie;
open my $fh, "<:utf8", $init_file;
While CSV can support some degree of binary data (like encoded text), it isn't well suited for this purpose, and you may want to choose another data format.
Finally, ASCII data effectively does not need any de- or encoding. The undef result for encoding detection does make sense here. It cannot be asserted with certaincy that a document was encoded to ASCII (as many encodings are a superset of ASCII), but given a certain document it can be asserted that it isn't valid ASCII (i.e. has the 8th bit set) but must rather be a more complex encoding like Latin-1, UTF-8.

Comparing two non-ascii strings in perl

I am unable to compare two non-ascii strings, although both the strings appear the same on the console. Below is what I tried. Please let me know what code is missing here, so that the two variables shall be equal.
if($lineContent[7] ne $name) {
/*Control coming to here*/
print "###### Values MIS-MATCHED\n";
} else {
print "###### Values MATCHED\n";
}
$lineContent[7] is from a CSV file
$name is from an XML file
When Putty's console is in the default Characterset
CSV Val: ENB69-åºå°å±
XML Val: ENB69-åºå°å±
When Putty's Console is set to UTF-8
CSV Val: ENB69-基地局
XML Val: ENB69-基地局
#!/usr/bin/perl
use warnings;
use strict;
use Encode;
binmode STDOUT, ":encoding(utf8)";
open F1, "<:utf8", "$ARGV[0]" or die "$!";
open F2, "<", "$ARGV[0]" or die "$!";
my $a1 = <F1>;
chomp $a1;
my $a2 = <F2>;
chomp $a2;
if ($a1 eq $a2) {
print "$a1=$a2 is true\n";
} else {
print "$a1=$a2 is false\n";
}
my $b = decode("utf-8", $a2);
if ($a1 eq $b) {
print "$a1=$b is true\n";
} else {
print "$a1=$b is false\n";
}
I wrote a test program listed above. And create a text file with one line: 基地局.
When you run the program with this text file, you can get a false and a true.
I don't know what's in your program, but I guess the csv file is read as a plain text without any parsers or encode/decode procedures, whereas the xml file must be parsed by some library, so that the internal encoding mechanism is different for the two string variables, including some leading bytes of encoding notation.
Simply put, you can try to encode or decode one of the two string variables, and see if they match.
By the way, this is my first answer here, hope it can be a little bit helpful to you ;-)
From your dump results, it's obvious. The first variable stores 9 characters which constrcut 基地局 in utf-8 encoding in its internal structure. The second variable represents 3 characters in its internal structure. They have same byte stream, and are equal in a byte-stream view but not equal in a character-based comparison.
Use decode/encode can solve your problem.
Your inputs:
"ENB13-\345\237\272\345\234\260\345\261\200"
"ENB13-\x{57fa}\x{5730}\x{5c40}"
As you can see, these are clearly not the same. Specifically, the first is the UTF-8 encoding of the other. Always decode inputs. Always encode outputs.
use strict;
use warnings;
use utf8; # Source code is saved as UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $name = "ENB69-基地局";
while ($line = <STDIN>) {
chomp;
my #lineContent = split /\t/, $line;
print($lineContent[7] eq $name ?1:0, "\n"); # 1
}
Personally I would be a little more careful if you know that you are comparing unicode strings. Unicode::Collate is the module for the job.
Of course you should also read tchrist's now-famous SO post on the topic of enabling unicode in Perl, https://stackoverflow.com/a/6163129/468327, but utf8::all does an admirable job of turning on proper unicode support. Note that better unicode handling was added to the Perl core in version 5.14 so I require that here as well.
Finally here is a quick script that does the comparison, of course you would populate the variables by reading the files as needed:
#!/usr/bin/env perl
use v5.14;
use strict;
use warnings;
use utf8::all;
use Unicode::Collate;
my $collator = Unicode::Collate->new;
my $csv = "ENB69-基地局";
my $xml = "ENB69-基地局";
say $collator->eq($csv, $xml) ? "equal" : "unequal";

Losing encoding when opening and saving a file

I'm trying to open a file with regular HTML and special Unicode characters such as "ÖÄÅ öäå" (Swedish), format it and then output it to a file.
So far everything works out great, I can open the file, find the parts I need and output into a file.
But here is the point:
I can't save the inputted Unicode data into the file without losing my encoding (eg. an 'ö' becomes 'ö').
Although I can, by manually entering them into the code itself, manage to both perform regex and output them to correct encoding. But not when I'm importing a file, formatting it and then outputting.
Example on working approach when using OCT (eg. this can output to the file without the encoding problem):
my $charsSWE = "öäåÅÄÖ";
# \344 = ä
# \345 = å
# \305 = Å
# \304 = Ä
# \326 = Ö
# \366 = ö
my $SwedishLetters = '\344 \345 \305 \304 \326 \366';
if($charsSWE =~ /([$SwedishLetters]+)/){
print "Output: $1\n";
}
The way below does not work because the encoding is lost (this is a quick illustration of the part of the code but its concept is the same [eg. open file, fetch and output]):
open(FH, 'swedish.htm') or die("File could not be opened");
while(<FH>)
{
my #List = /([$SwedishLetters]+)/g;
message($List[0]) if #List;
}
close(FH);
use Encode;
open FILE1, "<:encoding(UTF-8)", "swedish.htm" or die $!;
#do stuff
open FILE2, ">:encoding(UTF-8)", "output.htm" or die $!;
You may need to use a different encoding.

How can I guess the encoding of a string in Perl?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:
use Encode;
my $a_with_ring =
eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
or die "Could not decode string: $#";
This has the drawback that the same octet sequence can be valid in multiple encodings
I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)
You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.
I like mscha's solution here, but simplified using Perl's defined-or operator (//):
sub slurp($file)
local $/;
open(my $fh, '<:raw', $file) or return undef();
my $raw = <$fh>;
close($fh);
# return the first successful decoding result
return
eval { Encode::decode('utf-8', $raw, Encode::FB_CROAK); } // # Try UTF-8
eval { Encode::decode('windows-1252', $raw, Encode::FB_CROAK); } // # Try windows-1252 (a superset of iso-8859-1 and ascii)
$raw; # Give up and return the raw bytes
}
The first successful decoding is returned. Plain ASCII content succeeds in the first decoding.
If you are working directly with string variables instead of reading in files, you can use just the successive-eval expression.
You can use the following code also, to encrypt and decrypt the code
sub ENCRYPT_DECRYPT() {
my $Str_Message=$_[0];
my $Len_Str_Message=length($Str_Message);
my $Str_Encrypted_Message="";
for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
my $Key_To_Use = (($Len_Str_Message+$Position)+1);
$Key_To_Use =(255+$Key_To_Use) % 255;
my $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
my $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
my $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
my $Encrypted_Byte = chr($Xored_Byte);
$Str_Encrypted_Message .= $Encrypted_Byte;
}
return $Str_Encrypted_Message;
}
my $var=&ENCRYPT_DECRYPT("hai");
print &ENCRYPT_DECRYPT($var);