Encode::Guess:guess_encoding gives different results in different contexts - perl

I have the following sub that opens a text file and attempts to ensure its encoding is one of either UTF-8, ISO-8859-15 or ASCII.
The problem I have with it is different behaviours in interactive vs. non-interactive use.
when I run interactively with a file that contains a UTF-8 line, $decoder is, as expected, a reference object whose name returns utf8 for that line.
non-interactively (as it runs as part of a subversion commit hook) guess_encoding returns a scalar string of value utf8 or iso-8859-15 for the utf8 check line, and iso-8859-15 or utf8 for the other two lines.
I can't for the life of me, work out where the difference in behaviour is coming from. If I force the encoding of the open to say <:encoding(utf8), it accepts every line as UTF-8 without question.
The problem is I can't assume that every file it receives will be UTF-8, so I don't want to force the encoding as a work-around. Another potential workaround is to parse the scalar text, but that just seems messy, especially when it seems to work correctly in an interactive context.
From the shell, I've tried overriding $LANG (as non-interactively that isn't set, nor are any of the LC_ variables), however the interactive version still runs correctly.
The commented out line that reports $Encode::Guess::NoUTFAutoGuess returns 0 in both interactive and non-interactive use when commented in.
Ultimately, the one thing we're trying to prevent is having UTF-16 or other wide-char encodings in our repository (as some of our tooling doesn't play well with it): I thought that looking for a white-list of encodings is an easier job than looking for a black-list of encodings.
sub checkEncoding
{
my ($file) = #_;
my ($b1, $b2, $b3);
my $encoding = "";
my $retval = 1;
my $line = 0;
say("Checking encoding of $file");
#say($Encode::Guess::NoUTFAutoGuess);
open (GREPFILE, "<", $file);
while (<GREPFILE>) {
chomp($_);
$line++;
my $decoder = Encode::Guess::guess_encoding($_, 'utf8');
say("A: $decoder");
$decoder = Encode::Guess::guess_encoding($_, 'iso-8859-15') unless ref $decoder;
say("B: $decoder");
$decoder = Encode::Guess::guess_encoding($_, 'ascii') unless ref $decoder;
say("C: $decoder");
if (ref $decoder) {
$encoding = $decoder->name;
} else {
say "Mis-identified encoding '$decoder' on line $line: [$_]";
my $z = unpack('H*', $_);
say $z;
$encoding = $decoder;
$retval = 0;
}
last if ($retval == 0);
}
close GREPFILE;
return $retval;
}

No need to guess. For the specific options of UTF-8, ISO-8859-1 and US-ASCII, you can use Encoding::FixLatin's fix_latin. It's virtually guaranteed to succeed.
That said, I think the use of ISO-8859-1 in the OP is a typo for ISO-8859-15.
The method used by fix_latin would work just as well for ISO-8859-15 as it does for ISO-8859-1. It's simply a question of replacing _init_byte_map with the following:
sub _init_byte_map {
foreach my $i (0x80..0xFF) {
my $byte = chr($i);
my $utf8 = Encode::from_to($byte, 'iso-8859-15', 'UTF-8');
$byte_map->{$byte} = $utf8;
}
}
Alternatively, if you're willing to assume the data is all of one encoding or another (rather than a mix), you could also use the following approach:
my $text;
if (!eval {
$text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
1 # No exception
}) {
$text = decode("ISO-8859-15", $bytes);
}
Keep in mind that US-ASCII is a proper subset of both UTF-8 and ISO-8859-15, so it doesn't need to be handled specially.

Related

Cannot call pdflatex from perl script (due to encoding?)

When I call pdflatex manually from the windows command line, it generates the desired pdf.
When I call pdflatex from a perl script instead, it does not:
system("pdflatex $fileName");
.. results in
Sorry, but pdflatex did not succeed.
You may want to visit the MiKTeX project page, if you need help.
utf8 "\x80" does not map to Unicode at C:/strawberry-perl/perl/site/lib/Encode.pm line 200.
The script was running on unix before and working fine. Now, after having it migrated to a windows system it doesn't.
The content of the tex-input-file is generated by the script as well. the "file"-command on my Mac tells me that this file is encoded as "us-ascii".
So I tried to make perl encode it as "utf-8", but it did not work:
open(FH, "> :encoding(utf-8)", $fileName);
or
binmode(FH, ":utf8");
Files are still being generated with us-ascii encoding. How can I change that?
So far, the encoding is my only clue.
What else could be the problem?
If this works fine when manually typed into the command line the this could be due to the way perl interpolates the quotation marks before passing the command to the system. Have you tried printing the call you making to test whether it provides the exact same imput as when to enter it manually? Otherwise, for passing arguments to a program via the system command in perl I always separate them out as follows to avoid any interpolation errors:
#...
my $prog = "Z.*";
my $arg1 = "X";
my $arg2 = "Y";
#...
my $file = "W.*";
system("$prog", ("$arg1", "$arg2", ..., "$file"));
#...
If this doesn't work, another, albeit rather clunky solution, might be to import the file contents into a variable and try the following to 'manually' encode it in perl as follows:
use Encode;
use utf8;
use charnames qw( :full :short );
my $encodedfile = encode("utf8", $filecontents);
If you happen to have any active caracters in the file which could influence the way pdflatex handles the final output (for example in perl \\ gives \ to pdflatex, which ends up finally being ) you can append the following to the encoding:
my $str = $encodedfile;
my $find = "\\N{U+005C}";
my $replace = "\\textbackslash ";
$str =~ s/$find/$replace/g;
my %special_characters;
$special_characters{"\\N{U+0025}"} = "\\pourcent ";
$special_characters{"\\\$"} = "\\\$";
$special_characters{"\\N{U+007B}"} = "\\{";
$special_characters{"\N{U+007D}"} = "\\}";
$special_characters{"\N{U+0026}"} = "\\&";
$special_characters{"\\N{U+005F}"} = "\\textunderscore ";
$special_characters{"\\N{U+002F}"} = "\/";
$special_characters{"\\N{U+005B}"} = "\[";
$special_characters{"\\N{U+005D}"} = "\]";
$special_characters{"\\N{U+005E}"} = "\\textasciicircum ";
$special_characters{"\\N{U+0023}"} = "\\#";
$special_characters{"\\\N{U+007E}"} = "\\textasciitilde ";
$special_characters{"\\\N{U+0021}"} = " \\newline ";
my $string = $str;
foreach my $char (keys %special_characters) {
$string =~ s/$char/$special_characters{$char}/g;
}
Hope this helps.

perl how to detect corrupt data in CSV file?

I download a CSV file from another server using perl script. After download I wish to check whether the file contains any corrupted data or not. I tried to use Encode::Detect::Detector to detect encoding but it returns 'undef' in both cases:
if the string is ASCII or
if the string is corrupted
So using the below program I can't differentiate between ASCII & Corrupted Data.
use strict;
use Text::CSV;
use Encode::Detect::Detector;
use XML::Simple;
use Encode;
require Encode::Detect;
my #rows;
my $init_file = "new-data-jp-2013-8-8.csv";
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, $init_file or die $init_file.": $!";
while ( my $row = $csv->getline( $fh ) ) {
my #fields = #$row; # get line into array
for (my $i=1; $i<=23; $i++){ # I already know that CSV file has 23 columns
if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
print "the encoding is undef in col".$i.
" where field is ".$fields[$i-1].
" and its length is ".length($fields[$i-1])." \n";
}
else {
my $string = decode("Detect", $fields[$i-1]);
print "this is string print ".$string.
" the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
" and its length is ".length($fields[$i-1])."\n";
}
}
}
You have some bad assumptions about encodings, and some errors in your script.
foo() eq undef
does not make any sense. You cannot compare to string equality to undef, as undef isn't a string. It does, however, stringify to the empty string. You should use warnings to get error messages when you do such rubbish. To test whether a value is not undef, use defined:
unless(defined foo()) { .... }
The Encode::Detector::Detect module uses an object oriented interface. Therefore,
Encode::Detect::Detector::detect($foo)
is wrong. According to the docs, you should be doing
Encode::Detect::Detector->detect($foo)
You probably cannot do decoding on a field-by-field basis. Usually, one document has one encoding. You need to specify the encoding when opening the file handle, e.g.
use autodie;
open my $fh, "<:utf8", $init_file;
While CSV can support some degree of binary data (like encoded text), it isn't well suited for this purpose, and you may want to choose another data format.
Finally, ASCII data effectively does not need any de- or encoding. The undef result for encoding detection does make sense here. It cannot be asserted with certaincy that a document was encoded to ASCII (as many encodings are a superset of ASCII), but given a certain document it can be asserted that it isn't valid ASCII (i.e. has the 8th bit set) but must rather be a more complex encoding like Latin-1, UTF-8.

perl- trim utf8 bytes to 'length' and sanitize the data

I have utf8 sequence of bytes and need to trim it to say 30bytes. This may result in incomplete sequence at the end. I need to figure out how to remove the incomplete sequence.
e.g
$b="\x{263a}\x{263b}\x{263c}";
my $sstr;
print STDERR "length in utf8 bytes =" . length(Encode::encode_utf8($b)) . "\n";
{
use bytes;
$sstr= substr($b,0,29);
}
#After this $sstr contains "\342\230\272\342"\0
# How to remove \342 from the end
UTF-8 has some neat properties that allow us to do what you want while dealing with UTF-8 rather than characters. So first, you need UTF-8.
use Encode qw( encode_utf8 );
my $bytes = encode_utf8($str);
Now, to split between codepoints. The UTF-8 encoding of every code point will start with a byte matching 0b0xxxxxxx or 0b11xxxxxx, and you will never find those bytes in the middle of a code point. That means you want to truncate before
[\x00-\x7F\xC0-\xFF]
Together, we get:
use Encode qw( encode_utf8 );
my $max_bytes = 8;
my $str = "\x{263a}\x{263b}\x{263c}"; # ☺☻☼
my $bytes = encode_utf8($str);
$bytes =~ s/^.{0,$max_bytes}(?![^\x00-\x7F\xC0-\xFF])\K.*//s;
# $bytes contains encode_utf8("\x{263a}\x{263b}")
# instead of encode_utf8("\x{263a}\x{263b}") . "\xE2\x98"
Great, yes? Nope. The above can truncate in the middle of a grapheme. A grapheme (specifically, an "extended grapheme cluster") is what someone would perceive as a single visual unit. For example, "é" is a grapheme, but it can be encoded using two codepoints ("\x{0065}\x{0301}"). If you cut between the two code points, it would be valid UTF-8, but the "é" would become a "e"! If that's not acceptable, neither is the above solution. (Oleg's solution suffers from the same problem too.)
Unfortunately, UTF-8's properties are no longer sufficient to help us here. We'll need to grab one grapheme at a time, and add it to the output until we can't fit one.
my $max_bytes = 6;
my $str = "abcd\x{0065}\x{0301}fg"; # abcdéfg
my $bytes = '';
my $bytes_left = $max_bytes;
while ($str =~ /(\X)/g) {
my $grapheme = $1;
my $grapheme_bytes = encode_utf8($grapheme);
$bytes_left -= length($grapheme_bytes);
last if $bytes_left < 0;
$bytes .= $grapheme_bytes;
}
# $bytes contains encode_utf8("abcd")
# instead of encode_utf8("abcde")
# or encode_utf8("abcde") . "\xCC"
First, please don't use bytes (and never assume that any internal encoding in Perl). As documentation says: This pragma reflects early attempts to incorporate Unicode into perl and has since been superseded <...> use of this module for anything other than debugging purposes is strongly discouraged.
To strip incomplete sequence at end of line, assuming it contains octets, use Encode::decode's Encode::FB_QUIET handling mode to stop processing once you hit invalid sequence and then just encode result back:
my $valid = Encode::decode('utf8', $sstr, Encode::FB_QUIET);
$sstr = Encode::encode('utf8', $valid);
Note that if you plan to use it with another encoding in future, not all of encodings may support this handling method.

Perl - Unicode::String sub need to add/convert for Latin-9 support

Part 3 (Part 2 is here) (Part 1 is here)
Here is the perl Mod I'm using: Unicode::String
How I'm calling it:
print "Euro: ";
print unicode_encode("€")."\n";
print "Pound: ";
print unicode_encode("£")."\n";
would like it to return this format:
€ # Euro
£ # Pound
The function is below:
sub unicode_encode {
shift() if ref( $_[0] );
my $toencode = shift();
return undef unless defined($toencode);
print "Passed: ".$toencode."\n";
Unicode::String->stringify_as("utf8");
my $unicode_str = Unicode::String->new();
my $text_str = "";
my $pack_str = "";
# encode Perl UTF-8 string into latin1 Unicode::String
# - currently only Basic Latin and Latin 1 Supplement
# are supported here due to issues with Unicode::String .
$unicode_str->latin1($toencode);
print "Latin 1: ".$unicode_str."\n";
# Convert to hex format ("U+XXXX U+XXXX ")
$text_str = $unicode_str->hex;
# Now, the interesting part.
# We must search for the (now hex-encoded)
# Unicode escape sequence.
my $pattern =
'U\+005[C|c] U\+0058 U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f])';
# Replace escapes with entities (beginning of string)
$_ = $text_str;
if (/^$pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/^$pattern/\&#x$pack_str/;
}
# Replace escapes with entities (middle of string)
$_ = $text_str;
while (/ $pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/ $pattern/\;\&#x$pack_str/;
$_ = $text_str;
}
# Replace "U+" with "&#x" (beginning of string)
$text_str =~ s/^U\+/&#x/;
# Replace " U+" with ";&#x" (middle of string)
$text_str =~ s/ U\+/;&#x/g;
# Append ";" to end of string to close last entity.
# This last ";" at the end of the string isn't necessary in most parsers.
# However, it is included anyways to ensure full compatibility.
if ( $text_str ne "" ) {
$text_str .= ';';
}
return $text_str;
}
I need to get the same output but need to Support Latin-9 characters as well, but the Unicode::String is limited to latin1. any thoughts on how I can get around this?
I have a couple of other questions and think I have a somewhat understanding of Unicode and Encodings but having time issues as well.
Thanks to anyone who helps me out!
As you have been told already, Unicode::String is not an appropriate choice of module. Perl ships with a module called 'Encode' which can do everything you need.
If you have a character string in Perl like this:
my $euro = "\x{20ac}";
You can convert it to a string of bytes in Latin-9 like this:
my $bytes = encode("iso8859-15", $euro);
The $bytes variable will now contain \xA4.
Or you can have Perl automatically convert it out output to a filehandle like this:
binmode(STDOUT, ":encoding(iso8859-15)");
You can refer to the documentation for the Encode module. And also, PerlIO describes the encoding layer.
I know you are determined to ignore this final piece of advice but I'll offer it one last time. Latin-9 is a legacy encoding. Perl can quite happily read Latin-9 data and convert it to UTF-8 on the fly (using binmode). You should not be writing more software that generates Latin-9 data you should be migrating away from it.

How can I guess the encoding of a string in Perl?

I have a Unicode string and don't know what its encoding is. When this string is read by a Perl program, is there a default encoding that Perl will use? If so, how can I find out what it is?
I am trying to get rid of non-ASCII characters from the input. I found this on some forum that will do it:
my $line = encode('ascii', normalize('KD', $myutf), sub {$_[0] = ''});
How will the above work when no input encoding is specified? Should it be specified like the following?
my $line = encode('ascii', normalize('KD', decode($myutf, 'input-encoding'), sub {$_[0] = ''});
To find out in which encoding something unknown uses, you just have to try and look. The modules Encode::Detect and Encode::Guess automate that. (If you have trouble compiling Encode::Detect, try its fork Encode::Detective instead.)
use Encode::Detect::Detector;
my $unknown = "\x{54}\x{68}\x{69}\x{73}\x{20}\x{79}\x{65}\x{61}\x{72}\x{20}".
"\x{49}\x{20}\x{77}\x{65}\x{6e}\x{74}\x{20}\x{74}\x{6f}\x{20}".
"\x{b1}\x{b1}\x{be}\x{a9}\x{20}\x{50}\x{65}\x{72}\x{6c}\x{20}".
"\x{77}\x{6f}\x{72}\x{6b}\x{73}\x{68}\x{6f}\x{70}\x{2e}";
my $encoding_name = Encode::Detect::Detector::detect($unknown);
print $encoding_name; # gb18030
use Encode;
my $string = decode($encoding_name, $unknown);
I find encode 'ascii' is a lame solution for getting rid of non-ASCII characters. Everything will be substituted with questions marks; this is too lossy to be useful.
# Bad example; don't do this.
use utf8;
use Encode;
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string); # This year I went to ?? Perl workshop.
If you want readable ASCII text, I recommend Text::Unidecode instead. This, too, is a lossy encoding, but not as terrible as plain encode above.
use utf8;
use Text::Unidecode;
my $string = 'This year I went to 北京 Perl workshop.';
print unidecode($string); # This year I went to Bei Jing Perl workshop.
However, avoid those lossy encodings if you can help it. In case you want to reverse the operation later, pick either one of PERLQQ or XMLCREF.
use utf8;
use Encode qw(encode PERLQQ XMLCREF);
my $string = 'This year I went to 北京 Perl workshop.';
print encode('ascii', $string, PERLQQ); # This year I went to \x{5317}\x{4eac} Perl workshop.
print encode('ascii', $string, XMLCREF); # This year I went to 北京 Perl workshop.
The Encode module has a way that you can try to do this. You decode the raw octets with what you think the encoding is. If the octets don't represent a valid encoding, it blows up and you catch it with an eval. Otherwise, you get back a properly encoded string. For example:
use Encode;
my $a_with_ring =
eval { decode( 'UTF-8', "\x6b\xc5", Encode::FB_CROAK ) }
or die "Could not decode string: $#";
This has the drawback that the same octet sequence can be valid in multiple encodings
I have more to say about this in the upcoming Effective Perl Programming, 2nd Edition, which has an entire chapter on dealing with Unicode. I think my publisher would get mad if I posted the whole thing though. :)
You might also want to see Juerd's Unicode Advice, as well as some of the Unicode docs that come with Perl.
I like mscha's solution here, but simplified using Perl's defined-or operator (//):
sub slurp($file)
local $/;
open(my $fh, '<:raw', $file) or return undef();
my $raw = <$fh>;
close($fh);
# return the first successful decoding result
return
eval { Encode::decode('utf-8', $raw, Encode::FB_CROAK); } // # Try UTF-8
eval { Encode::decode('windows-1252', $raw, Encode::FB_CROAK); } // # Try windows-1252 (a superset of iso-8859-1 and ascii)
$raw; # Give up and return the raw bytes
}
The first successful decoding is returned. Plain ASCII content succeeds in the first decoding.
If you are working directly with string variables instead of reading in files, you can use just the successive-eval expression.
You can use the following code also, to encrypt and decrypt the code
sub ENCRYPT_DECRYPT() {
my $Str_Message=$_[0];
my $Len_Str_Message=length($Str_Message);
my $Str_Encrypted_Message="";
for (my $Position = 0;$Position<$Len_Str_Message;$Position++){
my $Key_To_Use = (($Len_Str_Message+$Position)+1);
$Key_To_Use =(255+$Key_To_Use) % 255;
my $Byte_To_Be_Encrypted = substr($Str_Message, $Position, 1);
my $Ascii_Num_Byte_To_Encrypt = ord($Byte_To_Be_Encrypted);
my $Xored_Byte = $Ascii_Num_Byte_To_Encrypt ^ $Key_To_Use;
my $Encrypted_Byte = chr($Xored_Byte);
$Str_Encrypted_Message .= $Encrypted_Byte;
}
return $Str_Encrypted_Message;
}
my $var=&ENCRYPT_DECRYPT("hai");
print &ENCRYPT_DECRYPT($var);