perl utf8 corruption - perl

I am using the perl module sapnwrfc to connect to SAP and retrieve reports. This module uses utf8 and when the data is returned some of the data has a pattern of utf8 character corruption. This appears to happen when a line in the SAP report is more than 4096 in length and my current thinking is that the read buffer of perl is splitting utf8 characters and causing the corruption.
$abap_lookup = $sap_rfc->function_lookup("REPORT");
$abap_program = $abap_lookup->create_function_call;
# set abap program input variables
$abap_program->REPORT($abap_program_name);
$abap_program->VARIANT($abap_variant_name);
# call the abap program
$abap_program->invoke;
$abap_program->DATA has the corruption in one place in each line that is more than 4Kb
This is the fragment with the corruption, the actual line is a byte or two more than 4Kb.
\x{f8fc}\x{2500} \x{500}/\x{f8fc}\x{2500}
This is what is expected, so I am assuming something is splitting the line and causing the problem.
\x{f8fc}\x{2500}\x{f8fc}\x{2500}\x{f8fc}\x{2500}
I have tried all manner of open ':utf8' pragma and other settings (use utf8, binmode(STDIN, ":utf8"), binmode(STDOUT, ":utf8");). Also have tried to turn off buffering ($| = 1;). I cannot tell if this is a utf8 problem or a buffering problem. Does anyone know why this would be doing this and how to fix it?

was not able to figure out where the corruption is happening, but it is repeatable so I built a filter.

Related

perl memory usage when processing a file inline

I have a CGI script that's used by our employees to fetch logs from servers that they don't have direct access to. For reasons I won't go into, after a recent update to our app some of these logs now have characters like linefeeds, tabs, backslashes, etc. translated into their text equivalents. As such, I've modified the CGI script to invoke the following to convert these back to their original values:
perl -i -pe 's/\\r/\r/g && s/\\n/\n/g && s/\\t/\t/g && s/\\\//\//g' $filename
I was just informed that some people are now getting out of memory errors when they try to fetch logs that are fairly large (a few hundred MB).
My question: How does perl manage memory when an inline command like this is invoked? Is it reading the whole file in, processing it, then writing it out, or is it creating a temporary file, processing the lines from the input file one at a time then replacing the file once complete?
This is using perl 5.10.1 on a 64-bit Amazon linux instance.
The -p switch creates a while(<>){...; print} loop to iterate on each “line” in your input file.
If all of your newlines have been converted into "\\n", then your file would just be a single very long line. Therefore, your command would be loading the entire file into memory to perform your fix.
To avoid that, you'll have to intentionally buffer the file using either sysread or $/.
It would probably be easiest to create an actual script instead of a one-liner to do the work. However, if you know that all of your newlines are converted, then one simple fix would be to use $/ = "\\n"
As a secondary note, your regex is flawed. You're currently listing out your translations s/// using a shortcut operator. If any one of the earlier regexes doesn't match for a particular line, then no other translations would be attempted. You should instead use simple semicolons to separate your regexes:
's/\\r/\r/g; s/\\n/\n/g; s/\\t/\t/g; s|\\/|/|g'

opening utf8 files on perl and double encoding

I have mysql db which have COLLATE='utf8_general_ci' for every table.
i connect to the tables with dbi my $db = DBI->connect($cstring, $user, $password) and without
$db->{mysql_enable_utf8} = 1
$db->do(qq{SET NAMES 'utf8';} );
Then select the table and copy it to the csv file using Text::CSV to myFile where myFile is opened like the the below :
binmode(Myfile, ":utf8")
The problem that i repeat this process on different tables with different files which opened like the above but on some files i get double encoding and only if i remove the binmode for those speicfic files the problem is solved while the other files are fine and encoded utf8 and if i remove the binmode for them i get a problem on the utf8 encdoing what could be the problem ?
worth to mention i tried to use : use utf8 on my script and also tried to use
$db-> {mysql_enable_utf8} = 1
$db->do(qq{SET NAMES 'utf8';} );
but the problem is not solved.
If I understand correctly, you see
éëè
where you expect
éëè
when using phpMyAdmin. This indicates the data in your database is wrong (double-encoded). You'll need to go back and repopulate your database with the correct data.
If you can't fix your database, it's most likely safe to just add the following:
utf8::decode($str); # Fix double-encoding
It will attempt to decode the already-decoded data from the database. If the data was double-encoded, this will fix it. If the data wasn't double-encoded, it will fail silently fail, leaving the correct value in $str (assuming your strings aren't very very weird).
I recommend that you write a small tool that reads the data from the database, uses this trick to fix the data, then puts it back in the database correctly.

Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:
$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right
... etc.
However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).
The new utf8 pragmas at the start of the script are:
use CGI qw(-utf8);
use open IO => ':utf8';
I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.
And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.
If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...
Encoding::FixLatin was specifically written to help fix data broken in the same manner as yours.
Ikegami already mentioned the Encoding::FixLatin module.
Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:
unless ( utf8::decode($string) ) {
require Encode;
$string = Encode::decode(cp1252 => $string);
}
Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.
You could also use Encode.pm's support for fallback.
use Encode qw[decode];
my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
my ($ordinal) = #_;
return decode('Windows-1252', pack 'C', $ordinal);
});
printf "<%s>\n",
join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;
Output:
<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>
Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as
open $filehandle, '<:encoding(cp1252)', $filename or die ...;
and everything (tm) should work.
If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

Perl: Problem with changing encoding in the middle of reading a file

I am using Perl to load some 'macro' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.
#encoding iso-8859-2
at the beginning of the macro).
Every time this directive is encountered in the macro, a function setting encoding is called and looks sth like this:
sub change_encoding {
my ($file_handle, $encoding) = #_;
$file_handle->flush();
binmode($file_handle); # get rid of IO layers
binmode($file_handle,":encoding($encoding)");
}
The problem is that when I read the macro using standard
while($line = <$file_handle>){
process_macro($line);
}
I got messages saying "utf8 "\xXY" does not map to Unicode", but only if characters with diacritics is near the #encoding directive. I tried several examples and I was able to have half of the string with \xXY codes and other half of the string with correctly decoded characters, like here:
sub macro5_fn {
print "\xBElu\xBBou\xE8k\xFD k\xF9\xF2 úpěl ďábelské ódy\n";
}
If I put more comments before the function, all the characters are OK:
sub macro5_fn {
print "žluťoučký kůň úpěl ďábelské ódy\n";
}
Simply said, the number of correctly decoded characters depends on the distance of these characters from the #encoding directive, the ones that are close are not decoded correctly.
It seems to me that this is an issue of Perl and PerlIO (not) flushing the buffer. Or am I doing something wrong?
Thank you for your answers.
The problem is that <> reads more than just one line, so the next line or so is being interpreted under the old encoding before you ever see the #encoding directive for the new.
Your best bet is probably to read the file in binary mode and use the Encode module to decode each line from the current encoding.

How can I handle unicode with Perl's DBI?

My delicious-to-wp perl script works but gives for all "weird" characters even weirder output.
So I tried
$description = decode_utf8( $description );
but that doesnt make a difference. I would like e.g. “go live” to become “go live” and not “go live†How can I handle unicode in Perl so that this works?
UPDATE: I found the problem was to set utf of DBI I had to set in Perl:
my $sql = qq{SET NAMES 'utf8';};
$dbh->do($sql);
That was the part that I had to set, tricky. Thanks!
It's worth noting that if you're running a version of DBD::mysql new enough (3.0008 on), you can do the following: $dbh->{'mysql_enable_utf8'} = 1; and then everything's decode()ed/encode()ed for you on the way out from/in to DBI.
Enable UTF8, when you connect to database like this:
my $dbh = DBI->connect(
"dbi:mysql:dbname=db_name",
"db_user", "db_pass",
{RaiseError => 0, PrintError => 0, mysql_enable_utf8 => 1}
) or die "Connect to database failed.";
This should get you character mode strings with the UTF8 flag set as needed.
From DBI General Interface Rules & Caveats:
Perl supports two kinds of strings: Unicode (utf8 internally) and non-Unicode (defaults to iso-8859-1 if forced to assume an encoding). Drivers should accept both kinds of strings and, if required, convert them to the character set of the database being used. Similarly, when fetching from the database character data that isn't iso-8859-1 the driver should convert it into utf8.
And the specifics from DBD::mysql for mysql_enable_utf8
Additionally, turning on this flag tells MySQL that incoming data should be treated as UTF-8. This will only take effect if used as part of the call to connect(). If you turn the flag on after connecting, you will need to issue the command SET NAMES utf8 to get the same effect.
The term
$dbh->do(qq{SET NAMES 'utf8';});
definitely saves the day for accessing an utf-8 declared database, but take notice, if you are going to do any perl processing of any data obatined from the db it would be wise to store it in a perl var as an utf8 string with, as this operation is not implicit.
$utfstring = decode('utf8',$string_from_db);
of course, for proper i/o handling of utf8 strings (reading, printing, writing to output) remember to set
use open ':utf8';
and
binmode STDOUT, ":utf8";
the latter being essential for printing out utf8 strings. Hope this helps.
It may have nothing to do with Perl. Check to make sure you're using UTF encodings in the pertinent MySQL table columns.
Leave this öne out:
binmode STDOUT, ":utf8";
when using:
$dbh->do(qq{SET NAMES 'utf8';});
Otherwise your output will have double utf8 encoding, resulting in unreadable double byte characters!
It took me a couple of hours to figure this out..
By default, the driver Perl/MySQL handles binary data (at least I concluded this from some experiments with MySQL 5.1 and 5.5).
Without setting mysql_enable_utf8, I encoded/decoded the strings to/from UTF-8 before writing/reading to/from the database.
It should not be relied upon the perl-internal string representation as an array of byte; be aware that the internal 'utf8' is not guaranteed to be standard UTF-8; in converse, the single byte encoding is not guaranteed to be ISO-8859-1; really do encode/decode to/from UTF-8 (and not 'utf8').
There are also some settings of MySQL (like SET NAMES above, as far as I remember there is a client encoding, a connection encoding, and a server encoding, whose interactions are not quite clear to me if they do not all have the same value) regarding to the encodings; setting all of them to UTF-8, and the recipe above, worked for me.