Perl unicode handling with DBI

Perl unicode handling with DBI - perl

I am using Spreadsheet::Read to get data from Excel (xls or xlsx) files and put them in a MySQL database using DBI.
If I print out the data to the console, it displays all special characters properly, but when I insert it into the database, some files end up with corrupted characters. For example, "Möbelwerkstätte" becomes "MÃ¶belwerkstÃ¤tte".
I think that Spreadsheet::Read "knows" which character set is coming out of the file, as it prints properly to the console each time, regardless of the file encoding. How do I make sure that it is going into the database in UTF-8?

The answer you have received (and accepted) already will probably work most of the time, but it's a little fragile and probably only works because Perl's internal character representation is a lot like UTF-8.
For a more robust solution, you should read the Perl Unicode Tutorial and follow the recommendations in there. They boil down to:
Decode any data that you get from outside your program
Encode any data that you send out of your program
In your case, you'll want to decode the data that you read from the spreadsheet and encode the data that you are sending to the database.

Both DBI and DBD::MySQL defaults to Latin1 (compiled with Latin1).
By sending "USE NAMES utf8" as your first query you will change it for that session.
From the manual:
SET NAMES indicates what character set the client will use to send SQL statements to the server. Thus, SET NAMES 'cp1251' tells the server, “future incoming messages from this client are in character set cp1251.” It also specifies the character set that the server should use for sending results back to the client. (For example, it indicates what character set to use for column values if you use a SELECT statement.)
See http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html for full documentation.

Related

Incorrect Special Character Handling in Informatica Powercenter 9.1

I am currently working on a project in my organisation where we are migrating Informatica Powercenter in our application from v8.1 to v9.1.
Informatica PC is loading data from datafiles but is not able to maintain certain special characters present in few of the input dat files.
The data was is getting loaded correctly in v8.1.
Tried changing characterset settings in Informatica as below -
CodePage movement = Unicode
NLS_LANG = AMERICAN_AMERICA.UTF8 to ENGLISH_UNITEDKINGDOM.UTF8
"DataMovementMode" = Unicode
After making the above settings I am getting the below error in the in Informatica log:
READER_1_2_1> FR_3015 Warning! Row [2258], field [exDestination]: Data [TO] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [exDestination]: Data [IOMR] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2265], field [parentOID]: Data [O-MS1109ZTRD00:esm4:iomr-2_20040510_0_0] was truncated.
READER_1_2_1> FR_3015 Warning! Row [2268], field [exDestination]: Data [IOMR] was truncated.
The special character that are being sent in the data are and not being handled correctly -
Ø
Ù
Ɨ
¿
Á
Can somebody please guide how to resolve this issue? What else is required at Informatica end to be changed.
Does it need any session parameters to be set in database?

I posted this in another thread about special characters. Please check if this is of any help.
Start with the Source in designer. Are you able to see the data correctly in the source qualifier preview? If not, you might want to set ff source definition encoding to UTF-8.
The Integration service should be running in Unicode mode and not ASCII mode. You can check this from the Integration service properties in Admin Console.
The target should be UTF-8 encoding.
Check the relational connection ( if the target is a database) encoding in workflow manager to ensure it is UTF-8
If the problem persists, write the output to a UTF-8 flat file and check if the data is loading properly. If yes, then the issue is with writing to the database.
Check the database settings like NLS_LANG, NLS_CHARACTERSET (for oracle) etc.

Also set your integration service (IS) to run in Unicode mode for best results apart from configuring ODBC and relational connections to use Unicode
Details for Unicode & ASCII
a) Unicode - IS allows 2 bytes for each character and uses additional byte for each non-ascii character (such as Japanese/chinese characters)
b) ASCII - IS holds all data in a single byte
Make sure that the size of the variable is big enough to hold the data. Some times the warnings mentioned will be received when the size is small to hold the incoming data.

How to achieve fool proof unicode handling in Perl CGI?

So I have a mysql database which is serving an old wordpress db I backed up. I am writing some simple perl scripts to serve the wordpress articles (I don't want a wordpress install).
Wordpress for whatever reason stored all quotes as unicode chars, all ... as unicode chars, all double dashes, all apostrophes, there is unicode nbsp all over the place -- it is a mess (this is why I do not a wordpress install).
In my test environment, which is Linux Mint 17.1 Perl 5.18.2 Mysql 5.5, things mostly work fine when I have Content-type line being served up with "charset=utf-8" (except apostrophes just simply never decode properly no matter what combination I of things I try). Omitting the charset causes all unicode characters to break (except apostrophes now work). This is OK, with the exception of the apostrophes, I under stand what is going on and I have a handle on the data.
Now on my production environment, which is a VM, is Linux CentOS 6.5 Perl 5.10.1 Mysql 5.6.22, and here things do not work at all. Whether or not I include the "charset=utf-8" in the Content-type there is no difference, no unicode charatcers work correctly (including apostrophes). Maybe it has to do with the lower version of Perl? Does anyone have any insight?
Apart from this very specific case, does anyone know of a fool-proof Perl idiom for handling unicode which comes from the DB? (I'm not sure where in the pipeline things are going wrong, but I have a suspicion it is at the DB-driver level)
One of the problems is that my data is very inconsistent and dirty. I could parse the entire DB and scrub all unicode and re-import it -- the point is I want to avoid that. I want a one-size fits all collection of Perl scripts for reading wordpress databases.

Dealing with Perl and UTF-8 has been a pain to me. After a good amount of time i learned that there is no "fool proof unicode handling" in Perl ... but there is an unicode handling that can be of help:
The Encode module.
As the perlunifaq says (http://perldoc.perl.org/perlunifaq.html):
When should I decode or encode?
Whenever you're communicating text with anything that is external to
your perl process, like a database, a text file, a socket, or another
program. Even if the thing you're communicating with is also written
in Perl.
So we do this to every UTF-8 text string sent to our Perl process:
my $perl_str = decode('utf8',$myExt_str);
And this to every text string sent from Perl to anything external to our Perl process:
my $ext_str = encode('utf8',$perl_str);
...
Now that's a lot of encoding/decoding when we retrieve or send data from/to a mysql or postgresql database. But fear not, because there is a way to tell Perl that EVERY TEXT STRING from/to a database are utf8. Additionally we tell the database that every text string should be treated as UTF-8. The only downside is that you need to be sure that every text string is UTF-8 encoded... but that's another story:
# For MySQL:
# This requires DBD::mysql version 4 or greater
use DBI;
my $dbh = DBI->connect ('dbi:mysql:test_db',
$username,
$password,
{mysql_enable_utf8 => 1}
);
Ok, now we have the text strings from our database in utf8 and the database knows all our text strings should be treated as UTF-8... But what about anything else? We need to tell Perl (AND CGI) that EVERY TEXT STRING we write in our process is utf8 AND tell other other processes to treat our text strings as UTF-8 as well:
use utf8;
use CGI '-utf8';
my $cgi = new CGI;
$cgi->charset('UTF-8');
UPDATED!
What is a "wide character"?
This is a term used both for characters with an ordinal value greater
than 127, characters with an ordinal value greater than 255, or any
character occupying more than one byte, depending on the context. The
Perl warning "Wide character in ..." is caused by a character with an
ordinal value greater than 255.
With no specified encoding layer, Perl
tries to fit things in ISO-8859-1 for backward compatibility reasons.
When it can't, it emits this warning (if warnings are enabled), and
outputs UTF-8 encoded data instead. To avoid this warning and to avoid
having different output encodings in a single stream, always specify
an encoding explicitly, for example with a PerlIO layer:
# The next line is required to avoid the "Wide character in print" warning
# AND to avoid having different output encodings in a single stream.
binmode STDOUT, ":encoding(UTF-8)";
...
Even with all of this sometimes you need to encode('utf8',$perl_str) . That's why i learn there is no fool proof unicode handling in Perl. Please read the perlunifaq (http://perldoc.perl.org/perlunifaq.html)
I hope this helps.

internal string encoding

I'm trying to understand how ASP classic handles strings internally. I've googled and debugged, but I still don't know how a string is encoded within the ASP script.
See the illustration below.
Is input data transformed so that all string variables have the same encoding no matter what source?
Most ASP-pages are saved on disk as utf-8. They do however #include asp-files that are saved with another encoding. A the top of front-end-pages I set the Response encoding to unicode.
response.codepage = 65001 //unicode
reponse.charset = 'utf-8'
http://www.designerline.se/db/aspclassicencoding.png

First of all its worth considering that the both UTF-8 and Windows-1252 (and ISO-8859-1 and others) are based on US-ASCII. The first 128 characters in all of these codepages are identical. Use exactly the same byte value and all occupy just one byte.
In many cases the vast majority of the content is within the US-ASCII range so its hard to tell there is any difference between. Frequently the whole file is just using US-ASCII characters and hence the files are identical despite choosen encoding (save perhaps the BOM at the start of the file).
Basic Script Processing
First the processor combines an ASP file with all its includes and the includes of those includes. This is done very simply sequentially replacing the include markers with the content of the include file being referenced. This is done purely at the byte level not attempt is made to convert files of different encodings.
Next the combined version of the file is parsed. tokenized, "compiled" even into a tight interperter friendly file. Its at this point that chunks of content in the file (the stuff outside of script code blocks) are turned into a special form of Response.Write. Its special in that at the point script execution would reach these special writes the processor simply copies verbatim the bytes as found in the file directly to the output stream, again no attempt is made to convert any encodings.
Script code and character encoding
The ASP processor just doesn't cope well with anything that isn't ASCII. All your code and especially your string literals in your code should only be in ASCII.
What can be a bit confusing once a script is executing all string variables are stored using Unicode encoding.
When code writes content the response using the proper Response.Write method this is where the Response.CodePage comes into effect. It will encode the unicode string the script provides to the response code page before adding it to the output stream.
What is the effect of Response.CharSet
It adds the CharSet attribute to the Content-Type http header. That is it, it has no other impact. If set this one character set but send different one because either your Response.CodePage doesn't match it or because the byte content of the files are not in that encoding then you can expect problems.
Input encoding
Things get really messy here. When form data is posted to the server there is no provision in the form url encoding standard to declare the code page used. Browser can be told what encoding to use and they will default to the charset of the html page contain the form, but there is no mechanism to communicate that choice to the server.
ASP takes the view that the codepage of posted form fields would be the same as the codepage of the response its about to send. Take a moment to absorb that.... This means that quite counter intuatively the Response.CodePage value has an impact on the strings returned by Request.Form. For this reason its important to get the correct codepage set early, doing some form processing and then setting the codepage later just before sending a response can lead to unexpected results.
The classic "the web page looks fine but the data in the database is corrupt" gotcha
One common gotcha this behaviour results in is where the developer has set CharSet="UTF-8" but left the codepage at something like "Windows-1252".
What ends up happening is the user enters text which is sent to the server in UTF-8 encoding but the script code reads it as 1252. This corrupt string gets stored in the database. A subsequent web page looks at this data, the corrupt string it pulled from the DB. This string is then sent by response.write using 1252 encoding but the destination page is told its UTF-8. This has the effect of reversing the corruption and everything looks fine to the user.
However when other components, say a report generator, creates content from the database then the data appears corrupt because it is.
The Bottom Line
You are already doing the correct thing, get that CharSet and CodePage set early and consistently. Where other files may not be saved as UTF-8 you will have problems if there is non-ascii content in them but otherwise you would be fine.
Many include asps are purely code with no content and since that code ought to be purely in ascii its encoding doesn't really matter.

Bad MySQL import, now we have garbage showing in place of utf-8 chars

We restored from a backup in a different format to a new MySQL structure (which is setup correctly for UTF-8 support). We have weird characters showing in the browser, but we're not sure what they're called so we can find a master list of what they translate to.
I have noticed that they do, in fact, correlate to a specific character. For example:
â„¢ always translates to ™
â€” always translates to —
â€¢ always translates to ·
I referenced this post, which got me started, but this is far from a complete list. Either I'm not searching for the correct name, or the "master list" of these bad-to-good conversions as a reference doesn't exist.
Reference:
Detecting utf8 broken characters in MySQL
Also, when trying to search via MySQL query, if I search for â, I always get MySQL treating it as an "a". Is there any way to tweak my MySQL queries so that they are more literal searches? We don't use internationalization much so I can safely assume any fields containing the â character is considered to be a problematic entry, which would need to be remedied by our "fixit" script we're building.

Instead of designing a "fixit" script to go through and replace this data, I think it would be better to simply fix the issue directly. It seems like the data was originally stored in a different format than UTF-8 so that when you brought it into the table that was set up for UTF-8, it garbled the text. If you have the opportunity, go back to your original backup to determine the format the data was stored in. If you can't do that, you will probably need to do a bit of trial and error to figure out which format the data is in. However, once you know that, conversion is easy. Read the following article's section on Repairing:
http://www.istognosis.com/en/mysql/35-garbled-data-set-utf8-characters-to-mysql-
Basically you are going to set the column to BINARY and then set it to the original charset. That should make the text appear properly (a good check to know you are using the correct charset). Once that is done, set the column to UTF-8. This will convert the data properly and it will correct the problems you are currently experiencing.

What problems should I expect when moving legacy Perl code to UTF-8?

Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?

The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.

A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse