Perl Catalyst encoding issue - perl

Well this is strange and hard to explain what's wrong but I'll try do my best.
For some reason values to template changes their encoding (I'm pretty sure they are).
Controller file (encoded in UTF-8):
print STDERR "ąęść";
$c->stash->{some_variable} = "ąęść"; # some unicode chars
Template file (encoded in UTF-8):
[% some_variable %]<br>
test: ąęść
As output in browser I'm getting:
ÄÄÅÄ
test: ąęść
Output on console (with UTF-8 encoding enabled):
ąęść

Please take a look at the good documentation provided by the Catalyst Wiki at Using Unicode and also Catalyst::View::TT. The Perl Unicode Cookbook may help you get a better understanding on Perl support of UTF-8, usually better than most other languages available today.
You may need to save your templates with the UTF-8 BOM mark using your editor, so your text editor does encode properly your template file when saving, or if not setting BOM, then at least define file encoding as UTF-8 every time you save it.

There's been a ton of fixes to unicode support and UTF-8 in general with the most recent stable release of Catalyst (5.90084). Catalyst now is UTF-8 by default, but of course there's always some hard edges. You should review the most recent docs on the subject to see what is going wrong.
If you template contains multibyte character than you do indeed need to set the BOM or review the documentation for your template view of choice.

Related

Notepad++ can recognize encoding?

I created file with UTF-8 encoded content (using PHP fputcsv).
When I open this file in Notepad++ - characters are wrong (Notepad++ starts with ANSI encoding).
When I set Format->"Encode in UTF-8" from menu - everything is fine.
Im worrying, that Notepad++ can recognize encoding somehow, and maybe something is wrong with my file created with fputcsv? First byte or something?
Automatically detecting an encoding is not something that can be done accurately. It's pretty much essential that the encoding be specified explicitly. It can be guessed in some cases, but even then not with 100% certainty.
This documentation (Encoding) explains the situation in relation to Notepad++.
They also point out that the difficulty arises especially if the file has not been saved with a Byte Order Mark (BOM).
Given that your file displays correctly once you manually set the encoding, I would say there's nothing wrong with how you are generating and saving the file. The only thing you can check for is whether a BOM is being saved, which might improve the chances of Notepad++ being able to automatically detect the encoding.
It's worth noting that although it may help editors like Notepad++ identify the encoding more accurately, according to The Unicode Standard document, the BOM is not recommended.
You have to check the lower right corner of the Notepad++ GUI to see the actual enconding that is being used. The problem it's not that Notepad++ specific because guessing the right encoding is a big problem without any real solution so it's better to let the user decide what is the most appropriate encoding in each single case.
When you want to reflect the encoding of the text file in a Java program, you have to consider two thnigs: encoding and character set. When you open a text file, you see encoding under "Encoding" menu. Additionally look at the character set menu point. Under "Eastern European" you will find "ISO 8859-2", and under Central European "Windows-1250". You can set corresponding encoding in the Java program
when you look up in the table:
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html
For example, for Cenntral European character set "Windows-1250" the table suggest Java encoding "Cp1250". Set the encoding and you will see the characters in program properly.

write utf8 with perl bug?

My problem is simple. I want to output UTF-8 with my Perl script.
This code is not working.
use utf8;
open(TROIS,">utf8.out.2.txt");
binmode(TROIS, ":utf8");
print TROIS "Hello\n";
The output file is not in UTF-8. (My file script is coded in UTF-8)
But if I insert an accentuated character in my print, then it's working and my output file is in UTF-8. Example:
print TROIS "é\n";
I use ActivePerl 5.10 under Windows. What might be the problem?
You're writing nothing but ASCII characters with Hello\n. Fortunately ASCII is still perfectly valid UTF-8. However, auto detection by editors will most likely not show UTF-8 as the encoding because they don't have anything to judge the file content's encoding by. I guess you simply don't know how file encodings work.
A file's encoding is a property that in general is not stored in a file or externally alongside a file. A lot of editors simply assume a certain encoding based on the operating system they run on or the environment settings (system language), or they include some kind of semi-intelligent auto-detection (which may still fail because file encodings cannot be auto-detected unambiguously). That's why you have to tell Perl that a file is encoded in UTF-8 when you read it with binmode or the corresponding I/O layer.
Now there is one way of marking a text file's encoding if said encoding is one of the UTF family (UTF-8, UTF-16 LE and BE, UTF-32 LE and BE) . That way is called the BOM (byte order mark). However, producing files with a BOM came from a time when UTF-8 had not been spread as widely as it is today. It usually poses more and different problems than it solves, especially due to editors and applications in general not supporting BOMs at all. Therefore BOMs should probably be avoided nowadays.
There are exceptions, of course, in which the file format contains certain instructions that tell the file's encoding. XML comes to mind with its DOCTYPE declaration. However, even for such files you will have to recognize if a file is encoded in a multi-byte encoding that always uses at least two bytes per character (UTF-16/UTF-32) or not in order to parse the DOCTYPE declaration in the first place. It's simply not simple ;)

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

What problems should I expect when moving legacy Perl code to UTF-8?

Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?
The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.
A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.