I'm modifying a mature CGI application written in Perl and the question of content encoding has come up. The browser reports that the content is iso-8859-1 encoded and the application is declaring iso-8859-1 as the charset in the HTTP headers but doesn't ever seem to actually do the encoding. None of the various encoding techniques described in the perldoc tutorials (Encode, Encoding, Open) are used in the code so I'm a little confused as to how the document is actually being encoded.
As mentioned, the application is quite mature and likely predates many of the current encoding methods. Does anyone know of any legacy or deprecated techniques I should be looking for? To what encoding does Perl assume/default to when no direction is provided by the developer?
Thanks
By default Perl handles strings as being byte sequences, so if you read from a file, and print that to STDOUT, it will produce the same byte sequence. If your templates are Latin-1, your output will also be Latin-1.
If you use a string in text string context (like with uc, lc and so on) perl assumes Latin-1 semantics, unless the string has been decoded before.
More on Perl, charsets and encodings
Perl will not assume anything, but the browser is assuming that encoding based usually on guesswork. The documents are output directly, just as they were written, if none of the encoding techniques is used.
You can specify the charset in the HTTP Content-Type header.
The first place I'd look is the server configuration. If you aren't setting the content-encoding header in the program, you're likely picking up the server's guess.
Run the script separate from the server to see what its actual output is. When the server gets the output from a CGI program (that's not nph), the server fixes up the header for anything it thinks is missing before it sends it to the client.
If the browser reports the content as iso-8859-1, maybe your perl script didn't output the correct headers to specify the charset?
Related
Well this is strange and hard to explain what's wrong but I'll try do my best.
For some reason values to template changes their encoding (I'm pretty sure they are).
Controller file (encoded in UTF-8):
print STDERR "ąęść";
$c->stash->{some_variable} = "ąęść"; # some unicode chars
Template file (encoded in UTF-8):
[% some_variable %]<br>
test: ąęść
As output in browser I'm getting:
ÄÄÅÄ
test: ąęść
Output on console (with UTF-8 encoding enabled):
ąęść
Please take a look at the good documentation provided by the Catalyst Wiki at Using Unicode and also Catalyst::View::TT. The Perl Unicode Cookbook may help you get a better understanding on Perl support of UTF-8, usually better than most other languages available today.
You may need to save your templates with the UTF-8 BOM mark using your editor, so your text editor does encode properly your template file when saving, or if not setting BOM, then at least define file encoding as UTF-8 every time you save it.
There's been a ton of fixes to unicode support and UTF-8 in general with the most recent stable release of Catalyst (5.90084). Catalyst now is UTF-8 by default, but of course there's always some hard edges. You should review the most recent docs on the subject to see what is going wrong.
If you template contains multibyte character than you do indeed need to set the BOM or review the documentation for your template view of choice.
My problem is simple. I want to output UTF-8 with my Perl script.
This code is not working.
use utf8;
open(TROIS,">utf8.out.2.txt");
binmode(TROIS, ":utf8");
print TROIS "Hello\n";
The output file is not in UTF-8. (My file script is coded in UTF-8)
But if I insert an accentuated character in my print, then it's working and my output file is in UTF-8. Example:
print TROIS "é\n";
I use ActivePerl 5.10 under Windows. What might be the problem?
You're writing nothing but ASCII characters with Hello\n. Fortunately ASCII is still perfectly valid UTF-8. However, auto detection by editors will most likely not show UTF-8 as the encoding because they don't have anything to judge the file content's encoding by. I guess you simply don't know how file encodings work.
A file's encoding is a property that in general is not stored in a file or externally alongside a file. A lot of editors simply assume a certain encoding based on the operating system they run on or the environment settings (system language), or they include some kind of semi-intelligent auto-detection (which may still fail because file encodings cannot be auto-detected unambiguously). That's why you have to tell Perl that a file is encoded in UTF-8 when you read it with binmode or the corresponding I/O layer.
Now there is one way of marking a text file's encoding if said encoding is one of the UTF family (UTF-8, UTF-16 LE and BE, UTF-32 LE and BE) . That way is called the BOM (byte order mark). However, producing files with a BOM came from a time when UTF-8 had not been spread as widely as it is today. It usually poses more and different problems than it solves, especially due to editors and applications in general not supporting BOMs at all. Therefore BOMs should probably be avoided nowadays.
There are exceptions, of course, in which the file format contains certain instructions that tell the file's encoding. XML comes to mind with its DOCTYPE declaration. However, even for such files you will have to recognize if a file is encoded in a multi-byte encoding that always uses at least two bytes per character (UTF-16/UTF-32) or not in order to parse the DOCTYPE declaration in the first place. It's simply not simple ;)
I'm using an API that processes my files and presents optimized output, but some special characters are not preserved, for example:
Input: äöü
Output: äöü
How do I fix this? What encoding should I use?
Many thanks for your help!
It really depend what processing you are done to your data. But in general, one powerful technique is to convert it to UTF-8 by Iconv, for example, and pass it through ASCII-capable API or functions. In general, if those functions don't mess with data they don't understand as ASCII, then the UTF-8 is preserved -- that's a nice property of UTF-8.
I am not sure what language you're using, but things like this occur when there is a mismatch between the encoding of the content when entered and encoding of the content when read in.
So, you might want to specify exactly what encoding to read the data. You may have to play with the actual encoding you need to use
string.getBytes("UTF-8")
string.getBytes("UTF-16")
string.getBytes("UTF-16LE")
string.getBytes("UTF-16BE")
etc...
Also, do some research about the system where this data is coming from. For example, web services from ASP.NET deliver the content as UTF-16LE, but Java uses UTF-16BE encoding. When these two system talk to each other with extended characters, they might not understand each other exactly the same way.
I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.
Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.
This is what I have done so far (forgive me for only including "summary" code examples):
Made sure files are always read and written in UTF-8:
use open ':utf8';
Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):
s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added
utf8::decode $_;
Made sure text is printed as UTF-8:
binmode STDOUT, ':utf8';
Made sure browsers interpret my content as UTF-8:
Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
accept-charset="UTF-8"
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
use utf8;
Does this looks reasonable or am I missing something?
EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.
The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).
For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).
Make decoding fail hard with an exception, compare:
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:
use Encode qw(decode);
use URI::Escape::XS qw(decodeURIComponent);
$_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.
Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.
Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).
For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.
You're always missing something. The problem is usually the unknown unknowns, though. :)
Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.
Some other things you'll need to do:
Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.
Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �
Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).
Ensure that anything looking at #ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).
If you find anything else, please let us know by answering your own question with whatever we left out. ;)
Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?
The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.
A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.