Perl CGI.pm encoding - wrong encoding for "ě" - perl

I have a simple web page that uses CGI.pm This is what I do:
when I call any perl CGI.pm function and use czech character "ě" for value of a textfield, label of radio_group or anything else I get �› insetad of "ě"
this is extremly weird - since the whole page is utf8 (<meta name="charset" content="utf-8"/> ). Especially since this works
print '<textfield value="ěěěě" >';
therefore I am positive - it has to be CGI.pm causing the problem... I tried to put
use utf8;
utf8::decode($textfield_value);
at the beginning of my scirpt and it fixed the CGI.pm problem but made all other characters in the script (those that are regulary printed) look funny..
Any ideas???

Set the accept-charset attribute in your form fields to UTF-8?
<form action="/..." accept-charset="UTF-8">
This might not be sufficient to solve your problem, but it is often necessary to force the client browser to utf8-encode the form data that gets sent to the server.

Have you tried replacing the ě's with their octal or hex escapes? Unfortunately, there doesn't seem to be an HTML code for the character.

Related

Perl Catalyst encoding issue

Well this is strange and hard to explain what's wrong but I'll try do my best.
For some reason values to template changes their encoding (I'm pretty sure they are).
Controller file (encoded in UTF-8):
print STDERR "ąęść";
$c->stash->{some_variable} = "ąęść"; # some unicode chars
Template file (encoded in UTF-8):
[% some_variable %]<br>
test: ąęść
As output in browser I'm getting:
ÄÄÅÄ
test: ąęść
Output on console (with UTF-8 encoding enabled):
ąęść
Please take a look at the good documentation provided by the Catalyst Wiki at Using Unicode and also Catalyst::View::TT. The Perl Unicode Cookbook may help you get a better understanding on Perl support of UTF-8, usually better than most other languages available today.
You may need to save your templates with the UTF-8 BOM mark using your editor, so your text editor does encode properly your template file when saving, or if not setting BOM, then at least define file encoding as UTF-8 every time you save it.
There's been a ton of fixes to unicode support and UTF-8 in general with the most recent stable release of Catalyst (5.90084). Catalyst now is UTF-8 by default, but of course there's always some hard edges. You should review the most recent docs on the subject to see what is going wrong.
If you template contains multibyte character than you do indeed need to set the BOM or review the documentation for your template view of choice.

UTF-8 incorrectly displayed in Lua/ Corona

In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!
Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.
Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().
I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.

Encoding issue in the web page

There is an encoding issue in the web page means it showing some special characters in the browser(Cinéma). content is in ISO, web page is rendering in UTF-8. some articles are displaying properly,bcz those are in UTF encode.some of the articles are shows the encoding issue like Cinéma in Perl 5.
Can any once help me out for this encoding issue.that would be a great!
Thanks in advance.
Ensure your Content-type header, or meta document element, contains correct encoding information.
A quick and easy way to test if this is your issue is to ask the browser to render the page as if it had received a specific encoding directive. In Safari this would be View -> Text Encoding and then selecting something appropriate.
I'd hazard a guess that if you inform the browser to use utf-8 then it will render the page correctly.
The only way to solve this will be to spend some time reading up on Unicode and UTF-8 and how to handle encoding in Perl. (perldoc perluniintro, perldoc perlunitut, perldoc perlunicode, perldoc perlunifaq for example).
UTF-8 encoding is a very different concept to other encodings that programmers encounter (escaping in strings, URL encoding, HTML character entities, etc) - it's about how your code should interpret sequences of bytes as characters.
Without knowing the source of the word containing the special character (an accented 'e'), it's impossible to offer further help - is it coming from a database? in a static HMTL page? in an HTML template? a string within Perl code?

Checklist for going the Unicode way with Perl

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.
Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.
This is what I have done so far (forgive me for only including "summary" code examples):
Made sure files are always read and written in UTF-8:
use open ':utf8';
Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):
s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added
utf8::decode $_;
Made sure text is printed as UTF-8:
binmode STDOUT, ':utf8';
Made sure browsers interpret my content as UTF-8:
Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
accept-charset="UTF-8"
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
use utf8;
Does this looks reasonable or am I missing something?
EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.
The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).
For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).
Make decoding fail hard with an exception, compare:
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:
use Encode qw(decode);
use URI::Escape::XS qw(decodeURIComponent);
$_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.
Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.
Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).
For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.
You're always missing something. The problem is usually the unknown unknowns, though. :)
Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.
Some other things you'll need to do:
Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.
Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �
Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).
Ensure that anything looking at #ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).
If you find anything else, please let us know by answering your own question with whatever we left out. ;)

How to avoid malformed URI sequence error?

I'm working with perl. I have data saved on database as  “
and I want to escape those characters to avoid having malformed URI sequence error on the client side. This error seems to happen on fire fox only. The fix I found while googling is not to use decodeURI , yet I need this for other characters to be displayed correctly.
Any help? uri_escape does not seem enough on the server side.
Thanks in advance.
Detalils:
In perl I'm doing the following:
print "<div style='display:none;' id='summary_".$note_count."_note'>".uri_escape($summary)."</div>";
and on the java script side I want to read from this div and place it on another place as this:
getObj('summary_div').innerHTML= unescape(decodeURI(note_obj.innerHTML));
where the note_obj is the hidden div that saved the summary on perl.
When I remove decodeURI the problem is solved, I don't get malformed URI sequence error on java script. Yet I need to use decodeURI for other characters.
This issue seems to be reproduced on firefox and IE7.
you can try to use the CGI module, and perform
$uri = CGI::escape($uri);
maybe it depends of the context your try to escape the uri.
This worked fine for me in CGI context.
After you added details, i can suggest :
<div style='display:none;' id='summary_".$note_count."_note'>".CGI::escape($summary)."</div>";
URL escaping won't help you here -- that's for escaping URLs, not escaping text in HTML. What you really want is to encode the string when you output it. See the Encode.pm built-in library. Make sure that you get your charset statements right in the HTTP headers: "Content-Type: text/html; charset=UTF-8" or something like that.
If you're unlucky, you may also have to decode the string as it comes out of the database. That depends on the database driver and the encoding...