encoding error in Mojolicious' template

encoding error in Mojolicious' template - perl

I have non-ascii characters with utf-8 encoding (Chinese characters), but they're not printed correctly. I have to add decode('utf8', $str) (in controller or template file) to get the right output. How could I set the template to recognize utf-8 string?
Anyhow, the literal stashed string can make the rigth output, and I don't know why.
The content are stored in MySQL with utf-8 collection. I added $DB->do("SET NAMES 'UTF8'"); after database is connected, but no effect.

try DBI option mysql_enable_utf8 set to 1.

Related

How to get proper filename sanitizing on upload in TYPO3?

When I upload a badly (or "utf8-ly") named file in a fresh TYPO3 7.6 install, I get underscores instead of spelled out special characters.
E.g. the filename Bräm!.png is sanitized to Bra__m_.png.
I would expect Braem.png.
The server locale looks fine:
LANG=de_CH.UTF-8
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES="de_CH.UTF-8"
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=
In localConfiguration, we have
'systemLocale' => 'de_CH.UTF-8',
And even, in php.ini, I tried
intl.default_locale = de_CH.UTF-8
Still, no "proper" renaming as I'd expect, renaming the File Bräm!.png to Braem.png or at least Braem_.png.
Where else could I look?

From what you describe the name of the file is not encoded in UTF-8 but in a single byte character set (ISO-8859-1 for example).
In \TYPO3\CMS\Core\Resource\Driver\LocalDriver::sanitizeFileName() UTF-8 is used if you use it in the backend (same for the old file handling functions).
In that case the "ä" isn't a valid multi-byte UTF-8 character and is thus replace by underscore characters.

Make sure [SYS][UTF8filesystem] = true in you LocalConfiguration.php

IPC::Open3 converting character encoding

I am observing strange behaviour with IPC::Open3 arguments as part of a script.
I give a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).
However once the subprocess is started the arguments are now UTF-8 encoded. I have verified this using the desired executable (freebcp) and a wrapper script. I ended up writing a wrapper script which converts all the arguments back to ISO-8859-15.
What causes this behaviour? LANG is set to en_AU.ISO-8859-15. It works correctly on other hosts. I cannot find any reference to binmode()

I has a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).
However once the subprocess is started the arguments are now UTF-8 encoded.
LANG is set to en_AU.ISO-8859-15.
Perl5 by default doesn't do any encoding conversion: the strings treated as dumb byte arrays.
That, until you tell Perl that the strings contain the Unicode, for example by calling decode(), or reading string from a file handle that has encoding layer attached (via binmode(), or via open() flags, or via use open with :encoding/:locale, or via command line with -C switch.)
Since you have the string in ISO-8859-15, but it is outputted in UTF-8, that means that the Perl is aware of the encoding of your string. Somewhere somehow you have told Perl the encoding of the string, and it has converted it to the Unicode, which is internally represented using the UTF-8. The UTF-8 which now seems to be printed to the open3() file handles.
As a possible solution, before outputting the strings, you should try to explicitly convert the strings into the desired encoding.
P.S. Using the utf8::is_utf8() function, you can try to debug/find when/how your strings get converted into the Unicode, and whether they are really Unicode.

Winjs, error reading file with FileIO.readTextAsync

I am reading a .json file from disk using Windows.Storage.FileIO.readTextAsync.
All is fine until I put some non english letters in the file, like Æ Å Ø
The error I get is (rough translation from Danish language):
WinRT: No mapping for the Unicode character exists in the target multi-byte code page.
any idea how to read those chars in WinJs?

I found the problem.
when I created the file manually with notepad I set it to type ANSII instead of utf8.
I reopened the file -> save as and the changed the type and overwrote it.

You may be able to solve this by changing the encoding from the default (Utf8) to Utf16. The readTextAsync method accepts a second parameter which is a UnicodeEncoding flag:
Windows.Storage.FileIO.readTextAsync(
file,
Windows.Storage.Streams.UnicodeEncoding.utf16LE
).done( ... );
Or if you need to, you can use utf16BE flag (see link above).

internal string encoding

I'm trying to understand how ASP classic handles strings internally. I've googled and debugged, but I still don't know how a string is encoded within the ASP script.
See the illustration below.
Is input data transformed so that all string variables have the same encoding no matter what source?
Most ASP-pages are saved on disk as utf-8. They do however #include asp-files that are saved with another encoding. A the top of front-end-pages I set the Response encoding to unicode.
response.codepage = 65001 //unicode
reponse.charset = 'utf-8'
http://www.designerline.se/db/aspclassicencoding.png

First of all its worth considering that the both UTF-8 and Windows-1252 (and ISO-8859-1 and others) are based on US-ASCII. The first 128 characters in all of these codepages are identical. Use exactly the same byte value and all occupy just one byte.
In many cases the vast majority of the content is within the US-ASCII range so its hard to tell there is any difference between. Frequently the whole file is just using US-ASCII characters and hence the files are identical despite choosen encoding (save perhaps the BOM at the start of the file).
Basic Script Processing
First the processor combines an ASP file with all its includes and the includes of those includes. This is done very simply sequentially replacing the include markers with the content of the include file being referenced. This is done purely at the byte level not attempt is made to convert files of different encodings.
Next the combined version of the file is parsed. tokenized, "compiled" even into a tight interperter friendly file. Its at this point that chunks of content in the file (the stuff outside of script code blocks) are turned into a special form of Response.Write. Its special in that at the point script execution would reach these special writes the processor simply copies verbatim the bytes as found in the file directly to the output stream, again no attempt is made to convert any encodings.
Script code and character encoding
The ASP processor just doesn't cope well with anything that isn't ASCII. All your code and especially your string literals in your code should only be in ASCII.
What can be a bit confusing once a script is executing all string variables are stored using Unicode encoding.
When code writes content the response using the proper Response.Write method this is where the Response.CodePage comes into effect. It will encode the unicode string the script provides to the response code page before adding it to the output stream.
What is the effect of Response.CharSet
It adds the CharSet attribute to the Content-Type http header. That is it, it has no other impact. If set this one character set but send different one because either your Response.CodePage doesn't match it or because the byte content of the files are not in that encoding then you can expect problems.
Input encoding
Things get really messy here. When form data is posted to the server there is no provision in the form url encoding standard to declare the code page used. Browser can be told what encoding to use and they will default to the charset of the html page contain the form, but there is no mechanism to communicate that choice to the server.
ASP takes the view that the codepage of posted form fields would be the same as the codepage of the response its about to send. Take a moment to absorb that.... This means that quite counter intuatively the Response.CodePage value has an impact on the strings returned by Request.Form. For this reason its important to get the correct codepage set early, doing some form processing and then setting the codepage later just before sending a response can lead to unexpected results.
The classic "the web page looks fine but the data in the database is corrupt" gotcha
One common gotcha this behaviour results in is where the developer has set CharSet="UTF-8" but left the codepage at something like "Windows-1252".
What ends up happening is the user enters text which is sent to the server in UTF-8 encoding but the script code reads it as 1252. This corrupt string gets stored in the database. A subsequent web page looks at this data, the corrupt string it pulled from the DB. This string is then sent by response.write using 1252 encoding but the destination page is told its UTF-8. This has the effect of reversing the corruption and everything looks fine to the user.
However when other components, say a report generator, creates content from the database then the data appears corrupt because it is.
The Bottom Line
You are already doing the correct thing, get that CharSet and CodePage set early and consistently. Where other files may not be saved as UTF-8 you will have problems if there is non-ascii content in them but otherwise you would be fine.
Many include asps are purely code with no content and since that code ought to be purely in ascii its encoding doesn't really matter.

UTF8 charset, diacritical elements, conversion problems - and Zend Framework form escaping

I am writing a webapp in ZF and am having serious issues with UTF8. It's using multi lingual content through Zend Form and it seems that ZF heavily escapes all of these characters and basically just won't show a field if there's diacritical elements 'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
Zend Form allows for having non escaped data, but trying to use this is confusing, and it seems it'd need to be used all over the place.
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
And if the last question is true, then how do I convert the source text to UTF8? I am comfortable setting up apache so that it sends a default UTF8 charset heading, and also adding the charset meta tag to the html, but doing this I am still getting messed up encoding. I have also tried opening the translation csv file in TextWrangler on OSX as UTF8, but it has done nothing.
Thanks!
L

'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
This I don't understand. Can you show an example of how it is displayed, as opposed to how it should be displayed?
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
Yup. In more detail: If the data you're displaying and the encoding of the HTML page are both UTF-8, the multi-byte special characters will be displayed correctly.
And if the last question is true, then how do I convert the source text to UTF8?
Advanced editors and IDEs enable you to define what encoding the source file is saved in. You would need to open the file in its current encoding (with special characters being displayed correctly) and save it as UTF-8.
If the content is messed up when you have the right content-type header and/or meta tag specified, then the content is not UTF-8 yet. If you don't get it sorted, post an example of what it looks like here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse