UTF-8 incorrectly displayed in Lua/ Corona - unicode

In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!

Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.

Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().

I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.

Related

Hebrew characters processed by HTML Tidy turn into gibberish

I'm using HTML Tidy Online (http://infohound.net/tidy/) to tidy up some very old and messed up HTML file which contains some Hebrew characters. Whenever the page is processed by Tidy the output turns Hebrew characters into gibberish, even after changing encoding methods in the settings. Using different settings, I do manage to get the same output with the Hebrew characters as unicode entities.
I Googled around for a possible solution but found none.
I had a couple ideas in mind, but I'm not sure exactly how to approach them, if at all (maybe someone has a better solution).
I thought maybe I could (after processing the page) scan the page for unicode entities and replace them with the corresponding Hebrew characters (in a systematic way, of course).
Maybe I could take the HTML Tidy source code and modify it to output Hebrew characters appropriately. The problem with this is that I doubt I am knowledgeable enough to even get started on something like this.
I had a similar problem. Document in UTF-8, containing unicode characters. HTML Tidy turned them into HTML entities. This in HTMLTIDY.CFG fixed it:
char-encoding: utf8
input-encoding: utf8
output-encoding: utf8
Hope it helps.
The website http://infohound.net/tidy/ that you are using has a "Char encoding" clause at the bottom right. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. In Notepad++ for example, you can go to Encoding > Convert to UTF-8 without BOM.

Perl encodings question

I need to get a string from <STDIN>, written in latin and russian mixed encodings, and convert it to some url:
$search_url = "http://searchengine.com/search?text=" . uri_escape($query);
But this proccess goes bad and gives out Mojibake (a mixture of weird letters). What can I do with Perl to solve it?
Before you can get started, there's a few things you need to know.
You'll need to know the encoding of your input. "Latin" and "russian" aren't (character) encodings.
If you're dealing with multiple encodings, you'll need to know what is encoded using which encoding. "It's a mix" isn't good enough.
You'll need to know the encoding the site expects the query to use. This should be the same encoding as the page that contains the search form.
Then, it's just a matter of decoding the input using the correct encoding, and encoding the query using the correct encoding. That's the easy part. Encode provides functions decode and encode to do just that.

Encoding issue in the web page

There is an encoding issue in the web page means it showing some special characters in the browser(Cinéma). content is in ISO, web page is rendering in UTF-8. some articles are displaying properly,bcz those are in UTF encode.some of the articles are shows the encoding issue like Cinéma in Perl 5.
Can any once help me out for this encoding issue.that would be a great!
Thanks in advance.
Ensure your Content-type header, or meta document element, contains correct encoding information.
A quick and easy way to test if this is your issue is to ask the browser to render the page as if it had received a specific encoding directive. In Safari this would be View -> Text Encoding and then selecting something appropriate.
I'd hazard a guess that if you inform the browser to use utf-8 then it will render the page correctly.
The only way to solve this will be to spend some time reading up on Unicode and UTF-8 and how to handle encoding in Perl. (perldoc perluniintro, perldoc perlunitut, perldoc perlunicode, perldoc perlunifaq for example).
UTF-8 encoding is a very different concept to other encodings that programmers encounter (escaping in strings, URL encoding, HTML character entities, etc) - it's about how your code should interpret sequences of bytes as characters.
Without knowing the source of the word containing the special character (an accented 'e'), it's impossible to offer further help - is it coming from a database? in a static HMTL page? in an HTML template? a string within Perl code?

Codepages and encodings

Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.
If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?
Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html
And no. if I understand your question correctly it doesn't mean that.
When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.
An encoding is simply a mapping between numerical values and "characters".
US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.
Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.
If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!
Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).
UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.
So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.
Use UTF-8 all the way if you can.

How to display a non-ascii filename in the file download box in browsers?

There doesn't seem to be an accepted way of sending down a header parameter in non ascii format.
The header for file download usually looks like
Content-disposition: attachment; filename="theasciifilename.doc"
Except if you smash a utf8 encoded string in the filename parameter, Firefox will handle it fine, whereas IE will throw up.
There is a document on CodeProject that explains a method for encoding the filename.
This document encodes Bản Kiểm Kê.doc to B%e1%ba%a3n%20Ki%e1%bb%83m%20K%c3%aa.doc by hex encoding the bytes.
Problem #1: the first character in that string: ả has a value of ả -- encode that number in Hex and you get %a3%1e. How did this guy get %e1%ba%a3? (I'm obviously missing something simple here)
Problem #2: While IE acknowledges this encoding, Firefox doesn't! What to do?
The specs basically don't permit anything other than US-ASCII. HTTP headers are US-ASCII. HTTP's payload defaults to ISO 8859-1 but that refers to the content body, not the headers.
Arguably the Right Thing to do would be to use MIME's technique for encoding non-ASCII data in headers, as described in RFC 2047, but I have no idea whether browsers actually support that.
EDIT: Whoops, no, RFC 2047 section 5 explicitly says that the encoded form is not permitted in Content-Disposition. Looks like you're out of luck - there is no standard.
EDIT 2: There is a standard - RFC 2231 defines how this is now supposed to work. It has support from some browsers, but is not supported in IE. I found some test cases which demonstrate how it works and what browser support is available.
Answer to question #1: You are confusing Unicode and UTF-8. The hex value of 'ả' is 0xA31E however that is not a UTF-8 character. In UTF-8 that character requries three bytes, 0xE1 0xBA 0xA3. URL encoding is poorly defined for non-ascii encodings but %e1%ba%a3 is the valid UTF-8 encoding to use for that character.
For Problem #2 you need to URL encode the file name for both Internet Explorer and Firefox. The only difference is that you need to use the format of RFC 2231 in Firefox.
This applies to Firefox 3 and Internet Explorer 7.
In the link you've got above, e1 ba a3 is the UTF-8 encoding of the character mentioned, not the character code.
Answer (sort of) to problem #2:
Since you've discovered that the naming scheme in one browser does not work in the other, your only solution is to do it differently for each browser, similar to the example here.
In case the link goes away, the solution is basically:
1. If browser is IE URL encode filename
2. Generate Content-disposition header
Of course determining if the browser is IE by User-agent (which is about the only way you can do it) is fraught with all sorts of the usual peril.
As North American centric as this sounds, if it is important that this work in a large number of browsers you do not control which may have the User-agent blocked, or modified, then simply avoid UTF-8 encoded characters in the filename and always use "Download" or something.
Unfortunately, there currently is no single way that would work in all User Agents.
See http://greenbytes.de/tech/tc2231/ for test cases, then complain to Microsoft, Google and Apple.