Using Tcl encoding command to convert from Traditional Chinese to Simplified Chinese - encoding

I support a website written in Tcl which displays data in Traditional Chinese (big5). We then have a Java servlet, using the translation code from mandarintools.com, to translate a page request into Simplified Chinese. The conversion as specified to the translation code is from UTF-8 to UTF-8S; Java is apparently correctly translating the data to UTF-8 as it comes in.
The Java translation code works but is slow, and since the website is written in Tcl someone on another list suggested I try using that. Unfortunately, Tcl doesn't support UTF-8S and I have been unable to figure out what translation to use in its place. I've tried gb2312, gb2312-raw,gb1988, euc-cn... all result in gibberish. My assumption is that Tcl is also translating to UTF-8 as it comes in, though I have tried converting from big5 first and it doesn't help.
My test code looks like this:
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 $page_body]
ns_write $translated_page_body
I have also tried
set page_body [ns_httpget http://www.mysite.com]
set translated_page_body [encoding convertto gb2312 [encoding convertfrom big5 $page_body]]
ns_write $translated_page_body
But it didn't change anything.
Does anyone out there have enough experience with this to help me figure it out?

FYI for completeness' sake, I've been told by Tcl experts that you can't do the conversion this way, it has to be done via character replacement.

By any chance, are you grabbing your data from Oracle?
If so, see if you can use the CONVERT function to convert to from "utf8" to "al32utf8", which is the true Utf8 standard and which Tcl should work-with seamlessly.
If not, well, I guess I'll wait for you comment(s).

Related

Converting "болно" to Cyrillic

I have problem in my database where some of the Cyrillic text is seen like this "болно Ð±Ð°Ñ Ð°Ð¼ÑŒÐ´Ñ€ÑƒÑƒÐ»Ð¶ ч Ð". Is there a way to convert this to back to human readable format.
I need to read actual context of this.
Best I could do from your data...it looks Cyrillic but Google Translate didn't make anything of it. It seems it was decoded under the default US Windows codec but was really UTF-8, but the data is not quite right. I'm using Python to attempt to fix it:
>>> s.encode('cp1252').decode('utf8',errors='replace')
'болно ба� амьдруулж ч �'

Erlang and binary with Cyrillic

I need to be able to use binaries with Cyrillic characters in them. I tried just writing <<"абвгд">> but I got a badarg error.
How can I work with Cyrillic (or unicode) strings in Erlang?
If you want to input the above expression in erlang shell, please read unicode module user manual.
Function character_to_binary, and character_to_list are both reversable function. The following are an example:
(emacs#yus-iMac.local)37> io:getopts().
[{expand_fun,#Fun<group.0.33302583>},
{echo,true},
{binary,false},
{encoding,unicode}]
(emacs#yus-iMac.local)40> A = unicode:characters_to_binary("上海").
<<228,184,138,230,181,183>>
(emacs#yus-iMac.local)41> unicode:characters_to_list(A).
[19978,28023]
(emacs#yus-iMac.local)45> io:format("~s~n",[ unicode:characters_to_list(A,utf8)]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.30.0>,"~s~n",[[19978,28023]])
(emacs#yus-iMac.local)46> io:format("~ts~n",[ unicode:characters_to_list(A,utf8)]).
上海
ok
If you want to use unicode:characters_to_binary("上海"). directly in the source code, it is a little more complex. You can try it firstly to find difference.
The Erlang compiler will interpret the code as ISO-8859-1 encoded text, which limits you to Latin characters. Although you may be able to bang in some ISO characters that may have the same byte representation as you want in Unicode, this is not a very good idea.
You want to make sure your editor reads and writes ISO-8859-1, and you want to avoid using literals as much as possible. Source these strings from files.

UTF-8 incorrectly displayed in Lua/ Corona

In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!
Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.
Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().
I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.

Character Encoding Issue

I'm using an API that processes my files and presents optimized output, but some special characters are not preserved, for example:
Input: äöü
Output: äöü
How do I fix this? What encoding should I use?
Many thanks for your help!
It really depend what processing you are done to your data. But in general, one powerful technique is to convert it to UTF-8 by Iconv, for example, and pass it through ASCII-capable API or functions. In general, if those functions don't mess with data they don't understand as ASCII, then the UTF-8 is preserved -- that's a nice property of UTF-8.
I am not sure what language you're using, but things like this occur when there is a mismatch between the encoding of the content when entered and encoding of the content when read in.
So, you might want to specify exactly what encoding to read the data. You may have to play with the actual encoding you need to use
string.getBytes("UTF-8")
string.getBytes("UTF-16")
string.getBytes("UTF-16LE")
string.getBytes("UTF-16BE")
etc...
Also, do some research about the system where this data is coming from. For example, web services from ASP.NET deliver the content as UTF-16LE, but Java uses UTF-16BE encoding. When these two system talk to each other with extended characters, they might not understand each other exactly the same way.

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.