What charset to use to store russian text into javascript files as an array - unicode

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?

Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.

By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.

I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.

The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

Related

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

internal string encoding

I'm trying to understand how ASP classic handles strings internally. I've googled and debugged, but I still don't know how a string is encoded within the ASP script.
See the illustration below.
Is input data transformed so that all string variables have the same encoding no matter what source?
Most ASP-pages are saved on disk as utf-8. They do however #include asp-files that are saved with another encoding. A the top of front-end-pages I set the Response encoding to unicode.
response.codepage = 65001 //unicode
reponse.charset = 'utf-8'
http://www.designerline.se/db/aspclassicencoding.png
First of all its worth considering that the both UTF-8 and Windows-1252 (and ISO-8859-1 and others) are based on US-ASCII. The first 128 characters in all of these codepages are identical. Use exactly the same byte value and all occupy just one byte.
In many cases the vast majority of the content is within the US-ASCII range so its hard to tell there is any difference between. Frequently the whole file is just using US-ASCII characters and hence the files are identical despite choosen encoding (save perhaps the BOM at the start of the file).
Basic Script Processing
First the processor combines an ASP file with all its includes and the includes of those includes. This is done very simply sequentially replacing the include markers with the content of the include file being referenced. This is done purely at the byte level not attempt is made to convert files of different encodings.
Next the combined version of the file is parsed. tokenized, "compiled" even into a tight interperter friendly file. Its at this point that chunks of content in the file (the stuff outside of script code blocks) are turned into a special form of Response.Write. Its special in that at the point script execution would reach these special writes the processor simply copies verbatim the bytes as found in the file directly to the output stream, again no attempt is made to convert any encodings.
Script code and character encoding
The ASP processor just doesn't cope well with anything that isn't ASCII. All your code and especially your string literals in your code should only be in ASCII.
What can be a bit confusing once a script is executing all string variables are stored using Unicode encoding.
When code writes content the response using the proper Response.Write method this is where the Response.CodePage comes into effect. It will encode the unicode string the script provides to the response code page before adding it to the output stream.
What is the effect of Response.CharSet
It adds the CharSet attribute to the Content-Type http header. That is it, it has no other impact. If set this one character set but send different one because either your Response.CodePage doesn't match it or because the byte content of the files are not in that encoding then you can expect problems.
Input encoding
Things get really messy here. When form data is posted to the server there is no provision in the form url encoding standard to declare the code page used. Browser can be told what encoding to use and they will default to the charset of the html page contain the form, but there is no mechanism to communicate that choice to the server.
ASP takes the view that the codepage of posted form fields would be the same as the codepage of the response its about to send. Take a moment to absorb that.... This means that quite counter intuatively the Response.CodePage value has an impact on the strings returned by Request.Form. For this reason its important to get the correct codepage set early, doing some form processing and then setting the codepage later just before sending a response can lead to unexpected results.
The classic "the web page looks fine but the data in the database is corrupt" gotcha
One common gotcha this behaviour results in is where the developer has set CharSet="UTF-8" but left the codepage at something like "Windows-1252".
What ends up happening is the user enters text which is sent to the server in UTF-8 encoding but the script code reads it as 1252. This corrupt string gets stored in the database. A subsequent web page looks at this data, the corrupt string it pulled from the DB. This string is then sent by response.write using 1252 encoding but the destination page is told its UTF-8. This has the effect of reversing the corruption and everything looks fine to the user.
However when other components, say a report generator, creates content from the database then the data appears corrupt because it is.
The Bottom Line
You are already doing the correct thing, get that CharSet and CodePage set early and consistently. Where other files may not be saved as UTF-8 you will have problems if there is non-ascii content in them but otherwise you would be fine.
Many include asps are purely code with no content and since that code ought to be purely in ascii its encoding doesn't really matter.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

What encoding does InstallShield expect non-latin-alphabet string table entries to use?

I work on an app that gets distributed via a single installer containing multiple localizations. The build process includes a script that updates the .ism string table with translations for each supported language.
This works fine for languages like French and German. But when testing the installer in, i.e. Japanese, the text shows up as a series of squares. It's unlikely to be a font problem, since the InstallShield-supplied strings show up fine; only the string table entries are mangled. So the problem seems to be that the strings are in the wrong encoding.
The .ism is in XML format, with UTF-8 declared as its encoding, so I assumed the strings needed to be UTF-8 encoded as well. Do they actually need to use the encoding of the target platform? Is there any concern, then, about targets having different encodings, i.e. Chinese systems using one GB-encoding versus another? What is the right thing to do here?
Edit: Using InstallShield 2009, since there is apparently a difference between that and 2010.
In InstallShield 2009 and earlier, the encoding is a base-64 encoding of the binary string in the ANSI encoding specific to the language in question (e.g. CP932 for Japanese). In InstallShield 2010 and later, it will still accept that or use UTF-8, depending on other columns in that table.
Thanks (up-voted his answer) go to Michael Urman, for pointing us in the right direction. But this is the actual working (with InstallShield 2009) algorithm, reverse-engineered by a co-worker:
Start with a unicode (multi-byte-character) string
Write out the length as the encoded-length field in the ism-file
Encode the string as UTF-16-little-endian
Base-64 using the uuencode dictionary, except with ` (back-tick) instead of spaces.
Write the result to the ism-file, escaping XML entities
Be aware that base-64ing using the uuencode dictionary is not the same as using the uuencode algorithm. Standard uuencode produces a set of newline-separated lines, including a header, footers and one or more data lines, each of which begins with a length-character. If you're implementing this using a uuencode codec, you'll need to strip all of that off.
I'm also trying to figure this out...
I've inhereted some Installshield 12 (which is pre-2009) projects with string table entries containing characters outside the range of base64 'target' characters.
For example, one of the Japanese strings is:
4P!H&$9!O'<4!R&\=!E&,=``#$(80!C&L=0!P"00!G`&4`;#!T`)(PI##S,+DPR##\,.LP5S!^,%DP`C
After much searching I happened upon Base85 encoding, which looks much closer to being plausible, but have not yet verified this to be the solution.

How do you troubleshoot character encoding problems?

If all you see is the ugly no-char boxes, what tools or strategies do you use to figure out what went wrong?
(The specific scenario I'm facing is no-char boxes within a <select> when it should be showing Japanese chars.)
Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.
Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.
So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?
Sorry to be so general, but the question doesn't give much more to work with.
If the data you send to the browser becomes mangled (moji-bake) you will get trash characters. Also, if you specify the wrong character set in your META headers, your browser will render the page incorrectly, causing moji-bake again, sometimes in random places on the page.
When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the lifetime of your program (data storage, retrieval, data manipulation in your code, displaying in the browsser etc...)
What is UTF8?
UTF8 handles binary streams of data, not strings. This means the bit combinations can have variable length. ASCII characters have a fixed length of 8 bits representing 1 byte, however UTF8 characters can be composed of 6bits, 8bits, 12bits, etc... As such, UTF8 is prone to what Japanese call "mojibake".
As a coder, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X).
Database Settings
If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their sense of humour!!
Checking your Codebase
In my experience editors like Notepad++, Notepad2, UltraEdit, e, etc... all have UTF8 support problems. They mostly work, but since their developers don't use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc ... all present problems.
I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://www.hidemaru.interlink.or.jp/software/
Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.
Manipulating Strings
Any string function need to multibyte safe. Notice I didn't say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!
META Tags
Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
Email
This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here.
Debugging UTF8 Issues
Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.
Hope that helps
Redirect the data to disk and use a Hex Editor. Most text editors / viewers do their own conversions behind the scenes, so it is difficult to be sure you are seeing the data in it's true form.