How important is file encoding? - encoding

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?

Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.

If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake

It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

Related

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

Why haven't ASCII and ISO-8859-1 encoding been relegated to history?

It seems to me if UTF-8 was the only encoding used everywhere ever, there would be a lot less issues with code:
Don't even need to think about encoding issues.
No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
Browsers don't need to wait for the <meta> tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
More characters can be represented in UTF-8.
Other things I can't think of right now.
So why haven't the inferior encodings been nuked from space?
Don't even need to think about encoding issues.
True. Except for all the data that's still in the old ASCII format.
No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
Incorrect. UTF-8 is variable length, from 1 to 6 or so bytes.
Browsers don't need to wait for the tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
Browsers don't generally wait for the full page, they make a guess based on the first part of the page data.
You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
Except for all those other old web pages that use other non-UTF-8 encodings (the non-English speaking world is pretty big).
More characters can be represented in UTF-8.
True. Your problems of data validation just got harder, too.
Why are EBCDIC, Baudot, and Morse still not nuked from orbit? Why did the buggy-whip manufacturers not close their doors the day after Gottlieb Daimler shipped his first automobile?
Relegating a technology to history takes non-zero time.
No issues with mixed 1-2-byte
character streaming, because
everything uses 2 bytes.
Not true at all. UTF-8 is a mixed-width 1, 2, 3, and 4-byte encoding. You may have been thinking of UTF-16, but even that has had 4-byte characters for a while. If you want a “simple” fixed-width encoding, you need UTF-32.
You would never see ? and other random
symbols on old web pages
Even with UTF-8 web pages, you still might not have a font that supports every Unicode character, so this is still a problem.
More characters can be represented in
UTF-8.
Sometimes this is a disadvantage. Having more characters means more bits are required to encode the characters. And to keep track of which ones are letters, digits, etc. And to store the fonts for displaying those characters. And to deal with additional Unicode-related complexities like normalization.
This is probably a non-issue for modern computers with gigabytes of RAM, but don't expect your TI-83 to support Unicode any time soon.
But still, if you do need those extra characters, it's way easier to work with UTF-8 than it is to work with than having zillions of different 8-bit character encodings (plus a few non-self-synchronizing East Asian multibyte encodings).
So why haven't the inferior encodings
been nuked from space?
In large part, this is because the “inferior” programming languages haven't been nuked from space. Lots of code is still written in languages like C and C++ (and even COBOL!) that predate Unicode and still don't have good support for it.
I badly wish we get rid of the situation where some libraries use char-based strings encoded in UTF-8 while others think char is for legacy encodings and Unicode should always use wchar_t and then you have to deal with whether wchar_t is UTF-16 or UTF-32 (or neither).
I don't think UTF-8 uses "2 bits" it's variable length. Also a lot of OS level code is UTF-16 and UTF-32 respectively, which means the choice is between ASCII or ISO-8859-1 for latin encodings.
Well, your question is a bit why-world-is-so-bad complaint. It is because it is so. The pages written in other encodings than UTF-8 come from the times when UTF-8 was badly supported by operating systems and when UTF-8 was not yet de-facto standard.
This pages will stay in their original encoding as long as someone will not change them, which is in many cases not very probable. Many of them are no longer supported by anyone.
There are also a lot of documents with non-unicode encoding in the internet, in many formats. Someone COULD convert them, but it, as above, requires a lot of effort.
So, the support for non-unicode must also stay.
And for the current times, keep as the rule that when someone uses non-unicode encoding, a kitten dies.

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)
Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)
You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?
Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).
My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).
I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.
I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches
It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.
I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.