Why haven't ASCII and ISO-8859-1 encoding been relegated to history? - encoding

It seems to me if UTF-8 was the only encoding used everywhere ever, there would be a lot less issues with code:
Don't even need to think about encoding issues.
No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
Browsers don't need to wait for the <meta> tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
More characters can be represented in UTF-8.
Other things I can't think of right now.
So why haven't the inferior encodings been nuked from space?

Don't even need to think about encoding issues.
True. Except for all the data that's still in the old ASCII format.
No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
Incorrect. UTF-8 is variable length, from 1 to 6 or so bytes.
Browsers don't need to wait for the tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
Browsers don't generally wait for the full page, they make a guess based on the first part of the page data.
You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
Except for all those other old web pages that use other non-UTF-8 encodings (the non-English speaking world is pretty big).
More characters can be represented in UTF-8.
True. Your problems of data validation just got harder, too.

Why are EBCDIC, Baudot, and Morse still not nuked from orbit? Why did the buggy-whip manufacturers not close their doors the day after Gottlieb Daimler shipped his first automobile?
Relegating a technology to history takes non-zero time.

No issues with mixed 1-2-byte
character streaming, because
everything uses 2 bytes.
Not true at all. UTF-8 is a mixed-width 1, 2, 3, and 4-byte encoding. You may have been thinking of UTF-16, but even that has had 4-byte characters for a while. If you want a “simple” fixed-width encoding, you need UTF-32.
You would never see ? and other random
symbols on old web pages
Even with UTF-8 web pages, you still might not have a font that supports every Unicode character, so this is still a problem.
More characters can be represented in
UTF-8.
Sometimes this is a disadvantage. Having more characters means more bits are required to encode the characters. And to keep track of which ones are letters, digits, etc. And to store the fonts for displaying those characters. And to deal with additional Unicode-related complexities like normalization.
This is probably a non-issue for modern computers with gigabytes of RAM, but don't expect your TI-83 to support Unicode any time soon.
But still, if you do need those extra characters, it's way easier to work with UTF-8 than it is to work with than having zillions of different 8-bit character encodings (plus a few non-self-synchronizing East Asian multibyte encodings).
So why haven't the inferior encodings
been nuked from space?
In large part, this is because the “inferior” programming languages haven't been nuked from space. Lots of code is still written in languages like C and C++ (and even COBOL!) that predate Unicode and still don't have good support for it.
I badly wish we get rid of the situation where some libraries use char-based strings encoded in UTF-8 while others think char is for legacy encodings and Unicode should always use wchar_t and then you have to deal with whether wchar_t is UTF-16 or UTF-32 (or neither).

I don't think UTF-8 uses "2 bits" it's variable length. Also a lot of OS level code is UTF-16 and UTF-32 respectively, which means the choice is between ASCII or ISO-8859-1 for latin encodings.

Well, your question is a bit why-world-is-so-bad complaint. It is because it is so. The pages written in other encodings than UTF-8 come from the times when UTF-8 was badly supported by operating systems and when UTF-8 was not yet de-facto standard.
This pages will stay in their original encoding as long as someone will not change them, which is in many cases not very probable. Many of them are no longer supported by anyone.
There are also a lot of documents with non-unicode encoding in the internet, in many formats. Someone COULD convert them, but it, as above, requires a lot of effort.
So, the support for non-unicode must also stay.
And for the current times, keep as the rule that when someone uses non-unicode encoding, a kitten dies.

Related

Why are there different encoding types?

This is a noob question, but I wanna know why there are different encoding types and what are their differences (ie. ASCII, utf-8 and 16, base64, etc.)
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
http://www.unicode.org/standard/standard.html
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8#Compared_to_other_multi-byte_encodings
I'm not familiar with UTF-16. Here's some information about the differences.
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Unicode_plane
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
http://en.wikipedia.org/wiki/Base64
http://www.phpeveryday.com/articles/PHP-Email-Using-Embedded-Images-in-HTML-Email-P113.html
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
The real reason why there are so many variants is that the Unicode consortium came along too late.
In The Beginning memory and storage was expensive and using more than 8 (or sometimes only 7) bit of memory to store a single character was considered excessive. Thus pretty much all text was stored using 7 or 8 bit per character. Clearly, 8 bit are not enough memory to represent the characters of all human languages. It's barely enough to represent most characters used in a single language (and for some languages even that's not possible). Therefore many different character encodings where designed to allow different languages (English, German, Greek, Russian, ...) to encode their texts in 8 bits per characters. After all a single text file (and usually even a single computer system) will only ever used in a single language, right?
This led to a situation where there was no single agreed-upon mapping of characters to numbers of any kind. Many different, incompatible solutions where produced and no real central control existed. Some computer systems used ASCII, others used EBCDIC (or more precisely: one of the many variations of EBCDIC), ISO-8859-* (or one of its many derivatives) or any of a big list of encodings that are hardly heard about now.
Finally, the Unicode Consortium stepped up to the task to produce that single mapping (together with lots of auxiliary data that's useful but outside of the bounds of this answer).
When the Unicode consortium finally produced a fairly comprehensive list of characters that a computer might represent (together with a number of encoding schemes to encode them to binary data, depending on your concrete needs), the other character encoding schemes were already widely used. This slowed down the adoption of Unicode and its encodings (UTF-8, UTF-16) considerably.
These days, if you want to represent text, your best bet is to use one of the few encodings that can represent all Unicode characters. UTF-8 and UTF-16 together should suffice for 99% of all use cases, UTF-32 covers almost all the others. And just to be clear: all the UTF-* encodings can encode all valid Unicode characters. But due to the fact that UTF-8 and UTF-16 are variable-width encodings, they might not be ideal for all use cases. Unless you need to be able to interact with a legacy system that can't handle those encodings, there is rarely a reason to choose anything else these days.
The main reason is to be able to show more characters. When the internet was in it's infancy, noone really planned ahead thinking that one day there would be people using it from all countries and all languages around the world. So a small character set was good enough. Gradually it was revealed to be limited and English-centric, thus the demand for bigger character sets.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

Why does anyone use an encoding other than UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to know why any developer would need to use an encoding other than UTF-8.
Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:
http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages
The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.
Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.
Those are the usual excuses for not using Unicode.
As for not using UTF-8 specifically there are different reasons. Some systems, like Windows1 (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.
Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.
So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.
1 NT.
In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.
UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.
For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.
Examples:
ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China
The last two examples show that encodings can even be a political issue.
Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).
Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).
Sometimes it's just laziness learning the UTF-8 functions for a particular language.
Because they do not know better.
The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings.
UTF-8 is superior because
It is ASCII compatible. Most known and tried string operations do not need adaptation.
It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age. If you have important data in encoding X, spend two minutes on Google and write a conversion function. Even if you have to interface with sourceless legacy app Z, you can run your communications through a pipe so that your logic stays in the 21st century.
UTF-16 isn't fixed length either and assuming it is like many do, will only cause terrible bugs.
Additionally Unicode is very complex and it is almost certain than any fixed-size algorithm adapted from ASCII will yield bad results even in UTF-32.
Say you have this UTF-16 string.
[0][1][2][F|3] [4] [5]
And you want to insert a character with code 8 between [3] and [4]
you would do insert(5,8)
If you don't check for characters outside BMP(serially as in UTF-8 as you cannot know how many double sized characters you have) you get:
[0][1][2][F|8][3][4][5]
Two new garbage characters. So much for your fixed size encoding.
You can of course disallow such characters altogether, but then when your code interfaces with the real world, you might find your program saves the profile for this user who lives in rm -Rf / in .profile instead of [Classical Chinese Proverb].profile.
Or just an angry user that cannot write his thesis on Classical Chinese Proverbs with your software.
One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.
Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.
Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.
And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.
Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.
The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.
Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.
When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.
Imagine all files to consider are in GB2312 (China mainland standard). Then you might choose GB18030 as Unicode encoding instead. They are compatible the same way as all ASCII is UTF-8. That is useful in China mainland!
You might decide even quicker when you find out that both mentioned GB-standards are required in your IT-product by law (as far as I have heard), if you want to ship in China (mainland).
Another upside is that GB2312, and as such GB18030 as well, are also ASCII compatible.
It is algorithmically not so robust, though. – So if you have no political reasons or any GB2312 legacy, it makes no sense to use it. But if you do, here you got your answer.
Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use?
UTF-8 general ci
or
UTF-8 unicode ci?
(I tend to use the UTF-8 variant that is used for the database connection)
Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.
Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.
At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.
Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.
What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.
Joel's article on the must-knows for character encoding is good and worth reading.

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Going down your list:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
There's more on my Unicode page and tips for debugging Unicode problems.
The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.
Some reading to get you started on character encodings: Joel on Software:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By the way - ASP.NET has nothing to do with it. Encodings are universal.