Best Resource for Character Encodings - encoding

I'm searching for a document (not printed) that explains in details, but still simply, the subject of character encoding.

A great overview from the Programmer's perspective is:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By Joel Spolsky
http://www.joelonsoftware.com/articles/Unicode.html

Have you tried Wikipedia's Character encoding page and it's links ?

This perhaps?
http://www.unicode.org/versions/Unicode5.1.0/

See section 2 onwards of this document http://ahds.ac.uk/creating/guides/linguistic-corpora/chapter4.htm, it has an interesting history of character encoding methods.

Wikipedia is actually as good a source as any to begin with:-
http://en.wikipedia.org/wiki/Character_encoding>Character-Encodings. As well as the more familiar ASCII, UTF-8 etc. they have good information on older schemes like fieldata and the various incarnations of EBCDIC.
For in depth info on utf-8 and unicode you cannot do any better than:-
http://www.unicode.org>Unicode.org
Various manufacturs sites such as Microsoft and IBM have lots of code page info but it tends to relate to thier own hardware/software products.

There is a French book about this called Fontes et codages by Yannis Haralambous, an O'Reilly book, I'm pretty sure it is or will be translated. Indeed, it is:
Fonts and Encodings.

A short explanation of the basic concepts: http://www.mihai-nita.net/article.php?artID=20060806a

What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text is a spirituall successor to the Article on joelonsoftware.com (linked to by lkessler).
It is just as good an introduction but is a bit better on the technical details.

Related

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Encoding - what is it and why do we need it?

Can someone explaing me about encoding and its importance. I understand that we have various encodings and in each of them first 127 characters are same.
Read Joel Spolsky's excellent article on the subject.
An interesting point that was noted in the discussion of another answer (which I didn't really think the author needed to delete) is that there is a difference between a character set, which (in the other author's words - don't remember his username) defines a mapping between integers and characters (e.g. "Capital A is 65"), and an encoding, which defines how those integers are to be represented in a byte stream. Most old character sets, such as ASCII, have only one very simple encoding: each integer becomes exactly one byte. The Unicode character set, on the other hand, has many different encodings, none of which are equally simple: UTF-8, UTF-16, UTF-32...
Apart from the article above mentioned by Aasmund Eldhuset, I find this Tedx Talk really interesting and explanatory on the same topic.
Hope this helps!
I think encoding is the technique to convert your message into a form that is not readable to unauthorized persons so that you can maintain your secrecy.

Compete understanding of encodings and character sets

Can anybody tell me where to find some clear introduction to character sets, encodings and everything releted to these things?
Thanks!
I can think of two articles:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - the legendary article from our Joel Spolsky.
Unicode - How to get characters right? - more practical and targeted on Java webdevelopment (as per your question history, you seem to be doing Java webdevelopment).

Why isn't everything we do in Unicode?

Given that Unicode has been around for 18 years, why are there still apps that don't have Unicode support? Even my experiences with some operating systems and Unicode have been painful to say the least. As Joel Spolsky pointed out in 2003, it's not that hard. So what's the deal? Why can't we get it together?
Start with a few questions
How often...
do you need to write an application that deals with something else than ascii?
do you need to write a multi-language application?
do you write an application that has to be multi-language from its first version?
have you heard that Unicode is used to represent non-ascii characters?
have you read that Unicode is a charset? That Unicode is an encoding?
do you see people confusing UTF-8 encoded bytestrings and Unicode data?
Do you know the difference between a collation and an encoding?
Where did you first heard of Unicode?
At school? (really?)
at work?
on a trendy blog?
Have you ever, in your young days, experienced moving source files from a system in locale A to a system in locale B, edited a typo on system B, saved the files, b0rking all the non-ascii comments and... ending up wasting a lot of time trying to understand what happened? (did your editor mix things up? the compiler? the system? the... ?)
Did you end up deciding that never again you will comment your code using non-ascii characters?
Have a look at what's being done elsewhere
Python
Did I mention on SO that I love Python? No? Well I love Python.
But until Python3.0, its Unicode support sucked. And there were all those rookie programmers, who at that time knew barely how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to deal with non-ascii characters. Well they basically got life-traumatized by the Unicode monster, and I know a lot of very efficient/experienced Python coders that are still frightened today about the idea of having to deal with Unicode data.
And with Python3, there is a clear separation between Unicode & bytestrings, but... look at how much trouble it is to port an application from Python 2.x to Python 3.x if you previously did not care much about the separation/if you don't really understand what Unicode is.
Databases, PHP
Do you know a popular commercial website that stores its international text as Unicode?
You will (perhaps) be surprised to learn that Wikipedia backend does not store its data using Unicode. All text is encoded in UTF-8 and is stored as binary data in the Database.
One key issue here is how to sort text data if you store it as Unicode codepoints. Here comes the Unicode collations, which define a sorting order on Unicode codepoints. But proper support for collations in Databases is missing/is in active development. (There are probably a lot of performance issues, too. -- IANADBA) Also, there is no widely-accepted standard for collations yet: for some languages, people don't agree on how words/letters/wordgroups should be sorted.
Have you heard of Unicode normalization? (Basically, you should convert your Unicode data to a canonical representation before storing it) Of course it's critical for Database storage, or local comparisons. But PHP for example only provides support for normalization since 5.2.4 which came out in August 2007.
And in fact, PHP does not completely supports Unicode yet. We'll have to wait PHP6 to get Unicode-compatible functions everywhere.
So, why isn't everything we do in Unicode?
Some people don't need Unicode.
Some people don't care.
Some people don't understand that they will need Unicode support later.
Some people don't understand Unicode.
For some others, Unicode is a bit like accessibility for webapps: you start without, and will add support for it later
A lot of popular libraries/languages/applications lack proper, complete Unicode support, not to mention collation & normalization issues. And until all items in your development stack completely support Unicode, you can't write a clean Unicode application.
The Internet clearly helps spreading the Unicode trend. And it's a good thing. Initiatives like Python3 breaking changes help educating people about the issue. But we will have to wait patiently a bit more to see Unicode everywhere and new programmers instinctively using Unicode instead of Strings where it matters.
For the anecdote, because FedEx does not apparently support international addresses, the Google Summer of Code '09 students all got asked by Google to provide an ascii-only name and address for shipping. If you think that most business actors understand stakes behind Unicode support, you are just wrong. FedEx does not understand, and their clients do not really care. Yet.
Many product developers don't consider their apps being used in Asia or other regions where Unicode is a requirement.
Converting existing apps to Unicode is expensive and usually driven by sales opportunities.
Many companies have products maintained on legacy systems and migrating to Unicode means a totally new development platform.
You'd be surprised how many developers don't understand the full implications of Unicode in a multi-language environment. It's not just a case of using wide strings.
Bottom line - cost.
Probably because people are used to ASCII and a lot of programming is done by native English speakers.
IMO, it's a function of collective habit, rather than conscious choice.
The widespread availability of development tools for working with Unicode may be a more recent event than you suppose. Working with Unicode was, until just a few years ago, a painful task of converting between character formats and dealing with incomplete or buggy implementations. You say it's not that hard, and as the tools improve that is becoming more true, but there are a lot of ways to trip up unless the details are hidden from you by good languages and libraries. Hell, just cutting and pasting unicode characters could be a questionable proposition a few years back. Developer education also took some time, and you still see people make a ton of really basic mistakes.
The Unicode standard weighs probably ten pounds. Even just an overview of it would have to discuss the subtle distinctions between characters, glyphs, codepoints, etc. Now think about ASCII. It's 128 characters. I can explain the entire thing to someone that knows binary in about 5 minutes.
I believe that almost all software should be written with full Unicode support these days, but it's been a long road to achieving a truly international character set with encoding to suit a variety of purposes, and it's not over just yet.
Laziness, ignorance.
One huge factor is programming language support, most of which use a character set that fits in 8 bits (like ASCII) as the default for strings. Java's String class uses UTF-16, and there are others that support variants of Unicode, but many languages opt for simplicity. Space is so trivial of a concern these days that coders who cling to "space efficient" strings should be slapped. Most people simply aren't running on embedded devices, and even devices like cell phones (the big computing wave of the near future) can easily handle 16-bit character sets.
Another factor is that many programs are written only to run in English, and the developers (1) don't plan (or even know how) to localize their code for multiple languages, and (2) they often don't even think about handling input in non-Roman languages. English is the dominant natural language spoken by programmers (at least, to communicate with each other) and to a large extent, that has carried over to the software we produce. However, the apathy and/or ignorance certainly can't last forever... Given the fact that the mobile market in Asia completely dwarfs most of the rest of the world, programmers are going to have to deal with Unicode quite soon, whether they like it or not.
For what it's worth, I don't think the complexity of the Unicode standard is not that big of a contributing factor for programmers, but rather for those who must implement language support. When programming in a language where the hard work has already been done, there is even less reason to not use the tools at hand. C'est la vie, old habits die hard.
All operating systems until very recently were built on the assumption that a character was a byte. It's APIs were built like that, the tools were built like that, the languages were built like that.
Yes, it would be much better if everything I wrote was already... err... UTF-8? UTF-16? UTF-7? UTF-32? Err... mmm... It seems that whatever you pick, you'll annoy someone. And, in fact, that's the truth.
If you pick UTF-16, then all of your data, as in, pretty much the western world whole economy, stops being seamlessly read, as you lose the ASCII compatibility. Add to that, a byte ceases to be a character, which seriously break the assumptions upon which today's software is built upon. Furthermore, some countries do not accept UTF-16. Now, if you pick ANY variable-length encoding, you break some basic premises of lots of software, such as not needing to traverse a string to find the nth character, of being able to read a string from any point of it.
And, then UTF-32... well, that's four bytes. What was the average hard drive size or memory size but 10 years ago? UTF-32 was too big!
So, the only solution is to change everything -- software, utilites, operating systems, languages, tools -- at the same time to be i18n-aware. Well. Good luck with "at the same time".
And if we can't do everything at the same time, then we always have to keep an eye out for stuff which hasn't been i18n. Which causes a vicious cycle.
It's easier for end user applications than for middleware or basic software, and some new languages are being built that way. But... we still use Fortran libraries written in the 60s. That legacy, it isn't going away.
Because UTF-16 became popular before UTF-8 and UTF-16 is a pig to work with. IMHO
Because for 99% of applications, Unicode support is not a checkbox on the customer's product comparison matrix.
Add to the equation:
It takes a conscious effort with almost no readily visible benefit.
Many programmers are afraid of it or don't understand it.
Management REALLY doesn't understand it or care about it, at least not until a customer is screaming about it.
The testing team isn't testing for Unicode compliance.
"We didn't localize the UI, so non-English speakers wouldn't be using it anyway."
Tradition and attitude. ASCII and computers are sadly synonyms to many people.
However, it would be naïve to think that the rôle of Unicode, is only a matter of Exotic languages from Eurasia and other parts of the world. A rich text encoding has lots of meaning to bring even to a "plain" English text. Look in a book sometime.
I would say there are mainly two reason. First one is simply that the Unicode support of your tools just isn't up to snuff. C++ still doesn't have Unicode support and won't get it until the next standard revision, which will take maybe a year or two to be finished and then another five or ten years to be in widespread use. Many other languages aren't much better and even if you finally have Unicode support, it might still be a more cumbersome to use then plain ASCII strings.
The second reason is in part what it causing the first issue, Unicode is hard, its not rocket science, but it gives you a ton of problems that you never had to deal with in ASCII. With ASCII you had a clear one byte == one glyph relationships, could address the Nth character of a string by a simple str[N], could just store all characters of the whole set in memory and so on. With Unicode you no longer can do that, you have to deal with different ways Unicode is encoded (UTF-8, UTF-16, ...), byte order marks, decoding errors, lots of fonts that have only a subset of characters which you would need for full Unicode support, more glyphs then you want to store in memory at a given time and so on.
ASCII could be understand by just looking at an ASCII table without any further documentation, with Unicode that is simply no longer the case.
Because of the inertia caused by C++. It had (has) horrible unicode support and dragged back the developers.
I personally do not like how certain formats of unicode break it so that you can no longer do string[3] to get the 3rd character. Sure it could be abstracted out, but imagine how much slower a big project with strings, such as GCC would be if it had to transverse a string to figure out the nth character. The only option is caching where "useful" positions are and even then it's slow, and in some formats your now taking a good 5 bytes per character. To me, that is just ridiculous.
More overhead, space requirements.
I suspect it's because software has such strong roots in the west. UTF-8 is a nice, compact format if you happen to live in America. But it's not so hot if you live in Asia. ;)
Unicode requires more work (thinking), you usually only get paid for what is required so you go with the fastest less complicated option.
Well that's from my point of view. I guess if you expect code to use std::wstring hw(L"hello world") you have to explain how it all works that to print wstring you need wcout : std::wcout << hw << std::endl; (I think), (but endl seems fine ..) ... so seems like more work to me - of course if I was writing international app I would have to invest into figuring it out but until then I don't (as I suspect most developers).
I guess this goes back to money, time is money.
It's simple. Because we only have ASCII characters on our keyboards, why would we ever encounter, or care about characters other than those? It's not so much an attitude as it is what happens when a programmer has never had to think about this issue, or never encountered it, perhaps doesn't even know what unicode is.
edit: Put another way, Unicode is something you have to think about, and thinking is not something most people are interested in doing, even programmers.

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?
Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).
My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).
I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.
I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches
It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.
I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.