Related
Users sometimes use weird ASCII characters in a program, and I was wondering if there was a way to "normalize" it.
So basically, if the input ᴀʙᴄᴅᴇꜰɢ, the output would be ABCDEFG. Is there a dictionary that exists somewhere that does something like this? If not, is there a better method than just doing something like str.replace("ᴀ", "A") for all the different "fonts"?
This isn't a language specific question -- if something doesn't exist like this than I guess the next step is to create a dictionary myself.
Yes.
BTW—The technical terms are: Latin Capital Letters from the C0 Controls and Basic Latin block and the Latin Letter Small Capitals from the Phonetic Extensions block.
Anyway, the general topic for your question is Unicode confusables. The link is for a mapping. Uncode.org has more material on confusables and everything else Unicode.
(Normalization is always something to consider when processing Unicode text, but it doesn't particularly relate to this issue.)
Your example seems to involve unicode characters, not ASCII characters. Unicode normalization (FAQ) is a large and complex subject, with many difference equivalence classes of characters, depending on what you are trying to do.
Turkish has dotted and dotless I as two separate characters, each with their own uppercase and lowercase forms.
Uppercase Lowercase
I U+0049 ı U+0131
İ U+0130 i U+0069
Whereas in other languages using the Latin alphabet, we have
Uppercase Lowercase
I U+0049 i U+0069
Now, The Unicode Consortium could have implemented this as six different characters, each with its own casing rules, but instead decided to use only four, with different casing rules in different locales. This seems rather odd to me. What was the rationale behind that decision?
A possible implementation with six different characters:
Uppercase Lowercase
I U+0049 i U+0069
I NEW ı U+0131
İ U+0130 i NEW
Codepoints currently used:
U+0049 ‹I› \N{LATIN CAPITAL LETTER I}
U+0130 ‹İ› \N{LATIN CAPITAL LETTER I WITH DOT ABOVE}
U+0131 ‹ı› \N{LATIN SMALL LETTER DOTLESS I}
U+0069 ‹i› \N{LATIN SMALL LETTER I}
There is one theoretical and one practical reason.
The theoretical one is that the i of most Latin-script alphabets and the i of the Turkish and Azerbaijani alphabets are the same, and again the I of most Latin-script alphabets and the I of the Turkish and Azerbaijani are the same. The alphabets differ in the relationship between those too. One could easily enough argue that they are in fact different (as your proposed encoding treats them) but that's how the Language Commission considered them in defining the alphabet and orthography in the 1920s in Turkey, and Azerbaijani use in the 1990s copied that.
(In contrast, there are Latin-based scripts for which i should be considered semantically the same as i though never drawn with a dot [just use a different font for differently shaped glyphs], particularly those that date before Carolingian or which derive from one that is, such as how Gaelic script was derived from Insular script. Indeed, it's particularly important never to write Irish in Gaelic script with a dot on the i that could be compared with the sí buailte diacritic of the orthography that was used with it. Sadly many fonts attempting this script make not only add a dot, but make the worse orthographical error of making it a stroke and hence confusable with the fada diacritic, which as it could appear on an i while the sí buailte could not, and so makes the spelling of words appear wrong. There are probably more "Irish" fonts with this error than without).
The practical reason is that existing Turkish character encodings such as ISO/IEC 8859-9, EBCDIC 1026 and IBM 00857 which had common subsets with either ASCII or EBCDIC already treated i and I as the same as those in ASCII or EBCDIC (that is to say, those in most Latin script alphabets) and ı and İ as separate characters which are their case-changed equivalents; exactly as Unicode does now. Compatibility with such scripts requires continuing that practice.
Another practical reason for that implementation is that doing otherwise would create a great confusion and difficulty for Turkish keyboard layout users.
Imagine it was implemented the way you suggested, and pressing the ıI key and the iİ key on Turkish keyboards produced Turkish-specific Unicode characters. Then, even though Turkish keyboard layout otherwise includes all ASCII/Basic Latin characters (e.g. q, w, x are on the keyboard even though they are not in the Turkish alphabet), one character would have become impossible to type. So, for example Turkish users wouldn't be able to visit wikipedia.org, because what they actually typed would be w�k�ped�a.org. Maybe web browsers could implement a workaround specifically for Turkish users, but think of the other use cases and heaps non-localized applications that would become difficult to use. Perhaps Turkish keyboard layout could add an additional key to become ASCII-complete again, so that there are three keys, i.e. ıI, iİ, iI. But it would be a pointless waste of a key in an already crowded layout and would be even more confusing, so Turkish users would need to think which one is appropriate in every context: "I am typing a user name, which tend to expect ASCII characters, so use the iI key here", "When creating my password with the i character, did I use the iI key or the iİ key?"
Due to a myriad of such problems, even if Unicode included Turkish-specific i and I characters, most likely the keyboard layouts would ignore it and continue to use regular ASCII/Basic Latin characters, so the new characters would be completely unused and moot. Except they would still probably occasionally come up in places and create confusion, so it's a good thing that they didn't go that route.
I am tormented by the question concerning the usage of Unicode for a long time. Unicode allows to accelerate and simplify the development of software (in terms of globalization), but I am concerned by the following factors:
increased memory and diskspace usage;
reduction of the text processing performance;
Asian languages treated all alike to the detriment of the national specificities.
With the first paragraph of all it is obvious... But I don't know the true or not the others. Is there anyone who is faced with the need to localize software for Asian countries, and is ready to share the experience?
At the moment I try to use the encoding of a narrow profile (cp1251 - for Russia, cp1254 for Turkey, etc.). Will somebody advice on this issue?
The impact on the size of data in bytes is affected by the choice of the Unicode encoding and by the type of data. For example, using UTF-8 (the only useful Unicode encoding on the web), English text has the same size as in 8-bit encodings, except for typographically correct punctuation marks, which may take two bytes each; for Turkish text, any non-Ascii letter is 2 bytes instead of 1 byte; for Russian text, any Cyrillic letter is 2 bytes. In most cases, this does not matter much.
Text processing performance depends on what you do and how you do that. The reasonable expectation is that there is no problem worth worrying about. If processing is fast enough, it hardly matters whether it would be 10% faster using an 8-bit encoding.
Unicode unification has its impact, but surely Asian languages are not treated all alike. The Unicode standard has a lot to say about specific treatment of characters in Asian scripts and languages. If you are referring to the different shapes of CJK characters in different languages, then the usual solution is to use fonts designed for the language used. (In addition, it can in principle at least also be handled within a font, when OpenType fonts are used.)
Check out the official Unicode FAQ. It has a lot to say about issues like these.
The first two points are very much negligible. You'd need to have a very specific use case where the difference in size and performance make a discernible difference that justifies the headaches of mixed encodings.
Regarding the Unihan characters: They are grouped by meaning of the character, but that character may be written slightly differently in different writing systems. This is a problem of properly marking up the language, it's not really an encoding problem. In HTML documents, you can mark the document with lang attributes and/or set specific fonts using CSS which will alter the appearance of the character for the language appropriately. How to handle this correctly depends on the type of software (HTML, desktop app, etc). I'd advise you open a new, detailed question about that.
Increased text size: Yes. Text size may be increased up to 6 times (for UTF-8). But storage for texts nowadays is nothing a big problem.
Reduction of text processing performance: As per my opinion, no. An UTF-8 character may take up to 6 bytes, but when scanning thru' the text, and right at the first byte of an UTF-8 character we already know how many bytes more for to read for it (the current character in scanning). So most likely the scanning performance stays the same as O(n), where 'n' is the length of the text. To keep the best performance, try not to access the characters in a text by index (yeah, this is a down-point for performance). Java string is not effected by random index access to string character because Java string is a series of 2-byte characters.
Asian languages treated all alike to the detriment of the national specificities: Yeah, human languages when presented in text format are all alike, but a letter 'i' of a single stroke or a letter of '長' of 16 strokes is just a character.
Increased text size, and all of the following are actually untrue.
They may be true, for old-school encodings of unicode, such as UTF-16. UTF-8 is not larger, or slower than ASCII for ASCII-only strings, and yet it allows encoding every Unicode code point. UTF-8 is also a de-facto standard of doing Unicode on the marketplace today.
There is an extensive analysis of performance of different Unicode encodings in http://www.utf8everywhere.org, including for the Asian languages.
This is a noob question, but I wanna know why there are different encoding types and what are their differences (ie. ASCII, utf-8 and 16, base64, etc.)
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
http://www.unicode.org/standard/standard.html
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8#Compared_to_other_multi-byte_encodings
I'm not familiar with UTF-16. Here's some information about the differences.
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Unicode_plane
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
http://en.wikipedia.org/wiki/Base64
http://www.phpeveryday.com/articles/PHP-Email-Using-Embedded-Images-in-HTML-Email-P113.html
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
The real reason why there are so many variants is that the Unicode consortium came along too late.
In The Beginning memory and storage was expensive and using more than 8 (or sometimes only 7) bit of memory to store a single character was considered excessive. Thus pretty much all text was stored using 7 or 8 bit per character. Clearly, 8 bit are not enough memory to represent the characters of all human languages. It's barely enough to represent most characters used in a single language (and for some languages even that's not possible). Therefore many different character encodings where designed to allow different languages (English, German, Greek, Russian, ...) to encode their texts in 8 bits per characters. After all a single text file (and usually even a single computer system) will only ever used in a single language, right?
This led to a situation where there was no single agreed-upon mapping of characters to numbers of any kind. Many different, incompatible solutions where produced and no real central control existed. Some computer systems used ASCII, others used EBCDIC (or more precisely: one of the many variations of EBCDIC), ISO-8859-* (or one of its many derivatives) or any of a big list of encodings that are hardly heard about now.
Finally, the Unicode Consortium stepped up to the task to produce that single mapping (together with lots of auxiliary data that's useful but outside of the bounds of this answer).
When the Unicode consortium finally produced a fairly comprehensive list of characters that a computer might represent (together with a number of encoding schemes to encode them to binary data, depending on your concrete needs), the other character encoding schemes were already widely used. This slowed down the adoption of Unicode and its encodings (UTF-8, UTF-16) considerably.
These days, if you want to represent text, your best bet is to use one of the few encodings that can represent all Unicode characters. UTF-8 and UTF-16 together should suffice for 99% of all use cases, UTF-32 covers almost all the others. And just to be clear: all the UTF-* encodings can encode all valid Unicode characters. But due to the fact that UTF-8 and UTF-16 are variable-width encodings, they might not be ideal for all use cases. Unless you need to be able to interact with a legacy system that can't handle those encodings, there is rarely a reason to choose anything else these days.
The main reason is to be able to show more characters. When the internet was in it's infancy, noone really planned ahead thinking that one day there would be people using it from all countries and all languages around the world. So a small character set was good enough. Gradually it was revealed to be limited and English-centric, thus the demand for bigger character sets.
I've heard a lot of people talk about how some new version of a language now supports unicode, and how much of an achievement unicode is. What's the big deal about being able to support a new characterset. It seems like something which would rarely if ever be used but people mention it quite often. What's the benefit or reason people use or even care about unicode?
Programming languages are used to produce software.
Software is used to solve problems faced by humans.
Producing software has a cost.
Software that solves problems for humans produces value. This value can be expressed in the form of profit, or the reduction of costs, depending on the business model of the software developer. How the value is expressed is irrelevant for the purposes of this discussion; what is relevant is that net value is produced.
There are seven billion humans in the world. A significant fraction of them are most comfortable reading text that is not written in the Latin alphabet.
Software which purports to solve a problem for some fraction of those seven billion humans who do not use the Latin alphabet does so more effectively if developers can easily manipulate text written in non-Latin alphabets.
Therefore, a programming language which supports non-Latin character sets lowers the costs of software developers, thereby enabling them to solve more problems for more people at lower costs, and thereby produce more value.
Unicode is the de facto standard for manipulation of non-Latin text.
Therefore, Unicode is important to the design and implementation of programming languages.
Our goal as programming language designers is the creation of tools which produce maximum value. Supporting Unicode is an easy way to massively increase the scope and range of real human problems that can be solved in software.
In the beginning, there were 256 possible characters and many different Code pages to represent them. It became a tangled mess. Supporting multiple languages and multiple characters sets became a programmer's nightmare.
Then the Unicode Consortium was formed. It created a standard that would allow a single character set with 256 x 256 = 65536 characters (plus combinations thereof) to include almost all languages of the world.
The biggest advantage is that a single character string may contain multiple languages. That is no small thing.
Unicode is now the native character specification used in Windows ever since Windows 2000. it is also allowed as a character set in HTML and on websites.
If your application does not support Unicode, or is not planning to support it, then it is only a matter of time until your application will be left behind.
What's the big deal about being able
to support a new characterset.
Unicode is not just "a new characterset". It's the character set that removes the need to think about character sets.
How would you rather write a string containing the Euro sign?
"\x80", "\x88", "\x9c", "\x9f", "\xa2\xe3", "\xa2\xe6", "\xa3\xe1", "\xa4", "\xa9\xa1", "\xd9\xe6", "\xdb", or "\xff" depending upon the encoding.
"\u20AC", in every locale, on every OS.
Unicode can support pretty much any language in the world. Without such an encoding you would have to worry about choosing the correct encoding for different languages, which is very bothersome (not to mention mixing multiple languages in the same text block, ugh)
Unicode support in a language means that the language's native character/string type supports all those languages as well, without the user having to worry about character encodings or multibyte characters and such while doing computations. Of course, one still has to acnowledge character encodings when doing I/O, but doing your string processing in one single sensible encoding helps a lot.
Well if you care anything about internationalization (AKA the rest of the world) scientific notations, etc you would care about unicode. Unicode is difficult to deal with because we have been so ingrained just ASCII support. But now that modern systems support Unicode, there is no reason really not to just encode your things UTF-8. I know I work in publishing and for a long time we had to do hack things like insert gif images of formulas etc. Now we can put unicode straight in and people can search and copy and paste etc, and our code can deal with it by using unicode regexes etc.
If you wish to communicate with someone whose native language is not English (either the British or American variants), you care. A lot.
As everyone says - support for all the charactersets and formatting used by every other language and locale in the world. Open source and commercial developers both like that because it increases their potential user base by about 20x fold (and growing).
Unicode is a good thing because it eliminates character set problems and leaves one less thing to worry about. Even if your software never leaves the U.S., you never know when you're going to run into a filename or text field with an odd character in it, and Unicode lets you live in ignorance.
Americans like Daisetsu may not care about Unicode, but the rest of the world uses a bit more than 26 Latin letters, and there Unicode is heavily used.
We had hundreds of messed up charsets in the past solely because American computer scientists thought "why would anyone want to use more than 26 Latin characters like we have in English?"
Narrow-mindedness is a bad thing.