My other question brought up a related question:
Is there a standard table of Unicode to ASCII transcriptions? Think for instance of German ü mapping to ue.
User bobince mentioned in a comment that other languages use the same character in a different way and I fear they may not only use the same glyph but also the same codepoint. Hence mapping e.g. "ü" to "u" would also be acceptable (mapping by visual similarity). So is mapping ü to "u as done by iconv (see for instance link posted by Juancho).
The methods shown in the link posted by Juancho are technically working solutions. However, is there a formal standard for such a mapping or at least a mapping used as a quasi-standard? Ideally it would also include for instance phonetics-based transcriptions for non-latin characters. I remember that one exists for Japanese kana and greek characters. It shouldn't be a big problem in that regard either.
There is no formal standard on such mappings. Mappings that deal with Latin letters in general (like ü, é and ß) mapping all to Ascii are not really transcriptions or transliterations but just, well, mappings, which might be called simplifications or Asciifications. They are performed for various purposes, often in an ad hoc way.
Mapping ü to ue is rather common in German and might be called an unofficial or de facto standard for German names when ü cannot be used. But other languages use other rules, and it would be odd to Asciify French or Spanish that way; instead, the diacritic would just be dropped, mapping ü to u.
People may map e.g. ü to u" when they are forced (or they believe they are forced) to use Ascii and yet want to convey the message that the u has a diaeresis on it.
Related
Turkish has dotted and dotless I as two separate characters, each with their own uppercase and lowercase forms.
Uppercase Lowercase
I U+0049 ı U+0131
İ U+0130 i U+0069
Whereas in other languages using the Latin alphabet, we have
Uppercase Lowercase
I U+0049 i U+0069
Now, The Unicode Consortium could have implemented this as six different characters, each with its own casing rules, but instead decided to use only four, with different casing rules in different locales. This seems rather odd to me. What was the rationale behind that decision?
A possible implementation with six different characters:
Uppercase Lowercase
I U+0049 i U+0069
I NEW ı U+0131
İ U+0130 i NEW
Codepoints currently used:
U+0049 ‹I› \N{LATIN CAPITAL LETTER I}
U+0130 ‹İ› \N{LATIN CAPITAL LETTER I WITH DOT ABOVE}
U+0131 ‹ı› \N{LATIN SMALL LETTER DOTLESS I}
U+0069 ‹i› \N{LATIN SMALL LETTER I}
There is one theoretical and one practical reason.
The theoretical one is that the i of most Latin-script alphabets and the i of the Turkish and Azerbaijani alphabets are the same, and again the I of most Latin-script alphabets and the I of the Turkish and Azerbaijani are the same. The alphabets differ in the relationship between those too. One could easily enough argue that they are in fact different (as your proposed encoding treats them) but that's how the Language Commission considered them in defining the alphabet and orthography in the 1920s in Turkey, and Azerbaijani use in the 1990s copied that.
(In contrast, there are Latin-based scripts for which i should be considered semantically the same as i though never drawn with a dot [just use a different font for differently shaped glyphs], particularly those that date before Carolingian or which derive from one that is, such as how Gaelic script was derived from Insular script. Indeed, it's particularly important never to write Irish in Gaelic script with a dot on the i that could be compared with the sí buailte diacritic of the orthography that was used with it. Sadly many fonts attempting this script make not only add a dot, but make the worse orthographical error of making it a stroke and hence confusable with the fada diacritic, which as it could appear on an i while the sí buailte could not, and so makes the spelling of words appear wrong. There are probably more "Irish" fonts with this error than without).
The practical reason is that existing Turkish character encodings such as ISO/IEC 8859-9, EBCDIC 1026 and IBM 00857 which had common subsets with either ASCII or EBCDIC already treated i and I as the same as those in ASCII or EBCDIC (that is to say, those in most Latin script alphabets) and ı and İ as separate characters which are their case-changed equivalents; exactly as Unicode does now. Compatibility with such scripts requires continuing that practice.
Another practical reason for that implementation is that doing otherwise would create a great confusion and difficulty for Turkish keyboard layout users.
Imagine it was implemented the way you suggested, and pressing the ıI key and the iİ key on Turkish keyboards produced Turkish-specific Unicode characters. Then, even though Turkish keyboard layout otherwise includes all ASCII/Basic Latin characters (e.g. q, w, x are on the keyboard even though they are not in the Turkish alphabet), one character would have become impossible to type. So, for example Turkish users wouldn't be able to visit wikipedia.org, because what they actually typed would be w�k�ped�a.org. Maybe web browsers could implement a workaround specifically for Turkish users, but think of the other use cases and heaps non-localized applications that would become difficult to use. Perhaps Turkish keyboard layout could add an additional key to become ASCII-complete again, so that there are three keys, i.e. ıI, iİ, iI. But it would be a pointless waste of a key in an already crowded layout and would be even more confusing, so Turkish users would need to think which one is appropriate in every context: "I am typing a user name, which tend to expect ASCII characters, so use the iI key here", "When creating my password with the i character, did I use the iI key or the iİ key?"
Due to a myriad of such problems, even if Unicode included Turkish-specific i and I characters, most likely the keyboard layouts would ignore it and continue to use regular ASCII/Basic Latin characters, so the new characters would be completely unused and moot. Except they would still probably occasionally come up in places and create confusion, so it's a good thing that they didn't go that route.
I am new to Unicode have been given the requirement to look at some translated text, iterate over all of the characters of that translation and determine if all the characters are valid for the target culture (language and location).
For example, if I am translating a document from English to Greek, I want to detect if there are any English/ASCII "A"s in the Greek translation and report that as an error. This may likely be the case from corrupted data from a translation memory.
Is there any existing grouping of Unicode characters by culture? Or is there any existing strategy for developing this kind of grouping? I see that there is some grouping of characters at (http://www.unicode.org/charts/). But it seems that this is not quite what I am looking for at first glance.
Does any thing exist like "Here are the valid Unicode characters for Spanish - Spain: [some Unicode range(s)]" or "Here are the valid Unicode characters for Russian - Russia: [some Unicode range(s)]"
Or has anyone developed a strategy to define these?
If this is not the right place to ask this question, I would welcome any direction on where might be a good place to ask the question.
This is something that CLDR (Common Locale Data Repository) deals with. It is not part of the Unicode Standard, but it is an activity and a resource managed by the Unicode Consortium. The LDML specification defines the format of the locale data. The Character Elements define some sets of characters: “main/standard”, “auxiliary”, “index”, and “punctuation”.
The data for Greek includes only Greek letters and some basic punctuation. This, like all such data at CLDR, is largely subjective. And even though the CLDR process is meant to produce well-reviewed data based on consensus, the reality is different. It can be argued that in normal Greek texts, Latin letters are not uncommon, especially in technical areas. For example, the international symbol for the ampere is “A” as a Latin letter; the symbol for the kilogram is “kg”, in Latin letters, even though the word for it is written Greek letters in Greek.
Thus, no matter how you run the analysis, the occurrence of Latin “A” in Greek text could be flagged as potentially suspicious, but not an error.
There are C/C++ and Java libraries that implement access to CLDR data, as part of ICU.
I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.
I have heard that some characters are not present in the Unicode standard despite being written in everyday life by populations of some areas. Especially I have heard about recent Chinese first names fabricated by assembling existing characters parts, but I can't find any reference for this.
For instance, the character below is very common for 50 million people, yet it was not in Unicode until October 2009:
Is there a list of such characters? (images, or website listing such characters as images)
Also: Here's unicode.org's list of unsupported scripts
Well, there's loads of stuff not present in Unicode (though new characters are still being added).
Some examples:
Due to Han Unification, Unicode uses one codepoint for several similar characters from different languages. People disagree whether these characters are really "the same"; if you believe they should be represented separately, then these separate representations could be said to be "missing" (though this is something of a philosophical question).
In a similar vein, many languages (especially Asian languages) sometimes have several variants of one character/glyph. The distinction between "one character with several representations" (=one codepoint) and "distinct characters" (=different codepoints) is somewhat arbitratry, thus there are cases (e.g. with Kanji characters) where some people feel alternative variants are "missing".
Many historic and rarely used characters are missing.
Many old/historic scripts are not covered, e.g. Demotic. Actually, there is an initiative specifically for including more scripts in Unicode, the Script Encoding Initiative(SEI).
There is also a page by the W3C on this topic, Missing characters and glyphs, with more explanations.
There are tons of characters from the symbol part of the standard that are annoyingly not included.
See the "Missing symmetric versions" section of https://web.archive.org/web/20210830121541/http://xahlee.info/comp/unicode_arrows.html for a bunch of arrow symbols that exist, but only in certain directions. Some are just silly. For example, there is ⥂, ⥃, and ⥄, but there isn't a right pointing version of the last one.
And you can see from http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts that they picked apparently randomly which letters to support in super- and sub-script form. For example, they include the subscript vowels a, e, o, and even schwa (ə), but not i, which would be very useful, as it's a common subscript in mathematical typesetting. Take a look at the wikipedia article for more details (you'll need a unicode font installed, because at least at the time of this writing they regular ascii equivalents are not explicitly listed), but basically they picked about half of the latin alphabet seemingly at random for each of upper- and lower-case super- and sub-script characters.
Also, a lot of symbols that would be convenient for building shapes with unicode do not exist.
It does not support the bilabial trill letter, turned beta, reversed k.
Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.
As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)
If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.
What I do hope is that you aren't asking the question to construct such an attack.
See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
Each line describes a unicode caharacter, for example:
1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;
If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.
As for your character, the entry is
0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405
which, as you can see, does not specify a compatibility character.