Users sometimes use weird ASCII characters in a program, and I was wondering if there was a way to "normalize" it.
So basically, if the input ᴀʙᴄᴅᴇꜰɢ, the output would be ABCDEFG. Is there a dictionary that exists somewhere that does something like this? If not, is there a better method than just doing something like str.replace("ᴀ", "A") for all the different "fonts"?
This isn't a language specific question -- if something doesn't exist like this than I guess the next step is to create a dictionary myself.
Yes.
BTW—The technical terms are: Latin Capital Letters from the C0 Controls and Basic Latin block and the Latin Letter Small Capitals from the Phonetic Extensions block.
Anyway, the general topic for your question is Unicode confusables. The link is for a mapping. Uncode.org has more material on confusables and everything else Unicode.
(Normalization is always something to consider when processing Unicode text, but it doesn't particularly relate to this issue.)
Your example seems to involve unicode characters, not ASCII characters. Unicode normalization (FAQ) is a large and complex subject, with many difference equivalence classes of characters, depending on what you are trying to do.
Related
As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.
Turkish has dotted and dotless I as two separate characters, each with their own uppercase and lowercase forms.
Uppercase Lowercase
I U+0049 ı U+0131
İ U+0130 i U+0069
Whereas in other languages using the Latin alphabet, we have
Uppercase Lowercase
I U+0049 i U+0069
Now, The Unicode Consortium could have implemented this as six different characters, each with its own casing rules, but instead decided to use only four, with different casing rules in different locales. This seems rather odd to me. What was the rationale behind that decision?
A possible implementation with six different characters:
Uppercase Lowercase
I U+0049 i U+0069
I NEW ı U+0131
İ U+0130 i NEW
Codepoints currently used:
U+0049 ‹I› \N{LATIN CAPITAL LETTER I}
U+0130 ‹İ› \N{LATIN CAPITAL LETTER I WITH DOT ABOVE}
U+0131 ‹ı› \N{LATIN SMALL LETTER DOTLESS I}
U+0069 ‹i› \N{LATIN SMALL LETTER I}
There is one theoretical and one practical reason.
The theoretical one is that the i of most Latin-script alphabets and the i of the Turkish and Azerbaijani alphabets are the same, and again the I of most Latin-script alphabets and the I of the Turkish and Azerbaijani are the same. The alphabets differ in the relationship between those too. One could easily enough argue that they are in fact different (as your proposed encoding treats them) but that's how the Language Commission considered them in defining the alphabet and orthography in the 1920s in Turkey, and Azerbaijani use in the 1990s copied that.
(In contrast, there are Latin-based scripts for which i should be considered semantically the same as i though never drawn with a dot [just use a different font for differently shaped glyphs], particularly those that date before Carolingian or which derive from one that is, such as how Gaelic script was derived from Insular script. Indeed, it's particularly important never to write Irish in Gaelic script with a dot on the i that could be compared with the sí buailte diacritic of the orthography that was used with it. Sadly many fonts attempting this script make not only add a dot, but make the worse orthographical error of making it a stroke and hence confusable with the fada diacritic, which as it could appear on an i while the sí buailte could not, and so makes the spelling of words appear wrong. There are probably more "Irish" fonts with this error than without).
The practical reason is that existing Turkish character encodings such as ISO/IEC 8859-9, EBCDIC 1026 and IBM 00857 which had common subsets with either ASCII or EBCDIC already treated i and I as the same as those in ASCII or EBCDIC (that is to say, those in most Latin script alphabets) and ı and İ as separate characters which are their case-changed equivalents; exactly as Unicode does now. Compatibility with such scripts requires continuing that practice.
Another practical reason for that implementation is that doing otherwise would create a great confusion and difficulty for Turkish keyboard layout users.
Imagine it was implemented the way you suggested, and pressing the ıI key and the iİ key on Turkish keyboards produced Turkish-specific Unicode characters. Then, even though Turkish keyboard layout otherwise includes all ASCII/Basic Latin characters (e.g. q, w, x are on the keyboard even though they are not in the Turkish alphabet), one character would have become impossible to type. So, for example Turkish users wouldn't be able to visit wikipedia.org, because what they actually typed would be w�k�ped�a.org. Maybe web browsers could implement a workaround specifically for Turkish users, but think of the other use cases and heaps non-localized applications that would become difficult to use. Perhaps Turkish keyboard layout could add an additional key to become ASCII-complete again, so that there are three keys, i.e. ıI, iİ, iI. But it would be a pointless waste of a key in an already crowded layout and would be even more confusing, so Turkish users would need to think which one is appropriate in every context: "I am typing a user name, which tend to expect ASCII characters, so use the iI key here", "When creating my password with the i character, did I use the iI key or the iİ key?"
Due to a myriad of such problems, even if Unicode included Turkish-specific i and I characters, most likely the keyboard layouts would ignore it and continue to use regular ASCII/Basic Latin characters, so the new characters would be completely unused and moot. Except they would still probably occasionally come up in places and create confusion, so it's a good thing that they didn't go that route.
I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.
There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.
when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.
I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.
Does someone know a easy way to find characters in Unicode that are similar to ASCII characters. An example is the "CYRILLIC SMALL LETTER DZE (ѕ)". I'd like to do a search and replace for similar characters. By similar I mean human readable. You can't see a difference by looking at it.
As noted by other commenters, Unicode normalisation ("compatibilty characters") isn't going to help you here as you aren't looking for official equivalences but for similarities in glyphs (letter shapes). (The linked Unicode Technical Report is still worth reading, though, as it is extremely well written.)
If I were you, to spare you the tedious work of assembling a list of characters yourself, I'd search for resources on homograph attacks: This is a method of maliciously misleading web users by displaying URLs containing domain names in which some letters have been replaced with visually similar letters. Another Unicode Technical Report, on security, contains a section on the problem. There is also -- and that may be what you most need -- a "confusables" table. Here's another article with mainly punctuation marks, some of which ASCII, that have visually similar counterparts in the non-ASCII code tables.
What I do hope is that you aren't asking the question to construct such an attack.
See the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
Each line describes a unicode caharacter, for example:
1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING;Ll;0;L;<compat> 0061 02BE;;;;N;;;;;
If there's any similar (compatible) characters for that symbol, it will appear in the <compat> field of the entry. In this example, 0061 (ASCII a) is compatible to the LATIN SMALL LETTER A WITH RIGHT HALF RING Unicode character.
As for your character, the entry is
0455;CYRILLIC SMALL LETTER DZE;Ll;0;L;;;;;N;;;0405;;0405
which, as you can see, does not specify a compatibility character.