Unicode comparison of Cyrillic 'С' and Latin 'C' - unicode

I have a dataset which mixes use of unicode characters \u0421, 'С' and \u0043, 'C'. Is there some sort of unicode comparison which considers those two characters the same? So far I've tried several ICU collations, including the Russian one.

There is no Unicode comparison that treats characters as the same on the basis of visual identity of glyphs. However, Unicode Technical Standard #39, Unicode Security Mechanisms, deals with “confusables” – characters that may be confused with each other due to visual identity or similarity. It includes a data file of confusables as well as “intentionally confusable” pairs, i.e. “characters whose glyphs in any particular typeface would probably be designed to be identical in shape when using a harmonized typeface design”, which mainly consists of pairs of Latin and Cyrillic or Greek letters, like C and С. You would probably need to code your own use of this data, as ICU does not seem to have anything related to the confusable concept.

when you take a look at http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt, you will see that some code positions are annotated for codepoints that are similar in use; however, i'm not aware of any extensive list that covers visual similarities across scripts. you might want to search for URL spoofing using intentional misspellings, which was discussed when they came up with punycode. other than that, your best bet might be to search the data for characters outside the expected using regular expressions, and compile a series of ad-hoc text fixers like text = text.replace /с/, 'c'.

Related

Is it possible to use unicode combining characters to combine arbitrary characters?

Is it possible to use unicode combining characters to for example make the characters x and y appear to be partially overlapping each other?
I know that in layout systems like CSS there are other ways to achieve this, but I specifically want to know if its possible with just unicode so I can for example do it in Slack messages.
No, there is no Unicode mechanism to make arbitrary letters overlap each other. You can put an x above a y using the character U+036F COMBINING LATIN SMALL LETTER X like so: yͯ, but that’s about it.
Latin letters partially overlapping each other serves no semantic function, so it is not part of the Unicode standard. And if it was found to be used to convey actual meaning in some writing system, it would most likely not be encoded as a generalised mechanism but as individual characters representing specific such ligatures.
The Unicode Consortium does not consider styling features like that to be part of plain text. That is also why those bold and italic mathematical letters you sometimes see on Twitter (𝐀, 𝐴, 𝓐 etc.) aren’t implemented as the base letters plus some style modifiers, but as separate character codes entirely. A character that means “display the preceding letter as bold” would have been too general; non-crucial style variation should be dealt with through higher-level protocols (like the CSS you mentioned) which are much more powerful and enjoy more widespread support anyway.

Visually-identical characters in Unicode

I want to find visually identical characters for a specific character in Unicode.
I know how to find canonical or compatibility decompositions of a character; but they do not give me what I want.
I want to find characters that are visually identical (not similar), and their only difference can be their sizes.
for example I want : (s,S), or (S,S) (whose code points are different).
I do not want (ß, β), or (e, é).
Any suggestions? Thanks.
For a particular character, you could start from annotations in the code charts in the Unicode standard. The annotations often refer to other characters for various reasons, including similarity or identity of shape. But the annotations are not meant to cover everything.
You could also draw your character at http://shapecatcher.com/ and ask it to recognize it. You often get a long list of visually similar alternatives.
As #TedHopp writes in his comment, visual identity is font-dependent. For example, “s” and “S” need not be identical in shape; in most fonts, they are not – the basic form is the same, but there are various differences in stroke width variation, curvature, serifs, etc. However, some characters can be expected to be visually identical in any font that contains them, such as Latin capital A, Greek capital alpha Α, and Cyrillic capital А.
You did not specify the purpose of the study, but you might be doing something that has been carried out to some extent by the Unicode Consortium. See UTR #6, Unicode Security Considerations, which also contains references to related work, including UTS #9, Unicode Security Mechanisms, which contains confusables.txt, Recommended confusable mapping for IDN (i.e., for a particular context, but it may be of interest for other purposes as well).

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski

What characters are NOT present in Unicode?

I have heard that some characters are not present in the Unicode standard despite being written in everyday life by populations of some areas. Especially I have heard about recent Chinese first names fabricated by assembling existing characters parts, but I can't find any reference for this.
For instance, the character below is very common for 50 million people, yet it was not in Unicode until October 2009:
Is there a list of such characters? (images, or website listing such characters as images)
Also: Here's unicode.org's list of unsupported scripts
Well, there's loads of stuff not present in Unicode (though new characters are still being added).
Some examples:
Due to Han Unification, Unicode uses one codepoint for several similar characters from different languages. People disagree whether these characters are really "the same"; if you believe they should be represented separately, then these separate representations could be said to be "missing" (though this is something of a philosophical question).
In a similar vein, many languages (especially Asian languages) sometimes have several variants of one character/glyph. The distinction between "one character with several representations" (=one codepoint) and "distinct characters" (=different codepoints) is somewhat arbitratry, thus there are cases (e.g. with Kanji characters) where some people feel alternative variants are "missing".
Many historic and rarely used characters are missing.
Many old/historic scripts are not covered, e.g. Demotic. Actually, there is an initiative specifically for including more scripts in Unicode, the Script Encoding Initiative(SEI).
There is also a page by the W3C on this topic, Missing characters and glyphs, with more explanations.
There are tons of characters from the symbol part of the standard that are annoyingly not included.
See the "Missing symmetric versions" section of https://web.archive.org/web/20210830121541/http://xahlee.info/comp/unicode_arrows.html for a bunch of arrow symbols that exist, but only in certain directions. Some are just silly. For example, there is ⥂, ⥃, and ⥄, but there isn't a right pointing version of the last one.
And you can see from http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts that they picked apparently randomly which letters to support in super- and sub-script form. For example, they include the subscript vowels a, e, o, and even schwa (ə), but not i, which would be very useful, as it's a common subscript in mathematical typesetting. Take a look at the wikipedia article for more details (you'll need a unicode font installed, because at least at the time of this writing they regular ascii equivalents are not explicitly listed), but basically they picked about half of the latin alphabet seemingly at random for each of upper- and lower-case super- and sub-script characters.
Also, a lot of symbols that would be convenient for building shapes with unicode do not exist.
It does not support the bilabial trill letter, turned beta, reversed k.

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.