What is the simplest way to figure out which writing systems (as in, Latin, Hebrew, Arabic, Katakana, Chinese characters) are supported by a given set of Unicode characters?
Inspect the Script and Script_Extensions properties of each character in the set, as documented in UAX #24.
Scripts and Blocks:
Unicode characters are divided into non-overlapping ranges called
blocks [Blocks]. Many of these blocks have a name derived from
a script name, because characters of that script are primarily encoded
in that block. However, blocks and scripts differ in the following
ways:
Blocks are simply ranges, and often contain code points that are unassigned.
Characters from the same script may be encoded in several different blocks.
Characters from different scripts may be encoded in the same block.
As a result, using the block names as simplistic substitute for script
identity generally leads to poor results. For example, see Annex A,
Character Blocks, in Unicode Technical Standard #18, "Unicode Regular
Expressions" [UTS18].
Inside the latter document [UTS18], pay your priority attention to Writing Systems Versus Blocks in Annex A: Character Blocks.
At this point I’m leaning towards testing if enough glyphs from a script appear in the character set.
The approach would require two preparatory steps:
Put together a set of writing systems (scripts) supported by Unicode
For each script, define a character set containing characters of that script
Then I can solve the question “does character set A supports script X” by a test “are enough characters of script X’s character set also members of character set A”. If I do that for every script from step (1), I get a list of supported scripts.
The link provided by 一二三 references a data file that maps Unicode characters to their respective scripts, which would be invaluable in steps (1) and (2).
Related
I have a name input field in an app and would like to prevent users from entering emojis. My idea is to filter for any characters from the general categories "Cs" and "So" in the Unicode specification, as this would prevent the bulk of inappropriate characters but allow most characters for writing natural language.
But after reading the spec, I'm not sure if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points. (My understanding is still rough.)
Would excluding surrogates still leave most Chinese users with the characters they need to enter their names, or is the original Unicode space not big enough for that to be a reasonable expectation?
Your method would be both ineffective and too excessive.
Not all emoji are outside of the Basic Multilingual Plane (and thus don’t require surrogates in the first place), and not all emoji belong to the general category So. Filtering out only these two groups of characters would leave the following emoji intact:
#️⃣ *️⃣ 0️⃣ 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ 6️⃣ 7️⃣ 8️⃣ 9️⃣ ‼️ ⁉️ ℹ️ ↔️ ◼️ ◻️ ◾️ ◽️ ⤴️ ⤵️ 〰️ 〽️
At the same time, this approach would also exclude about 79,000 (and counting) non-emoji characters covering several dozen scripts – many of them historic, but some with active user communities. The majority of all Han (Chinese) characters for instance are encoded outside the BMP. While most of these are of scholarly interest only, you will need to support them regardless especially when you are dealing with personal names. You can never know how uncommon your users’ names might be.
This whole ordeal also hinges on the technical details of your app. Removing surrogates would only work if the framework you are using encodes strings in a format that actually employs surrogates (i.e. UTF-16) and if your framework is simultaneously not aware of how UTF-16 really works (as Java or JavaScript are, for example). Surrogates are never treated as actual characters; they are exceptionally reserved codepoints that exist for the sole purpose of allowing UTF-16 to deal with characters in the higher planes. Other Unicode encodings aren’t even allowed to use them at all.
If your app is written in a language that either uses a different encoding like UTF-8 or is smart enough to process surrogates correctly, then removing Cs characters on input is never going to have any effect because no individual surrogates are ever being exposed to your program. How these characters are entered by the user does not matter because all your app gets to see is the finished product (the actual character codepoints).
If your goal is to remove all emoji and only emoji, then you will have to put a lot of effort into designing your code because the Unicode emoji spec is incredibly convoluted. Most emoji nowadays are constructed out of multiple characters, not all of which are categorised as emoji by themselves. There is no easy way to filter out just emoji from a string other than maintaining an explicit list of every single official emoji which would need to be steadily updated.
Will precluding surrogate code points also impede entering Chinese characters? […] if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points.
You cannot intercept how characters are entered, whether via input method editor, copy-paste or dozens of other possibilities. You only get to see a character when it is completed (and an IME's work is done), or depending on the widget toolkit, even only after the text has been submitted. That leaves you with validation. Let's consider a realistic case. From Unihan_Readings.txt 12.0.0 (2018-11-09):
U+20009 ‹𠀉› (the same as U+4E18 丘) a hill; elder; empty; a name
U+22218 ‹𢈘› variant of 鹿 U+9E7F, a deer; surname
U+22489 ‹𢒉› a surname
U+224B9 ‹𢒹› surname
U+25874 ‹𥡴› surname
Assume the user enters 𠀉, then your unnamed – but hopefully Unicode compliant – programming language must consider the text on the grapheme level (1 grapheme cluster) or character level (1 character), not the code unit level (surrogate pair 0xD840 0xDC09). That means that it is okay to exclude characters with the Cs property.
🔖
I am not sure whether everyone can see the above character, but I can see it. I got it when I input "booknote" in Chinese on my iPhone. To my surprise, this character seems "platform-insensative", it can be seen on my phones, chrome on laptop, and even in MacOS terminal.
Is it an ASCII character? I've never seen colorful characters like this before. Since when these have been around? And where I can get a list of similar characters?
Here: http://www.unicode.org/charts/nameslist/index.html
You put the character on an HTML page. All characters on an HTML page are from the Unicode character set. Characters that are not in the Unicode character set either soon will be or are too specialized to be of general use.
The Unicode Consortium occasionally publishes a new version of the character set. Since you ask about the kind of character, the common partitions of the character set are blocks, categories, and—stretching a bit—which version the character was added in. Some characters are in a script (for a language writing system), some are not. You see the block and category of 🔖 at http://www.fileformat.info/info/unicode/char/1f516/index.htm.
The Unicode character set is published in text files called the Unicode Character Database (UCD), as well as many supplementary documents and webpages. The data includes important information about usage and relationships. For example, for applicable characters, which character is considered the uppercase form of another in a particular language.
To see any character, you have to use a font that presents it. This can be a problem for some characters. There is probably no one font that presents every Unicode character as it was meant to be.
You mentioned ASCII. Although it used every day in HTTP headers and other specialized and historical applications, ASCII is such a limited character set that it hasn't generally been used in decades.
I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.
What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski
I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese.
What it doesn't do is tell if the character is traditional or simplified Chinese. How would you go about finding this out?
update
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?
http://unicode.org/faq/han_cjk.html
Their argument that the characters regardless of their shape have the same meaning and therefore should be represented by the same code. Well, it's not meaningless to me because I am analyzing individual characters which doesn't work with their solution:
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
As already stated, you can't reliably detect the script style from a single character, but it is possible for a sufficiently long sample of text. See https://github.com/jpatokal/script_detector for a Ruby gem that does the job, and Simplified Chinese Unicode table for a general discussion.
It is possible for some characters. The Traditional and Simplified character sets overlap, so you have basically three sets of characters:
Characters that are traditional only.
Characters that are simplified only.
Characters that have been left untouched, and are available in both.
Take the character 面 for instance. It belongs both to #2 and #3... As a simplified character, it stands for 面 and 麵, face and noodles. Whereas 麵 is a traditional character only. So in the Unihan database, 麵 has a kSimplifiedVariant, which points to 面. So you can deduct that it is a traditional character only.
But 面 also has a kTraditionalVariant, which points to 麵. This is where the system breaks: if you use this data to deduct that 面 is a simplified character only, you'd be wrong...
On the other hand, 韩 has a kTraditionalVariant, pointing to 韓, and these two are a "real" Simplified/Traditional pair. But nothing in the Unihan database differentiates cases like 韓/韩 from cases like 麵/面.
As I think you've discovered, you can't. Simplified and traditional are just two styles of writing the same characters - it's like the difference between Roman and Gothic script for European languages.