Differenciate between symbol, number and letter-codepoints in Unicode? - unicode

Unicode has a huge number of codepoints, how can I check wheter a codepoint is a symbol (like "!" or "☭"), a number (like "4" or "৯"), a letter (like "a" or "え") or a control character (are usually not displayed directly)?
Is there any logic behind the position of the character and what kind of character it is (as opposed to just what alphabet it is part of), if not, are there any existing resources which classify which ranges are what?

That would be done through the General Category property of those codepoints. It's part of the canonical UnicodeData.txt dataset, and every serious Unicode-related library should have some way for you to get this property.

Related

Will precluding surrogate code points also impede entering Chinese characters?

I have a name input field in an app and would like to prevent users from entering emojis. My idea is to filter for any characters from the general categories "Cs" and "So" in the Unicode specification, as this would prevent the bulk of inappropriate characters but allow most characters for writing natural language.
But after reading the spec, I'm not sure if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points. (My understanding is still rough.)
Would excluding surrogates still leave most Chinese users with the characters they need to enter their names, or is the original Unicode space not big enough for that to be a reasonable expectation?
Your method would be both ineffective and too excessive.
Not all emoji are outside of the Basic Multilingual Plane (and thus don’t require surrogates in the first place), and not all emoji belong to the general category So. Filtering out only these two groups of characters would leave the following emoji intact:
#️⃣ *️⃣ 0️⃣ 1️⃣ 2️⃣ 3️⃣ 4️⃣ 5️⃣ 6️⃣ 7️⃣ 8️⃣ 9️⃣ ‼️ ⁉️ ℹ️ ↔️ ◼️ ◻️ ◾️ ◽️ ⤴️ ⤵️ 〰️ 〽️
At the same time, this approach would also exclude about 79,000 (and counting) non-emoji characters covering several dozen scripts – many of them historic, but some with active user communities. The majority of all Han (Chinese) characters for instance are encoded outside the BMP. While most of these are of scholarly interest only, you will need to support them regardless especially when you are dealing with personal names. You can never know how uncommon your users’ names might be.
This whole ordeal also hinges on the technical details of your app. Removing surrogates would only work if the framework you are using encodes strings in a format that actually employs surrogates (i.e. UTF-16) and if your framework is simultaneously not aware of how UTF-16 really works (as Java or JavaScript are, for example). Surrogates are never treated as actual characters; they are exceptionally reserved codepoints that exist for the sole purpose of allowing UTF-16 to deal with characters in the higher planes. Other Unicode encodings aren’t even allowed to use them at all.
If your app is written in a language that either uses a different encoding like UTF-8 or is smart enough to process surrogates correctly, then removing Cs characters on input is never going to have any effect because no individual surrogates are ever being exposed to your program. How these characters are entered by the user does not matter because all your app gets to see is the finished product (the actual character codepoints).
If your goal is to remove all emoji and only emoji, then you will have to put a lot of effort into designing your code because the Unicode emoji spec is incredibly convoluted. Most emoji nowadays are constructed out of multiple characters, not all of which are categorised as emoji by themselves. There is no easy way to filter out just emoji from a string other than maintaining an explicit list of every single official emoji which would need to be steadily updated.
Will precluding surrogate code points also impede entering Chinese characters? […] if this would preclude, for example, a Pinyin keyboard from submitting Chinese characters that need supplemental code points.
You cannot intercept how characters are entered, whether via input method editor, copy-paste or dozens of other possibilities. You only get to see a character when it is completed (and an IME's work is done), or depending on the widget toolkit, even only after the text has been submitted. That leaves you with validation. Let's consider a realistic case. From Unihan_Readings.txt 12.0.0 (2018-11-09):
U+20009 ‹𠀉› (the same as U+4E18 丘) a hill; elder; empty; a name
U+22218 ‹𢈘› variant of 鹿 U+9E7F, a deer; surname
U+22489 ‹𢒉› a surname
U+224B9 ‹𢒹› surname
U+25874 ‹𥡴› surname
Assume the user enters 𠀉, then your unnamed – but hopefully Unicode compliant – programming language must consider the text on the grapheme level (1 grapheme cluster) or character level (1 character), not the code unit level (surrogate pair 0xD840 0xDC09). That means that it is okay to exclude characters with the Cs property.

Which nonnegative integers aren't assigned a character in the UCS?

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.
The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski

What's the difference between GC=Mark and GC=Punctuation in Unicode general categories?

I'm having trouble understanding some concepts. In the Unicode spec, there's a property called general category.
OK I understood what are each of letters (usual characters; GC=L), numbers (like digits 0–9 and other characters that have numeric values; GC=N) and separators (dividers; GC=Z). But it's really hard to distinguish between symbols (GC=S), punctuation (GC=P), and marks (GC=M).
I looked up a list of them, but I couldn't find conceptual difference. And the document doesn't help me a lot. What's the difference between all these?
Marks aren't standalone characters, but are applied to another character. Non-spacing marks are displayed over the target character, spacing marks are displayed attached to the target character and enclosing marks are displayed surrounding the target character. For example here's an a in a box (the character "a" combined with the enclosing square character):
a⃞
Regarding punctuations versus symbols: As the text you linked explains, some edge cases are classified rather arbitrarily, but in principle the difference is that punctuation is used "to organize and delimit textual units" (i.e. to mark the end of a sentence, separate different parts of a sentence, separate the elements of an enumeration etc.) and symbols "to represent concepts" (like units for example or mathematical notations).

Is there encoding in Unicode where every "character" is just one code point?

Trying to rephrase: Can you map every combining character combination into one code point?
I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Is this true for Basic Multilingual Plane also?
If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.
If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.
Can you map every combining character
combination into one code point?
Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.
There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.
it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”
Is this true for Basic Multilingual Plane also?
There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.
Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.