Glossary of Unicode Terms defines coded character set as:
A character set in which each character is assigned a numeric code point.
But from the linguistic point of view those three words could be understood in two ways:
(coded character) set = set of coded characters
coded (character set) = character set in which each character is assigned a code
I feel, like the first interpretation is invalid and couldn't be treated as a synonym for #2. In my opinion first defines a set of characters, that has already been converted to codes and it's more like codespace.
On the other hand the very same glossary has a definition for Code Set:
Code Set. (See coded character set.)
Which led me to total confusion.
Could you please tell, if both of the interpretations are legitimate, or only #2 is correct. I am asking not only out of curiosity, but because I have a strong feeling, that the most widespread translation of this term to my native language is completely wrong.
Related
To this day, unicode tags are exclusively used to represent subdivision flags. But are they limited to subdivisions? Can I use them for instance to represent a language flag like the Esperanto Flag using the mapping:
🏴 + [e] + [o] + [cancel]
You can use any sequence of Unicode characters to represent anything. The trouble is that nobody will be able to understand your message unless you’re following an established protocol.
If you want to conform to the Unicode Standard, or more specifically Unicode Technical Standard #51 (Unicode Emoji), then the tag sequence you are considering is invalid, has no meaning, and should ideally be displayed using a special “error” glyph to indicate its malformedness. Annex C of UTS #51 has more information on that matter.
As of the time of writing, there is only one type of valid emoji tag sequences: Those representing flags of regions with a region code ultimately derived from ISO 3166-2. The Esperanto language does not possess such a region code (because it is not a region) and thus cannot be represented with this mechanism. I recommend employing private-use characters.
I want to write a function in OCaml that determines whether an int represents a Unicode code point that is alphabetic. This is the only character class that I care about, because I'm writing a parser for a programming language and that is the only character class I am going to use. Is there a set of simple ranges that can determine whether or not a character is alphabetic? I've tried looking up the implementation of java.lang.Character.isAlphabetic(int) but they end up doing a lookup table, since they support many different Unicode properties.
Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.
The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.
Unicode has a huge number of codepoints, how can I check wheter a codepoint is a symbol (like "!" or "☭"), a number (like "4" or "৯"), a letter (like "a" or "え") or a control character (are usually not displayed directly)?
Is there any logic behind the position of the character and what kind of character it is (as opposed to just what alphabet it is part of), if not, are there any existing resources which classify which ranges are what?
That would be done through the General Category property of those codepoints. It's part of the canonical UnicodeData.txt dataset, and every serious Unicode-related library should have some way for you to get this property.
Trying to rephrase: Can you map every combining character combination into one code point?
I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Is this true for Basic Multilingual Plane also?
If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.
If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.
Can you map every combining character
combination into one code point?
Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.
There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.
it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”
Is this true for Basic Multilingual Plane also?
There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.
Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.