Is there encoding in Unicode where every "character" is just one code point? - unicode

Trying to rephrase: Can you map every combining character combination into one code point?
I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Is this true for Basic Multilingual Plane also?

If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.
If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.

Can you map every combining character
combination into one code point?
Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.
There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.

it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”
Is this true for Basic Multilingual Plane also?
There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.
Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.

Related

How does UTF-16 achieve self-synchronization?

I know that UTF-16 is a self-synchronizing encoding scheme. I also read the below Wiki, but did not quite get it.
Self Synchronizing Code
Can you please explain me with an example of UTF-16?
In UTF-16 characters outside of the BMP are represented using a surrogate pair in with the first code unit (CU) lies between 0xD800—0xDBFF and the second one between 0xDC00—0xDFFF. Each of the CU represents 10 bits of the code point. Characters in the BMP is encoded as itself.
Now the synchronization is easy. Given the position of any arbitrary code unit:
If the code unit is in the 0xD800—0xDBFF range, it's the first code unit of two, just read the next one and decode. Voilà, we have a full character outside of BMP
If the code unit is in the 0xDC00—0xDFFF range, it's the second code unit of two, just go back one unit to read the first part, or advance to the next unit to skip the current character
If it's in neither of those ranges then it's a character in BMP. We don't need to do anything more
In UTF-16 CU is the unit, i.e. the smallest element. We work at the CU level and read the CU one-by-one instead of byte-by-byte. Because of that along with historical reasons UTF-16 is only self-synchronizable at CU level.
The point of self-synchronization is to know whether we're in the middle of something immediately instead of having to read again from the start and check. UTF-16 allows us to do that
Since the ranges for the high surrogates, low surrogates, and valid BMP characters are disjoint, it is not possible for a surrogate to match a BMP character, or for (parts of) two adjacent characters to look like a legal surrogate pair. This simplifies searches a great deal. It also means that UTF-16 is self-synchronizing on 16-bit words: whether a code unit starts a character can be determined without examining earlier code units. UTF-8 shares these advantages, but many earlier multi-byte encoding schemes (such as Shift JIS and other Asian multi-byte encodings) did not allow unambiguous searching and could only be synchronized by re-parsing from the start of the string (UTF-16 is not self-synchronizing if one byte is lost or if traversal starts at a random byte).
https://en.wikipedia.org/wiki/UTF-16#Description
Of course that means UTF-16 may be not suitable for working over a medium without error correction/detection like a bare network environment. However in a proper local environment it's a lot better than working without self-synchronization. For example in DOS/V for Japanese every time you press Backspace you must iterate from the start to know which character was deleted because in the awful Shift-JIS encoding there's no way to know how long the character before the cursor is without a length map

Are there any real alternatives to unicode?

As a C++ developer supporting unicode is, putting it mildly, a pain in the butt. Unicode has a few unfortunate properties that makes it very hard to determine the case of a letter, convert them or pretty much anything beyond identifying a single known codepoint or so (which may or may not be a letter). The only real rescue, it seems, is ICU for those who are unfortunate enough to not have unicode support builtin the language (i.e. C and C++). Support for unicode in other languages may or may not be good enough.
So, I thought, there must be a real alternative to unicode! i.e. an encoding that does allow easy identification of character classes, besides having a lookup datastructure (tree, table, whatever), and identifying the relationship between characters? I suspect that any such encoding would likely be multi-byte for most text -- that's not a real concern to me, but I accept that it is for others. Providing such an encoding is a lot of work, so I'm not really expecting any such encoding to exist 😞.
Short answer: not that I know of.
As a non-C++ developer, I don't know what specifically is a pain about Unicode, but since you didn't tag the question with C++, I still dare to attempt an answer.
While I'm personally very happy about Unicode in general, I agree that some aspects are cumbersome.
Some of them could arguably be improved if Unicode was redesigned from scratch, eg. by removing some redundancies like the "Latin Greek" math letters besides the actual Greek ones (but that would also break compatibility with older encodings).
But most of the "pains" just reflect the chaotic usage of writing in the first place.
You mention yourself the problem of uppercase "i", which is "I" in some, "İ" in other orthographies, but there are tons of other difficulties – eg. German "ß", which is lowercase, but has no uppercase equivalent (well, it has now, but is rarely used); or letters that look different in final position (Greek "σ"/"ς"); or quotes with inverted meaning («French style» vs. »Swiss style«, “English” vs. „German style“)... I could continue for a while.
I don't see how an encoding could help with that, other than providing tables of character properties, equivalences, and relations, which is what Unicode does.
You say in comments that, by looking at the bytes of an encoded character, you want it to tell you if it's upper or lower case.
To me, this sounds like saying: "When I look at a number, I want it to tell me if it's prime."
I mean, not even ASCII codes tell you if they are upper or lower case, you just memorised the properties table which tells you that 41..5A is upper, 61..7A is lower case.
But it's hard to memorise or hardcode these ranges for all 120k Unicode codepoints. So the easiest thing is to use a table look-up.
There's also a bit of confusion about what "encoding" means.
Unicode doesn't define any byte representation, it only assigns codepoints, ie. integers, to character definitions, and it maintains the said tables.
Encodings in the strict sense ("codecs") are the transformation formats (UTF-8 etc.), which define a mapping between the codepoints and their byte representation.
Now it would be possible to define a new UTF which maps codepoints to bytes in a way that provides a pattern for upper/lower case.
But what could that be?
Odd for upper, even for lower case?
But what about letters without upper-/lower-case distinction?
And then, characters that aren't letters?
And what about all the other character categories – punctuation, digits, whitespace, symbols, combining diacritics –, why not represent those as well?
You could put each in a predefined range, but what happens if too many new characters are added to one of the categories?
To sum it up: I don't think what you ask for is possible.

Which nonnegative integers aren't assigned a character in the UCS?

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract characters: the latter term more closely refers to our notion of character, while the first is a concept in the context of coded character sets. Some abstract characters are represented by more than one character. The Unicode article at Wikipedia cites an example:
For example, a Latin small letter "i" with an ogonek, a dot above, and
an acute accent [an abstract character], which is required in
Lithuanian, is represented by the character sequence U+012F, U+0307,
U+0301.
The UCS (Universal Coded Character Set) is a coded character set defined by the International Standard ISO/IEC 10646, which, for reference, may be downloaded through this official link.
The task at hand is to tell whether a given nonnegative integer is mapped to a character by the UCS, the Universal Coded Character Set.
Let us consider first the nonnegative integers that are not assigned a character, even though they are, in fact, reserved by the UCS. The UCS (§ 6.3.1, Classification, Table 1; page 19 of the linked document) lists three possibilities, based on the basic type that corresponds to them:
surrogate (the range D800–DFFF)
noncharacter (the range FDD0–FDEF plus any code point ending in the value FFFE or FFFF)
The Unicode standard defines noncharacters as follows:
Noncharacters are code points that are permanently reserved and will
never have characters assigned to them.
This page lists noncharacters more precisely.
reserved (I haven't found which nonnegative integers belong to this category)
On the other hand, code points whose basic type is any of:
graphic
format
control
private use
are assigned to characters. This is, however, open to discussion. For instance, should private use code points be considered to actually be assigned any characters? The very UCS (§ 6.3.5, Private use characters; page 20 of the linked document) defines them as:
Private use characters are not constrained in any way by this
International Standard. Private use characters can be used to provide
user-defined characters.
Additionally, I would like to know the range of nonnegative integers that the UCS maps or reserves. What is the maximum value? In some pages I have found that the whole range of nonnegative integers that the UCS maps is –presumably– 0–0x10FFFF. Is this true?
Ideally, this information would be publicly offered in a machine-readable format that one could build algorithms upon. Is it, by chance?
For clarity: What I need is a function that takes a nonnegative integer as argument and returns whether it is mapped to a character by the UCS. Additionally, I would prefer that it were based on official, machine-readable information. To answer this question, it would be enough to point to one such resource that I could build the function myself upon.
The Unicode Character Database (UCD) is available on the unicode.org site; it is certainly machine-readable. It contains a list of all of the assigned characters. (Of course, the set of assigned codepoints is larger with every new version of Unicode.) Full documentation on the various files which make up the UCD is also linked from the UCD page.
The range of potential codes is, as you suspect, 0-0x10FFFF. Of those, the non-characters and the surrogate blocks will never be assigned as codepoints to any character. Codes in the private use areas can be assigned to characters only by mutual agreement between applications; they will never be assigned to characters by Unicode itself. Any other code might be.

Detecting non-character Unicode characters

I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen.
There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for "dangerous" strings. For instance, the RTL override character will make strings look like they're backwards.
Since the audience is mostly programmers, my solution would be to, first, get the type C canonical form of the string, and then replace anything that's not a printable character on its own with the Unicode code point in the form \uXXXXXX. (The intent is not to have a perfectly accurate representation of the string, it is to have a mostly good representation. The full string data is still available.)
My problem, then, is determining what's an actual printable character and what's a non-printable character. Swift has a Character class, but contrary to, say, Java's Character class, the Swift one doesn't seem to have any method to find out the classification of a character.
How could I carry that plan? Is there anything else I should consider?

Unicode alphanumeric character range

I'm looking at the IsCharAlphaNumeric Windows API function. As it only takes a single TCHAR, it obviously can't make any decisions about surrogate pairs for UTF16 content. Does that mean that there are no alphanumeric characters that are surrogate pairs?
Characters outside the BMP can be letters. (Michael Kaplan recently discussed a bug in the classification of the character U+1F48C.) But IsCharAlphaNumeric cannot see characters outside the BMP (for the reasons you noted), so you cannot obtain classification information for them that way.
If you have a surrogate pair, call GetStringType with cchSrc = 2 and check for C1_ALPHA and C1_DIGIT.
Edit: The second half of this answer is incorrect GetStringType does not support surrogate pairs.
You can determine yourself by looking at the Unicode plane assignment what you are missing by not being able to inspect non-BMP codepoints.
For example, you won't be able to identify imperial Aramaic characters as alphanumeric. Shame.
Does that mean that there are no alphanumeric characters that are surrogate pairs?
No, there are supplementary code-points that are in the letter group.
Comparing a char to a code-point?
For example, Character.isLetter('\uD840') returns false, even though this specific value if followed by any low-surrogate value in a string would represent a letter.