Proper handling of UTF8 string concatenation - unicode

I just learned that it's OK for a Unicode string to contain isolated combining characters.
This triggers another question, relative to concatenation of strings starting with such characters.
I'm developing a UTF8String object, to make UTF-8 string handling easier.
This object has a concat() method, that concatenates another string to the current one.
If the second string starts with a combining character, should I add a non-breaking space between the two strings, to avoid combining the previously isolated first character of the second string, to the last character of the first string?
Or would it be expected to have the combination occur?

I'm developing a UTF8String object, to make UTF-8 string handling easier. [...] should I add a non-breaking space between the two strings?
I would say definitely not. Handling byte encodings like UTF-8 is a separate, lower-level concern than handling grapheme boundaries. Mixing the two issues together would be an unexpected, unwelcome layering violation.
If you want to build a string class that treats grapheme clusters as its indivisible units that's fine, but that's a different animal (and quite a lot of work to do consistently).

Related

Can a combining character be used alone in Unicode?

Let's take COMBINING ACUTE ACCENT, for example. Its browser test page does include it alone in the page, but it reacts in a strange way: I can't select it with my mouse, and if I try to interact with it in the DOM inspector, it feels like it's not part of the text at all (there's no before and after this character):
Is a combining character, used alone, still a valid Unicode string?
Or does it have to follow another character?
Yes, a combining character alone is a valid Unicode string (even though its behaviour may be weird without a base character). Section 2.11 of the Unicode Standard emphasises this:
In the Unicode Standard, all sequences of character codes are permitted.
The presentation of such strings is described in D52:
There may be no such base character, such as when a combining character is at the start of text or follows a control or format character [...] In such cases, the combining characters are called isolated combining characters.
With isolated combining characters or when a process is unable to perform graphical combination, a process may present a combining character without graphical combination; that is, it may present it as if it were a base character.
However, if you want to display a combining character by itself, it is recommended that you attach it to a no-break space base character:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent
isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be
employed, for example, when talking about the combining mark itself as a mark, rather
than using it in its normal way in text (that is, applied as an accent to a base letter or in
other combinations).
Also, a dotted circle ◌ (U+25CC, ◌) character can be used as a base character.
Source: https://en.wikipedia.org/wiki/Dotted_circle

Detecting non-character Unicode characters

I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen.
There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for "dangerous" strings. For instance, the RTL override character will make strings look like they're backwards.
Since the audience is mostly programmers, my solution would be to, first, get the type C canonical form of the string, and then replace anything that's not a printable character on its own with the Unicode code point in the form \uXXXXXX. (The intent is not to have a perfectly accurate representation of the string, it is to have a mostly good representation. The full string data is still available.)
My problem, then, is determining what's an actual printable character and what's a non-printable character. Swift has a Character class, but contrary to, say, Java's Character class, the Swift one doesn't seem to have any method to find out the classification of a character.
How could I carry that plan? Is there anything else I should consider?

Fully correct Unicode visual string reversal

[Inspired largely by trying to explain the problems with Character Encoding independent character swap, but also these other questions neither of which contain a complete answer: How to reverse a Unicode string, How to get a reversed String (unicode safe)]
Doing a visual string reversal in Unicode is much harder than it looks. In any storage format other than UTF-32 you have to pay attention to codepoint boundaries rather than going byte-by-byte. But that's not good enough, because of combining glyphs; the spec has a concept of "grapheme cluster" that's closer to the basic unit you want to be reversing. But that's still not good enough; there are all sorts of special case characters, like bidi overrides and final forms, that will have to be fixed up.
This pseudo-algorithm handles all the easy cases I know about:
Segment the string into an alternating list of words and word-separators (some word-separators may be the empty string)
Reverse the order of this list.
For each string in the list:
Segment the string into grapheme clusters.
Reverse the order of the grapheme clusters.
Check the initial and final cluster in the reversed sequence; their base characters may need to be reassigned to the correct form (e.g. if U+05DB HEBREW LETTER KAF is now at the end of the sequence it needs to become U+05DA HEBREW LETTER FINAL KAF, and vice versa)
Join the sequence back into a string.
Recombine the list of reversed words to produce the final reversed string.
... But it doesn't handle bidi overrides and I'm sure there's stuff I don't know about, as well. Can anyone fill in the gaps?

What is a well-formed UTF-16 string?

I need some help understanding the concept of a well-formed UTF-16 string as mentioned on these two paragraphs at Chapter 2: General Structure 2.7 Unicode String:
"Depending on the programming environment, a Unicode string may or may not be required to be in the corresponding Unicode encoding form. For example, strings in Java, C#, or ECMAScript are Unicode 16-bit strings, but are not necessarily well-formed UTF-16 sequences. In normal processing, it can be far more efficient to allow such strings to contain code unit sequences that are not well-formed UTF-16—that is, isolated surrogates. Because strings are such a fundamental component of every program, checking for isolated surrogates in every operation that modifies strings can create significant overhead, especially because supplementary characters are extremely rare as a percentage of overall text in programs worldwide.
Whenever such strings are specified to be in a particular Unicode encoding form—even one with the same code unit size—the string must not violate the requirements of that encoding form. For example, isolated surrogates in a Unicode 16-bit string are not allowed when that string is specified to be well formed UTF-16.
The paragraph explains it for UTF-16; not well-formed means the string contains isolated surrogate codeunits.
That is, there are certain code units which are only valid when they appear in pairs. A code unit in the range [0xD800-0xDFFF] must occur only in pairs where the first must be in the range [0xD800-0xDBFF] and the second must be in the range [0xDC00-0xDFFF]. If a string does not obey this requirement then it is not well-formed.

Is there encoding in Unicode where every "character" is just one code point?

Trying to rephrase: Can you map every combining character combination into one code point?
I'm new to Unicode, but it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Is this true for Basic Multilingual Plane also?
If you mean one char == one number (ie: where every char is represented by the same number of bytes/words/what-have-you): in UCS-4, each character is represented by a 4-byte number. That's way more than big enough for every character to be represented by a single value, but it's quite wasteful if you don't need any of the higher chars.
If you mean the compatibility sequences (ie: where e + ´ => é): there are single-character representations for most of the combinations in use in existing modern languages. If you're making up your own language, you could run into problems...but if you're sticking to the ones that people actually use, you'll be fine.
Can you map every combining character
combination into one code point?
Every combining character combination? How would your proposed encoding represent the string "à̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡"? (an 'a' with more than a hundred combining marks attached to it?) It's just not practical.
There are, however, a lot of "precomposed" characters in Unicode, like áçñü. Normalization form C will use these instead of the decomposed version whenever possible.
it seems to me that there is no encoding, normalization or representation where one character would be one code point in every case in Unicode. Is this correct?
Depends on the meaning of the meaning of the word “character.” Unicode has the concepts of abstract character (definition 7 in chapter 3 of the standard: “A unit of information used for the organization, control, or representation of textual data”) and encoded character (definition 11: “An association (or mapping) between an abstract character and a code point”). So a character never is a code point, but for many code points, there exists an abstract character that maps to the code point, this mapping being called “encoded character.” But (definition 11, paragraph 4): “A single abstract character may also be represented by a sequence of code points”
Is this true for Basic Multilingual Plane also?
There is no conceptual difference related to abstract or encoded characters between the BMP and the other planes. The statement above holds for all subsets of the codespace.
Depending on your application, you have to distinguish between the terms glyph, grapheme cluster, grapheme, abstract character, encoded character, code point, scalar value, code unit and byte. All of these concepts are different, and there is no simple mapping between them. In particular, there is almost never a one-to-one mapping between these entities.