Extended Grapheme Clusters stop combining - swift

I am having one question with the Extended Grapheme Clusters.
For example, look at following code:
let message = "c\u{0327}a va bien" // => "ça va bien"
How does Swift know it needs to be combined (i.e. ç) rather than treating it as a small letter c AND a "COMBINING CEDILLA"?

Use the unicodeScalars view on the string:
let message1 = "c\u{0327}".decomposedStringWithCanonicalMapping
for scalar in message1.unicodeScalars {
print(scalar) // print c and Combining Cedilla separately
}
let message2 = "c\u{0327}".precomposedStringWithCanonicalMapping
for scalar in message2.unicodeScalars {
print(scalar) // print Latin Small Letter C with Cedilla
}
Note that not all composite characters have a precomposed form, as noted by Apple's Technical Q&A:
Important: Do not convert to precomposed Unicode in an attempt to simplify your text processing. Precomposed Unicode can still contain composite characters. For example, there is no precomposed equivalent of U+0065 U+030A (LATIN SMALL LETTER E followed by COMBINING RING ABOVE)

Related

Are NFC normalization boundaries also extended grapheme cluster boundaries?

This question is related to text editing. Say you have a piece of text in normalization form NFC, and a cursor that points to an extended grapheme cluster boundary within this text. You want to insert another piece of text at the cursor location, and make sure that the resulting text is also in NFC. You also want to move the cursor on the first grapheme boundary that immediately follows the inserted text.
Now, since concatenating two strings that are both in NFC doesn't necessarily produce a string that is also in NFC, you might have to emend the text around the insertion point. For instance, if you have a string that contains 4 code points like so:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E
[2] COMBINING MACRON BELOW
--- Cursor location
[3] LATIN SMALL LETTER A
And you want to insert a 2-codepoints string {COMBINING ACUTE ACCENT, COMBINING DOT ABOVE} at the cursor location. Then the result will be:
[0] LATIN SMALL LETTER B
[1] LATIN SMALL LETTER E WITH ACUTE
[2] COMBINING MACRON BELOW
[3] COMBINING DOT ABOVE
--- Cursor location
[4] LATIN SMALL LETTER A
Now my question is: how do you figure out at which offset you should place the cursor after inserting the string, in such a way that the cursor ends up after the inserted string and also on a grapheme boundary? In this particular case, the text that follows the cursor location cannot possibly interact, during normalization, with what precedes. So the following sample Python code would work:
import unicodedata
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
if new_cursor_pos == 0:
# grapheme_break_after is a function that
# returns the offset of the first grapheme
# boundary after the given index
new_cursor_pos = grapheme_break_after(new_text, 0)
return new_text, new_cursor_pos
But does this approach necessarily work? To be more explicit: is it necessarily the case that the text that follows a grapheme boundary doesn't interact with what precedes it during normalization, such that NFC(text[:grapheme_break]) + NFC(text[grapheme_break:]) == NFC(text) is always true?
Update
#nwellnhof's excellent analysis below motivated me to investigate things
further. So I followed the "When in doubt, use brute force" mantra and wrote a
small script that parses grapheme break properties and examines each code point
that can appear at the beginning of a grapheme, to test whether it can
possibly interact with preceding code points during normalization. Here's the
script:
from urllib.request import urlopen
import icu, unicodedata
URL = "http://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt"
break_props = {}
with urlopen(URL) as f:
for line in f:
line = line.decode()
p = line.find("#")
if p >= 0:
line = line[:p]
line = line.strip()
if not line:
continue
fields = [x.strip() for x in line.split(";")]
codes = [int(x, 16) for x in fields[0].split("..")]
if len(codes) == 2:
start, end = codes
else:
assert(len(codes) == 1)
start, end = codes[0], codes[0]
category = fields[1]
break_props.setdefault(category, []).extend(range(start, end + 1))
# The only code points that can't appear at the beginning of a grapheme boundary
# are those that appear in the following categories. See the regexps in
# UAX #29 Tables 1b and 1c.
to_ignore = set(c for name in ("Extend", "ZWJ", "SpacingMark") for c in break_props[name])
nfc = icu.Normalizer2.getNFCInstance()
for c in range(0x10FFFF + 1):
if c in to_ignore:
continue
if not nfc.hasBoundaryBefore(chr(c)):
print("U+%04X %s" % (c, unicodedata.name(chr(c))))
Looking at the output, it appears that there are about 40 code points that are
grapheme starters but still compose with preceding code points in NFC.
Basically, they are non-precomposed Hangul syllables of type V
(U+1161..U+1175) and T (U+11A8..U+11C2). Things makes sense when you examine
the regular expressions in UAX #29, Table
1c together with what
the standard says about Jamo composition (section 3.12, p. 147 of the version
13 of the standard).
The gist of it is that Hangul sequences of the form {L, V} can compose to a
Hangul syllable of type LV, and similarly sequences of the form {LV, T} can
compose to a syllable of type LVT.
To sum up, and assuming I'm not mistaken, the above Python code could
be corrected as follows:
import unicodedata
import icu # pip3 install icu
def insert(text, cursor_pos, text_to_insert):
new_text = text[:cursor_pos] + text_to_insert
new_text = unicodedata.normalize("NFC", new_text)
new_cursor_pos = len(new_text)
new_text += text[cursor_pos:]
new_text = unicodedata.normalize("NFC", new_text)
break_iter = icu.BreakIterator.createCharacterInstance(icu.Locale())
break_iter.setText(new_text)
if new_cursor_pos == 0:
# Move the cursor to the first grapheme boundary > 0.
new_cursor_pos = breakIter.nextBoundary()
elif new_cursor_pos > len(new_text):
new_cursor_pos = len(new_text)
elif not break_iter.isBoundary(new_cursor_pos):
# isBoundary() moves the cursor on the first boundary >= the given
# position.
new_cursor_pos = break_iter.current()
return new_text, new_cursor_pos
The (possibly) pointless test new_cursor_pos > len(new_text) is there to
catch the case len(NFC(x)) > len(NFC(x + y)). I'm not sure whether this can
actually happen with the current Unicode database (more tests would be needed to prove it), but it is theoretically quite possible. If, say, you have
a set a three code points A, B and C and two precomposed forms A+B and
A+B+C (but not A+C), then you could very well have NFC({A, C} + {B}) = {A+B+C}.
If this case doesn't occur in practice (which is very likely, especially with
"real" texts), then the above Python code will necessarily locate the first
grapheme boundary after the end of the inserted text. Otherwise, it will merely
locate some grapheme boundary after the inserted text, but not necessarily the
first one. I don't yet see how it could be possible to improve the second case (assuming it isn't merely theoretical), so I think I'll leave
my investigation at that for now.
As mentioned in my comment, the actual boundaries can differ slightly. But AFAICS, there should be no meaningful interaction. UAX #29 states:
6.1 Normalization
[...] the grapheme cluster boundary specification has the following features:
There is never a break within a sequence of nonspacing marks.
There is never a break between a base character and subsequent nonspacing marks.
This only mentions nonspacing marks. But with extended grapheme clusters (as opposed to legacy ones), I'm pretty sure these statements also apply to "non-starter" spacing marks[1]. This would cover all normalization non-starters (which must be either nonspacing (Mn) or spacing (Mc) marks). So there's never an extended grapheme cluster boundary before a non-starter[2] which should give you the guarantee you need.
Note that it's possible to have multiple runs of starters and non-starters ("normalization boundaries") within a single grapheme cluster, for example with U+034F COMBINING GRAPHEME JOINER.
[1] Some spacing marks are excluded, but these should all be starters.
[2] Except at the start of text.

How to split a Korean word into it's components?

So, for example the character 김 is made up of ㄱ, ㅣ and ㅁ. I need to split the Korean word into it's components to get the resulting 3 characters.
I tried by doing the following but it doesn't seem to output it correctly:
let str = "김"
let utf8 = str.utf8
let first:UInt8 = utf8.first!
let char = Character(UnicodeScalar(first))
The problem is, that that code returns ê, when it should be returning ㄱ.
You need to use the decomposedStringWithCompatibilityMapping string to get the unicode scalar values and then use those scalar values to get the characters. Something below,
let string = "김"
for scalar in string.decomposedStringWithCompatibilityMapping.unicodeScalars {
print("\(scalar) ")
}
Output:
ᄀ
ᅵ
ᆷ
You can create list of character strings as,
let chars = string.decomposedStringWithCompatibilityMapping.unicodeScalars.map { String($0) }
print(chars)
// ["ᄀ", "ᅵ", "ᆷ"]
Korean related info in Apple docs
Extended grapheme clusters are a flexible way to represent many
complex script characters as a single Character value. For example,
Hangul syllables from the Korean alphabet can be represented as either
a precomposed or decomposed sequence. Both of these representations
qualify as a single Character value in Swift:
let precomposed: Character = "\u{D55C}" // 한
let decomposed: Character = "\u{1112}\u{1161}\u{11AB}" // ᄒ, ᅡ, ᆫ
// precomposed is 한, decomposed is 한

Created unicode & unicode without whitespace generators in ScalaCheck

During testing we want to qualify unicode characters, sometimes with wide ranges and sometimes more narrow. I've created a few specific generators:
// Generate a wide varying of Unicode strings with all legal characters (21-40 characters):
val latinUnicodeCharacter = Gen.choose('\u0041', '\u01B5').filter(Character.isDefined)
// Generate latin Unicode strings with all legal characters (21-40 characters):
val latinUnicodeGenerator: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter))
}
// Generate latin unicode strings without whitespace (21-40 characters): !! COMES UP SHORT...
val latinUnicodeGeneratorNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter)).map(_.replaceAll("[\\p{Z}\\p{C}]", ""))
}
The latinUnicodeCharacter generator picks from characters ranging from standard latin ("A," "B," etc.) up to higher order latin character (Germanic/Nordic and others). This is good for testing latin-based character input for, say, names.
The latinUnicodeGenerator creates strings of 21-40 characters in length. These strings include horizontal space (not just a space character but other "horizontal space").
The final example, latinUnicodeGeneratorNoWhitespace, is used for say email addresses. We want the latin characters but we don't want spaces, control codes, and the like. The problem: Because I'm mapping the final result String and filtering out the control characters, the String shrinks and I end up with a total length that is less than 21 characters (sometimes).
So the question is: How can I implement latinUnicodeGeneratorNoWhitespace but do it inside the generator in such a way that I always get 21-40 character strings?
You could do this by putting together a sequence of your non-whitespace characters, another of whitespace, and then picking from either only the non-whitespace, or from both together:
import org.scalacheck.Gen
val myChars = ('A' to 'Z') ++ ('a' to 'z')
val ws = Seq(' ', '\t')
val myCharsGenNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars))
}
val myCharsGen: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars ++ ws))
}
I would suggest considering what you're really testing for, though—the more you restrict the test cases, the less you're checking about how your program will behave on unexpected inputs.

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter

Unicode characters having asymmetric upper/lower case. Why?

Why do the following three characters have not symmetric toLower, toUpper results
/**
* Written in the Scala programming language, typed into the Scala REPL.
* Results commented accordingly.
*/
/* Unicode Character 'LATIN CAPITAL LETTER SHARP S' (U+1E9E) */
'\u1e9e'.toHexString == "1e9e" // true
'\u1e9e'.toLower.toHexString == "df" // "df" == "df"
'\u1e9e'.toHexString == '\u1e9e'.toLower.toUpper.toHexString // "1e9e" != "df"
/* Unicode Character 'KELVIN SIGN' (U+212A) */
'\u212a'.toHexString == "212a" // "212a" == "212a"
'\u212a'.toLower.toHexString == "6b" // "6b" == "6b"
'\u212a'.toHexString == '\u212a'.toLower.toUpper.toHexString // "212a" != "4b"
/* Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130) */
'\u0130'.toHexString == "130" // "130" == "130"
'\u0130'.toLower.toHexString == "69" // "69" == "69"
'\u0130'.toHexString == '\u0130'.toLower.toUpper.toHexString // "130" != "49"
For the first one, there is this explanation:
In the German language, the Sharp S ("ß" or U+00df) is a lowercase letter, and it capitalizes to the letters "SS".
In other words, U+1E9E lower-cases to U+00DF, but the upper-case of U+00DF is not U+1E9E.
For the second one, U+212A (KELVIN SIGN) lower-cases to U+0068 (LATIN SMALL LETTER K). The upper-case of U+0068 is U+004B (LATIN CAPITAL LETTER K). This one seems to make sense to me.
For the third case, U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE) is a Turkish/Azerbaijani character that lower-cases to U+0069 (LATIN SMALL LETTER I). I would imagine that if you were somehow in a Turkish/Azerbaijani locale you'd get the proper upper-case version of U+0069, but that might not necessarily be universal.
Characters need not necessarily have symmetric upper- and lower-case transformations.
Edit: To respond to PhiLho's comment below, the Unicode 6.0 spec has this to say about U+212A (KELVIN SIGN):
Three letterlike symbols have been given canonical equivalence to regular letters: U+2126
OHM SIGN, U+212A KELVIN SIGN, and U+212B ANGSTROM SIGN. In all three instances, the regular letter should be used. If text is normalized according to Unicode Standard Annex #15, “Unicode Normalization Forms,” these three characters will be replaced by their regular equivalents.
In other words, you shouldn't really be using U+212A, you should be using U+004B (LATIN CAPITAL LETTER K) instead, and if you normalize your Unicode text, U+212A should be replaced with U+004B.
May I refer to another post about Unicode and upper and lower case..
It is a common mistake to think that signs for a language have to be available in upper and lower case!
Unicode-correct title case in Java