Swift string indexing combines "\r\n" as one char instead of two - swift

I am dealing with strings containing \r\n with Swift 4.2. I ran into kind of strange behavior of Swift index, it appears \r\n will be treated as one character instead of two by Swift indexing methods. I wrote a piece of code to present this behavior:
var text = "ABC\r\n\r\nDEF"
func printChar(_ lower: Int, _ upper: Int) {
let start = text.index(text.startIndex, offsetBy: lower)
let end = text.index(text.startIndex, offsetBy: upper)
print("\"" + text[start..<end] + "\"")
}
printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"
The print result will be
"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"
Any idea why it's like this?

TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.
Swift treats \r\n as one Character.
Objective-C NSString treats it as two characters (in terms of the result from length).
On the swift-users forum someone wrote:
โ€“ "\r\n" is a single Character. Is this the correct behaviour?
โ€“ Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.
And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.
Take a look at the Apple documentation on Characters and Grapheme Clusters.
It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.
The Swift documentation on Strings and Characters is also worth reading.
This overview from objc.io is interesting as well.
NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.
Another example of this is an emoji like ๐Ÿ‘๐Ÿป. This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.
If you wanted to see the scalars you could iterate them as follows:
for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}
Which for "\r\n" would give you 13 10
In the Swift documentation you'll find why NSString is different:
The count of the characters returned by the count property isnโ€™t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the stringโ€™s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.
Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.

Related

How to split a Korean word into it's components?

So, for example the character ๊น€ is made up of ใ„ฑ, ใ…ฃ and ใ…. I need to split the Korean word into it's components to get the resulting 3 characters.
I tried by doing the following but it doesn't seem to output it correctly:
let str = "๊น€"
let utf8 = str.utf8
let first:UInt8 = utf8.first!
let char = Character(UnicodeScalar(first))
The problem is, that that code returns รช, when it should be returning ใ„ฑ.
You need to use the decomposedStringWithCompatibilityMapping string to get the unicode scalar values and then use those scalar values to get the characters. Something below,
let string = "๊น€"
for scalar in string.decomposedStringWithCompatibilityMapping.unicodeScalars {
print("\(scalar) ")
}
Output:
แ„€
แ…ต
แ†ท
You can create list of character strings as,
let chars = string.decomposedStringWithCompatibilityMapping.unicodeScalars.map { String($0) }
print(chars)
// ["แ„€", "แ…ต", "แ†ท"]
Korean related info in Apple docs
Extended grapheme clusters are a flexible way to represent many
complex script characters as a single Character value. For example,
Hangul syllables from the Korean alphabet can be represented as either
a precomposed or decomposed sequence. Both of these representations
qualify as a single Character value in Swift:
let precomposed: Character = "\u{D55C}" // ํ•œ
let decomposed: Character = "\u{1112}\u{1161}\u{11AB}" // แ„’, แ…ก, แ†ซ
// precomposed is ํ•œ, decomposed is แ„’แ…กแ†ซ

String substring with flag emoji in Swift [duplicate]

let str1 = "๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช"
let str2 = "๐Ÿ‡ฉ๐Ÿ‡ช.๐Ÿ‡ฉ๐Ÿ‡ช.๐Ÿ‡ฉ๐Ÿ‡ช.๐Ÿ‡ฉ๐Ÿ‡ช.๐Ÿ‡ฉ๐Ÿ‡ช."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช"
print(str1.count) // 5
print(Array(str1)) // ["๐Ÿ‡ฉ๐Ÿ‡ช", "๐Ÿ‡ฉ๐Ÿ‡ช", "๐Ÿ‡ฉ๐Ÿ‡ช", "๐Ÿ‡ฉ๐Ÿ‡ช", "๐Ÿ‡ฉ๐Ÿ‡ช"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or ใ‚ซ)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a โ€œstackโ€.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "๐Ÿ‡ฉ๐Ÿ‡ช\u{200C}๐Ÿ‡ฉ๐Ÿ‡ช"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "๐Ÿ‡ฉ๐Ÿ‡ช\u{200B}๐Ÿ‡ฉ๐Ÿ‡ช"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "๐Ÿ‡ซโ€‹๐Ÿ‡ทโ€‹๐Ÿ‡บโ€‹๐Ÿ‡ธ"
be "๐Ÿ‡ซโ€‹๐Ÿ‡ท๐Ÿ‡บโ€‹๐Ÿ‡ธ" or "๐Ÿ‡ซ๐Ÿ‡ทโ€‹๐Ÿ‡บ๐Ÿ‡ธ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช".
Here's how I solved that problem, for Swift 3:
let str = "๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡ฉ๐Ÿ‡ช" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (รค for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("รฉ").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("๐Ÿ˜€").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("รฉ").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeรฉ#ยถล“ลŽO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
รฉ is not ascii; รฉ is a letter
# is ascii with value 64; # is not a letter
ยถ is not ascii; ยถ is not a letter
ล“ is not ascii; ล“ is a letter
ลŽ is not ascii; ลŽ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter

How can I get the Unicode code point(s) of a Character?

How can I extract the Unicode code point(s) of a given Character without first converting it to a String? I know that I can use the following:
let ch: Character = "A"
let s = String(ch).unicodeScalars
s[s.startIndex].value // returns 65
but it seems like there should be a more direct way to accomplish this using just Swift's standard library. The Language Guide sections "Working with Characters" and "Unicode" only discuss iterating through the characters in a String, not working directly with Characters.
From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?
If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.
It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.
That said, if you need to get the code points in a concise manner, I would recommend an extension like such:
extension Character
{
func unicodeScalarCodePoint() -> UInt32
{
let characterString = String(self)
let scalars = characterString.unicodeScalars
return scalars[scalars.startIndex].value
}
}
Then you can use it like so:
let char : Character = "A"
char.unicodeScalarCodePoint()
In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.
Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.
I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points
Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).
Concept 1: The Unicode point is called the Unicode Scalar in Swift
A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.
Concept 2: The Code Unit is the abstract representation of the encoding.
Consider the following code snippet
let theCat = "Cat!๐Ÿฑ"
for char in theCat.utf8 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf8 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf16 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.utf16 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-32 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}
Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.
Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)
consider the following code snippet
let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}"
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "ํ•œ"
print(decomposed) //print "ํ•œ"
The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)
for preCha in precomposed.utf16 {
print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}
print("")
for deCha in decomposed.utf16 {
print("\(deCha) ", terminator: "") //print 4370 4449 4523
}
Extra example
var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")
word += "\u{301}"
print("the number of characters in \(word) is \(word.characters.count)")
Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.
Further Readings:
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html
I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.
Instead, UnicodeScalar represents a Unicode code point.
I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:
let ch: Character = "A"
for code in String(ch).utf8 { println(code) }
#1. Using Unicode.Scalar's value property
With Swift 5, Unicode.Scalar has a value property that has the following declaration:
A numeric representation of the Unicode scalar.
var value: UInt32 { get }
The following Playground sample code shows how to iterate over the unicodeScalars property of a Character and print the value of each Unicode scalar that composes it:
let character: Character = "A"
for scalar in character.unicodeScalars {
print(scalar.value)
}
/*
prints: 65
*/
As an alternative, you can use the sample code below if you only want to print the value of the first unicode scalar of a Character:
let character: Character = "A"
let scalars = character.unicodeScalars
let firstScalar = scalars[scalars.startIndex]
print(firstScalar.value)
/*
prints: 65
*/
#2. Using Character's asciiValue property
If what you really want is to get the ASCII encoding value of a character, you can use Character's asciiValue. asciiValue has the following declaration:
Returns the ASCII encoding value of this Character, if ASCII.
var asciiValue: UInt8? { get }
The Playground sample code below show how to use asciiValue:
let character: Character = "A"
print(String(describing: character.asciiValue))
/*
prints: Optional(65)
*/
let character: Character = "ะŸ"
print(String(describing: character.asciiValue))
/*
prints: nil
*/
Have you tried:
import Foundation
let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
let stringSegment: String = "\(character)"
let anInt: Int = stringSegment.toInt()!
numbers.append(anInt)
}
numbers
Output:
[97, 98, 99]
It may also be only one Character in the String.

How is the ๐Ÿ‡ฉ๐Ÿ‡ช character represented in Swift strings?

Like some other emoji characters, the 0x0001F1E9 0x0001F1EA combination (German flag) is represented as a single character on screen although it is really two different Unicode character points combined. Is it represented as one or two different characters in Swift?
let flag = "\u{1f1e9}\u{1f1ea}"
then flag is ๐Ÿ‡ฉ๐Ÿ‡ช .
For more regional indicator symbols, see:
http://en.wikipedia.org/wiki/Regional_Indicator_Symbol
Support for "extended grapheme clusters" has been added to Swift in the meantime.
Iterating over the characters of a string produces a single character for
the "flags":
let string = "Hi๐Ÿ‡ฉ๐Ÿ‡ช!"
for char in string.characters {
print(char)
}
Output:
H
i
๐Ÿ‡ฉ๐Ÿ‡ช
!
Swift 3 implements Unicode in its String struct. In Unicode, all flags are pairs of Regional Indicator Symbols. So, ๐Ÿ‡ฉ๐Ÿ‡ช is actually ๐Ÿ‡ฉ followed by ๐Ÿ‡ช (try copying the two and pasting them next to eachother!).
When two or more Regional Indicator Symbols are placed next to eachother, they form an "Extended Grapheme Cluster", which means they're treated as one character. This is why "๐Ÿ‡ช๐Ÿ‡บ = ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฉ๐Ÿ‡ช...".characters gives you ["๐Ÿ‡ช๐Ÿ‡บ", " ", "=", " ", "๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฉ๐Ÿ‡ช", ".", ".", "."].
If you want to see every single Unicode code point (AKA "scalar"), you can use .unicodeScalars, so that "Hi๐Ÿ‡ฉ๐Ÿ‡ช!".unicodeScalars gives you ["H", "i", "๐Ÿ‡ฉ", "๐Ÿ‡ช", "!"]
tl;dr
๐Ÿ‡ฉ๐Ÿ‡ช is one character (in both Swift and Unicode), which is made up of two code points (AKA scalars). Don't forget these are different! ๐Ÿ™‚
See Also
Why are emoji characters like ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ treated so strangely in Swift strings?
The Swift Programming Language (Swift 3.1) - Strings and Characters - Unicode
With Swift 5, you can iterate over the unicodeScalars property of a flag emoji character in order to print the Unicode scalar values that compose it:
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(scalar)
}
/*
prints:
๐Ÿ‡ฎ
๐Ÿ‡น
*/
If you combine those scalars (that are Regional Indicator Symbols), you get a flag emoji:
let italianFlag = "๐Ÿ‡ฎ" + "๐Ÿ‡น"
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
Each Unicode.Scalar instance also has a property value that you can use in order to display a numeric representation of it:
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(scalar.value)
}
/*
prints:
127470
127481
*/
You can create Unicode scalars from those numeric representations then associate them into a string:
let scalar1 = Unicode.Scalar(127470)
let scalar2 = Unicode.Scalar(127481)
let italianFlag = String(scalar1!) + String(scalar2!)
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
If needed, you can use Unicode.Scalar's escaped(asASCII:) method in order to display a string representation of the Unicode scalars (using ASCII characters):
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(scalar.escaped(asASCII: true))
}
/*
prints:
\u{0001F1EE}
\u{0001F1F9}
*/
let italianFlag = "\u{0001F1EE}\u{0001F1F9}"
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
String's init(_:radix:uppercase:) may also be relevant to convert the scalar value to an hexadecimal value:
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(String(scalar.value, radix: 16, uppercase: true))
}
/*
prints:
1F1EE
1F1F9
*/
let italianFlag = "\u{1F1EE}\u{1F1F9}"
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
Swift doesn't tell you what the internal representation of a String is. You interact with a String as a list of full-size (32-bit) Unicode code points:
for character in "Dog!๐Ÿถ" {
println(character)
}
// prints D, o, g, !, ๐Ÿถ
If you want to work with a string as a sequence of UTF-8 or UTF-16 code points, use its utf8 or utf16 properties. See Strings and Characters in the docs.