How String Comparison happens in Swift

How String Comparison happens in Swift - swift

in this example:
var str1 = "hello"
var str2 = "Hello"
if str1 < str2 { print("hello is less than Hello")}
else {print("hello is more than Hello")}
on what basis it is found that str1 is greater than str2?

Swift strings are compared according to the
Unicode Collation Algorithm,
which means that (effectively),
each string is put into "Unicode Normalization Form D",
the unicode scalar values of these "decomposed" strings are compared lexicographically.
In your example, "hello" and "Hello" have the Unicode values
hello: U+0068, U+0065, U+006C, U+006C, U+006F
Hello: U+0048, U+0065, U+006C, U+006C, U+006F
and therefore "Hello" < "hello".
The "normalization" or "decomposing" is relevant e.g. for characters
with diacritical marks. As an example,
a = U+0061
ä = U+00E4
b = U+0062
have the decomposed form
a: U+0061
ä: U+0061, U+0308 // LATIN SMALL LETTER A + COMBINING DIAERESIS
b: U+0062
and therefore "a" < "ä" < "b".
For more details and examples, see What does it mean that string and character comparisons in Swift are not locale-sensitive?

The two strings are compared, character by character, using each character's Unicode value. Since h has a higher code (U+0068) than H (U+0048), str1 is "greater" than str2.
Based on Martin's comment below the question, it's slightly more complex than I stated. Please see What does it mean that string and character comparisons in Swift are not locale-sensitive? for more detail.

I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order

In Swift 4.2 -
//Unicode Value gives you an idea why "hello" is greater than "Hello" as the length of both Strings are the same.
var str1 = "hello"
var str2 = "Hello"
if (str1 < str2){
print("hello is less than Hello")
}
else {
print("hello is more than Hello")
}
print(str1.unicodeScalars[str1.unicodeScalars.startIndex].value)
print(str2.unicodeScalars[str2.unicodeScalars.startIndex].value)

Related

Getting a random emoji/character from a unicode string

My goal is to get a random emoticon, from a list, in F#.
I started with this:
let pickOne (icons: string) : char = icons.[Helpers.random.Next(icons.Length)]
let happySymbols = "🔥😂😊😁🙏😎💪😋😇🎉🙌🤘👍🤑🤩🤪🤠🥳😌🤤😍😀"
let sadSymbols = "😭😔😒😩😢🤦🤷😱👎🤨😑😬🙄🤮😵🤯🧐😕😟😤😡🤬"
that doesn't work because:
"🔥😂😊😁🙏😎💪😋😇🎉🙌🤘👍🤑🤩🤪🤠🥳😌🤤😍😀".Length
is returning 44 as length returns the number of chars in a string, which is not working well with unicode characters.
I can't just divide by 2 because I may add some single byte characters in the string at some point.
Indexing doesn't work either:
let a = "🔥😂😊😁🙏😎💪😋😇🎉🙌🤘👍🤑🤩🤪🤠🥳😌🤤😍😀"
a.[0]
will not return 🔥 but I get some unknown character symbol.
so, plan B was: let's make this an array instead of a string:
let a = [| '🔥'; '😂'; '😊'; '😁'; '🙏'; '😎'; '💪'; '😋'; '😇'; '🎉'; '🙌'; '🤘'; '👍'; '🤑'; '🤩'; '🤪'; '🤠'; '🥳'; '😌'; '🤤'; '😍'; '😀' |]
this is not compiling, I'm getting:
Parse error Unexpected quote symbol in binding. Expected '|]' or other token.
why is that?
anyhow, I can make a list of strings and get it to work, but I'm curious: is there a "proper" way to make the first one work and take a random unicode character from a unicode string?

Asti's answer works for your purpose, but I wasn't too happy about where we landed on this. I guess I got hung up in the word "proper" in the answer. After a lot of research in various places, I got curious about the method String.EnumerateRunes, which again lead me to the type Rune. The documentation for that type is particularly enlightening about proper string handling, and what's in a Unicode UTF-8 string in .NET. I also experimented in LINQPad, and got this.
let dump x = x.Dump()
let runes = "abcABCæøåÆØÅ😂😊😁₅茨茧茦茥".EnumerateRunes().ToArray()
runes.Length |> dump
// 20
runes |> Array.iter (fun rune -> dump (string rune))
// a b c A B C æ ø å Æ Ø Å 😂 😊 😁 ₅ 茨 茧 茦 茥
dump runes
// see screenshot
let smiley = runes.[13].ToString()
dump smiley
// 😊

All strings in .NET are 16-bit unicode strings.
That's the definition of char:
Represents a character as a UTF-16 code unit.
All characters take up the minimum encoding size (2 bytes for UTF-16), up to as many bytes as required. Emojis don't fit in 2 bytes, so they align to 4 bytes, or 2 chars.
So what's the solution? align(4) all the things! (insert GCC joke here).
First we convert everything into UTF32:
let utf32 (source: string) =
Encoding.Convert(Encoding.Unicode, Encoding.UTF32, Encoding.Unicode.GetBytes(source))
Then we can pick and choose any "character":
let pick (arr: byte[]) index =
Encoding.UTF32.GetString(arr, index * 4, 4)
Test:
let happySymbols = "🔥😂😊😁🙏😎💪😋😇🎉🙌🤘👍🤑🤩🤪🤠🥳😌🤤😍😀YTHO"
pick (utf32 happySymbols) 0;;
val it : string = "🔥"
> pick (utf32 happySymbols) 22;;
val it : string = "Y"
For the actual length, just div by 4.
let surpriseMe arr =
let rnd = Random()
pick arr (rnd.Next(0, arr.Length / 4))
Hmmm
> surpriseMe (utf32 happySymbols);;
val it : string = "😍"

Alphabetic comparison - why is a less than b? [duplicate]

in this example:
var str1 = "hello"
var str2 = "Hello"
if str1 < str2 { print("hello is less than Hello")}
else {print("hello is more than Hello")}
on what basis it is found that str1 is greater than str2?

Swift strings are compared according to the
Unicode Collation Algorithm,
which means that (effectively),
each string is put into "Unicode Normalization Form D",
the unicode scalar values of these "decomposed" strings are compared lexicographically.
In your example, "hello" and "Hello" have the Unicode values
hello: U+0068, U+0065, U+006C, U+006C, U+006F
Hello: U+0048, U+0065, U+006C, U+006C, U+006F
and therefore "Hello" < "hello".
The "normalization" or "decomposing" is relevant e.g. for characters
with diacritical marks. As an example,
a = U+0061
ä = U+00E4
b = U+0062
have the decomposed form
a: U+0061
ä: U+0061, U+0308 // LATIN SMALL LETTER A + COMBINING DIAERESIS
b: U+0062
and therefore "a" < "ä" < "b".
For more details and examples, see What does it mean that string and character comparisons in Swift are not locale-sensitive?

The two strings are compared, character by character, using each character's Unicode value. Since h has a higher code (U+0068) than H (U+0048), str1 is "greater" than str2.
Based on Martin's comment below the question, it's slightly more complex than I stated. Please see What does it mean that string and character comparisons in Swift are not locale-sensitive? for more detail.

I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order

In Swift 4.2 -
//Unicode Value gives you an idea why "hello" is greater than "Hello" as the length of both Strings are the same.
var str1 = "hello"
var str2 = "Hello"
if (str1 < str2){
print("hello is less than Hello")
}
else {
print("hello is more than Hello")
}
print(str1.unicodeScalars[str1.unicodeScalars.startIndex].value)
print(str2.unicodeScalars[str2.unicodeScalars.startIndex].value)

What does it mean that string and character comparisons in Swift are not locale-sensitive?

I started learning Swift language and I am very curious What does it mean that string and character comparisons in Swift are not locale-sensitive? Does it mean that all the characters are stored in Swift like UTF-8 characters?

(All code examples updated for Swift 3 now.)
Comparing Swift strings with < does a lexicographical comparison
based on the so-called "Unicode Normalization Form D" (which can be computed with
decomposedStringWithCanonicalMapping)
For example, the decomposition of
"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS
is the sequence of two Unicode code points
U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS
For demonstration purposes, I have written a small String extension which dumps the
contents of the String as an array of Unicode code points:
extension String {
var unicodeData : String {
return self.unicodeScalars.map {
String(format: "%04X", $0.value)
}.joined(separator: ",")
}
}
Now lets take some strings, sort them with <:
let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
print(someStrings)
// ["a", "ã", "ă", "ä", "ǟ", "b"]
and dump the Unicode code points of each string (in original and decomposed
form) in the sorted array:
for str in someStrings {
print("\(str) \(str.unicodeData) \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}
The output
äx 00E4,0078 0061,0308,0078
ǟx 01DF,0078 0061,0308,0304,0078
ǟψ 01DF,03C8 0061,0308,0304,03C8
äψ 00E4,03C8 0061,0308,03C8
nicely shows that the comparison is done by a lexicographic ordering of the Unicode
code points in the decomposed form.
This is also true for strings of more than one character, as the following example
shows. With
let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
the output of above loop is
äx 00E4,0078 0061,0308,0078
ǟx 01DF,0078 0061,0308,0304,0078
ǟψ 01DF,03C8 0061,0308,0304,03C8
äψ 00E4,03C8 0061,0308,03C8
which means that
"äx" < "ǟx", but "äψ" > "ǟψ"
(which was at least unexpected for me).
Finally let's compare this with a locale-sensitive ordering, for example swedish:
let locale = Locale(identifier: "sv") // svenska
var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
someStrings.sort {
$0.compare($1, locale: locale) == .orderedAscending
}
print(someStrings)
// ["a", "ă", "ã", "b", "ä", "ǟ"]
As you see, the result is different from the Swift < sorting.

Changing the locale can change the alphabetical order, e.g. a case-sensitive comparison can appear case-insensitive because of the locale, or more generally, the alphabetical order of two strings is different.

Lexicographical ordering and locale-sensitive ordering can be different. You can see an example of it in this question:
Sorting scala list equivalent to C# without changing C# order
In that specific case the locale-sensitive ordering placed _ before 1, whereas in a lexicographical ordering it's the opposite.
Swift comparison uses lexicographical ordering.

How can I get the Unicode code point(s) of a Character?

How can I extract the Unicode code point(s) of a given Character without first converting it to a String? I know that I can use the following:
let ch: Character = "A"
let s = String(ch).unicodeScalars
s[s.startIndex].value // returns 65
but it seems like there should be a more direct way to accomplish this using just Swift's standard library. The Language Guide sections "Working with Characters" and "Unicode" only discuss iterating through the characters in a String, not working directly with Characters.

From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?
If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.
It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.
That said, if you need to get the code points in a concise manner, I would recommend an extension like such:
extension Character
{
func unicodeScalarCodePoint() -> UInt32
{
let characterString = String(self)
let scalars = characterString.unicodeScalars
return scalars[scalars.startIndex].value
}
}
Then you can use it like so:
let char : Character = "A"
char.unicodeScalarCodePoint()
In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.
Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.

I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points
Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).
Concept 1: The Unicode point is called the Unicode Scalar in Swift
A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.
Concept 2: The Code Unit is the abstract representation of the encoding.
Consider the following code snippet
let theCat = "Cat!🐱"
for char in theCat.utf8 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf8 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf16 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.utf16 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-32 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}
Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.
Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)
consider the following code snippet
let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}"
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "한"
print(decomposed) //print "한"
The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)
for preCha in precomposed.utf16 {
print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}
print("")
for deCha in decomposed.utf16 {
print("\(deCha) ", terminator: "") //print 4370 4449 4523
}
Extra example
var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")
word += "\u{301}"
print("the number of characters in \(word) is \(word.characters.count)")
Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.
Further Readings:
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.
Instead, UnicodeScalar represents a Unicode code point.

I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:
let ch: Character = "A"
for code in String(ch).utf8 { println(code) }

#1. Using Unicode.Scalar's value property
With Swift 5, Unicode.Scalar has a value property that has the following declaration:
A numeric representation of the Unicode scalar.
var value: UInt32 { get }
The following Playground sample code shows how to iterate over the unicodeScalars property of a Character and print the value of each Unicode scalar that composes it:
let character: Character = "A"
for scalar in character.unicodeScalars {
print(scalar.value)
}
/*
prints: 65
*/
As an alternative, you can use the sample code below if you only want to print the value of the first unicode scalar of a Character:
let character: Character = "A"
let scalars = character.unicodeScalars
let firstScalar = scalars[scalars.startIndex]
print(firstScalar.value)
/*
prints: 65
*/
#2. Using Character's asciiValue property
If what you really want is to get the ASCII encoding value of a character, you can use Character's asciiValue. asciiValue has the following declaration:
Returns the ASCII encoding value of this Character, if ASCII.
var asciiValue: UInt8? { get }
The Playground sample code below show how to use asciiValue:
let character: Character = "A"
print(String(describing: character.asciiValue))
/*
prints: Optional(65)
*/
let character: Character = "П"
print(String(describing: character.asciiValue))
/*
prints: nil
*/

Have you tried:
import Foundation
let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
let stringSegment: String = "\(character)"
let anInt: Int = stringSegment.toInt()!
numbers.append(anInt)
}
numbers
Output:
[97, 98, 99]
It may also be only one Character in the String.

How is the 🇩🇪 character represented in Swift strings?

Like some other emoji characters, the 0x0001F1E9 0x0001F1EA combination (German flag) is represented as a single character on screen although it is really two different Unicode character points combined. Is it represented as one or two different characters in Swift?

let flag = "\u{1f1e9}\u{1f1ea}"
then flag is 🇩🇪 .
For more regional indicator symbols, see:
http://en.wikipedia.org/wiki/Regional_Indicator_Symbol

Support for "extended grapheme clusters" has been added to Swift in the meantime.
Iterating over the characters of a string produces a single character for
the "flags":
let string = "Hi🇩🇪!"
for char in string.characters {
print(char)
}
Output:
H
i
🇩🇪
!

Swift 3 implements Unicode in its String struct. In Unicode, all flags are pairs of Regional Indicator Symbols. So, 🇩🇪 is actually 🇩 followed by 🇪 (try copying the two and pasting them next to eachother!).
When two or more Regional Indicator Symbols are placed next to eachother, they form an "Extended Grapheme Cluster", which means they're treated as one character. This is why "🇪🇺 = 🇫🇷🇪🇸🇩🇪...".characters gives you ["🇪🇺", " ", "=", " ", "🇫🇷🇪🇸🇩🇪", ".", ".", "."].
If you want to see every single Unicode code point (AKA "scalar"), you can use .unicodeScalars, so that "Hi🇩🇪!".unicodeScalars gives you ["H", "i", "🇩", "🇪", "!"]
tl;dr
🇩🇪 is one character (in both Swift and Unicode), which is made up of two code points (AKA scalars). Don't forget these are different! 🙂
See Also
Why are emoji characters like 👩‍👩‍👧‍👦 treated so strangely in Swift strings?
The Swift Programming Language (Swift 3.1) - Strings and Characters - Unicode

With Swift 5, you can iterate over the unicodeScalars property of a flag emoji character in order to print the Unicode scalar values that compose it:
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(scalar)
}
/*
prints:
🇮
🇹
*/
If you combine those scalars (that are Regional Indicator Symbols), you get a flag emoji:
let italianFlag = "🇮" + "🇹"
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
Each Unicode.Scalar instance also has a property value that you can use in order to display a numeric representation of it:
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(scalar.value)
}
/*
prints:
127470
127481
*/
You can create Unicode scalars from those numeric representations then associate them into a string:
let scalar1 = Unicode.Scalar(127470)
let scalar2 = Unicode.Scalar(127481)
let italianFlag = String(scalar1!) + String(scalar2!)
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
If needed, you can use Unicode.Scalar's escaped(asASCII:) method in order to display a string representation of the Unicode scalars (using ASCII characters):
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(scalar.escaped(asASCII: true))
}
/*
prints:
\u{0001F1EE}
\u{0001F1F9}
*/
let italianFlag = "\u{0001F1EE}\u{0001F1F9}"
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1
String's init(_:radix:uppercase:) may also be relevant to convert the scalar value to an hexadecimal value:
let emoji: Character = "🇮🇹"
for scalar in emoji.unicodeScalars {
print(String(scalar.value, radix: 16, uppercase: true))
}
/*
prints:
1F1EE
1F1F9
*/
let italianFlag = "\u{1F1EE}\u{1F1F9}"
print(italianFlag) // prints: 🇮🇹
print(italianFlag.count) // prints: 1

Swift doesn't tell you what the internal representation of a String is. You interact with a String as a list of full-size (32-bit) Unicode code points:
for character in "Dog!🐶" {
println(character)
}
// prints D, o, g, !, 🐶
If you want to work with a string as a sequence of UTF-8 or UTF-16 code points, use its utf8 or utf16 properties. See Strings and Characters in the docs.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How String Comparison happens in Swift - swift

in this example: var str1 = "hello" var str2 = "Hello" if str1 < str2 { print("hello is less than Hello")} else {print("hello is more than Hello")} on what basis it is found that str1 is greater than str2?

I think it is based on the Lexicographical Order.https://en.wikipedia.org/wiki/Lexicographical_order

Related

Getting a random emoji/character from a unicode string

Alphabetic comparison - why is a less than b? [duplicate]

What does it mean that string and character comparisons in Swift are not locale-sensitive?

How can I get the Unicode code point(s) of a Character?

How is the 🇩🇪 character represented in Swift strings?

Categories

Resources