I've reviewed questions such as Get the length of a String and Why are emoji characters like π©βπ©βπ§βπ¦ treated so strangely in Swift strings? but neither cover this specific question.
This all started when trying to apply skin tone modifiers to Emoji characters (see Add skin tone modifier to an emoji programmatically). This led to wondering what happens when you apply a skin tone modifier to a regular character such as "A".
Examples:
let tonedThumbsUp = "π" + "π»" // ππ»
let tonedA = "A" + "πΎ" // AπΎ
I'm trying to detect that second case. The count of both of those strings is 1. And the unicodeScalars.count is 2 for both.
How do I determine if the resulting string appears as a single character when displayed? In other words, how can I determine if the skin tone modifier was applied to make a single character or not?
I've tried a few ways to dump information about the string but none give the desired result.
func dumpString(_ str: String) {
print("Raw:", str, str.count)
print("Scalars:", str.unicodeScalars, str.unicodeScalars.count)
print("UTF16:", str.utf16, str.utf16.count)
print("UTF8:", str.utf8, str.utf16.count)
print("Range:", str.startIndex, str.endIndex)
print("First/Last:", str.first == str.last, str.first, str.last)
}
dumpString("Aπ½")
dumpString("\u{1f469}\u{1f3fe}")
Results:
Raw: Aπ½ 1
Scalars: Aπ½ 2
UTF16: Aπ½ 3
UTF8: Aπ½ 3
First/Last: true Optional("Aπ½") Optional("Aπ½")
Raw: π©πΎ 1
Scalars: π©πΎ 2
UTF16: π©πΎ 4
UTF8: π©πΎ 4
First/Last: true Optional("π©πΎ") Optional("π©πΎ")
What happens if you print ππ» on a system that doesn't support the Fitzpatrick modifiers? You get π followed by whatever the system uses for an unknown character placeholder.
So I think to answer this, you must consult your system's typesetter. For Apple platforms, you can use Core Text to create a CTLine and then count the line's glyph runs. Example:
import Foundation
import CoreText
func test(_ string: String) {
let richText = NSAttributedString(string: string)
let line = CTLineCreateWithAttributedString(richText as CFAttributedString)
let runs = CTLineGetGlyphRuns(line) as! [CTRun]
print(string, runs.count)
}
test("π" + "π»")
test("A" + "πΎ")
test("B\u{0300}\u{0301}\u{0302}" + "πΎ")
Output from a macOS playground in Xcode 10.2.1 on macOS 10.14.6 Beta (18G48f):
ππ» 1
AπΎ 2
BΜΜΜπΎ 2
I think it might be possible to reason about this by looking to see whether the modifier is present and if so whether it has increased the character count.
So for example:
let tonedThumbsUp = "π" + "π»"
let tonedA = "A" + "π»"
tonedThumbsUp.count // 1
tonedThumbsUp.unicodeScalars.count // 2
tonedA.count //2
tonedThumbsUp.unicodeScalars.count //2
let c = "\u{1F3FB}"
tonedThumbsUp.contains(c) // true
tonedA.contains(c) // true
Okay, so they both contain a modifier character, and they both contain two unicode scalars, but one is count 1 and the other is count 2. Surely that's a useful distinction.
Related
I'm using this code for detecting the cursor's position in a UITextView :
if let selectedRange = textView.selectedTextRange {
let cursorPosition = textView.offset(from: textView.beginningOfDocument, to: selectedRange.start)
print("\(cursorPosition)")
}
I put this under textViewDidChange func for detecting cursor position each time the text change.
It is working fine but when I putting emojis the textView.text.count is different with the cursor position. from swift 4 each emoji counted as one character but it seems that for cursor position it is different.
so How can I get the exact cursor position that matches the count of characters in a text ?
Long story short: When using Swift with String and NSRange use this extension for Range conversion
extension String {
/// Fixes the problem with `NSRange` to `Range` conversion
var range: NSRange {
let fromIndex = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: 0)
let toIndex = unicodeScalars.index(fromIndex, offsetBy: count)
return NSRange(fromIndex..<toIndex, in: self)
}
}
Let's take a deeper look:
let myStr = "WΓ©ll hellΓ³ βοΈ"
myStr.count // 12
myStr.unicodeScalars.count // 13
myStr.utf8.count // 19
myStr.utf16.count // 13
In Swift 4 string is a collection of characters (composite character like ΓΆ and emoji will count as one character). UTF-8 and UTF-16 views are the collections of UTF-8 and UTF-16 code units respectively.
Your problem is, that textView.text.count counts collection elements (emoji as well as composite character will count as one element) and NSRange counts indexes of UTF-16 code units. The difference is illustrated in the snipped above.
More here:
Strings And Characters
I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem ππβοΈπ€ ipsum":
Go's utf8.RuneCountInString("Lorem ππβοΈπ€ ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem ππβοΈπ€ ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like π¨βπ©βπ§βπ¦ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an βextended grapheme cluster,β and each of "π", "π", "βοΈ", "π€", "π¨βπ©βπ§βπ¦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a βruneβ is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "βοΈ" which
counts as a single Swift character, but is built from two Unicode scalars:
print("βοΈ".count) // 1
print("βοΈ".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem ππβοΈπ€ ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.
let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ"
let str2 = "π©πͺ.π©πͺ.π©πͺ.π©πͺ.π©πͺ."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ"
print(str1.count) // 5
print(Array(str1)) // ["π©πͺ", "π©πͺ", "π©πͺ", "π©πͺ", "π©πͺ"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or γ«)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a βstackβ.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "π©πͺ\u{200C}π©πͺ"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "π©πͺ\u{200B}π©πͺ"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "π«βπ·βπΊβπΈ"
be "π«βπ·πΊβπΈ" or "π«π·βπΊπΈ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ".
Here's how I solved that problem, for Swift 3:
let str = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.
This question already has answers here:
Number of words in a Swift String for word count calculation
(7 answers)
Closed 5 years ago.
Edit: there is already a question similar to this one but it's for numbers separated by a specific character (Get no. Of words in swift for average calculator). Instead this question is about to get the number of real words in a text, separated in various ways: a line break, some line breaks, a space, more than a space etc.
I would like to get the number of words in a string with Swift 3.
I'm using this code but I get imprecise result because the number is get counting the spaces and new lines instead of the effective number of words.
let str = "Architects and city planners,are \ndesigning buildings to create a better quality of life in our urban areas."
// 18 words, 21 spaces, 2 lines
let components = str.components(separatedBy: .whitespacesAndNewlines)
let a = components.count
print(a)
// 23 instead of 18
Consecutive spaces and newlines aren't coalesced into one generic whitespace region, so you're simply getting a bunch of empty "words" between successive whitespace characters. Get rid of this by filtering out empty strings:
let components = str.components(separatedBy: .whitespacesAndNewlines)
let words = components.filter { !$0.isEmpty }
print(words.count) // 17
The above will print 17 because you haven't included , as a separation character, so the string "planners,are" is treated as one word.
You can break that string up as well by adding punctuation characters to the set of separators like so:
let chararacterSet = CharacterSet.whitespacesAndNewlines.union(.punctuationCharacters)
let components = str.components(separatedBy: chararacterSet)
let words = components.filter { !$0.isEmpty }
print(words.count) // 18
Now you'll see a count of 18 like you expect.
I have a word that is being displayed into a label. Could I program it, where it will only show the last 2 characters of the word, or the the first 3 only? How can I do this?
Swift's string APIs can be a little confusing. You get access to the characters of a string via its characters property, on which you can then use prefix() or suffix() to get the substring you want. That subset of characters needs to be converted back to a String:
let str = "Hello, world!"
// first three characters:
let prefixSubstring = String(str.characters.prefix(3))
// last two characters:
let suffixSubstring = String(str.characters.suffix(2))
I agree it is definitely confusing working with String indexing in Swift and they have changed a little bit from Swift 1 to 2 making googling a bit of a challenge but it can actually be quite simple once you get a hang of the methods. You basically need to make it into a two-step process:
1) Find the index you need
2) Advance from there
For example:
let sampleString = "HelloWorld"
let lastThreeindex = sampleString.endIndex.advancedBy(-3)
sampleString.substringFromIndex(lastThreeindex) //prints rld
let secondIndex = sampleString.startIndex.advancedBy(2)
sampleString.substringToIndex(secondIndex) //prints He