let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ"
let str2 = "π©πͺ.π©πͺ.π©πͺ.π©πͺ.π©πͺ."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ"
print(str1.count) // 5
print(Array(str1)) // ["π©πͺ", "π©πͺ", "π©πͺ", "π©πͺ", "π©πͺ"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or γ«)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a βstackβ.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "π©πͺ\u{200C}π©πͺ"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "π©πͺ\u{200B}π©πͺ"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "π«βπ·βπΊβπΈ"
be "π«βπ·πΊβπΈ" or "π«π·βπΊπΈ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ".
Here's how I solved that problem, for Swift 3:
let str = "π©πͺπ©πͺπ©πͺπ©πͺπ©πͺ" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.
Related
I've reviewed questions such as Get the length of a String and Why are emoji characters like π©βπ©βπ§βπ¦ treated so strangely in Swift strings? but neither cover this specific question.
This all started when trying to apply skin tone modifiers to Emoji characters (see Add skin tone modifier to an emoji programmatically). This led to wondering what happens when you apply a skin tone modifier to a regular character such as "A".
Examples:
let tonedThumbsUp = "π" + "π»" // ππ»
let tonedA = "A" + "πΎ" // AπΎ
I'm trying to detect that second case. The count of both of those strings is 1. And the unicodeScalars.count is 2 for both.
How do I determine if the resulting string appears as a single character when displayed? In other words, how can I determine if the skin tone modifier was applied to make a single character or not?
I've tried a few ways to dump information about the string but none give the desired result.
func dumpString(_ str: String) {
print("Raw:", str, str.count)
print("Scalars:", str.unicodeScalars, str.unicodeScalars.count)
print("UTF16:", str.utf16, str.utf16.count)
print("UTF8:", str.utf8, str.utf16.count)
print("Range:", str.startIndex, str.endIndex)
print("First/Last:", str.first == str.last, str.first, str.last)
}
dumpString("Aπ½")
dumpString("\u{1f469}\u{1f3fe}")
Results:
Raw: Aπ½ 1
Scalars: Aπ½ 2
UTF16: Aπ½ 3
UTF8: Aπ½ 3
First/Last: true Optional("Aπ½") Optional("Aπ½")
Raw: π©πΎ 1
Scalars: π©πΎ 2
UTF16: π©πΎ 4
UTF8: π©πΎ 4
First/Last: true Optional("π©πΎ") Optional("π©πΎ")
What happens if you print ππ» on a system that doesn't support the Fitzpatrick modifiers? You get π followed by whatever the system uses for an unknown character placeholder.
So I think to answer this, you must consult your system's typesetter. For Apple platforms, you can use Core Text to create a CTLine and then count the line's glyph runs. Example:
import Foundation
import CoreText
func test(_ string: String) {
let richText = NSAttributedString(string: string)
let line = CTLineCreateWithAttributedString(richText as CFAttributedString)
let runs = CTLineGetGlyphRuns(line) as! [CTRun]
print(string, runs.count)
}
test("π" + "π»")
test("A" + "πΎ")
test("B\u{0300}\u{0301}\u{0302}" + "πΎ")
Output from a macOS playground in Xcode 10.2.1 on macOS 10.14.6 Beta (18G48f):
ππ» 1
AπΎ 2
BΜΜΜπΎ 2
I think it might be possible to reason about this by looking to see whether the modifier is present and if so whether it has increased the character count.
So for example:
let tonedThumbsUp = "π" + "π»"
let tonedA = "A" + "π»"
tonedThumbsUp.count // 1
tonedThumbsUp.unicodeScalars.count // 2
tonedA.count //2
tonedThumbsUp.unicodeScalars.count //2
let c = "\u{1F3FB}"
tonedThumbsUp.contains(c) // true
tonedA.contains(c) // true
Okay, so they both contain a modifier character, and they both contain two unicode scalars, but one is count 1 and the other is count 2. Surely that's a useful distinction.
Why characters1 is not empty?
var characters1 = CharacterSet.decimalDigits
let characters2 = CharacterSet(charactersIn: "01234567890")
characters1.subtract(characters2)
print(characters1.isEmpty)
Here everything is OK
var characters1 = CharacterSet(charactersIn: "9876543210")
let characters2 = CharacterSet(charactersIn: "0123456789")
characters1.subtract(characters2)
print(characters1.isEmpty)
From the docs (emphasis mine)
Informally, this set is the set of all characters used to represent
the decimal values 0 through 9. These characters include, for example,
the decimal digits of the Indic scripts and Arabic.
Therefore, CharacterSet.decimalDigits don't only contains "9876543210", they also have numerals from the Indic scripts (and other scripts).
I'm trying to reduce the precision of the floats that are embedded in a strings.
The example is [93829.38, 1415.45467897]
I'd like to cut float numbers obtaining float number with a maximum precision of 2 (I can cut the string directly, no needs to round the numbers somehow).
The example is [93829.38, 1415.45]
with this regexp on rubular I can get float numbers in the string:
(\d+\.\d)
But I can't understand how to port this regexp on Swift and how to substitute the float strings with the shortest ones...
You may use
let str = "The example is [93829.38, 1415.45467897, 1.2, 134.34]"
let pattern = "(\\d+\\.\\d{2})\\d+"
let result = str.replacingOccurrences(of: pattern, with: "$1", options: [.regularExpression])
print(result) // => The example is [93829.38, 1415.45, 1.2, 134.34]
A pattern like (\d+\.\d{2})\d+ will match and capture into Group 1 one or more diigts, a dot and then two digits, and then will match one or more digits. The replacement is $1, the backreference to the value stored in Group 1, thus, truncating the digits matched with the last \d+.
See the regex demo here.
If there are any edge cases, they can usually be handled by means of word boundaries (\b) or lookarounds.
I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem ππβοΈπ€ ipsum":
Go's utf8.RuneCountInString("Lorem ππβοΈπ€ ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem ππβοΈπ€ ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like π¨βπ©βπ§βπ¦ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an βextended grapheme cluster,β and each of "π", "π", "βοΈ", "π€", "π¨βπ©βπ§βπ¦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a βruneβ is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "βοΈ" which
counts as a single Swift character, but is built from two Unicode scalars:
print("βοΈ".count) // 1
print("βοΈ".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem ππβοΈπ€ ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.
This question already has answers here:
Number of words in a Swift String for word count calculation
(7 answers)
Closed 5 years ago.
Edit: there is already a question similar to this one but it's for numbers separated by a specific character (Get no. Of words in swift for average calculator). Instead this question is about to get the number of real words in a text, separated in various ways: a line break, some line breaks, a space, more than a space etc.
I would like to get the number of words in a string with Swift 3.
I'm using this code but I get imprecise result because the number is get counting the spaces and new lines instead of the effective number of words.
let str = "Architects and city planners,are \ndesigning buildings to create a better quality of life in our urban areas."
// 18 words, 21 spaces, 2 lines
let components = str.components(separatedBy: .whitespacesAndNewlines)
let a = components.count
print(a)
// 23 instead of 18
Consecutive spaces and newlines aren't coalesced into one generic whitespace region, so you're simply getting a bunch of empty "words" between successive whitespace characters. Get rid of this by filtering out empty strings:
let components = str.components(separatedBy: .whitespacesAndNewlines)
let words = components.filter { !$0.isEmpty }
print(words.count) // 17
The above will print 17 because you haven't included , as a separation character, so the string "planners,are" is treated as one word.
You can break that string up as well by adding punctuation characters to the set of separators like so:
let chararacterSet = CharacterSet.whitespacesAndNewlines.union(.punctuationCharacters)
let components = str.components(separatedBy: chararacterSet)
let words = components.filter { !$0.isEmpty }
print(words.count) // 18
Now you'll see a count of 18 like you expect.