How to get the number of real words in a text in Swift [duplicate] - swift

This question already has answers here:
Number of words in a Swift String for word count calculation
(7 answers)
Closed 5 years ago.
Edit: there is already a question similar to this one but it's for numbers separated by a specific character (Get no. Of words in swift for average calculator). Instead this question is about to get the number of real words in a text, separated in various ways: a line break, some line breaks, a space, more than a space etc.
I would like to get the number of words in a string with Swift 3.
I'm using this code but I get imprecise result because the number is get counting the spaces and new lines instead of the effective number of words.
let str = "Architects and city planners,are \ndesigning buildings to create a better quality of life in our urban areas."
// 18 words, 21 spaces, 2 lines
let components = str.components(separatedBy: .whitespacesAndNewlines)
let a = components.count
print(a)
// 23 instead of 18

Consecutive spaces and newlines aren't coalesced into one generic whitespace region, so you're simply getting a bunch of empty "words" between successive whitespace characters. Get rid of this by filtering out empty strings:
let components = str.components(separatedBy: .whitespacesAndNewlines)
let words = components.filter { !$0.isEmpty }
print(words.count) // 17
The above will print 17 because you haven't included , as a separation character, so the string "planners,are" is treated as one word.
You can break that string up as well by adding punctuation characters to the set of separators like so:
let chararacterSet = CharacterSet.whitespacesAndNewlines.union(.punctuationCharacters)
let components = str.components(separatedBy: chararacterSet)
let words = components.filter { !$0.isEmpty }
print(words.count) // 18
Now you'll see a count of 18 like you expect.

Related

UTF8 String length and indices in Go vs Swift

I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum":
Go's utf8.RuneCountInString("Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an β€œextended grapheme cluster,” and each of "πŸ˜‚", "πŸ˜ƒ", "✌️", "πŸ€”", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a β€œrune” is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "✌️" which
counts as a single Swift character, but is built from two Unicode scalars:
print("✌️".count) // 1
print("✌️".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.

Why two flags only form 1 character? [duplicate]

let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ"
let str2 = "πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ"
print(str1.count) // 5
print(Array(str1)) // ["πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or γ‚«)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a β€œstack”.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "πŸ‡©πŸ‡ͺ\u{200C}πŸ‡©πŸ‡ͺ"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "πŸ‡©πŸ‡ͺ\u{200B}πŸ‡©πŸ‡ͺ"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "πŸ‡«β€‹πŸ‡·β€‹πŸ‡Ίβ€‹πŸ‡Έ"
be "πŸ‡«β€‹πŸ‡·πŸ‡Ίβ€‹πŸ‡Έ" or "πŸ‡«πŸ‡·β€‹πŸ‡ΊπŸ‡Έ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ".
Here's how I solved that problem, for Swift 3:
let str = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.

How to split string into only two parts with a given separator in Swift? [duplicate]

This question already has answers here:
Swift split string at first match of a character
(3 answers)
Closed 5 years ago.
I have a requirement to split a string into 2 parts based on the first separator, for example the following source data:
1,Froederick,Frankenstien
2,Ludwig,Van,Beethoven
3,Anne Frank
Above each array element to be separated as following:
1st Component 2nd Component
1 Froederick,Frankenstien
2 Ludwig,Van,Beethoven
3 Anne Frank
I'm familiar with String.components(separatedBy: String) but I'm not sure how to only split once, as I get 3 components for 1st string, 4 components for 2nd string. Is there a Swifty (elegant) way of doing this?
You can use split on a characters property of a string and set the maxSplits parameter to 1. For example:
let splitString = "1,Froederick,Frankenstien".characters.split(separator: ",", maxSplits: 1)
The result is an array of CharacterView that need to be converted into strings for example using map along with init(_ characters:) initializer of a String.
let strings = splitString.map { String($0) }
This should produce an array ["1", "Froederick,Frankenstien"].

Deleting Specific Substrings in Strings [Swift] [duplicate]

This question already has answers here:
Any way to replace characters on Swift String?
(23 answers)
Closed 5 years ago.
I have a string var m = "I random don't like confusing random code." I want to delete all instances of the substring random within string m, returning string parsed with the deletions completed.
The end result would be: parsed = "I don't like confusing code."
How would I go about doing this in Swift 3.0+?
It is quite simple enough, there is one of many ways where you can replace the string "random" with empty string
let parsed = m.replacingOccurrences(of: "random", with: "")
Depend on how complex you want the replacement to be (remove/keep punctuation marks after random). If you want to remove random and optionally the space behind it:
var m = "I random don't like confusing random code."
m = m.replacingOccurrences(of: "random ?", with: "", options: [.caseInsensitive, .regularExpression])

How can I take a user input that may contain spaces and convert the spaces to a hyphen in Swift? [duplicate]

This question already has answers here:
Any way to replace characters on Swift String?
(23 answers)
Closed 7 years ago.
I'm trying to create a simple iOS app that takes user input ( a city ) and searches a website for that city, and then will display the forecasts for that city.
What I'm currently stuck on and unable to find much documentation that isn't overwhelming is how I can be sure that the user input will translate well to a URL if there are more then one words in the name of the city.
aka if a user inputs Salt Lake City into my text field, how can I write an if else statement that determines the amount of spaces, and if the amount of spaces is greater than 0 will convert those spaces to "-".
So far I've tried creating an array out of the string, but still can't figure out how I can append a - to each element in the array. I don't think it's possible.
Does anyone know how I can do what I'm trying to do? Or am I approaching it the incorrect way?
Here's a poor first attempt. I know this doesn't work, but hopefully it explains it a bit more of what I'm trying to accomplish than my text above.
var cityText = "Salt Lake City"
let cityArray = cityText.componentsSeparatedByString(" ")
let combineDashUrl = cityArray[0] + "-" + cityArray[1] + "-" + cityArray[2]
print(combineDashUrl)
Assuming there are never multiple spaces in a row you should be able to use stringByReplacingOccurrencesOfString.
let cityText = "Salt Lake City"
let newCityText = cityText.stringByReplacingOccurrencesOfString(
" ",
withString: "-")
Replacing variable numbers of spaces with a dash would be more complicated. I'd probably use regular expressions for that.
You can use map over the array of characters to transform spaces into hyphens.
let city = "Salt Lake City"
let hyphenatedCity = String(city.characters.map{$0 == " " ? "-" : $0})