Swift CharacterSet subtract - swift

Why characters1 is not empty?
var characters1 = CharacterSet.decimalDigits
let characters2 = CharacterSet(charactersIn: "01234567890")
characters1.subtract(characters2)
print(characters1.isEmpty)
Here everything is OK
var characters1 = CharacterSet(charactersIn: "9876543210")
let characters2 = CharacterSet(charactersIn: "0123456789")
characters1.subtract(characters2)
print(characters1.isEmpty)

From the docs (emphasis mine)
Informally, this set is the set of all characters used to represent
the decimal values 0 through 9. These characters include, for example,
the decimal digits of the Indic scripts and Arabic.
Therefore, CharacterSet.decimalDigits don't only contains "9876543210", they also have numerals from the Indic scripts (and other scripts).

Related

Find number in string and return location and length

let myStr = "I have 4.34 apples."
I need the location range and the length, because I'm using NSRange(location:, length:) to bold the number 4.34
extension String{
func findNumbersAndBoldThem()->NSAttributedString{
//the code
}
}
My suggestion is also based on regular expression but there is a more convenient way to get NSRange from Range<String.Index>
let myStr = "I have 4.34 apples."
if let range = myStr.range(of: "\\d+\\.\\d+", options: .regularExpression) {
let nsRange = NSRange(range, in: myStr)
print(nsRange)
}
If you want to detect integer and floating point values use the pattern
"\\d+(\\.\\d+)?"
The parentheses and the trailing question mark indicate that the decimal point and the fractional digits are optional.

UTF8 String length and indices in Go vs Swift

I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum":
Go's utf8.RuneCountInString("Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an β€œextended grapheme cluster,” and each of "πŸ˜‚", "πŸ˜ƒ", "✌️", "πŸ€”", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a β€œrune” is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "✌️" which
counts as a single Swift character, but is built from two Unicode scalars:
print("✌️".count) // 1
print("✌️".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.

Index distance to FileHandle pointer and characters encoding in Swift 4

I have this function to return (and seek) a FileHandle pointer at a specific word:
func getFilePointerIndex(atWord word: String, inFile file: FileHandle) -> UInt64? {
let offset = file.offsetInFile
if let str = String(data: file.readDataToEndOfFile(), encoding: .utf8) {
if let range = str.range(of: word) {
let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
file.seek(toFileOffset: offset + UInt64(intIndex))
return UInt64(intIndex) + offset
}
}
return nil
}
When applied on some utf8 text files, it yields offset results far from the location of the word passed in. I thought it has to be the character encoding (variable-byte characters), since the seek(toFileOffset:) method applies to class Data objects.
Any idea to fix it?
let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
measures the distance in Characters, i.e. β€œextended Unicode grapheme
clusters”. For example, the character "€" would be stored as three
bytes "0xE2 0x82 0xAC" in UTF-8 encoding, but counts as a single
Character.
To measure the distance in UTF-8 code units, use
let intIndex = str.utf8.distance(from: str.utf8.startIndex, to: range.lowerBound)
See also Strings in Swift 2 in the Swift blog for an overview about grapheme clusters and
the different views of a Swift string.

Why two flags only form 1 character? [duplicate]

let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ"
let str2 = "πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ"
print(str1.count) // 5
print(Array(str1)) // ["πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or γ‚«)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a β€œstack”.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "πŸ‡©πŸ‡ͺ\u{200C}πŸ‡©πŸ‡ͺ"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "πŸ‡©πŸ‡ͺ\u{200B}πŸ‡©πŸ‡ͺ"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "πŸ‡«β€‹πŸ‡·β€‹πŸ‡Ίβ€‹πŸ‡Έ"
be "πŸ‡«β€‹πŸ‡·πŸ‡Ίβ€‹πŸ‡Έ" or "πŸ‡«πŸ‡·β€‹πŸ‡ΊπŸ‡Έ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ".
Here's how I solved that problem, for Swift 3:
let str = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.

NSString.rangeOfString returns unusual result with non-latin characters

I need to get the range of two words in a string, for example:
ฒัฟิแก ΰΉ„ΰΈŸΰΈ«ΰΈ
(this is literally me typing PYABCD WASD) - it's a non-sensical test since I don't speak Thai.
//Find all the ranges of each word
var words: [String] = []
var ranges: [NSRange] = []
//Convert to nsstring first because otherwise you get stuck with Ranges and Strings.
let nstext = backgroundTextField.stringValue as NSString //contains "ฒัฟิแก ΰΉ„ΰΈŸΰΈ«ΰΈ"
words = nstext.componentsSeparatedByString(" ")
var nstextLessWordsWeHaveRangesFor = nstext //if you have two identical words this prevents just getting the first word's range
for word in words
{
let range:NSRange = nstextLessWordsWeHaveRangesFor.rangeOfString(word)
Swift.print(range)
ranges.append(range)
//create a string the same length as word
var fillerString:String = ""
for i in 0..<word.characters.count{
//for var i=0;i<word.characters.count;i += 1{
Swift.print("i: \(i)")
fillerString = fillerString.stringByAppendingString(" ")
}
//remove duplicate words / letters so that we get correct range each time.
if range.length <= nstextLessWordsWeHaveRangesFor.length
{
nstextLessWordsWeHaveRangesFor = nstextLessWordsWeHaveRangesFor.stringByReplacingCharactersInRange(range, withString: fillerString)
}
}
outputs:
(0,6)
(5,4)
Those ranges are overlapping.
This causes problems down the road where I'm trying to use NSLayoutManager.enumerateEnclosingRectsForGlyphRange since the ranges are inconsistent.
How can I get the correct range (or in this specific case, non-overlapping ranges)?
Swift String characters describe "extended grapheme clusters", and NSString
uses UTF-16 code points, therefore the length of a string differs
depending on which representation you use.
For example, the first character "ΰΈ’ΰΈ±" is actually the combination
of "ΰΈ’" (U+0E22) with the diacritical mark " ΰΈ±" (U+0E31).
That counts as one String character, but as two NSString characters.
As a consequence, indices change when you replace the word with
spaces.
The simplest solution is to stick to one, either String or NSString
(if possible). Since you are working with NSString, changing
for i in 0..<word.characters.count {
to
for i in 0..<range.length {
should solve the problem. The creation of the filler string
can be simplified to
//create a string the same length as word
let fillerString = String(count: range.length, repeatedValue: Character(" "))
Removing nstextLessWordsWeHaveRangesFor solves the issue (at the bottom starting with range.length <= nstextLessWordsWeHaveRangesFor.length). The modification of that variable is changing the range and giving unexpected output. Here is the result when the duplicate word removal is removed:
var words: [String] = []
let nstext = "ฒัฟิแก ΰΉ„ΰΈŸΰΈ«ΰΈ" as NSString
words = nstext.componentsSeparatedByString(" ")
for word in words {
let range = nstext.rangeOfString(word)
print(range)
}
Output is: (0,6) and (7,4)