UTF8 String length and indices in Go vs Swift - swift

I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum":
Go's utf8.RuneCountInString("Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}

In Swift a Character is an β€œextended grapheme cluster,” and each of "πŸ˜‚", "πŸ˜ƒ", "✌️", "πŸ€”", "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a β€œrune” is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "✌️" which
counts as a single Swift character, but is built from two Unicode scalars:
print("✌️".count) // 1
print("✌️".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem πŸ˜‚πŸ˜ƒβœŒοΈπŸ€” ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.

A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.

Related

Swift Dictionary is slow?

Situation: I was solving LeetCode 3. Longest Substring Without Repeating Characters, when I use the Dictionary using Swift the result is Time Limit Exceeded that failed to last test case, but using the same notion of code with C++ it acctually passed with runtime just fine. I thought in swift Dictionary is same thing as UnorderdMap.
Some research: I found some resources said use NSDictionary over regular one but it requires reference type instead of Int or Character etc.
Expected result: fast performance in accessing Dictionary in Swift
Question: I know there are better answer for the question, but the main goal here is Is there a effiencient to access and write to Dictionary or someting we can use to substitude.
func lengthOfLongestSubstring(_ s: String) -> Int {
var window:[Character:Int] = [:] //swift dictionary is kind of slow?
let array = Array(s)
var res = 0
var left = 0, right = 0
while right < s.count {
let rightChar = array[right]
right += 1
window[rightChar, default: 0] += 1
while window[rightChar]! > 1 {
let leftChar = array[left]
window[leftChar, default: 0] -= 1
left += 1
}
res = max(res, right - left)
}
return res
}
Because complexity of count in String is O(n), so that you should save count in a variable. You can read at chapter
Strings and Characters in Swift Book
Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different charactersβ€”and different representations of the same characterβ€”can require different amounts of memory to store. Because of this, characters in Swift don’t each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string can’t be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.
The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

How to shift a string's Range?

I have the Range of a word and its enclosing sentence within a big long String. After extracting that sentence into its own String, I'd like to know the position of the word within it.
If we were dealing with integer indexes, I would just subtract the sentence's starting index from the word's range and I'd be done. For example, if the word was in characters 10–12 and its sentence started at character 8, then I'd have a new word range of 2–4.
Here's what I've got, ready to copy&paste to a Playground:
// The Setup (this is just to get easy testing values, no need for feedback on this part)
let bigLongString = "A beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows."
let sentenceInString = bigLongString.range(of: "This every sister of the Bene Gesserit knows.")!
let wordInString = bigLongString.range(of: "sister")!
let sentence = String(bigLongString[sentenceInString])
// The Code In Question
let wordInSentence = ??? // Something that shifts the `wordInString` range
// The Test (again, just for testing. it should read "This every *sister* of the Bene Gesserit knows.")
print(sentence.replacingCharacters(in: wordInSentence,
with: "*\(sentence[wordInSentence])*"))
Also, note that wordInString may refer to any instance of a given word, not just the first one. (So, re-finding the word in sentence, i.e., sentence.range(of: "sister"), won't do the trick here unfortunately.) The range needs to be shifted somehow.
Thanks for reading!
EDIT:
Introducing a slightly more complicated bigLongString seems to be an issue with the solution I posted. E.g.,
let bigLongString = "Really…? Thought I had it."
let sentenceInString = bigLongString.range(of: "Thought I had it.")!
let wordInString = bigLongString.range(of: "I")!
This can get kinda tricky, depending on precisely what you need to do.
NSRange
Firstly, as you may have noticed, Range<String.Index> and NSRange are different.
Range<String.Index> is how Swift represent ranges of indices in native Swift.Strings. It's an opaque type, that's only usable by the String APIs that consume it. It understands Swift strings as collections of Swift.Characters, which represent what Unicode calls "extended grapheme clusters".
NSRange is the older range representation, used by Objective C to represent ranges in Foundation.NSStrings. It's an open container, containing a "start" location and a length. Importantly, these NSRange and NSString understand collections of utf16 encoded unicode scalars.
Because NSRange and NSString expose so many of their internals, they haven't undergone the same migration from utf16 to utf8 that Swift.String underwent. A migration that most people probably didn't even notice, since Swift.String guarded its implementation details much more than NSString did.
NSRange is more amenable to the kinds of simple operations you might be looking for. You can offset the start location just like you describe. However, you need to be careful that the resulting range doesn't start/end in the middle of an extended grapheme cluster. In that case, slicing could lead to a substring with invalid unicode characters (for example, you might accidentally cut an e away from its accent. the accent modifier isn't valid on its own without the e.)
Bridging back and forth between NSRange and Range<String.Index> is possible, but can be error prone if you're not careful. For that reason, I suggest you try to minimize conversions, by trying to either exclusively use NSRange, or Range<String.Index>, but not mix the two too much.
replacingCharacters(in:with:)
I suspect you're only using this as example way of consuming wordInSentence, but it's still worth noting that:
Foundation.NSString.replacingCharacters(in:with:)](https://developer.apple.com/documentation/foundation/nsstring/1412937-replacingoccurrences) is an NSString API that's imported onto Swift.String when Foundation is imported. It accept an NSString. If you're dealing with Range<String.Index>, you should use its Swift-native counterpart, Swift.String.replaceSubrange(_:with:).
Substring is your friend
Don't fight it; unless you absolutely need sentence to be a String, keep it as a Substring for the duration of these short-lived processing actions. Not only does this save you a copy of the string's contents, but it also makes it so that the indices can be shared between the slice and the parent string. This is valid:
let sentence = bigLongString[sentenceInString]
print(sentence[wordInString])
or even just: bigLongString[sentenceInString][wordInString] or bigLongString[wordInString]
Shifting around
I couldn't find a native solution for this, so I rolled my own. I could definitely be missing something simpler, but here's what I came up with:
import Foundation
struct SubstringOffset {
let offset: String.IndexDistance
let parent: String
init(of substring: Substring, in parent: String) {
self.offset = parent.distance(from: parent.startIndex, to: substring.startIndex)
self.parent = parent
}
func convert(indexInParent: String.Index, toIndexIn newString: String) -> String.Index {
let distance = parent.distance(from: parent.startIndex, to: indexInParent)
let distanceInNewString = distance - offset
return newString.index(newString.startIndex, offsetBy: distanceInNewString)
}
func convert(rangeInParent: Range<String.Index>, toRangeIn newString: String) -> Range<String.Index> {
let newLowerBound = self.convert(indexInParent: rangeInParent.lowerBound, toIndexIn: newString)
let span = self.parent.distance(from: rangeInParent.lowerBound, to: rangeInParent.upperBound)
let newUpperBound = newString.index(newLowerBound, offsetBy: span)
return newLowerBound ..< newUpperBound
}
}
// The Setup (this is just to get easy testing values, no need for feedback on this part)
let bigLongString = "Really…? Thought I had it."
let sentenceInString = bigLongString.range(of: "Thought I had it.")!
let wordInString = bigLongString.range(of: "I")!
var sentence: String = String(bigLongString[sentenceInString])
let offset = SubstringOffset(of: bigLongString[sentenceInString], in: bigLongString)
// The Code In Question
let wordInSentence: Range<String.Index> = offset.convert(rangeInParent: wordInString, toRangeIn: sentence)
sentence.replaceSubrange(wordInSentence, with: "*\(sentence[wordInSentence])*")
print(sentence)
OK, this is what I've come up with. It appears to work OK for both examples in the question.
We use the String instance method distance(from:to:) to get the distance between the bigLongString start and the sentence start. (Analogous to the "8" in the question.) Then the word range is shifted back by this amount by shifting the upper and lower bounds separately, and then reforming them into a Range.
let wordStartInSentence = bigLongString.distance(from: sentenceInString.lowerBound,
to: wordInString.lowerBound)
let wordEndInSentence = bigLongString.distance(from: sentenceInString.lowerBound,
to: wordInString.upperBound)
let wordStart = sentence.index(sentence.startIndex, offsetBy: wordStartInSentence)
let wordEnd = sentence.index(sentence.startIndex, offsetBy: wordEndInSentence)
let wordInSentence = wordStart..<wordEnd
EDIT: Updated answer to work for the more complicated bigLongString example (and coincidentally also reduce the "string walking," especially when bigLongString is very big).

Reduce float precision using RegExp in swift

I'm trying to reduce the precision of the floats that are embedded in a strings.
The example is [93829.38, 1415.45467897]
I'd like to cut float numbers obtaining float number with a maximum precision of 2 (I can cut the string directly, no needs to round the numbers somehow).
The example is [93829.38, 1415.45]
with this regexp on rubular I can get float numbers in the string:
(\d+\.\d)
But I can't understand how to port this regexp on Swift and how to substitute the float strings with the shortest ones...
You may use
let str = "The example is [93829.38, 1415.45467897, 1.2, 134.34]"
let pattern = "(\\d+\\.\\d{2})\\d+"
let result = str.replacingOccurrences(of: pattern, with: "$1", options: [.regularExpression])
print(result) // => The example is [93829.38, 1415.45, 1.2, 134.34]
A pattern like (\d+\.\d{2})\d+ will match and capture into Group 1 one or more diigts, a dot and then two digits, and then will match one or more digits. The replacement is $1, the backreference to the value stored in Group 1, thus, truncating the digits matched with the last \d+.
See the regex demo here.
If there are any edge cases, they can usually be handled by means of word boundaries (\b) or lookarounds.

Why two flags only form 1 character? [duplicate]

let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ"
let str2 = "πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ.πŸ‡©πŸ‡ͺ."
println("\(countElements(str1)), \(countElements(str2))")
Result: 1, 10
But should not str1 have 5 elements?
The bug seems only occurred when I use the flag emoji.
Update for Swift 4 (Xcode 9)
As of Swift 4 (tested with Xcode 9 beta) grapheme clusters break after every second regional indicator symbol, as mandated by the Unicode 9
standard:
let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ"
print(str1.count) // 5
print(Array(str1)) // ["πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ", "πŸ‡©πŸ‡ͺ"]
Also String is a collection of its characters (again), so one can
obtain the character count with str1.count.
(Old answer for Swift 3 and older:)
From "3 Grapheme Cluster Boundaries"
in the "Standard Annex #29 UNICODE TEXT SEGMENTATION":
(emphasis added):
A legacy grapheme cluster is defined as a base (such as A or γ‚«)
followed by zero or more continuing characters. One way to think of
this is as a sequence of characters that form a β€œstack”.
The base can be single characters, or be any sequence of Hangul Jamo
characters that form a Hangul Syllable, as defined by D133 in The
Unicode Standard, or be any sequence of Regional_Indicator (RI) characters. The RI characters are used in pairs to denote Emoji
national flag symbols corresponding to ISO country codes. Sequences of
more than two RI characters should be separated by other characters,
such as U+200B ZWSP.
(Thanks to #rintaro for the link).
A Swift Character represents an extended grapheme cluster, so it is (according
to this reference) correct that any sequence of regional indicator symbols
is counted as a single character.
You can separate the "flags" by a ZERO WIDTH NON-JOINER:
let str1 = "πŸ‡©πŸ‡ͺ\u{200C}πŸ‡©πŸ‡ͺ"
print(str1.characters.count) // 2
or insert a ZERO WIDTH SPACE:
let str2 = "πŸ‡©πŸ‡ͺ\u{200B}πŸ‡©πŸ‡ͺ"
print(str2.characters.count) // 3
This solves also possible ambiguities, e.g. should "πŸ‡«β€‹πŸ‡·β€‹πŸ‡Ίβ€‹πŸ‡Έ"
be "πŸ‡«β€‹πŸ‡·πŸ‡Ίβ€‹πŸ‡Έ" or "πŸ‡«πŸ‡·β€‹πŸ‡ΊπŸ‡Έ" ?
See also How to know if two emojis will be displayed as one emoji? about a possible method
to count the number of "composed characters" in a Swift string,
which would return 5 for your let str1 = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ".
Here's how I solved that problem, for Swift 3:
let str = "πŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺπŸ‡©πŸ‡ͺ" //or whatever the string of emojis is
let range = str.startIndex..<str.endIndex
var length = 0
str.enumerateSubstrings(in: range, options: NSString.EnumerationOptions.byComposedCharacterSequences) { (substring, substringRange, enclosingRange, stop) -> () in
length = length + 1
}
print("Character Count: \(length)")
This fixes all the problems with character count and emojis, and is the simplest method I have found.

Emulating Python's `index(separator, start_index)` in Rust

I'm currently porting a library from Python from Rust and found a line for which I'm having trouble finding the right "translation":
right = s.index(sep, left)
where right is the index of the first instance of sep found in string s that is after index left.
A simple example of this can be seen here:
Python 3
>>> s = "Hello, my name is erip and my favorite color is green."
>>> right = s.index("my", 10) # Find index of first instance of 'my' after index 10
>>> print right
27
>>> print s[27:]
my favorite color is green.
My attempt in Rust is:
// s: &str, sep: &str, left: usize
let right = s[left..].find(sep).unwrap() + left;
This will search the bytes after left for sep. This seems to work when using ASCII characters. There seems to be a problem when using Unicode, though:
Python 3
>>> s = "Hello, mΓΏ name is erip and mΓΏ favorite color is green."
>>> right = s.index("mΓΏ", 10)
>>> print(right)
27
Rust
fn main() {
let sep: &str = "mΓΏ";
let left: usize = 10;
let s: &str = "Hello, mΓΏ name is erip and mΓΏ favorite color is green.";
let right = s[left..].find(sep).unwrap() + left;
println!("{}", right); //prints 28
}
I realize that Python 2 would also give 28 because it doesn't support Unicode natively, but I'd like to mimic Python 3's results.
The problem is because usize in Rust refers to the number of bytes in a string because "mΓΏ" actually requires 3 bytes to encode. How can I acquire this desired behavior in Rust?
I'm using rustc 1.4.0.
Let's restate the problem a bit as it's unclear what the unit of index should be. Humans believe that strings are easy because we've been using them for most of our lives. However, things aren't nearly as easy as we'd like.
Rust takes the point-of-view that strings (&str or String) are UTF-8 encoded sequences of bytes. Jumping into a string using a byte offset is O(1), and you really want that level of performance guarantee to build more complicated things upon.
I don't know what Python considers that index to be. It gets hard once you get beyond simple encoding schemes like ASCII where one character is one byte. There are multiple ways to chunk a Unicode string depending on what you want. Two obvious ones are by Unicode codepoint and by grapheme.
Since codepoints can be represented in Rust using a char, that's what I assume you want. However, you are the only one that can figure that out.
Additionally, since you requested that the result be 28, that must be the number of bytes into the string. It's a little odd to skip N codepoints but return bytes, but it is what it is.
Now that we know what we are doing... let's try and do it. (See next solution where I read the desired outcome better).
The key thing you need to use is char_indices. This is an O(n) operation that walks through the string and and gives you each codepoint and its corresponding byte offset.
Then, it's just a matter of putting that together and correctly handling cases of walking off the end of the string. This is made obvious by Rust's strong types, hooray!
// `index` is the number of Unicode codepoints to skip
// The result is the number of **bytes** inside the haystack
// that the needle can be found.
fn python_index(haystack: &str, needle: &str, index: usize) -> Option<usize> {
haystack.char_indices().nth(index).and_then(|(byte_idx, _)| {
let leftover = &haystack[byte_idx..];
leftover.find(needle).map(|inner_idx| inner_idx + byte_idx)
})
}
fn main() {
let right = python_index("Hello, mΓΏ name is erip and mΓΏ favorite color is green.", "mΓΏ", 10);
println!("{:?}", right); // prints Some(28)
}
We do the same high-level concept as above, but once we have found the needle, we then reset back and iterate through the codepoints again. When we find the same byte offset of the substring, we terminate.
Then it's just a matter of counting the characters we saw and adding the number that we already skipped.
// `index` is the number of Unicode codepoints to skip
// The result is the number of codepoints inside the haystack
// that the needle can be found.
fn python_index(haystack: &str, needle: &str, index: usize) -> Option<usize> {
haystack.char_indices().nth(index).and_then(|(byte_idx, _)| {
let leftover = &haystack[byte_idx..];
leftover.find(needle).map(|inner_offset| {
leftover.char_indices().take_while(|&(inner_inner_offset, _)| {
inner_inner_offset != inner_offset
}).count() + index
})
})
}
fn main() {
let right = python_index("Hello, mΓΏ name is erip and mΓΏ favorite color is green.", "mΓΏ", 10);
println!("{:?}", right); // prints Some(27)
}
This certainly feels not super-efficient; you'd want to benchmark to see how it fares. However, the find implementation is pretty optimized, so I'd rather use it and then do a straight-shot through the characters and trust in the cache and prefetching to help me out ^_^.