Emulating Python's `index(separator, start_index)` in Rust - unicode

I'm currently porting a library from Python from Rust and found a line for which I'm having trouble finding the right "translation":
right = s.index(sep, left)
where right is the index of the first instance of sep found in string s that is after index left.
A simple example of this can be seen here:
Python 3
>>> s = "Hello, my name is erip and my favorite color is green."
>>> right = s.index("my", 10) # Find index of first instance of 'my' after index 10
>>> print right
27
>>> print s[27:]
my favorite color is green.
My attempt in Rust is:
// s: &str, sep: &str, left: usize
let right = s[left..].find(sep).unwrap() + left;
This will search the bytes after left for sep. This seems to work when using ASCII characters. There seems to be a problem when using Unicode, though:
Python 3
>>> s = "Hello, mÿ name is erip and mÿ favorite color is green."
>>> right = s.index("mÿ", 10)
>>> print(right)
27
Rust
fn main() {
let sep: &str = "mÿ";
let left: usize = 10;
let s: &str = "Hello, mÿ name is erip and mÿ favorite color is green.";
let right = s[left..].find(sep).unwrap() + left;
println!("{}", right); //prints 28
}
I realize that Python 2 would also give 28 because it doesn't support Unicode natively, but I'd like to mimic Python 3's results.
The problem is because usize in Rust refers to the number of bytes in a string because "mÿ" actually requires 3 bytes to encode. How can I acquire this desired behavior in Rust?
I'm using rustc 1.4.0.

Let's restate the problem a bit as it's unclear what the unit of index should be. Humans believe that strings are easy because we've been using them for most of our lives. However, things aren't nearly as easy as we'd like.
Rust takes the point-of-view that strings (&str or String) are UTF-8 encoded sequences of bytes. Jumping into a string using a byte offset is O(1), and you really want that level of performance guarantee to build more complicated things upon.
I don't know what Python considers that index to be. It gets hard once you get beyond simple encoding schemes like ASCII where one character is one byte. There are multiple ways to chunk a Unicode string depending on what you want. Two obvious ones are by Unicode codepoint and by grapheme.
Since codepoints can be represented in Rust using a char, that's what I assume you want. However, you are the only one that can figure that out.
Additionally, since you requested that the result be 28, that must be the number of bytes into the string. It's a little odd to skip N codepoints but return bytes, but it is what it is.
Now that we know what we are doing... let's try and do it. (See next solution where I read the desired outcome better).
The key thing you need to use is char_indices. This is an O(n) operation that walks through the string and and gives you each codepoint and its corresponding byte offset.
Then, it's just a matter of putting that together and correctly handling cases of walking off the end of the string. This is made obvious by Rust's strong types, hooray!
// `index` is the number of Unicode codepoints to skip
// The result is the number of **bytes** inside the haystack
// that the needle can be found.
fn python_index(haystack: &str, needle: &str, index: usize) -> Option<usize> {
haystack.char_indices().nth(index).and_then(|(byte_idx, _)| {
let leftover = &haystack[byte_idx..];
leftover.find(needle).map(|inner_idx| inner_idx + byte_idx)
})
}
fn main() {
let right = python_index("Hello, mÿ name is erip and mÿ favorite color is green.", "mÿ", 10);
println!("{:?}", right); // prints Some(28)
}
We do the same high-level concept as above, but once we have found the needle, we then reset back and iterate through the codepoints again. When we find the same byte offset of the substring, we terminate.
Then it's just a matter of counting the characters we saw and adding the number that we already skipped.
// `index` is the number of Unicode codepoints to skip
// The result is the number of codepoints inside the haystack
// that the needle can be found.
fn python_index(haystack: &str, needle: &str, index: usize) -> Option<usize> {
haystack.char_indices().nth(index).and_then(|(byte_idx, _)| {
let leftover = &haystack[byte_idx..];
leftover.find(needle).map(|inner_offset| {
leftover.char_indices().take_while(|&(inner_inner_offset, _)| {
inner_inner_offset != inner_offset
}).count() + index
})
})
}
fn main() {
let right = python_index("Hello, mÿ name is erip and mÿ favorite color is green.", "mÿ", 10);
println!("{:?}", right); // prints Some(27)
}
This certainly feels not super-efficient; you'd want to benchmark to see how it fares. However, the find implementation is pretty optimized, so I'd rather use it and then do a straight-shot through the characters and trust in the cache and prefetching to help me out ^_^.

Related

Swift Dictionary is slow?

Situation: I was solving LeetCode 3. Longest Substring Without Repeating Characters, when I use the Dictionary using Swift the result is Time Limit Exceeded that failed to last test case, but using the same notion of code with C++ it acctually passed with runtime just fine. I thought in swift Dictionary is same thing as UnorderdMap.
Some research: I found some resources said use NSDictionary over regular one but it requires reference type instead of Int or Character etc.
Expected result: fast performance in accessing Dictionary in Swift
Question: I know there are better answer for the question, but the main goal here is Is there a effiencient to access and write to Dictionary or someting we can use to substitude.
func lengthOfLongestSubstring(_ s: String) -> Int {
var window:[Character:Int] = [:] //swift dictionary is kind of slow?
let array = Array(s)
var res = 0
var left = 0, right = 0
while right < s.count {
let rightChar = array[right]
right += 1
window[rightChar, default: 0] += 1
while window[rightChar]! > 1 {
let leftChar = array[left]
window[leftChar, default: 0] -= 1
left += 1
}
res = max(res, right - left)
}
return res
}
Because complexity of count in String is O(n), so that you should save count in a variable. You can read at chapter
Strings and Characters in Swift Book
Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift don’t each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string can’t be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.
The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

How to shift a string's Range?

I have the Range of a word and its enclosing sentence within a big long String. After extracting that sentence into its own String, I'd like to know the position of the word within it.
If we were dealing with integer indexes, I would just subtract the sentence's starting index from the word's range and I'd be done. For example, if the word was in characters 10–12 and its sentence started at character 8, then I'd have a new word range of 2–4.
Here's what I've got, ready to copy&paste to a Playground:
// The Setup (this is just to get easy testing values, no need for feedback on this part)
let bigLongString = "A beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows."
let sentenceInString = bigLongString.range(of: "This every sister of the Bene Gesserit knows.")!
let wordInString = bigLongString.range(of: "sister")!
let sentence = String(bigLongString[sentenceInString])
// The Code In Question
let wordInSentence = ??? // Something that shifts the `wordInString` range
// The Test (again, just for testing. it should read "This every *sister* of the Bene Gesserit knows.")
print(sentence.replacingCharacters(in: wordInSentence,
with: "*\(sentence[wordInSentence])*"))
Also, note that wordInString may refer to any instance of a given word, not just the first one. (So, re-finding the word in sentence, i.e., sentence.range(of: "sister"), won't do the trick here unfortunately.) The range needs to be shifted somehow.
Thanks for reading!
EDIT:
Introducing a slightly more complicated bigLongString seems to be an issue with the solution I posted. E.g.,
let bigLongString = "Really…? Thought I had it."
let sentenceInString = bigLongString.range(of: "Thought I had it.")!
let wordInString = bigLongString.range(of: "I")!
This can get kinda tricky, depending on precisely what you need to do.
NSRange
Firstly, as you may have noticed, Range<String.Index> and NSRange are different.
Range<String.Index> is how Swift represent ranges of indices in native Swift.Strings. It's an opaque type, that's only usable by the String APIs that consume it. It understands Swift strings as collections of Swift.Characters, which represent what Unicode calls "extended grapheme clusters".
NSRange is the older range representation, used by Objective C to represent ranges in Foundation.NSStrings. It's an open container, containing a "start" location and a length. Importantly, these NSRange and NSString understand collections of utf16 encoded unicode scalars.
Because NSRange and NSString expose so many of their internals, they haven't undergone the same migration from utf16 to utf8 that Swift.String underwent. A migration that most people probably didn't even notice, since Swift.String guarded its implementation details much more than NSString did.
NSRange is more amenable to the kinds of simple operations you might be looking for. You can offset the start location just like you describe. However, you need to be careful that the resulting range doesn't start/end in the middle of an extended grapheme cluster. In that case, slicing could lead to a substring with invalid unicode characters (for example, you might accidentally cut an e away from its accent. the accent modifier isn't valid on its own without the e.)
Bridging back and forth between NSRange and Range<String.Index> is possible, but can be error prone if you're not careful. For that reason, I suggest you try to minimize conversions, by trying to either exclusively use NSRange, or Range<String.Index>, but not mix the two too much.
replacingCharacters(in:with:)
I suspect you're only using this as example way of consuming wordInSentence, but it's still worth noting that:
Foundation.NSString.replacingCharacters(in:with:)](https://developer.apple.com/documentation/foundation/nsstring/1412937-replacingoccurrences) is an NSString API that's imported onto Swift.String when Foundation is imported. It accept an NSString. If you're dealing with Range<String.Index>, you should use its Swift-native counterpart, Swift.String.replaceSubrange(_:with:).
Substring is your friend
Don't fight it; unless you absolutely need sentence to be a String, keep it as a Substring for the duration of these short-lived processing actions. Not only does this save you a copy of the string's contents, but it also makes it so that the indices can be shared between the slice and the parent string. This is valid:
let sentence = bigLongString[sentenceInString]
print(sentence[wordInString])
or even just: bigLongString[sentenceInString][wordInString] or bigLongString[wordInString]
Shifting around
I couldn't find a native solution for this, so I rolled my own. I could definitely be missing something simpler, but here's what I came up with:
import Foundation
struct SubstringOffset {
let offset: String.IndexDistance
let parent: String
init(of substring: Substring, in parent: String) {
self.offset = parent.distance(from: parent.startIndex, to: substring.startIndex)
self.parent = parent
}
func convert(indexInParent: String.Index, toIndexIn newString: String) -> String.Index {
let distance = parent.distance(from: parent.startIndex, to: indexInParent)
let distanceInNewString = distance - offset
return newString.index(newString.startIndex, offsetBy: distanceInNewString)
}
func convert(rangeInParent: Range<String.Index>, toRangeIn newString: String) -> Range<String.Index> {
let newLowerBound = self.convert(indexInParent: rangeInParent.lowerBound, toIndexIn: newString)
let span = self.parent.distance(from: rangeInParent.lowerBound, to: rangeInParent.upperBound)
let newUpperBound = newString.index(newLowerBound, offsetBy: span)
return newLowerBound ..< newUpperBound
}
}
// The Setup (this is just to get easy testing values, no need for feedback on this part)
let bigLongString = "Really…? Thought I had it."
let sentenceInString = bigLongString.range(of: "Thought I had it.")!
let wordInString = bigLongString.range(of: "I")!
var sentence: String = String(bigLongString[sentenceInString])
let offset = SubstringOffset(of: bigLongString[sentenceInString], in: bigLongString)
// The Code In Question
let wordInSentence: Range<String.Index> = offset.convert(rangeInParent: wordInString, toRangeIn: sentence)
sentence.replaceSubrange(wordInSentence, with: "*\(sentence[wordInSentence])*")
print(sentence)
OK, this is what I've come up with. It appears to work OK for both examples in the question.
We use the String instance method distance(from:to:) to get the distance between the bigLongString start and the sentence start. (Analogous to the "8" in the question.) Then the word range is shifted back by this amount by shifting the upper and lower bounds separately, and then reforming them into a Range.
let wordStartInSentence = bigLongString.distance(from: sentenceInString.lowerBound,
to: wordInString.lowerBound)
let wordEndInSentence = bigLongString.distance(from: sentenceInString.lowerBound,
to: wordInString.upperBound)
let wordStart = sentence.index(sentence.startIndex, offsetBy: wordStartInSentence)
let wordEnd = sentence.index(sentence.startIndex, offsetBy: wordEndInSentence)
let wordInSentence = wordStart..<wordEnd
EDIT: Updated answer to work for the more complicated bigLongString example (and coincidentally also reduce the "string walking," especially when bigLongString is very big).

Swift 5: String prefix with a maximum UTF-8 length

I have a string that can contain arbitrary Unicode characters and I want to get a prefix of that string whose UTF-8 encoded length is as close as possible to 32 bytes, while still being valid UTF-8 and without changing the characters' meaning (i.e. not cutting off an extended grapheme cluster).
Consider this CORRECT example:
let string = "\u{1F3F4}\u{E0067}\u{E0062}\u{E0073}\u{E0063}\u{E0074}\u{E007F}\u{1F1EA}\u{1F1FA}"
print(string) // 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺
print(string.count) // 2
print(string.utf8.count) // 36
let prefix = string.utf8Prefix(32) // <-- function I want to implement
print(prefix) // 🏴󠁧󠁢󠁳󠁣󠁴󠁿
print(prefix.count) // 1
print(prefix.utf8.count) // 28
print(string.hasPrefix(prefix)) // true
And this example of a WRONG implementation:
let string = "ar\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}"
print(string) // ar🏴‍☠️🏴‍☠️🏴‍☠️
print(string.count) // 5
print(string.utf8.count) // 41
let prefix = string.wrongUTF8Prefix(32) // <-- wrong implementation
print(prefix) // ar🏴‍☠️🏴‍☠️🏴
print(prefix.count) // 5
print(prefix.utf8.count) // 32
print(string.hasPrefix(prefix)) // false
What's an elegant way to do this? (besides trial&error)
You've shown no attempt at a solution and SO doesn't normally write code for you. So instead here as some algorithm suggestions for you:
What's an elegant way to do this? (besides trial&error)
By what definition of elegant? (like beauty it depends on the eye of the beholder...)
Simple?
Start with String.makeIterator, write a while loop, append Characters to your prefix as long as the byte count ≤ 32.
It's a very simple loop, worse case is 32 iterations and 32 appends.
"Smart" Search Strategy?
You could implement a strategy based on the average byte length of each Character in the String and using String.Prefix(Int).
E.g. for your first example the character count is 2 and the byte count 36, giving an average of 18 bytes/character, 18 goes into 32 just once (we don't deal in fractional characters or bytes!) so start with Prefix(1), which has a byte count of 28 and leaves 1 character and 8 bytes – so the remainder has an average byte length of 8 and you are seeking at most 4 more bytes, 8 goes into 4 zero times and you are done.
The above example shows the case of extending (or not) your prefix guess. If your prefix guess is too long you can just start your algorithm from scratch using the prefix character & byte counts rather than the original string's.
If you have trouble implementing your algorithm ask a new question showing the code you've written, describe the issue, and someone will undoubtedly help you with the next step.
HTH
I discovered that String and String.UTF8View share the same indices, so I managed to create a very simple (and efficient?) solution, I think:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
var index = self.utf8.index(self.startIndex, offsetBy: maxLength+1)
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
Explanation (assuming maxLength == 32 and startIndex == 0):
The first case (utf8.count <= maxLength) should be clear, that's where no work is needed.
For the second case we first get the utf8-index 33, which is either
A: the endIndex of the string (if it's exactly 33 bytes long),
B: an index at the start of a character (after 33 bytes of previous characters)
C: an index somewhere in the middle of a character (after <33 bytes of previous characters)
So if we now move our index back one character (with formIndex(before:)) this will jump to the first extended grapheme cluster boundary before index which in case A and B is one character before and in C the start of that character.
I any case, the utf8-index will now be guaranteed to be at most 32 and at an extended grapheme cluster boundary, so prefix(upTo: index) will safely create a prefix with length ≤32.
…but it's not perfect.
In theory this should also be always the optimal solution, i.e. the prefix's count is as close as possible to maxLength but sometimes when the string ends with an extended grapheme cluster consisting of more than one Unicode scalar, formIndex(before: &index) goes back one character too many than would be necessary, so the prefix ends up shorter. I'm not exactly sure why that's the case.
EDIT: A not as elegant but in exchange completely "correct" solution would be this (still only O(n)):
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let endIndex = self.utf8.index(self.startIndex, offsetBy: maxLength)
var index = self.startIndex
while index <= endIndex {
self.formIndex(after: &index)
}
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
I like the first solution you came up with. I've found it works more correctly (and simpler) if you take out the formIndex:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let index = self.utf8.index(self.startIndex, offsetBy: maxLength)
return self.prefix(upTo: index)
}
}
My solution looks like this:
extension String {
func prefix(maxUTF8Length: Int) -> String {
if self.utf8.count <= maxUTF8Length { return self }
var utf8EndIndex = self.utf8.index(self.utf8.startIndex, offsetBy: maxUTF8Length)
while utf8EndIndex > self.utf8.startIndex {
if let stringIndex = utf8EndIndex.samePosition(in: self) {
return String(self[..<stringIndex])
} else {
self.utf8.formIndex(before: &utf8EndIndex)
}
}
return ""
}
}
It takes the highest possible utf8 index, checks if it is a valid character index using the Index.samePosition(in:) method. If not, it reduces the utf8 index one by one until it finds a valid character index.
The advantage is that you could replace utf8 with utf16 and it would also work.

Index of word in string 'covering' certain position

Not sure if this is the right place to ask but I couldn't find any related or similar questions.
Anyway: imagine you have a certain string like
val exampleString = "Hello StackOverflow this is my question, cool right?"
If given a position in this string, for example 23, return the word that 'occupies' this position in the string. If we look at the example string, we can see that the 23rd character is the letter 's' (the last character of 'this'), so we should return index = 5 (because 'this' is the 5th word). In my question spaces are counted as words. If, for example, we were given position 5, we land on the first space and thus we should return index = 1.
I'm implementing this in Scala (but this should be quite language-agnostic and I would love to see implementations in other languages).
Currently I have the following approach (assume exampleString is the given string and charPosition the given position):
exampleString.split("((?<= )|(?= ))").scanLeft(0)((a, b) => a + b.length()).drop(1).zipWithIndex.takeWhile(_._1 <= charPosition).last._2 + 1
This works, but it is way too complex to be honest. Is there a better (more efficient?) way to achieve this. I'm fairly new to functions like fold, scan, map, filter ... but I would love to learn more.
Thanks in advance.
def wordIndex(exampleString: String, index: Int): Int = {
exampleString.take(index + 1).foldLeft((0, exampleString.head.isWhitespace)) {
case ((n, isWhitespace), c) =>
if (isWhitespace == c.isWhitespace) (n, isWhitespace)
else (n + 1, !isWhitespace)
}._1
}
This will fold over the string, keeping track of whether the previous character was a whitespace or not, and if it detects a change, it will flip the boolean and add 1 to the count (n).
This will be able to handle groups of spaces (e.g. in hello world, world would be at position 2), and also spaces at the start of the string would count as index 0 and the first word would be index 1.
Note that this can't handle when the input is an empty string, I'll let you decide what you want to do in that case.

UTF8 String length and indices in Go vs Swift

I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem 😂😃✌️🤔 ipsum":
Go's utf8.RuneCountInString("Lorem 😂😃✌️🤔 ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem 😂😃✌️🤔 ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like 👨‍👩‍👧‍👦 which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an “extended grapheme cluster,” and each of "😂", "😃", "✌️", "🤔", "👨‍👩‍👧‍👦" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a “rune” is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "✌️" which
counts as a single Swift character, but is built from two Unicode scalars:
print("✌️".count) // 1
print("✌️".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem 😂😃✌️🤔 ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.