I have a string that can contain arbitrary Unicode characters and I want to get a prefix of that string whose UTF-8 encoded length is as close as possible to 32 bytes, while still being valid UTF-8 and without changing the characters' meaning (i.e. not cutting off an extended grapheme cluster).
Consider this CORRECT example:
let string = "\u{1F3F4}\u{E0067}\u{E0062}\u{E0073}\u{E0063}\u{E0074}\u{E007F}\u{1F1EA}\u{1F1FA}"
print(string) // ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ๐ช๐บ
print(string.count) // 2
print(string.utf8.count) // 36
let prefix = string.utf8Prefix(32) // <-- function I want to implement
print(prefix) // ๐ด๓ ง๓ ข๓ ณ๓ ฃ๓ ด๓ ฟ
print(prefix.count) // 1
print(prefix.utf8.count) // 28
print(string.hasPrefix(prefix)) // true
And this example of a WRONG implementation:
let string = "ar\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}"
print(string) // ar๐ดโโ ๏ธ๐ดโโ ๏ธ๐ดโโ ๏ธ
print(string.count) // 5
print(string.utf8.count) // 41
let prefix = string.wrongUTF8Prefix(32) // <-- wrong implementation
print(prefix) // ar๐ดโโ ๏ธ๐ดโโ ๏ธ๐ด
print(prefix.count) // 5
print(prefix.utf8.count) // 32
print(string.hasPrefix(prefix)) // false
What's an elegant way to do this? (besides trial&error)
You've shown no attempt at a solution and SO doesn't normally write code for you. So instead here as some algorithm suggestions for you:
What's an elegant way to do this? (besides trial&error)
By what definition of elegant? (like beauty it depends on the eye of the beholder...)
Simple?
Start with String.makeIterator, write a while loop, append Characters to your prefix as long as the byte count โค 32.
It's a very simple loop, worse case is 32 iterations and 32 appends.
"Smart" Search Strategy?
You could implement a strategy based on the average byte length of each Character in the String and using String.Prefix(Int).
E.g. for your first example the character count is 2 and the byte count 36, giving an average of 18 bytes/character, 18 goes into 32 just once (we don't deal in fractional characters or bytes!) so start with Prefix(1), which has a byte count of 28 and leaves 1 character and 8 bytes โ so the remainder has an average byte length of 8 and you are seeking at most 4 more bytes, 8 goes into 4 zero times and you are done.
The above example shows the case of extending (or not) your prefix guess. If your prefix guess is too long you can just start your algorithm from scratch using the prefix character & byte counts rather than the original string's.
If you have trouble implementing your algorithm ask a new question showing the code you've written, describe the issue, and someone will undoubtedly help you with the next step.
HTH
I discovered that String and String.UTF8View share the same indices, so I managed to create a very simple (and efficient?) solution, I think:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
var index = self.utf8.index(self.startIndex, offsetBy: maxLength+1)
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
Explanation (assuming maxLength == 32 and startIndex == 0):
The first case (utf8.count <= maxLength) should be clear, that's where no work is needed.
For the second case we first get the utf8-index 33, which is either
A: the endIndex of the string (if it's exactly 33 bytes long),
B: an index at the start of a character (after 33 bytes of previous characters)
C: an index somewhere in the middle of a character (after <33 bytes of previous characters)
So if we now move our index back one character (with formIndex(before:)) this will jump to the first extended grapheme cluster boundary before index which in case A and B is one character before and in C the start of that character.
I any case, the utf8-index will now be guaranteed to be at most 32 and at an extended grapheme cluster boundary, so prefix(upTo: index) will safely create a prefix with length โค32.
โฆbut it's not perfect.
In theory this should also be always the optimal solution, i.e. the prefix's count is as close as possible to maxLength but sometimes when the string ends with an extended grapheme cluster consisting of more than one Unicode scalar, formIndex(before: &index) goes back one character too many than would be necessary, so the prefix ends up shorter. I'm not exactly sure why that's the case.
EDIT: A not as elegant but in exchange completely "correct" solution would be this (still only O(n)):
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let endIndex = self.utf8.index(self.startIndex, offsetBy: maxLength)
var index = self.startIndex
while index <= endIndex {
self.formIndex(after: &index)
}
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
I like the first solution you came up with. I've found it works more correctly (and simpler) if you take out the formIndex:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let index = self.utf8.index(self.startIndex, offsetBy: maxLength)
return self.prefix(upTo: index)
}
}
My solution looks like this:
extension String {
func prefix(maxUTF8Length: Int) -> String {
if self.utf8.count <= maxUTF8Length { return self }
var utf8EndIndex = self.utf8.index(self.utf8.startIndex, offsetBy: maxUTF8Length)
while utf8EndIndex > self.utf8.startIndex {
if let stringIndex = utf8EndIndex.samePosition(in: self) {
return String(self[..<stringIndex])
} else {
self.utf8.formIndex(before: &utf8EndIndex)
}
}
return ""
}
}
It takes the highest possible utf8 index, checks if it is a valid character index using the Index.samePosition(in:) method. If not, it reduces the utf8 index one by one until it finds a valid character index.
The advantage is that you could replace utf8 with utf16 and it would also work.
Related
Situation: I was solving LeetCode 3. Longest Substring Without Repeating Characters, when I use the Dictionary using Swift the result is Time Limit Exceeded that failed to last test case, but using the same notion of code with C++ it acctually passed with runtime just fine. I thought in swift Dictionary is same thing as UnorderdMap.
Some research: I found some resources said use NSDictionary over regular one but it requires reference type instead of Int or Character etc.
Expected result: fast performance in accessing Dictionary in Swift
Question: I know there are better answer for the question, but the main goal here is Is there a effiencient to access and write to Dictionary or someting we can use to substitude.
func lengthOfLongestSubstring(_ s: String) -> Int {
var window:[Character:Int] = [:] //swift dictionary is kind of slow?
let array = Array(s)
var res = 0
var left = 0, right = 0
while right < s.count {
let rightChar = array[right]
right += 1
window[rightChar, default: 0] += 1
while window[rightChar]! > 1 {
let leftChar = array[left]
window[leftChar, default: 0] -= 1
left += 1
}
res = max(res, right - left)
}
return res
}
Because complexity of count in String is O(n), so that you should save count in a variable. You can read at chapter
Strings and Characters in Swift Book
Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different charactersโand different representations of the same characterโcan require different amounts of memory to store. Because of this, characters in Swift donโt each take up the same amount of memory within a stringโs representation. As a result, the number of characters in a string canโt be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.
The count of the characters returned by the count property isnโt always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the stringโs UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.
Look at the program:
let s = "1"
print(s.startIndex)
print(s.index(before: s.endIndex))
print(s.index(before: s.endIndex) == s.startIndex)
It returns:
Index(_rawBits: 0)
Index(_rawBits: 256)
true
So, the same position in the string is represented with rawBits 0 and 256. Why?
The raw bits of the index are an implementation detail. As you see in your example, the two values are equal (they return true for ==).
As to the current implementation, bit 8 is set, which is not part of the position. That's a cached value for the offset to the next grapheme cluster, which is 1 byte away. It's telling you that there's one byte to the next grapheme (which it didn't know until you calculated the endIndex).
Equality between two String.Indexes is defined on the upper 50 bits of the _rawBits, aka orderingValue, as follows:
extension String.Index: Equatable {
#inlinable #inline(__always)
public static func == (lhs: String.Index, rhs: String.Index) -> Bool {
return lhs.orderingValue == rhs.orderingValue
}
}
And since 0 &>> 14 and 256 &>> 14 both equal 0, the positions are equal, and thus the indices are considered equal.
&>> is the infix operator to shift bits to the right, masking the shift amount to 64 bits.
I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem ๐๐โ๏ธ๐ค ipsum":
Go's utf8.RuneCountInString("Lorem ๐๐โ๏ธ๐ค ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem ๐๐โ๏ธ๐ค ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like ๐จโ๐ฉโ๐งโ๐ฆ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an โextended grapheme cluster,โ and each of "๐", "๐", "โ๏ธ", "๐ค", "๐จโ๐ฉโ๐งโ๐ฆ" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a โruneโ is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "โ๏ธ" which
counts as a single Swift character, but is built from two Unicode scalars:
print("โ๏ธ".count) // 1
print("โ๏ธ".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem ๐๐โ๏ธ๐ค ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.
I want to format, in real time, the number entered into a UITextField. Depending on the field, the number may be an integer or a double, may be positive or negative.
Integers are easy (see below).
Doubles should be displayed exactly as the user enters with three possible exceptions:
If the user begins with a decimal separator, or a negative sign followed by a decimal separator, insert a leading zero:
"." becomes "0."
"-." becomes "-0."
Remove any "excess" leading zeros if the user deletes a decimal point:
If the number is "0.00023" and the decimal point is deleted, the number should become "23".
Do not allow a leading zero if the next character is not a decimal separator:
"03" becomes "3".
Long story short, one and only one leading zero, no trailing zeros.
It seemed like the easiest idea was to convert the (already validated) string to a number then use format specifiers. I've scoured:
https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Strings/Articles/formatSpecifiers.html
and
http://www.cplusplus.com/reference/cstdio/printf/
and others but can't figure out how to format a double so that it does not add a decimal when there are no digits after it, or any trailing zeros. For example:
x = 23.0
print (String(format: "%f", x))
//output is 23.000000
//I want 23
x = 23.45
print (String(format: "%f", x))
//output is 23.450000
//I want 23.45
On How to create a string with format?, I found this gem:
var str = "\(INT_VALUE) , \(FLOAT_VALUE) , \(DOUBLE_VALUE), \(STRING_VALUE)"
print(str)
It works perfectly for integers (why I said integers are easy above), but for doubles it appends a ".0" onto the first character the user enters. (It does work perfectly in Playground, but not my program (why???).
Will I have to resort to counting the number of digits before and after the decimal separator and inserting them into a format specifier? (And if so, how do I count those? I know how to create the format specifier.) Or is there a really simple way or a quick fix to use that one-liner above?
Thanks!
Turned out to be simple without using NumberFormatter (which I'm not so sure would really have accomplished what I want without a LOT more work).
let decimalSeparator = NSLocale.current.decimalSeparator! as String
var tempStr: String = textField.text
var i: Int = tempStr.count
//remove leading zeros for positive numbers (integer or real)
if i > 1 {
while (tempStr[0] == "0" && tempStr[1] != decimalSeparator[0] ) {
tempStr.remove(at: tempStr.startIndex)
i = i - 1
if i < 2 {
break
}
}
}
//remove leading zeros for negative numbers (integer or real)
if i > 2 {
while (tempStr[0] == "-" && tempStr[1] == "0") && tempStr[2] != decimalSeparator[0] {
tempStr.remove(at: tempStr.index(tempStr.startIndex, offsetBy: 1))
i = i - 1
if i < 3 {
break
}
}
}
Using the following extension to subscript the string:
extension String {
subscript (i: Int) -> Character {
return self[index(startIndex, offsetBy: i)]
}
}
s is a native Swift string consisted of ASCII characters only. It could be arbitrary long. What's the most efficient way to figure out if s is short than a certain length (say, 100k)?
if countElements(s) < 100_000 is not the most efficient, as countElements is O(n) complexity and s could have billions of characters.
If you're sure you don't need to worry about anything other than ASCII, you can use the utf16Count property (which is the length property of the bridged NSString):
let stringLength = superLongString.utf16Count
If you want to be able to handle Unicode you need to walk the string, you just don't want to walk the whole string. Here's a function to count just up to your limit:
func lengthLessThanMax(#string: String, maximum max: Int) -> Bool {
var idx = string.startIndex
var count = 0
while idx < string.endIndex && count < max {
++count
idx = idx.successor()
}
return count < max
}
lengthLessThanMax(string: "Hello!", maximum: 10)
// true
lengthLessThanMax(string: "Hello! Nice to meet you!", maximum: 10)
// false
just chose what your want:
var emoji = "๐"
countElements(emoji) //returns 1
emoji.utf16Count //returns 2
emoji.bridgeToObjectiveC().length //returns 2
find from Get the length of a String