In Swift, how to get estimate of String length in constant time? - swift

In Swift 3, you can count the characters in a String with:
str.characters.count
I need to do this frequently, and that line above looks like it could be O(N). Is there a way to get a string length, or a length of something — maybe the underlying unicode buffer — with an operation that is guaranteed to not have to walk the entire string? Maybe:
str.utf16.count
I ask because I'm checking the length of some text every time the user types a character, to limit the size of a UITextView. The call doesn't need to be an exact count of the glyphs, like characters.count.

This is a good question. The answer is... complicated. Converting from UTF-8 to UTF-16, or vice-versa, or converting to or from some other encoding, will all require examining the string, since the characters can be made up of more than one code unit. So if you want to get the count in constant time, it's going to come down to what the internal representation is. If the string is using UTF-16 internally, then it's a reasonable assumption that string.utf16.count would be in constant time, but if the internal representation is UTF-8 or something else, then the string will need to be analyzed to determine what the length in UTF-16 would be. So what's String using internally? Well:
https://github.com/apple/swift/blob/master/stdlib/public/core/StringCore.swift
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
This is discouraging. The internal representation could be ASCII or UTF-16, or it could be wrapping a Foundation NSString. Hrm. We do know that NSString uses UTF-16 internally, since this is actually documented, so that's good. So the main outlier here is when the string stores ASCII. The saving grace is that since the first 128 Unicode code points have the same values as the ASCII character set, any ASCII character 0xXX should correspond to the UTF-16 character 0x00XX, so the UTF-16 length should simply be the ASCII length times two, and thus calculable in constant time. Is this the case in the implementation? Let's look.
In the UTF16View source, there is no implementation of count. It appears that count is inherited from Collection's implementation, which is implemented via distance():
public var count: IndexDistance {
return distance(from: startIndex, to: endIndex)
}
UTF16View's implementation of distance() looks like this:
public func distance(from start: Index, to end: Index) -> IndexDistance {
// FIXME: swift-3-indexing-model: range check start and end?
return start.encodedOffset.distance(to: end.encodedOffset)
}
And in the String.Index source, encodedOffset looks like this:
public var encodedOffset : Int {
return Int(_compoundOffset >> _Self._strideBits)
}
where _compoundOffset appears to be a simple 64-bit integer:
internal var _compoundOffset : UInt64
and _strideBits appears to be a straight integer as well:
internal static var _strideBits : Int { return 2 }
So it... looks... like you should get constant time from string.utf16.count, since unless I'm making a mistake somewhere, you're just bit-shifting a couple of integers and then comparing the results (I'd probably still run some tests to be sure). The caveat is, of course, that this isn't documented, and thus could change in the future—particularly since the documentation for String does claim that it needs to iterate through the string:
Unlike with isEmpty, calculating a view’s count property requires iterating through the elements of the string.
With all that said, you're using a UITextView, which is implemented in Objective-C via NSAttributedString. If you're willing to incur the Objective-C message-passing overhead (which, let's be honest, is probably occurring under the scenes anyway to generate the String), you can just call its length property, which, since NSAttributedString is built on top of NSString, which does guarantee that it uses UTF-16 internally, is almost certain to be in constant time.

Related

Why does Swift String.Index keeps its index value 4 times bigger than real?

I was trying to implement Boyer-Moore algorithm in Swift Playground and I used Swift String.Index a lot and something that started to bother me is why indexes are kept 4 times bigger that what it seems they should be.
For example:
let why = "is s on 4th position not 1st".index(of: "s")
This code in Swift Playground will generate _compoundOffset 4 not 1. I'm sure there is a reason for doing this, but I couldn't find explanation anywhere.
It's not a duplicate of any question that explains how to get index of char in Swift, I know that, I used index(of:) function just to illustrate the question. I wanted to know why value of 2nd char is 4 not 1 when using String.Index.
So I guess the way it keeps indexes is private and I don't need to know the inside implementation, it's probably connected with UTF16 and UTF32 coding.
First of all, don’t ever assume _compoundOffset to be anything else than an implementation detail. _compoundOffset is an internal property of String.Index that uses bit masking to store two values in this one number:
The encodedOffset, which is the index's byte offset in terms of UTF-16 code units. This one is public and can be relied on. In your case encodedOffset is 1 because that's the offset for that character, as measured in UTF-16 code units. Note that the encoding of the string in memory doesn't matter! encodedOffset is always UTF-16.
The transcodedOffset, which stores the index's offset inside the current UTF-16 code unit. This is also an internal property that you can't access. The value is usually 0 for most indices, unless you have an index into the string's UTF-8 view that refers to a code unit which doesn't fall on a UTF-16 boundary. In that case, the transcodedOffset will store the offset in bytes from the encodedOffset.
Now why is _compoundOffset == 4? Because it stores the transcodedOffset in the two least significant bits and the encodedOffset in the 62 most significant bits. So the bit pattern for encodedOffset == 1, transcodedOffset == 0 is 0b100, which is 4.
You can verify all this in the source code for String.Index.

Memory occupied by a string variable in swift [duplicate]

This question already has answers here:
Swift: How to use sizeof?
(5 answers)
Calculate the size in bytes of a Swift String
(1 answer)
Closed 5 years ago.
I want to find the memory occupied by a string variable in bytes.
Lets suppose we have a variable named test
let test = "abvd"
I want to know how to find the size of test in runtime.
I have checked the details in Calculate the size in bytes of a Swift String
But this question is different.
According to apple, "Behind the scenes, Swift’s native String type is built from Unicode scalar values. A Unicode scalar is a unique 21-bit number for a character or modifier, such as U+0061 for LATIN SMALL LETTER A ("a"), or U+1F425 for FRONT-FACING BABY CHICK ("🐥")." This can be found in https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
So if thats the case, apple is actually using a fixed size representation for unicode code points instead of the dynamic UTF8 encoding.
I wanted to verify this claim.
Thanks in advance.
To your real goal: neither understanding is correct. Strings do not promise an internal representation. They can hold a variety of representations, depending on how they're constructed. In principle they can even take zero real memory if they are statically defined in the binary and memory mapped (I can't remember if the StaticString type makes use of this fully yet). The only way you're going to answer this question is to look at the current implementation, starting in String.swift, and then moving to StringCore.swift, and then reading the rest of the string files.
To your particular question, this is probably the beginning of the answer you're looking for (but again, this is current implementation; it's not part of any spec):
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
The end of the answer you're looking for is "it's complicated."
Note that if you ask for MemoryLayout.size(ofValue: test), you're going to get a surprising result (24), because that's just measuring the container. There is a reference type inside the container (which takes one word of storage for a pointer). There's no mechanism to determine "all the storage used by this value" because that's not very well defined when pointers get involved.
String only has one property:
var _core: _StringCore
And _StringCore has the following properties:
public var _baseAddress: UnsafeMutableRawPointer?
var _countAndFlags: UInt
public var _owner: AnyObject?
Each of those take one word (8 bytes on a 64-bit platform) of storage, so 24 bytes total. It doesn't matter how long the string is.

Direct access to Swift's UTF-16 code units

I need direct access to the underlying UTF-16 code units for a Swift string. While it is possible to get a pointer to the UTF-8 C characters with:
string.utf8CString.withUnsafeBufferPointer { utf8chars in
...
}
there doesn't seem to be any equivalent for UTF-16 code units, which is odd since Swift strings are internally stored as UTF-16.
The closest I can get is to wrap the UTF16 view in an array:
ContiguousArray(string.utf16).withUnsafeBufferPointer { utf16units in
...
}
but this is drastically slower than direct access to the code units.
Is there some hidden method of getting at the underlying UTF-16 code units that I am missing?

NSRange from first occurrence until end of string

I'm pulling my hair out trying to generate a valid NSRange, it doesn't seem like it should be this complicated so I'm guessing I'm using the wrong approach. Here is what I'm trying to do:
I have a string with some unicode character in it:
"The quick brown fox\n❄jumped\n❄over the lazy dog"
I want to create an NSRange from that character until the end of string, and while I can get the corresponding index for the first occurrence of the character:
text.rangeOfString("❄")?.startIndex
I can't seem to get the end of the string in a consistent format (something that I can pass to NSMakeRange) to actually generate the range. This seems like it should be pretty simple, yet I've been stuck for over an hour now trying to figure out how to get it to work, I keep ending up with Index types that I can't cast to integers to convert back to length that NSMakeRange requires for its second element.
Ideally I'd do something like this (which is invalid due to incompatible and non-castable types (Index vs Int)):
let start = text.rangeOfString("❄")?.startIndex
NSMakeRange(start, text.endIndex - start)
I am using Swift, so I have the ability to use Swift's Range<String.Index>, if it will make things easier, although it seems to be yet another range representation different from NSRange and I'm not sure how compatible the two are (don't want to run into another dimension of Index vs Int).
Cast your String as NSString.
You will be able to use Foundation's .rangeOfString instead of Swift's .rangeOfString.
The Foundation's one will return an NSRange.
Be careful though, it doesn't work the same as Swift's method with Unicode, and NSRange and Range are not compatible (although there's ways to convert them).

How to cast [Int8] to [UInt8] in Swift

I have a buffer that contains just characters
let buffer: [Int8] = ....
Then I need to pass this to a function process that takes [UInt8] as an argument.
func process(buffer: [UInt8]) {
// some code
}
What would be the best way to pass the [Int8] buffer to cast to [Int8]? I know following code would work, but in this case the buffer contains just bunch of characters, and it is unnecessary to use functions like map.
process(buffer.map{ x in UInt8(x) }) // OK
process([UInt8](buffer)) // error
process(buffer as! [UInt8]) // error
I am using Xcode7 b3 Swift2.
I broadly agree with the other answers that you should just stick with map, however, if your array were truly huge, and it really was painful to create a whole second buffer just for converting to the same bit pattern, you could do it like this:
// first, change your process logic to be generic on any kind of container
func process<C: CollectionType where C.Generator.Element == UInt8>(chars: C) {
// just to prove it's working...
print(String(chars.map { UnicodeScalar($0) }))
}
// sample input
let a: [Int8] = [104, 101, 108, 108, 111] // ascii "Hello"
// access the underlying raw buffer as a pointer
a.withUnsafeBufferPointer { buf -> Void in
process(
UnsafeBufferPointer(
// cast the underlying pointer to the type you want
start: UnsafePointer(buf.baseAddress),
count: buf.count))
}
// this prints [h, e, l, l, o]
Note withUnsafeBufferPointer means what it says. It’s unsafe and you can corrupt memory if you get this wrong (be especially careful with the count). It works based on your external knowledge that, for example, if any of the integers are negative then your code doesn't mind them becoming corrupt unsigned integers. You might know that, but the Swift type system can't, so it won't allow it without resort to the unsafe types.
That said, the above code is correct and within the rules and these techniques are justifiable if you need the performance edge. You almost certainly won’t unless you’re dealing with gigantic amounts of data or writing a library that you will call a gazillion times.
It’s also worth noting that there are circumstances where an array is not actually backed by a contiguous buffer (for example if it were cast from an NSArray) in which case calling .withUnsafeBufferPointer will first copy all the elements into a contiguous array. Also, Swift arrays are growable so this copy of underlying elements happens often as the array grows. If performance is absolutely critical, you could consider allocating your own memory using UnsafeMutablePointer and using it fixed-size style using UnsafeBufferPointer.
For a humorous but definitely not within the rules example that you shouldn’t actually use, this will also work:
process(unsafeBitCast(a, [UInt8].self))
It's also worth noting that these solutions are not the same as a.map { UInt8($0) } since the latter will trap at runtime if you pass it a negative integer. If this is a possibility you may need to filter them first.
IMO, the best way to do this would be to stick to the same base type throughout the whole application to avoid the whole need to do casts/coercions. That is, either use Int8 everywhere, or UInt8, but not both.
If you have no choice, e.g. if you use two separate frameworks over which you have no control, and one of them uses Int8 while another uses UInt8, then you should use map if you really want to use Swift. The latter 2 lines from your examples (process([UInt8](buffer)) and
process(buffer as! [UInt8])) look more like C approach to the problem, that is, we don't care that this area in memory is an array on singed integers we will now treat it as if it is unsigneds. Which basically throws whole Swift idea of strong types to the window.
What I would probably try to do is to use lazy sequences. E.g. check if it possible to feed process() method with something like:
let convertedBuffer = lazy(buffer).map() {
UInt8($0)
}
process(convertedBuffer)
This would at least save you from extra memory overhead (as otherwise you would have to keep 2 arrays), and possibly save you some performance (thanks to laziness).
You cannot cast arrays in Swift. It looks like you can, but what's really happening is that you are casting all the elements, one by one. Therefore, you can use cast notation with an array only if the elements can be cast.
Well, you cannot cast between numeric types in Swift. You have to coerce, which is a completely different thing - that is, you must make a new object of a different numeric type, based on the original object. The only way to use an Int8 where a UInt8 is expected is to coerce it: UInt8(x).
So what is true for one Int8 is true for an entire array of Int8. You cannot cast from an array of Int8 to an array of UInt8, any more than you could cast one of them. The only way to end up with an array of UInt8 is to coerce all the elements. That is exactly what your map call does. That is the way to do it; saying it is "unnecessary" is meaningless.