I need direct access to the underlying UTF-16 code units for a Swift string. While it is possible to get a pointer to the UTF-8 C characters with:
string.utf8CString.withUnsafeBufferPointer { utf8chars in
...
}
there doesn't seem to be any equivalent for UTF-16 code units, which is odd since Swift strings are internally stored as UTF-16.
The closest I can get is to wrap the UTF16 view in an array:
ContiguousArray(string.utf16).withUnsafeBufferPointer { utf16units in
...
}
but this is drastically slower than direct access to the code units.
Is there some hidden method of getting at the underlying UTF-16 code units that I am missing?
Related
I'm trying to work with this Argon2 implementation. I'm implementing an existing protocol, so I don't have any flexibility in design, and the protocol treats various inputs to the function as byte sequences. However, that implementation treats inputs as Strings. Is there any encoding that I can use which will allow me to convert an arbitrary byte sequence to a String without any validity constraints - that is, such that all possible byte sequences will convert without error?
I was trying to implement Boyer-Moore algorithm in Swift Playground and I used Swift String.Index a lot and something that started to bother me is why indexes are kept 4 times bigger that what it seems they should be.
For example:
let why = "is s on 4th position not 1st".index(of: "s")
This code in Swift Playground will generate _compoundOffset 4 not 1. I'm sure there is a reason for doing this, but I couldn't find explanation anywhere.
It's not a duplicate of any question that explains how to get index of char in Swift, I know that, I used index(of:) function just to illustrate the question. I wanted to know why value of 2nd char is 4 not 1 when using String.Index.
So I guess the way it keeps indexes is private and I don't need to know the inside implementation, it's probably connected with UTF16 and UTF32 coding.
First of all, don’t ever assume _compoundOffset to be anything else than an implementation detail. _compoundOffset is an internal property of String.Index that uses bit masking to store two values in this one number:
The encodedOffset, which is the index's byte offset in terms of UTF-16 code units. This one is public and can be relied on. In your case encodedOffset is 1 because that's the offset for that character, as measured in UTF-16 code units. Note that the encoding of the string in memory doesn't matter! encodedOffset is always UTF-16.
The transcodedOffset, which stores the index's offset inside the current UTF-16 code unit. This is also an internal property that you can't access. The value is usually 0 for most indices, unless you have an index into the string's UTF-8 view that refers to a code unit which doesn't fall on a UTF-16 boundary. In that case, the transcodedOffset will store the offset in bytes from the encodedOffset.
Now why is _compoundOffset == 4? Because it stores the transcodedOffset in the two least significant bits and the encodedOffset in the 62 most significant bits. So the bit pattern for encodedOffset == 1, transcodedOffset == 0 is 0b100, which is 4.
You can verify all this in the source code for String.Index.
In Swift 3, you can count the characters in a String with:
str.characters.count
I need to do this frequently, and that line above looks like it could be O(N). Is there a way to get a string length, or a length of something — maybe the underlying unicode buffer — with an operation that is guaranteed to not have to walk the entire string? Maybe:
str.utf16.count
I ask because I'm checking the length of some text every time the user types a character, to limit the size of a UITextView. The call doesn't need to be an exact count of the glyphs, like characters.count.
This is a good question. The answer is... complicated. Converting from UTF-8 to UTF-16, or vice-versa, or converting to or from some other encoding, will all require examining the string, since the characters can be made up of more than one code unit. So if you want to get the count in constant time, it's going to come down to what the internal representation is. If the string is using UTF-16 internally, then it's a reasonable assumption that string.utf16.count would be in constant time, but if the internal representation is UTF-8 or something else, then the string will need to be analyzed to determine what the length in UTF-16 would be. So what's String using internally? Well:
https://github.com/apple/swift/blob/master/stdlib/public/core/StringCore.swift
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
This is discouraging. The internal representation could be ASCII or UTF-16, or it could be wrapping a Foundation NSString. Hrm. We do know that NSString uses UTF-16 internally, since this is actually documented, so that's good. So the main outlier here is when the string stores ASCII. The saving grace is that since the first 128 Unicode code points have the same values as the ASCII character set, any ASCII character 0xXX should correspond to the UTF-16 character 0x00XX, so the UTF-16 length should simply be the ASCII length times two, and thus calculable in constant time. Is this the case in the implementation? Let's look.
In the UTF16View source, there is no implementation of count. It appears that count is inherited from Collection's implementation, which is implemented via distance():
public var count: IndexDistance {
return distance(from: startIndex, to: endIndex)
}
UTF16View's implementation of distance() looks like this:
public func distance(from start: Index, to end: Index) -> IndexDistance {
// FIXME: swift-3-indexing-model: range check start and end?
return start.encodedOffset.distance(to: end.encodedOffset)
}
And in the String.Index source, encodedOffset looks like this:
public var encodedOffset : Int {
return Int(_compoundOffset >> _Self._strideBits)
}
where _compoundOffset appears to be a simple 64-bit integer:
internal var _compoundOffset : UInt64
and _strideBits appears to be a straight integer as well:
internal static var _strideBits : Int { return 2 }
So it... looks... like you should get constant time from string.utf16.count, since unless I'm making a mistake somewhere, you're just bit-shifting a couple of integers and then comparing the results (I'd probably still run some tests to be sure). The caveat is, of course, that this isn't documented, and thus could change in the future—particularly since the documentation for String does claim that it needs to iterate through the string:
Unlike with isEmpty, calculating a view’s count property requires iterating through the elements of the string.
With all that said, you're using a UITextView, which is implemented in Objective-C via NSAttributedString. If you're willing to incur the Objective-C message-passing overhead (which, let's be honest, is probably occurring under the scenes anyway to generate the String), you can just call its length property, which, since NSAttributedString is built on top of NSString, which does guarantee that it uses UTF-16 internally, is almost certain to be in constant time.
Float values are getting changed after parsing with JSONKit. The problem occurs after calling objectFromJSONString or mutableObjectFromJSONString.
The JSON response is fine before this method is triggered in JSONKit.m:
static id _NSStringObjectFromJSONString(NSString *jsonString, JKParseOptionFlags parseOptionFlags, NSError **error, BOOL mutableCollection)
Original response:
"value":"1002.65"
Response after calling objectFromJSONString:
"value":"1002.6500000001" or sometimes "value":"1002.649999999 "
Thanks.
This is not an issue.
The value 1002.65 can not be represented exactly using a IEEE 754 floating point number.
Floating-point numbers are converted to their decimal representation using the printf format conversion specifier %.17g.
From the Docs:
The C double primitive type, or IEEE 754 Double 64-bit floating-point,
is used to represent floating-point JSON Number values. JSON that
contains floating-point Number values that can not be represented as a
double (i.e., due to over or underflow) will fail to parse and
optionally return a NSError object. The function strtod() is used to
perform the conversion. Note that the JSON standard does not allow for
infinities or NaN (Not a Number). The conversion and manipulation of
floating-point values is non-trivial. Unfortunately, RFC 4627 is
silent on how such details should be handled. You should not depend on
or expect that when a floating-point value is round tripped that it
will have the same textual representation or even compare equal. This
is true even when JSONKit is used as both the parser and creator of
the JSON, let alone when transferring JSON between different systems
and implementations.
Source: See this thread https://github.com/johnezang/JSONKit/issues/110
Solution: You can specify a precision, while converting float to string for output. NSNumberFormatter will be a better choice or use some printf solutions like in the previous answer.
use float fixed point representation like,
NSLog(#"value = %.2f",floatvalue);
now it will show value = 1002.65
How do you use NSCoding to code (and decode) an array of of ten values of primitive type int? Encode each integer individually (in a for-loop). But what if my array held one million integers? Is there a more satisfying alternative to using a for-loop here?
Edit (after first answer): And decode? (#Justin: I'll then tick your answer.)
If performance is your concern here: CFData/NSData is NSCoding compliant, so just wrap your serialized representation of the array as NSCFData.
edit to detail encoding/decoding:
your array of ints will need to to be converted to a common endian format (depending on the machine's endianness) - e.g. always store it as little or big endian. during encoding, convert it to an array of integers in the specified endianness, which is passed to the NSData object. then pass the NSData representation to the NSCoder instance. at decode, you'll receive an NSData object for the key, you conditionally convert it to the native endianness of the machine when decoding it. one set of byte swapping routines available for OS X and iOS begin with OSSwap*.
alternatively, see -[NSCoder encodeBytes:voidPtr length:numBytes forKey:key]. this routine also requires the client to swap endianness.