Why does Swift String.Index keeps its index value 4 times bigger than real? - swift

I was trying to implement Boyer-Moore algorithm in Swift Playground and I used Swift String.Index a lot and something that started to bother me is why indexes are kept 4 times bigger that what it seems they should be.
For example:
let why = "is s on 4th position not 1st".index(of: "s")
This code in Swift Playground will generate _compoundOffset 4 not 1. I'm sure there is a reason for doing this, but I couldn't find explanation anywhere.
It's not a duplicate of any question that explains how to get index of char in Swift, I know that, I used index(of:) function just to illustrate the question. I wanted to know why value of 2nd char is 4 not 1 when using String.Index.
So I guess the way it keeps indexes is private and I don't need to know the inside implementation, it's probably connected with UTF16 and UTF32 coding.

First of all, don’t ever assume _compoundOffset to be anything else than an implementation detail. _compoundOffset is an internal property of String.Index that uses bit masking to store two values in this one number:
The encodedOffset, which is the index's byte offset in terms of UTF-16 code units. This one is public and can be relied on. In your case encodedOffset is 1 because that's the offset for that character, as measured in UTF-16 code units. Note that the encoding of the string in memory doesn't matter! encodedOffset is always UTF-16.
The transcodedOffset, which stores the index's offset inside the current UTF-16 code unit. This is also an internal property that you can't access. The value is usually 0 for most indices, unless you have an index into the string's UTF-8 view that refers to a code unit which doesn't fall on a UTF-16 boundary. In that case, the transcodedOffset will store the offset in bytes from the encodedOffset.
Now why is _compoundOffset == 4? Because it stores the transcodedOffset in the two least significant bits and the encodedOffset in the 62 most significant bits. So the bit pattern for encodedOffset == 1, transcodedOffset == 0 is 0b100, which is 4.
You can verify all this in the source code for String.Index.

Related

swift what's the difference between "maximumLengthOfBytes(using:)" and "lengthOfBytes(using:)"

The descriptions seem to be the same to me. "required" vs "needed" what does that mean?
// Returns the number of bytes required to store the String in a given encoding.
lengthOfBytes(using:)
// Returns the maximum number of bytes needed to store the receiver in a given encoding.
maximumLengthOfBytes(using:)
The lengthOfBytes(using:) returns the exact number, while maximumLengthOfBytes(using:) returns an estimate, which "may be considerably greater than the actual length" (in Apple own words)
The main difference between these methods is given by the Discussion sections of their documentation.
lengthOfBytes(using:):
The result is exact and is returned in O(n) time.
maximumLengthOfBytes(using:):
The result is an estimate and is returned in O(1) time; the estimate may be considerably greater than the actual length needed.
An example where they may differ: the UTF-8 string encoding requires between 1 and 4 bytes to represent a code point, but the exact representation depends on which code point is being represented. lengthOfBytes(using:) will go through the string, calculating the exact number of bytes for every single character, while maximumLengthOfBytes(using:) is allowed to round up to 4 for every code point without looking at the actual value in the string. In this case, the maximum returned is 3× as much as actually needed:
import Foundation
let str = "Hello, world!"
print(str.lengthOfBytes(using: .utf8)) // => 13
print(str.maximumLengthOfBytes(using: .utf8)) // => 39
maximumLengthOfBytes(using:) can give you an immediate answer with little to no computation, at the cost of overestimating, sometimes greatly. The tradeoff of which to use depends on your specific use-case.

In Swift, how to get estimate of String length in constant time?

In Swift 3, you can count the characters in a String with:
str.characters.count
I need to do this frequently, and that line above looks like it could be O(N). Is there a way to get a string length, or a length of something — maybe the underlying unicode buffer — with an operation that is guaranteed to not have to walk the entire string? Maybe:
str.utf16.count
I ask because I'm checking the length of some text every time the user types a character, to limit the size of a UITextView. The call doesn't need to be an exact count of the glyphs, like characters.count.
This is a good question. The answer is... complicated. Converting from UTF-8 to UTF-16, or vice-versa, or converting to or from some other encoding, will all require examining the string, since the characters can be made up of more than one code unit. So if you want to get the count in constant time, it's going to come down to what the internal representation is. If the string is using UTF-16 internally, then it's a reasonable assumption that string.utf16.count would be in constant time, but if the internal representation is UTF-8 or something else, then the string will need to be analyzed to determine what the length in UTF-16 would be. So what's String using internally? Well:
https://github.com/apple/swift/blob/master/stdlib/public/core/StringCore.swift
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
This is discouraging. The internal representation could be ASCII or UTF-16, or it could be wrapping a Foundation NSString. Hrm. We do know that NSString uses UTF-16 internally, since this is actually documented, so that's good. So the main outlier here is when the string stores ASCII. The saving grace is that since the first 128 Unicode code points have the same values as the ASCII character set, any ASCII character 0xXX should correspond to the UTF-16 character 0x00XX, so the UTF-16 length should simply be the ASCII length times two, and thus calculable in constant time. Is this the case in the implementation? Let's look.
In the UTF16View source, there is no implementation of count. It appears that count is inherited from Collection's implementation, which is implemented via distance():
public var count: IndexDistance {
return distance(from: startIndex, to: endIndex)
}
UTF16View's implementation of distance() looks like this:
public func distance(from start: Index, to end: Index) -> IndexDistance {
// FIXME: swift-3-indexing-model: range check start and end?
return start.encodedOffset.distance(to: end.encodedOffset)
}
And in the String.Index source, encodedOffset looks like this:
public var encodedOffset : Int {
return Int(_compoundOffset >> _Self._strideBits)
}
where _compoundOffset appears to be a simple 64-bit integer:
internal var _compoundOffset : UInt64
and _strideBits appears to be a straight integer as well:
internal static var _strideBits : Int { return 2 }
So it... looks... like you should get constant time from string.utf16.count, since unless I'm making a mistake somewhere, you're just bit-shifting a couple of integers and then comparing the results (I'd probably still run some tests to be sure). The caveat is, of course, that this isn't documented, and thus could change in the future—particularly since the documentation for String does claim that it needs to iterate through the string:
Unlike with isEmpty, calculating a view’s count property requires iterating through the elements of the string.
With all that said, you're using a UITextView, which is implemented in Objective-C via NSAttributedString. If you're willing to incur the Objective-C message-passing overhead (which, let's be honest, is probably occurring under the scenes anyway to generate the String), you can just call its length property, which, since NSAttributedString is built on top of NSString, which does guarantee that it uses UTF-16 internally, is almost certain to be in constant time.

Memory occupied by a string variable in swift [duplicate]

This question already has answers here:
Swift: How to use sizeof?
(5 answers)
Calculate the size in bytes of a Swift String
(1 answer)
Closed 5 years ago.
I want to find the memory occupied by a string variable in bytes.
Lets suppose we have a variable named test
let test = "abvd"
I want to know how to find the size of test in runtime.
I have checked the details in Calculate the size in bytes of a Swift String
But this question is different.
According to apple, "Behind the scenes, Swift’s native String type is built from Unicode scalar values. A Unicode scalar is a unique 21-bit number for a character or modifier, such as U+0061 for LATIN SMALL LETTER A ("a"), or U+1F425 for FRONT-FACING BABY CHICK ("🐥")." This can be found in https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
So if thats the case, apple is actually using a fixed size representation for unicode code points instead of the dynamic UTF8 encoding.
I wanted to verify this claim.
Thanks in advance.
To your real goal: neither understanding is correct. Strings do not promise an internal representation. They can hold a variety of representations, depending on how they're constructed. In principle they can even take zero real memory if they are statically defined in the binary and memory mapped (I can't remember if the StaticString type makes use of this fully yet). The only way you're going to answer this question is to look at the current implementation, starting in String.swift, and then moving to StringCore.swift, and then reading the rest of the string files.
To your particular question, this is probably the beginning of the answer you're looking for (but again, this is current implementation; it's not part of any spec):
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
The end of the answer you're looking for is "it's complicated."
Note that if you ask for MemoryLayout.size(ofValue: test), you're going to get a surprising result (24), because that's just measuring the container. There is a reference type inside the container (which takes one word of storage for a pointer). There's no mechanism to determine "all the storage used by this value" because that's not very well defined when pointers get involved.
String only has one property:
var _core: _StringCore
And _StringCore has the following properties:
public var _baseAddress: UnsafeMutableRawPointer?
var _countAndFlags: UInt
public var _owner: AnyObject?
Each of those take one word (8 bytes on a 64-bit platform) of storage, so 24 bytes total. It doesn't matter how long the string is.

Using signed integers instead of unsigned integers [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
In software development, it's usually a good idea to take advantage of compiler-errors. Allowing the compiler to work for you by checking your code makes sense. In strong-type languages, if a variable only has two valid values, you'd make it a boolean or define an enum for it. Swift furthers this by bringing the Optional type.
In my mind, the same would apply to unsigned integers: if you know a negative value is impossible, program in a way that enforces it. I'm talking about high-level APIs; not low-level APIs where the negative value is usually used as a cryptic error signalling mechanism.
And yet Apple suggests avoiding unsigned integers:
Use UInt only when you specifically need an unsigned integer type with the same size as the platform’s native word size. If this is not the case, Int is preferred, even when the values to be stored are known to be non-negative. [...]
Here's an example: Swift's Array.count returns an Int. How can one possibly have negative amount of items?!
Why?!
Apple's states that:
A consistent use of Int for integer values aids code interoperability, avoids the need to convert between different number types, and matches integer type inference, as described in Type Safety and Type Inference.
But I don't buy it! Using Int wouldn't aid "interoperability" anymore than UInt since Int could resolve to Int32 or Int64 (for 32-bit and 64-bit platforms respectively).
If you care about robustness at all, using signed integers where it makes no logical sense essentially forces to do an additional check (What if the value is negative?)
I can't see the act of casting between signed and unsigned as being anything other than trivial. Wouldn't that simply indicate the compiler to resolve the machine-code to use either signed or unsigned byte-codes?!
Casting back and forth between signed and unsigned integers is extremely bug-prone on one side while adds little value on the other.
One reason to have unsigned int that you suggest, being an implicit guarantee that an index never gets negative value.. well, it's a bit speculative. Where would the potential to get a negative value come from? Of course, from the code, that is, either from a static value or from a computation. But in both cases, for a static or a computed value to be able to get negative they must be handled as signed integers. Therefore, it is a language implementation responsibility to introduce all sorts of checks every time you assign signed value to unsigned variable (or vice versa). This means that we talk not about being forced "to do an additional check" or not, but about having this check implicitly made for us by the language every time we feel lazy to bother with corner cases.
Conceptually, signed and unsigned integers come into the language from low level (machine codes). In other words, unsigned integer is in the language not because it is the language that has a need, but because it is directly bridge-able to machine instructions, hence allows performance gain just for being native. No other big reason behind. Therefore, if one has just a glimpse of portability in mind, then one would say "Be it Int and this is it. Let developer write clean code, we bring the rest".
As long as we have an opinions based question...
Basing programming language mathematical operations on machine register size is one of the great travesties of Computer Science. There should be Integer*, Rational, Real and Complex - done and dusted. You need something that maps to a U8 Register for some device driver? Call it a RegisterOfU8Data - or whatever - just not 'Int(eger)'
*Of course, calling it an 'Integer' means it 'rolls over' to an unlimited range, aka BigNum.
Sharing what I've discovered which indirectly helps me understand... at least in part. Maybe it ends up helping others?!
After a few days of digging and thinking, it seems part of my problem boils down to the usage of the word "casting".
As far back as I can remember, I've been taught that casting was very distinct and different from converting in the following ways:
Converting kept the meaning but changed the data.
Casting kept the data but changed the meaning.
Casting was a mechanism allowing you to inform the compiler how both it and you would be manipulating some piece of data (No changing of data, thus no cost). " Me to the compiler: Okay! Initially I told you this byte was an number because I wanted to perform math on it. Now, lets treat it as an ASCII character."
Converting was a mechanism for transforming the data into different formats. " Me to the compiler: I have this number, please generate an ASCII string that represents that value."
My problem, it seems, is that in Swift (and most likely other languages) the line between casting and converting is blurred...
Case in point, Apple explains that:
Type casting in Swift is implemented with the is and as operators. […]
var x: Int = 5
var y: UInt = x as UInt // Casting... Compiler refuses claiming
// it's not "convertible".
// I don't want to convert it, I want to cast it.
If "casting" is not this clearly defined action it could explain why unsigned integers are to be avoided like the plague...

What is the default hash code that Mathematica uses?

The online documentation says
Hash[expr]
gives an integer hash code for the expression expr.
Hash[expr,"type"]
gives an integer hash code of the specified type for expr.
It also gives "possible hash code types":
"Adler32" Adler 32-bit cyclic redundancy check
"CRC32" 32-bit cyclic redundancy check
"MD2" 128-bit MD2 code
"MD5" 128-bit MD5 code
"SHA" 160-bit SHA-1 code
"SHA256" 256-bit SHA code
"SHA384" 384-bit SHA code
"SHA512" 512-bit SHA code
Yet none of these correspond to the default returned by Hash[expr].
So my questions are:
What method does the default Hash use?
Are there any other hash codes built in?
The default hash algorithm is, more-or-less, a basic 32-bit hash function applied to the underlying expression representation, but the exact code is a proprietary component of the Mathematica kernel. It's subject to (and has) change between Mathematica versions, and lacks a number of desirable cryptographic properties, so I personally recommend you use MD5 or one of the SHA variants for any serious application where security matters. The built-in hash is intended for typical data structure use (e.g. in a hash table).
The named hash algorithms you list from the documentation are the only ones currently available. Are you looking for a different one in particular?
I've been doing some reverse engeneering on 32 and 64 bit Windows version of Mathematica 10.4 and that's what I found:
32 BIT
It uses a Fowler–Noll–Vo hash function (FNV-1, with multiplication before) with 16777619 as FNV prime and ‭84696351‬ as offset basis. This function is applied on Murmur3-32 hashed value of the address of expression's data (MMA uses a pointer in order to keep one instance of each data). The address is eventually resolved to the value - for simple machine integers the value is immediate, for others is a bit trickier. The Murmur3-32 implementing function contains in fact an additional parameter (defaulted with 4, special case making behaving as in Wikipedia) which selects how many bits to choose from the expression struct in input. Since a normal expression is internally represented as an array of pointers, one can take the first, the second etc.. by repeatedly adding 4 (bytes = 32 bit) to the base pointer of the expression. So, passing 8 to the function will give the second pointer, 12 the third and so on. Since internal structs (big integers, machine integers, machine reals, big reals and so on) have different member variables (e.g. a machine integer has only a pointer to int, a complex 2 pointers to numbers etc..), for each expression struct there is a "wrapper" that combine its internal members in one single 32-bit hash (basically with FNV-1 rounds). The simplest expression to hash is an integer.
The murmur3_32() function has 1131470165 as seed, n=0 and other params as in Wikipedia.
So we have:
hash_of_number = 16777619 * (84696351‬ ^ murmur3_32( &number ))
with " ^ " meaning XOR.
I really didn't try it - pointers are encoded using WINAPI EncodePointer(), so they can't be exploited at runtime. (May be worth running in Linux under Wine with a modified version of EncodePonter?)
64 BIT
It uses a FNV-1 64 bit hash function with 0xAF63BD4C8601B7DF as offset basis and 0x100000001B3 as FNV prime, along with a SIP64-24 hash (here's the reference code) with the first 64 bit of 0x0AE3F68FE7126BBF76F98EF7F39DE1521 as k0 and the last 64 bit as k1. The function is applied to the base pointer of the expression and resolved internally. As in 32-bit's murmur3, there is an additional parameter (defaulted to 8) to select how many pointers to choose from the input expression struct. For each expression type there is a wrapper to condensate struct members into a single hash by means of FNV-1 64 bit rounds.
For a machine integer, we have:
hash_number_64bit = 0x100000001B3 * (0xAF63BD4C8601B7DF ^ SIP64_24( &number ))
Again, I didn't really try it. Could anyone try?
Not for the faint-hearted
If you take a look at their notes on internal implementation, they say that "Each expression contains a special form of hash code that is used both in pattern matching and evaluation."
The hash code they're referring to is the one generated by these functions - at some point in the normal expression wrapper function there's an assignment that puts the computed hash inside the expression struct itself.
It would certainly be cool to understand HOW they can make use of these hashes for pattern matching purpose. So I had a try running through the bigInteger wrapper to see what happens - that's the simplest compound expression.
It begins checking something that returns 1 - dunno what.
So it executes
var1 = 16777619 * (67918732 ^ hashMachineInteger(1));
with hashMachineInteger() is what we said before - including values.
Then it reads the length in bytes of the bigInt from the struct (bignum_length) and runs
result = 16777619 * (v10 ^ murmur3_32(v6, 4 * v4));
Note that murmur3_32() is called if 4 * bignum_length is greater than 8 (may be related to the max value of machine integers $MaxMachineNumber 2^32^32 and by converse to what a bigInt is supposed to be).
So, the final code is
if (bignum_length > 8){
result = 16777619 * (16777619 * (67918732 ^ ( 16777619 * (84696351‬ ^ murmur3_32( 1, 4 )))) ^ murmur3_32( &bignum, 4 * bignum_length ));
}
I've made some hypoteses on the properties of this construction. The presence of many XORs and the fact that 16777619 + 67918732 = 84696351‬ may make one think that some sort of cyclic structure is exploited to check patterns - i.e. subtracting the offset and dividing by the prime, or something like that. The software Cassandra uses the Murmur hash algorithm for token generation - see these images for what I mean with "cyclic structure". Maybe various primes are used for each expression - must still check.
Hope it helps
It seems that Hash calls the internal Data`HashCode function, then divides it by 2, takes the first 20 digits of N[..] and then the IntegerPart, plus one, that is:
IntegerPart[N[Data`HashCode[expr]/2, 20]] + 1