Memory occupied by a string variable in swift [duplicate] - swift

This question already has answers here:
Swift: How to use sizeof?
(5 answers)
Calculate the size in bytes of a Swift String
(1 answer)
Closed 5 years ago.
I want to find the memory occupied by a string variable in bytes.
Lets suppose we have a variable named test
let test = "abvd"
I want to know how to find the size of test in runtime.
I have checked the details in Calculate the size in bytes of a Swift String
But this question is different.
According to apple, "Behind the scenes, Swift’s native String type is built from Unicode scalar values. A Unicode scalar is a unique 21-bit number for a character or modifier, such as U+0061 for LATIN SMALL LETTER A ("a"), or U+1F425 for FRONT-FACING BABY CHICK ("🐥")." This can be found in https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
So if thats the case, apple is actually using a fixed size representation for unicode code points instead of the dynamic UTF8 encoding.
I wanted to verify this claim.
Thanks in advance.

To your real goal: neither understanding is correct. Strings do not promise an internal representation. They can hold a variety of representations, depending on how they're constructed. In principle they can even take zero real memory if they are statically defined in the binary and memory mapped (I can't remember if the StaticString type makes use of this fully yet). The only way you're going to answer this question is to look at the current implementation, starting in String.swift, and then moving to StringCore.swift, and then reading the rest of the string files.
To your particular question, this is probably the beginning of the answer you're looking for (but again, this is current implementation; it's not part of any spec):
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
The end of the answer you're looking for is "it's complicated."
Note that if you ask for MemoryLayout.size(ofValue: test), you're going to get a surprising result (24), because that's just measuring the container. There is a reference type inside the container (which takes one word of storage for a pointer). There's no mechanism to determine "all the storage used by this value" because that's not very well defined when pointers get involved.
String only has one property:
var _core: _StringCore
And _StringCore has the following properties:
public var _baseAddress: UnsafeMutableRawPointer?
var _countAndFlags: UInt
public var _owner: AnyObject?
Each of those take one word (8 bytes on a 64-bit platform) of storage, so 24 bytes total. It doesn't matter how long the string is.

Related

Why does Swift String.Index keeps its index value 4 times bigger than real?

I was trying to implement Boyer-Moore algorithm in Swift Playground and I used Swift String.Index a lot and something that started to bother me is why indexes are kept 4 times bigger that what it seems they should be.
For example:
let why = "is s on 4th position not 1st".index(of: "s")
This code in Swift Playground will generate _compoundOffset 4 not 1. I'm sure there is a reason for doing this, but I couldn't find explanation anywhere.
It's not a duplicate of any question that explains how to get index of char in Swift, I know that, I used index(of:) function just to illustrate the question. I wanted to know why value of 2nd char is 4 not 1 when using String.Index.
So I guess the way it keeps indexes is private and I don't need to know the inside implementation, it's probably connected with UTF16 and UTF32 coding.
First of all, don’t ever assume _compoundOffset to be anything else than an implementation detail. _compoundOffset is an internal property of String.Index that uses bit masking to store two values in this one number:
The encodedOffset, which is the index's byte offset in terms of UTF-16 code units. This one is public and can be relied on. In your case encodedOffset is 1 because that's the offset for that character, as measured in UTF-16 code units. Note that the encoding of the string in memory doesn't matter! encodedOffset is always UTF-16.
The transcodedOffset, which stores the index's offset inside the current UTF-16 code unit. This is also an internal property that you can't access. The value is usually 0 for most indices, unless you have an index into the string's UTF-8 view that refers to a code unit which doesn't fall on a UTF-16 boundary. In that case, the transcodedOffset will store the offset in bytes from the encodedOffset.
Now why is _compoundOffset == 4? Because it stores the transcodedOffset in the two least significant bits and the encodedOffset in the 62 most significant bits. So the bit pattern for encodedOffset == 1, transcodedOffset == 0 is 0b100, which is 4.
You can verify all this in the source code for String.Index.

In Swift, how to get estimate of String length in constant time?

In Swift 3, you can count the characters in a String with:
str.characters.count
I need to do this frequently, and that line above looks like it could be O(N). Is there a way to get a string length, or a length of something — maybe the underlying unicode buffer — with an operation that is guaranteed to not have to walk the entire string? Maybe:
str.utf16.count
I ask because I'm checking the length of some text every time the user types a character, to limit the size of a UITextView. The call doesn't need to be an exact count of the glyphs, like characters.count.
This is a good question. The answer is... complicated. Converting from UTF-8 to UTF-16, or vice-versa, or converting to or from some other encoding, will all require examining the string, since the characters can be made up of more than one code unit. So if you want to get the count in constant time, it's going to come down to what the internal representation is. If the string is using UTF-16 internally, then it's a reasonable assumption that string.utf16.count would be in constant time, but if the internal representation is UTF-8 or something else, then the string will need to be analyzed to determine what the length in UTF-16 would be. So what's String using internally? Well:
https://github.com/apple/swift/blob/master/stdlib/public/core/StringCore.swift
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
This is discouraging. The internal representation could be ASCII or UTF-16, or it could be wrapping a Foundation NSString. Hrm. We do know that NSString uses UTF-16 internally, since this is actually documented, so that's good. So the main outlier here is when the string stores ASCII. The saving grace is that since the first 128 Unicode code points have the same values as the ASCII character set, any ASCII character 0xXX should correspond to the UTF-16 character 0x00XX, so the UTF-16 length should simply be the ASCII length times two, and thus calculable in constant time. Is this the case in the implementation? Let's look.
In the UTF16View source, there is no implementation of count. It appears that count is inherited from Collection's implementation, which is implemented via distance():
public var count: IndexDistance {
return distance(from: startIndex, to: endIndex)
}
UTF16View's implementation of distance() looks like this:
public func distance(from start: Index, to end: Index) -> IndexDistance {
// FIXME: swift-3-indexing-model: range check start and end?
return start.encodedOffset.distance(to: end.encodedOffset)
}
And in the String.Index source, encodedOffset looks like this:
public var encodedOffset : Int {
return Int(_compoundOffset >> _Self._strideBits)
}
where _compoundOffset appears to be a simple 64-bit integer:
internal var _compoundOffset : UInt64
and _strideBits appears to be a straight integer as well:
internal static var _strideBits : Int { return 2 }
So it... looks... like you should get constant time from string.utf16.count, since unless I'm making a mistake somewhere, you're just bit-shifting a couple of integers and then comparing the results (I'd probably still run some tests to be sure). The caveat is, of course, that this isn't documented, and thus could change in the future—particularly since the documentation for String does claim that it needs to iterate through the string:
Unlike with isEmpty, calculating a view’s count property requires iterating through the elements of the string.
With all that said, you're using a UITextView, which is implemented in Objective-C via NSAttributedString. If you're willing to incur the Objective-C message-passing overhead (which, let's be honest, is probably occurring under the scenes anyway to generate the String), you can just call its length property, which, since NSAttributedString is built on top of NSString, which does guarantee that it uses UTF-16 internally, is almost certain to be in constant time.

Does string concatenation in Swift make a new copy?

I can concatenate two strings in Swift like this:
var c = "Hello World"
c += "!"
Does this create a new string? (Allocating a new block of memory, copying over the original string, concatenating the "!" string and returning the new memory.) Or, does it update the original string in place (only allocating a new block of memory if the original block can't fit the character).
No, it does not make a new copy. As you can see, the original string has changed.But the address remains same.
As it says in the apple documentation : https://developer.apple.com/documentation/swift/string
in the section of Performance optimizations:
"Although strings in Swift have value semantics, strings use a copy-on-write strategy to store their data in a buffer. This buffer can then be shared by different copies of a string. A string’s data is only copied lazily, upon mutation, when more than one string instance is using the same buffer. Therefore, the first in any sequence of mutating operations may cost O(n) time and space."
IOs use copy-on-write so if more than 1 process use the same variable, or has more than 1 copy (i don't fully understand this part), it makes a copy, but if the variable is only used for 1 process and has only one copy, then you can mutate it as you wish without generating copies

How to declare any-range string element inside an user-defined type in QBasic?

I'm learning QBasic and found an user-defined type example code in a documentation. In this example there's a string element inside an user-defined type, and that string doesn't have a length defined.
However my compiler throws the exception "Expected STRING * on..." for this example. Test-case defining the string length:
TYPE Person
name AS STRING * 4
END TYPE
DIM Matheus AS Person:
Matheus.name = "Matheus":
PRINT Matheus.name:
It logs "Math", expected "Matheus". Is there a way to allow any range for this string?
Note: I'm using QB64 compiler
No, there is not a way to use a variable-length string, even with QB64. You might look into FreeBASIC if you want this feature since it offers it.
TYPE creates a record type with the specified fields, and records have a fixed length. Look at the OPEN ... FOR RANDOM specification:
OPEN Filename$ FOR RANDOM AS #1 [LEN = recordlength%]
recordlength% is determined by getting the LEN of a TYPE variable or a FIELD statement.
If no record length is used in the OPEN statement, the default record size is 128 bytes except for the last record.
A record length cannot exceed 32767 or an error will occur!
TYPE was never intended to contain strings that are dynamically sized. This allows a developer to keep record sizes small. If you had an address book, for example, you wouldn't want people's names to be too large, else the address book wouldn't fit in memory.
QB64 didn't remove that restraint, probably to keep things compatible with older QBASIC code since the original goal was to preserve compatibility.

Using signed integers instead of unsigned integers [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
In software development, it's usually a good idea to take advantage of compiler-errors. Allowing the compiler to work for you by checking your code makes sense. In strong-type languages, if a variable only has two valid values, you'd make it a boolean or define an enum for it. Swift furthers this by bringing the Optional type.
In my mind, the same would apply to unsigned integers: if you know a negative value is impossible, program in a way that enforces it. I'm talking about high-level APIs; not low-level APIs where the negative value is usually used as a cryptic error signalling mechanism.
And yet Apple suggests avoiding unsigned integers:
Use UInt only when you specifically need an unsigned integer type with the same size as the platform’s native word size. If this is not the case, Int is preferred, even when the values to be stored are known to be non-negative. [...]
Here's an example: Swift's Array.count returns an Int. How can one possibly have negative amount of items?!
Why?!
Apple's states that:
A consistent use of Int for integer values aids code interoperability, avoids the need to convert between different number types, and matches integer type inference, as described in Type Safety and Type Inference.
But I don't buy it! Using Int wouldn't aid "interoperability" anymore than UInt since Int could resolve to Int32 or Int64 (for 32-bit and 64-bit platforms respectively).
If you care about robustness at all, using signed integers where it makes no logical sense essentially forces to do an additional check (What if the value is negative?)
I can't see the act of casting between signed and unsigned as being anything other than trivial. Wouldn't that simply indicate the compiler to resolve the machine-code to use either signed or unsigned byte-codes?!
Casting back and forth between signed and unsigned integers is extremely bug-prone on one side while adds little value on the other.
One reason to have unsigned int that you suggest, being an implicit guarantee that an index never gets negative value.. well, it's a bit speculative. Where would the potential to get a negative value come from? Of course, from the code, that is, either from a static value or from a computation. But in both cases, for a static or a computed value to be able to get negative they must be handled as signed integers. Therefore, it is a language implementation responsibility to introduce all sorts of checks every time you assign signed value to unsigned variable (or vice versa). This means that we talk not about being forced "to do an additional check" or not, but about having this check implicitly made for us by the language every time we feel lazy to bother with corner cases.
Conceptually, signed and unsigned integers come into the language from low level (machine codes). In other words, unsigned integer is in the language not because it is the language that has a need, but because it is directly bridge-able to machine instructions, hence allows performance gain just for being native. No other big reason behind. Therefore, if one has just a glimpse of portability in mind, then one would say "Be it Int and this is it. Let developer write clean code, we bring the rest".
As long as we have an opinions based question...
Basing programming language mathematical operations on machine register size is one of the great travesties of Computer Science. There should be Integer*, Rational, Real and Complex - done and dusted. You need something that maps to a U8 Register for some device driver? Call it a RegisterOfU8Data - or whatever - just not 'Int(eger)'
*Of course, calling it an 'Integer' means it 'rolls over' to an unlimited range, aka BigNum.
Sharing what I've discovered which indirectly helps me understand... at least in part. Maybe it ends up helping others?!
After a few days of digging and thinking, it seems part of my problem boils down to the usage of the word "casting".
As far back as I can remember, I've been taught that casting was very distinct and different from converting in the following ways:
Converting kept the meaning but changed the data.
Casting kept the data but changed the meaning.
Casting was a mechanism allowing you to inform the compiler how both it and you would be manipulating some piece of data (No changing of data, thus no cost). " Me to the compiler: Okay! Initially I told you this byte was an number because I wanted to perform math on it. Now, lets treat it as an ASCII character."
Converting was a mechanism for transforming the data into different formats. " Me to the compiler: I have this number, please generate an ASCII string that represents that value."
My problem, it seems, is that in Swift (and most likely other languages) the line between casting and converting is blurred...
Case in point, Apple explains that:
Type casting in Swift is implemented with the is and as operators. […]
var x: Int = 5
var y: UInt = x as UInt // Casting... Compiler refuses claiming
// it's not "convertible".
// I don't want to convert it, I want to cast it.
If "casting" is not this clearly defined action it could explain why unsigned integers are to be avoided like the plague...