Getting bit-pattern of bool in Swift - swift

In ObjC, a bool's bit pattern could be retrieved by casting it to a UInt8.
e.g.
true => 0x01
false => 0x00
This bit pattern could then be used in further bit manipulation operations.
Now I want to do the same in Swift.
What I got working so far is
UInt8(UInt(boolValue))
but this doesn't look like it is the preferred approach.
I also need the conversion in O(1) without data-dependent branching. So, stuff like the following is not allowed.
boolValue ? 1 : 0
Also, is there some documentation about the way the UInt8 and UInt initializers are implemented? e.g. if the UInt initializer to convert from bool uses data-dependent branching, I can't use it either.
Of course, the fallback is always to use further bitwise operations to avoid the bool value altogether (e.g. Check if a number is non zero using bitwise operators in C).
Does Swift offer an elegant way to access the bit pattern of a Bool / convert it to UInt8, in O(1) without data-dependent branching?

When in doubt, have a look at the generated assembly code :)
func foo(someBool : Bool) -> UInt8 {
let x = UInt8(UInt(someBool))
return x
}
compiled with ("-O" = "Compile with optimizations")
xcrun -sdk macosx swiftc -emit-assembly -O main.swift
gives
.globl __TF4main3fooFSbVSs5UInt8
.align 4, 0x90
__TF4main3fooFSbVSs5UInt8:
.cfi_startproc
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
callq __TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber
movq %rax, %rdi
callq __TFE10FoundationSuCfMSuFCSo8NSNumberSu
movzbl %al, %ecx
cmpq %rcx, %rax
jne LBB0_2
popq %rbp
retq
The function names can be demangled with
$ xcrun -sdk macosx swift-demangle __TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber __TFE10FoundationSuCfMSuFCSo8NSNumberSu
_TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber ---> ext.Foundation.Swift.Bool._bridgeToObjectiveC (Swift.Bool)() -> ObjectiveC.NSNumber
_TFE10FoundationSuCfMSuFCSo8NSNumberSu ---> ext.Foundation.Swift.UInt.init (Swift.UInt.Type)(ObjectiveC.NSNumber) -> Swift.UInt
There is no UInt initializer that takes a Bool argument.
So the smart compiler has used the automatic conversion between Swift
and Foundation types and generated some code like
let x = UInt8(NSNumber(bool: someBool).unsignedLongValue)
Probably not very efficient with two function calls. (And it does not
compile if you only import Swift, without Foundation.)
Now the other method where you assumed data-dependent branching:
func bar(someBool : Bool) -> UInt8 {
let x = UInt8(someBool ? 1 : 0)
return x
}
The assembly code is
.globl __TF4main3barFSbVSs5UInt8
.align 4, 0x90
__TF4main3barFSbVSs5UInt8:
pushq %rbp
movq %rsp, %rbp
andb $1, %dil
movb %dil, %al
popq %rbp
retq
No branching, just an "AND" operation with 0x01!
Therefore I do not see a reason not to use this "straight-forward" conversion.
You can then profile with Instruments to check if it is a bottleneck for
your app.

#martin-r’s answer is more fun :-), but this can be done in a playground.
// first check this is true or you’ll be sorry...
sizeof(Bool) == sizeof(UInt8)
let t = unsafeBitCast(true, UInt8.self) // = 1
let f = unsafeBitCast(false, UInt8.self) // = 0

Related

Is MemoryLayout<T>.size/stride/alignment compile time?

For reference, in C/C++, the equivalent (sizeof operator) is compile time, and can be used with template programming (Generics).
I was looking through Swift Algorithms Club for implementations of common data structures and came across their implementation of a Bit Set:
public struct BitSet {
private(set) public var size: Int
private let N = 64
public typealias Word = UInt64
fileprivate(set) public var words: [Word]
public init(size: Int) {
precondition(size > 0)
self.size = size
// Round up the count to the next multiple of 64.
let n = (size + (N-1)) / N
words = [Word](repeating: 0, count: n)
}
<clipped>
In a language where sizeof exists as a compile time constant operator, I would have set N to sizeof(Word) * 8 or rather MemoryLayout<UInt64>.size * 8 rather than the "magic number" 64. I admit, its not very magic here, but the point stands if only just for making it semantically clear whats going on.
Further, I noted the family of functions, for which the same questions applies (ref).
static func size(ofValue value: T) -> Int
static func stride(ofValue value: T) -> Int
static func alignment(ofValue value: T) -> Int
Edit: Adding some disassembly from generic/not generic version of functions.
Non-generic swift:
func getSizeOfInt() -> Int {
return MemoryLayout<UInt64>.size
}
Produces this disassembly:
(lldb) disassemble --frame
MemoryLayout`getSizeOfInt() -> Int:
0x1000013c0 <+0>: pushq %rbp
0x1000013c1 <+1>: movq %rsp, %rbp
0x1000013c4 <+4>: movl $0x8, %eax
-> 0x1000013c9 <+9>: popq %rbp
0x1000013ca <+10>: retq
Based on the 0x8 constant, this looks like a compile time constant based on #Charles Srstka's answer.
How about the generic swift implementation?
func getSizeOf<T>(_ t:T) -> Int {
return MemoryLayout<T>.size
}
Produced this disassembly:
(lldb) disassemble --frame
MemoryLayout`getSizeOf<A> (A) -> Int:
0x100001390 <+0>: pushq %rbp
0x100001391 <+1>: movq %rsp, %rbp
0x100001394 <+4>: subq $0x20, %rsp
0x100001398 <+8>: movq %rsi, -0x8(%rbp)
0x10000139c <+12>: movq %rdi, -0x10(%rbp)
-> 0x1000013a0 <+16>: movq -0x8(%rsi), %rax
0x1000013a4 <+20>: movq 0x88(%rax), %rcx
0x1000013ab <+27>: movq %rcx, -0x18(%rbp)
0x1000013af <+31>: callq *0x20(%rax)
0x1000013b2 <+34>: movq -0x18(%rbp), %rax
0x1000013b6 <+38>: addq $0x20, %rsp
0x1000013ba <+42>: popq %rbp
0x1000013bb <+43>: retq
0x1000013bc <+44>: nopl (%rax)
The above doesn't look compile time... ? I am not familiar yet with assembler on mac/lldb.
This code:
func getSizeOfInt64() -> Int {
return MemoryLayout<Int64>.size
}
generates this assembly:
MyApp`getSizeOfInt64():
0x1000015a0 <+0>: pushq %rbp
0x1000015a1 <+1>: movq %rsp, %rbp
0x1000015a4 <+4>: movl $0x8, %eax
0x1000015a9 <+9>: popq %rbp
0x1000015aa <+10>: retq
From this, it appears that MemoryLayout<Int64>.size is indeed compile-time.
Checking the assembly for stride and alignment is left to the reader, but they give similar results (actually identical, in the case of Int64).
EDIT:
If we're talking about generic functions, obviously more work has to be done since the function will not know the type of which it's getting the size at compile time, and thus can't just put in a constant. But, if you define your function to take a type instead of an instance of the type, it does a little less work than in your example:
func getSizeOf<T>(_: T.Type) -> Int {
return MemoryLayout<T>.size
}
called like: getSizeOf(UInt64.self)
generates this assembly:
MyApp`getSizeOf<A>(_:):
0x100001590 <+0>: pushq %rbp
0x100001591 <+1>: movq %rsp, %rbp
0x100001594 <+4>: movq %rsi, -0x8(%rbp)
0x100001598 <+8>: movq %rdi, -0x10(%rbp)
-> 0x10000159c <+12>: movq -0x8(%rsi), %rsi
0x1000015a0 <+16>: movq 0x88(%rsi), %rax
0x1000015a7 <+23>: popq %rbp
0x1000015a8 <+24>: retq

When using emacs lisp, if the value exceed most-negative-number

So, if I multiply two values:
emacs -batch -eval '(print (* 1252463 -4400000000000))'
It will exceed most-negative-fixnum frame and will return mathematically wrong answer. What will be the difference in instruction level between
-O2 flag, -O2 -fsanitize=undefined flag, and -O2 -fwrapv flag?
In emacs? Probably nothing. The function that is compiled probably looks like this:
int multiply(int x, int y) {
return x * y;
}
If we compile that and look at the assembly (gcc -S multiply.c && cat multiply.s), we get
multiply:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
imull -8(%rbp), %eax
popq %rbp
ret
See the imull instruction? It's doing a regular multiply. What if we try gcc -O2 -S multiply.c?
multiply:
movl %edi, %eax
imull %esi, %eax
ret
Well that's certainly removed some code, but it's still doing imull, a regular multiplication.
Let's try to get it to not use imull:
int multiply(int x) {
return x * 2;
}
With gcc -O2 -S multiply.c, we get
multiply:
leal (%rdi,%rdi), %eax
ret
Instead of computing the slower x * 2, it instead computed x + x, because addition is faster than multiplication.
Can we get -fwrapv to do produce different code? Yes:
int multiply(int x) {
return x * 2 < 0;
}
With gcc -O2 -S multiply.c, we get
multiply:
movl %edi, %eax
shrl $31, %eax
ret
So it was simplified into x >> 31, which is the same thing as x < 0. In math, if x * 2 < 0 then x < 0. But in the reality of processors, if x * 2 overflows it may become negative, for example 2,000,000,000 * 2 = -294967296.
If you force gcc to take this into account with gcc -O2 -fwrapv -S temp.c, we get
multiply:
leal (%rdi,%rdi), %eax
shrl $31, %eax
ret
So it optimized x * 2 < 0 to x + x < 0. It might seem strange to have -fwrapv not be the default, but C was created before it was standard for multiplication to overflow in this predictable manner.

Benchmark for loop and range operator

I read that range-based loops have better performance on some programming language. Is it the case in Swift. For instance in Playgroud:
func timeDebug(desc: String, function: ()->() )
{
let start : UInt64 = mach_absolute_time()
function()
let duration : UInt64 = mach_absolute_time() - start
var info : mach_timebase_info = mach_timebase_info(numer: 0, denom: 0)
mach_timebase_info(&info)
let total = (duration * UInt64(info.numer) / UInt64(info.denom)) / 1_000
println("\(desc): \(total) µs.")
}
func loopOne(){
for i in 0..<4000 {
println(i);
}
}
func loopTwo(){
for var i = 0; i < 4000; i++ {
println(i);
}
}
range-based loop
timeDebug("Loop One time"){
loopOne(); // Loop One time: 2075159 µs.
}
normal for loop
timeDebug("Loop Two time"){
loopTwo(); // Loop Two time: 1905956 µs.
}
How to properly benchmark in swift?
// Update on the device
First run
Loop Two time: 54 µs.
Loop One time: 482 µs.
Second
Loop Two time: 44 µs.
Loop One time: 382 µs.
Third
Loop Two time: 43 µs.
Loop One time: 419 µs.
Fourth
Loop Two time: 44 µs.
Loop One time: 399 µs.
// Update 2
func printTimeElapsedWhenRunningCode(title:String, operation:()->()) {
let startTime = CFAbsoluteTimeGetCurrent()
operation()
let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
println("Time elapsed for \(title): \(timeElapsed) s")
}
printTimeElapsedWhenRunningCode("Loop Two time") {
loopTwo(); // Time elapsed for Loop Two time: 4.10079956054688e-05 s
}
printTimeElapsedWhenRunningCode("Loop One time") {
loopOne(); // Time elapsed for Loop One time: 0.000500023365020752 s.
}
You shouldn’t really benchmark in playgrounds since they’re unoptimized. Unless you’re interested in how long things will take when you’re debugging, you should only ever benchmark optimized builds (swiftc -O).
To understand why a range-based loop can be faster, you can look at the assembly generated for the two options:
Range-based
% echo "for i in 0..<4_000 { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
; increment i
incq %rbx
movq %r14, %rdi
movq %r15, %rsi
; print (pre-incremented) i
callq __TFSs7printlnU__FQ_T_
; compare i to 4_000
cmpq $4000, %rbx
; loop if not equal
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
C-style for loop
% echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
; print i
callq __TFSs7printlnU__FQ_T_
; increment i
incq %rbx
; jump if overflow
jo LBB0_4
; compare i to 4_000
cmpq $4000, %rbx
; loop if less than
jl LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
LBB0_4:
; raise illegal instruction due to overflow
ud2
.cfi_endproc
So the reason the C-style loop is slower is because it’s performing an extra operation – checking for overflow. Either Range was written to avoid the overflow check (or do it up front), or the optimizer was more able to eliminate it with the Range version.
If you switch to using the check-free addition operator, you can eliminate this check. This produces near-identical code to the range-based version (the only difference being some immaterial ordering of the code):
% echo "for var i = 0;i < 4_000;i = i &+ 1 { println(i) }" | swiftc -O -emit-assembly -
; snip
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
callq __TFSs7printlnU__FQ_T_
incq %rbx
cmpq $4000, %rbx
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
Never Benchmark Unoptimized Builds
If you want to understand why, try looking at the output for the Range-based version of the above, but with no optimization: echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -Onone -emit-assembly -. You will see it output a lot more code. That’s because Range used via for…in is an abstraction, a struct used with custom operators and functions returning generators, and does a lot of safety checks and other helpful things. This makes it a lot easier to write/read code. But when you turn on the optimizer, all this disappears and you’re left with very efficient code.
Benchmarking
As to ways to benchmark, this is the code I tend to use, just replacing the array:
import CoreFoundation.CFDate
func timeRun<T>(name: String, f: ()->T) -> String {
let start = CFAbsoluteTimeGetCurrent()
let result = f()
let end = CFAbsoluteTimeGetCurrent()
let timeStr = toString(Int((end - start) * 1_000_000))
return "\(name)\t\(timeStr)µs, produced \(result)"
}
let n = 4_000
let runs: [(String,()->Void)] = [
("for in range", {
for i in 0..<n { println(i) }
}),
("plain ol for", {
for var i = 0;i < n;++i { println(i) }
}),
("w/o overflow", {
for var i = 0;i < n;i = i &+ 1 { println(i) }
}),
]
println("\n".join(map(runs, timeRun)))
But the results will probably be meaningless, since jitter during println will likely obscure actual measurement. To really benchmark (assuming you don’t just trust the assembly analysis :) you’d need to replace it with something very lightweight.

Is there a performance improvement when using _ to ignore a parameter in Swift?

When creating a UITableViewController, sometimes I only need the indexPath in my function, is there a performance improvement when using _ to ignore the tableView parameter?
Ex: using this:
override func tableView(_: UITableView, didSelectRowAtIndexPath indexPath: NSIndexPath)
instead of this:
override func tableView(tableView: UITableView, didSelectRowAtIndexPath indexPath: NSIndexPath) {
Generally, this falls under the category "micro optimization".
Even if there were a difference, it would probably be negligible
compared to the rest of your program. And chances are great that the
compiler notices the unused parameter and optimizes the code
accordingly. You should decide what parameter name makes the most
sense in your situation.
In this particular case, it does not make any difference at all.
How you name the (internal) method parameter affects only the compiling
phase, but does not change the generated code.
You can verify that easily
yourself. Create a source file "main.swift":
// main.swift
import Swift
func foo(str : String) -> Int {
return 100
}
func bar(_ : String) -> Int {
return 100
}
println(foo("a"))
println(bar("b"))
Now compile it and inspect the generated assembly code:
swiftc -O -emit-assembly main.swift
The assembly code for both methods is completely identical:
.private_extern __TF4main3fooFSSSi
.globl __TF4main3fooFSSSi
.align 4, 0x90
__TF4main3fooFSSSi:
pushq %rbp
movq %rsp, %rbp
movq %rdx, %rdi
callq _swift_unknownRelease
movl $100, %eax
popq %rbp
retq
.private_extern __TF4main3barFSSSi
.globl __TF4main3barFSSSi
.align 4, 0x90
__TF4main3barFSSSi:
pushq %rbp
movq %rsp, %rbp
movq %rdx, %rdi
callq _swift_unknownRelease
movl $100, %eax
popq %rbp
retq

Carry bit, GAS constraint

I am writing assembly long addition in GAS inline assembly,
template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y);
If I specialize I could do :
template <>
void inline KA_add<128>(vli<128> & x, vli<128> const& y){
asm("addq %2, %0; adcq %3, %1;" :"+r"(x[0]),"+r"(x[1]):"g"(y[0]),"g"(y[1]):"cc");
}
Nice it works, now if I try to generalize to allow the inline of template, and let work my compiler for any length ...
template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y){
asm("addq %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
for(int i(1); i < vli<NumBits>::numwords;++i)
asm("adcq %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};
Well, it does not work I have no guarantee that the carry bit (CB) is propagated. It is not conserve between the first asm line and the second one. It may be logic because the loop increment i and so "delete" the CB I thing, it should exist a GAS constraint to conserve the CB over the two ASM call. Unfortunately I do not find such informations.
Any idea ?
Thank you, Merci !
PS I rewrite my function to remove the C++ ideology
template <std::size_t NumBits>
inline void KA_add_test(boost::uint64_t* x, boost::uint64_t const* y){
asm ("addq %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
for(int i(1); i < vli<NumBits>::numwords;++i)
asm ("adcq %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};
The asm gives (GCC Debug mode),
APP
addq %rdx, %rax;
NO_APP
movq -24(%rbp), %rdx
movq %rax, (%rdx)
.LBB94:
.loc 9 55 0
movl $1, -4(%rbp)
jmp .L323
.L324:
.loc 9 56 0
movl -4(%rbp), %eax
cltq
salq $3, %rax
movq %rax, %rdx
addq -24(%rbp), %rdx <----------------- Break the carry bit
movl -4(%rbp), %eax
cltq
salq $3, %rax
addq -32(%rbp), %rax
movq (%rax), %rcx
movq (%rdx), %rax
APP
adcq %rcx, %rax;
NO_APP
As we can read there is additional addq, it destroys the propagation of the CB
I see no way to explicitly tell the compiler that the loop code must be created without instructions affecting the C flag.
It's surely possible to do so - use lea to count the array addresses upwards, dec to count the loop downwards and test Z for end condition. That way, nothing in the loop except the actual array sum changes the C flag.
You'd have to do a manual thing, like:
long long tmp; // hold a register
__asm__("0:
movq (%1), %0
lea 8(%1), %1
adcq %0, (%2)
lea 8(%2), %2
dec %3
jnz 0b"
: "=r"(tmp)
: "m"(&x[0]), "m"(&y[0]), "r"(vli<NumBits>::numwords)
: "cc", "memory");
For hot code, a tight loop isn't optimal though; for one, the instructions have dependencies, and there's significantly more instructions per iteration than inlined / unrolled adc sequences. A better sequence would be something like (%rbp resp. %rsi having the start addresses for the source and target arrays):
0:
lea 64(%rbp), %r13
lea 64(%rsi), %r14
movq (%rbp), %rax
movq 8(%rbp), %rdx
adcq (%rsi), %rax
movq 16(%rbp), %rcx
adcq 8(%rsi), %rdx
movq 24(%rbp), %r8
adcq 16(%rsi), %rcx
movq 32(%rbp), %r9
adcq 24(%rsi), %r8
movq 40(%rbp), %r10
adcq 32(%rsi), %r9
movq 48(%rbp), %r11
adcq 40(%rsi), %r10
movq 56(%rbp), %r12
adcq 48(%rsi), %r10
movq %rax, (%rsi)
adcq 56(%rsi), %r10
movq %rdx, 8(%rsi)
movq %rcx, 16(%rsi)
movq %r8, 24(%rsi)
movq %r13, %rbp // next src
movq %r9, 32(%rsi)
movq %r10, 40(%rsi)
movq %r11, 48(%rsi)
movq %r12, 56(%rsi)
movq %r14, %rsi // next tgt
dec %edi // use counter % 8 (doing 8 words / iteration)
jnz 0b // loop again if not yet zero
and looping only around such blocks. The advantage would be that the loads are blocked, and you'd deal with loop count / termination condition only once-per-that.
I would, quite honestly, try not to make the general bit width particularly "neat", but rather specialcase explicitly unrolled code for, say, bit widths of powers-of-two. Rather add a flag / constructor message to the non-optimized template instantiation telling the user "use a power of two" ?