Which is more efficient: Creating a "var" and re-using it, or creating several "let"s? - swift

Just curious which is more efficient/better in swift:
Creating three temporary constants (using let) and using those constants to define other variables
Creating one temporary variable (using var) and using that variable to hold three different values which will then be used to define other variables
This is perhaps better explained through an example:
var one = Object()
var two = Object()
var three = Object()
func firstFunction() {
let tempVar1 = //calculation1
one = tempVar1
let tempVar2 = //calculation2
two = tempVar2
let tempVar3 = //calculation3
three = tempVar3
}
func seconFunction() {
var tempVar = //calculation1
one = tempVar
tempVar = //calculation2
two = tempVar
tempVar = //calculation3
three = tempVar
}
Which of the two functions is more efficient? Thank you for your time!

Not to be too cute about it, but the most efficient version of your code above is:
var one = Object()
var two = Object()
var three = Object()
That is logically equivalent to all the code you've written since you never use the results of the computations (assuming the computations have no side-effects). It is the job of the optimizer to get down to this simplest form. Technically the simplest form is:
func main() {}
But the optimizer isn't quite that smart. But the optimizer really is smart enough to get to my first example. Consider this program:
var one = 1
var two = 2
var three = 3
func calculation1() -> Int { return 1 }
func calculation2() -> Int { return 2 }
func calculation3() -> Int { return 3 }
func firstFunction() {
let tempVar1 = calculation1()
one = tempVar1
let tempVar2 = calculation2()
two = tempVar2
let tempVar3 = calculation3()
three = tempVar3
}
func secondFunction() {
var tempVar = calculation1()
one = tempVar
tempVar = calculation2()
two = tempVar
tempVar = calculation3()
three = tempVar
}
func main() {
firstFunction()
secondFunction()
}
Run it through the compiler with optimizations:
$ swiftc -O -wmo -emit-assembly x.swift
Here's the whole output:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 9
.globl _main
.p2align 4, 0x90
_main:
pushq %rbp
movq %rsp, %rbp
movq $1, __Tv1x3oneSi(%rip)
movq $2, __Tv1x3twoSi(%rip)
movq $3, __Tv1x5threeSi(%rip)
xorl %eax, %eax
popq %rbp
retq
.private_extern __Tv1x3oneSi
.globl __Tv1x3oneSi
.zerofill __DATA,__common,__Tv1x3oneSi,8,3
.private_extern __Tv1x3twoSi
.globl __Tv1x3twoSi
.zerofill __DATA,__common,__Tv1x3twoSi,8,3
.private_extern __Tv1x5threeSi
.globl __Tv1x5threeSi
.zerofill __DATA,__common,__Tv1x5threeSi,8,3
.private_extern ___swift_reflection_version
.section __TEXT,__const
.globl ___swift_reflection_version
.weak_definition ___swift_reflection_version
.p2align 1
___swift_reflection_version:
.short 1
.no_dead_strip ___swift_reflection_version
.linker_option "-lswiftCore"
.linker_option "-lobjc"
.section __DATA,__objc_imageinfo,regular,no_dead_strip
L_OBJC_IMAGE_INFO:
.long 0
.long 1088
Your functions aren't even in the output because they don't do anything. main is simplified to:
_main:
pushq %rbp
movq %rsp, %rbp
movq $1, __Tv1x3oneSi(%rip)
movq $2, __Tv1x3twoSi(%rip)
movq $3, __Tv1x5threeSi(%rip)
xorl %eax, %eax
popq %rbp
retq
This sticks the values 1, 2, and 3 into the globals, and then exits.
My point here is that if it's smart enough to do that, don't try to second-guess it with temporary variables. It's job is to figure that out. In fact, let's see how smart it is. We'll turn off Whole Module Optimization (-wmo). Without that, it won't strip the functions, because it doesn't know whether something else will call them. And then we can see how it writes these functions.
Here's firstFunction():
__TF1x13firstFunctionFT_T_:
pushq %rbp
movq %rsp, %rbp
movq $1, __Tv1x3oneSi(%rip)
movq $2, __Tv1x3twoSi(%rip)
movq $3, __Tv1x5threeSi(%rip)
popq %rbp
retq
Since it can see that the calculation methods just return constants, it inlines those results and writes them to the globals.
Now how about secondFunction():
__TF1x14secondFunctionFT_T_:
pushq %rbp
movq %rsp, %rbp
popq %rbp
jmp __TF1x13firstFunctionFT_T_
Yes. It's that smart. It realized that secondFunction() is identical to firstFunction() and it just jumps to it. Your functions literally could not be more identical and the optimizer knows that.
So what's the most efficient? The one that is simplest to reason about. The one with the fewest side-effects. The one that is easiest to read and debug. That's the efficiency you should be focused on. Let the optimizer do its job. It's really quite smart. And the more you write in nice, clear, obvious Swift, the easier it is for the optimizer to do its job. Every time you do something clever "for performance," you're just making the optimizer work harder to figure out what you've done (and probably undo it).
Just to finish the thought: the local variables you create are barely hints to the compiler. The compiler generates its own local variables when it converts your code to its internal representation (IR). IR is in static single assignment form (SSA), in which every variable can only be assigned one time. Because of this, your second function actually creates more local variables than your first function. Here's function one (create using swiftc -emit-ir x.swift):
define hidden void #_TF1x13firstFunctionFT_T_() #0 {
entry:
%0 = call i64 #_TF1x12calculation1FT_Si()
store i64 %0, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3oneSi, i32 0, i32 0), align 8
%1 = call i64 #_TF1x12calculation2FT_Si()
store i64 %1, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3twoSi, i32 0, i32 0), align 8
%2 = call i64 #_TF1x12calculation3FT_Si()
store i64 %2, i64* getelementptr inbounds (%Si, %Si* #_Tv1x5threeSi, i32 0, i32 0), align 8
ret void
}
In this form, variables have a % prefix. As you can see, there are 3.
Here's your second function:
define hidden void #_TF1x14secondFunctionFT_T_() #0 {
entry:
%0 = alloca %Si, align 8
%1 = bitcast %Si* %0 to i8*
call void #llvm.lifetime.start(i64 8, i8* %1)
%2 = call i64 #_TF1x12calculation1FT_Si()
%._value = getelementptr inbounds %Si, %Si* %0, i32 0, i32 0
store i64 %2, i64* %._value, align 8
store i64 %2, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3oneSi, i32 0, i32 0), align 8
%3 = call i64 #_TF1x12calculation2FT_Si()
%._value1 = getelementptr inbounds %Si, %Si* %0, i32 0, i32 0
store i64 %3, i64* %._value1, align 8
store i64 %3, i64* getelementptr inbounds (%Si, %Si* #_Tv1x3twoSi, i32 0, i32 0), align 8
%4 = call i64 #_TF1x12calculation3FT_Si()
%._value2 = getelementptr inbounds %Si, %Si* %0, i32 0, i32 0
store i64 %4, i64* %._value2, align 8
store i64 %4, i64* getelementptr inbounds (%Si, %Si* #_Tv1x5threeSi, i32 0, i32 0), align 8
%5 = bitcast %Si* %0 to i8*
call void #llvm.lifetime.end(i64 8, i8* %5)
ret void
}
This one has 6 local variables! But, just like the local variables in the original source code, this tells us nothing about final performance. The compiler just creates this version because it's easier to reason about (and therefore optimize) than a version where variables can change their values.
(Even more dramatic is this code in SIL (-emit-sil), which creates 16 local variables for function 1 and 17 for function 2! If the compiler is happy to invent 16 local variables just to make it easier for it to reason about 6 lines of code, you certainly shouldn't be worried about the local variables you create. They're not just a minor concern; they're completely free.)

Unless you're dealing with a VERY specialized use case, this should never make a meaningful performance difference.
It's likely the compiler can easily simplify things to direct assignments in firstFunction, I'm unsure if secondFunction will easily lend itself to similar compiler optimization. You would either have to be an expert on the compiler or do some performance tests to find any differences.
Regardless, unless you're doing this at a scale of hundreds of thousands or millions it's not something to worry about.
I personally think re-using variables in that way of the secondFunction is unnecessarily confusing, but to each their own.
Note: it looks like you're dealing with classes, but be aware that struct copy semantics means re-using variables is useless anyway.

You should really just inline the local variables:
var one: Object
var two: Object
var three: Object
func firstFunction() {
one = //calculation1
two = //calculation2
three = //calculation3
}
One exception to this is if you end up writing something like this:
var someOptional: Foo?
func init() {
self.someOptional = Foo()
self.someOptional?.a = a
self.someOptional?.b = b
self.someOptional?.c = c
}
In which case it would be better to do:
func init() {
let foo = Foo()
foo.a = a
foo.b = b
foo.c = c
self.someOptional = foo
}
or perhaps:
func init() {
self.someOptional = {
let foo = Foo()
foo.a = a
foo.b = b
foo.c = c
return foo
}()
}

Related

Swift compiler LLVM IR optimization

For a project I am currently using the swift compiler front end to generate LLVM IR. I need to analyze the IR to find run time dependencies between variable read/writes with the end goal of finding parallelism.
For this it is important that I am able to generate LLVM IR that is unoptimized and all the load and store instructions are present. It works well with clang with the -O0 flag which prevents optimization.
However with swiftc I have the following issue on this simple example:
func returnInputSum(test: Int) -> Int {
var a = 5
var b = 10
var c = a + b
return c
}
with swiftc -Onone this becomes(removed debug etc.):
entry:
%a = alloca i64, align 8
%b = alloca i64, align 8
%c = alloca i64, align 8
store i64 5, i64* %a, align 8
store i64 10, i64* %b align 8
store i64 15, i64* %c, align 8
ret i64 15
so instead of loading a and b, inserting an "add" instruction and storing the result in c, the compiler seems to evaluate 5 + 10 directory to store the 15 in c. Then the return does not even return c but an immediate value this is a problem for me because I want to see all load/store instructions.
when I use clang with c/c++ code it works just fine and I get what I expect there:
%a = alloca i32, align 4
%b = alloca i32, align 4
%c = alloca i32, align 4
store i32 5, i32* %a, align
store i32 10, i32* %b, align 4
%0 = load i32, i32* %b, align 4
%1 = load i32, i32* %a, align 4
%add = add nsw i32 %0, %1
store i32 %add, i32* %c, align 4
%2 = load i32, i32* %c, align 4
ret i32 %2
is there any way to actually make the compiler do no optimization at all on the LLVM IR code? or am I not understanding something important here? I know that before generating LLVM IR there is SIL pass with some optimization but I am under the assumption that with the -Onone flag this should also be turned "off"?
thank you!

eBPF / XDP map not getting created

I have an implementation in BPF for XDP, wherein I specify five maps to be created as follows:
struct bpf_map_def SEC("maps") servers = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct ip_key),
.value_size = sizeof(struct dest_info),
.max_entries = MAX_SERVERS,
};
struct bpf_map_def SEC("maps") server_ips = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct ip_key),
.value_size = sizeof(struct server_ip_key),
.max_entries = MAX_SERVERS,
};
struct bpf_map_def SEC("maps") client_addrs = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct port_key),
.value_size = sizeof(struct client_port_addr),
.max_entries = MAX_CLIENTS,
};
struct bpf_map_def SEC("maps") stoc_port_maps = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct port_key),
.value_size = sizeof(struct port_map),
.max_entries = MAX_FLOWS,
};
struct bpf_map_def SEC("maps") ctos_port_maps = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(struct port_key),
.value_size = sizeof(struct port_map),
.max_entries = MAX_FLOWS,
};
However, no matter what I do, the servers map is not getting created. When I run bpftool map show, I only get such output as the following:
root#balancer:/xdp# bpftool map list
68: hash name client_addrs flags 0x0
key 8B value 16B max_entries 4096 memlock 98304B
69: hash name ctos_port_maps flags 0x0
key 8B value 20B max_entries 4096 memlock 131072B
70: hash name server_ips flags 0x0
key 8B value 8B max_entries 512 memlock 8192B
73: hash name stoc_port_maps flags 0x0
key 8B value 20B max_entries 4096 memlock 131072B
74: array name xdp_lb_k.rodata flags 0x480
key 4B value 50B max_entries 1 memlock 4096B
frozen
root#balancer:/xdp#
It is notable that each of the key or value structs have been padded to the closest multiple of eight bytes, and there are no compile or verifier errors. I am also running the program on docker containers. So far, I have tried moving the servers map definition around in my code, commenting out the other map definitions leaving only the servers definition active, changing the name to other combinations, and a few other minor changes but nothing has worked so far.
Please let me know if you would need any other portion of my code or information for a better analysis of the situation.
Appendix 1:
I am compiling the object file using this Makefile rule:
xdp_lb_kern.o: xdp_lb_kern.c
clang -S \
-target bpf \
-D __BPF_TRACING__ \
-I../../libbpf/src \
-I../../custom-headers \
-Wall \
-Wno-unused-value \
-Wno-pointer-sign \
-Wno-compare-distinct-pointer-types \
-O2 -emit-llvm -c -o ${#:.o=.ll} $<
llc -march=bpf -filetype=obj -o $# ${#:.o=.ll}
Then, in the container's environment, I load the program using this rule:
load_balancer:
bpftool net detach xdpgeneric dev eth0
rm -f /sys/fs/bpf/xdp_lb
bpftool prog load xdp_lb_kern.o /sys/fs/bpf/xdp_lb
bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_lb dev eth0
The compilation process generates a .o and a .ll output file. The beginning lines of the .ll output file, where the map definitions are visible, are shown below:
; ModuleID = 'xdp_lb_kern.c'
source_filename = "xdp_lb_kern.c"
target datalayout = "e-m:e-p:64:64-i64:64-n32:64-S128"
target triple = "bpf"
%struct.bpf_map_def = type { i32, i32, i32, i32, i32 }
%struct.xdp_md = type { i32, i32, i32, i32, i32 }
%struct.ip_key = type { i32, i32 }
%struct.port_key = type { i16, [3 x i16] }
%struct.ethhdr = type { [6 x i8], [6 x i8], i16 }
%struct.iphdr = type { i8, i8, i16, i16, i16, i8, i8, i16, i32, i32 }
#servers = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 32, i32 512, i32 0 }, section "maps", align 4
#server_ips = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 8, i32 512, i32 0 }, section "maps", align 4
#client_addrs = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 16, i32 4096, i32 0 }, section "maps", align 4
#stoc_port_maps = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 20, i32 4096, i32 0 }, section "maps", align 4
#ctos_port_maps = dso_local global %struct.bpf_map_def { i32 1, i32 8, i32 20, i32 4096, i32 0 }, section "maps", align 4
#loadbal.____fmt = internal constant [24 x i8] c"balancer got something!\00", align 1
#_license = dso_local global [4 x i8] c"GPL\00", section "license", align 1
#process_packet.____fmt = internal constant [26 x i8] c"it's an ip packet from %x\00", align 1
#llvm.used = appending global [7 x i8*] [i8* getelementptr inbounds ([4 x i8], [4 x i8]* #_license, i32 0, i32 0), i8* bitcast (%struct.bpf_map_def* #client_addrs to i8*), i8* bitcast (%struct.bpf_map_def* #ctos_port_maps to i8*), i8* bitcast (i32 (%struct.xdp_md*)* #loadbal to i8*), i8* bitcast (%struct.bpf_map_def* #server_ips to i8*), i8* bitcast (%struct.bpf_map_def* #servers to i8*), i8* bitcast (%struct.bpf_map_def* #stoc_port_maps to i8*)], section "llvm.metadata"
; Function Attrs: nounwind
define dso_local i32 #loadbal(%struct.xdp_md* nocapture readonly %0) #0 section "xdp" {
%2 = alloca %struct.ip_key, align 4
%3 = alloca %struct.port_key, align 2
%4 = alloca %struct.port_key, align 2
%5 = getelementptr inbounds %struct.xdp_md, %struct.xdp_md* %0, i64 0, i32 1
%6 = load i32, i32* %5, align 4, !tbaa !2
%7 = zext i32 %6 to i64
%8 = inttoptr i64 %7 to i8*
%9 = getelementptr inbounds %struct.xdp_md, %struct.xdp_md* %0, i64 0, i32 0
%10 = load i32, i32* %9, align 4, !tbaa !7
As per the discussion in the comments, the map is not created because it is not actually used in your eBPF code (not provided in the question).
As you realised yourself, the branch in your code that was calling the map was in fact unreachable. Based on that, it's likely that clang compiled out this portion of code, and that the map is not used in the resulting eBPF bytecode. When preparing to load your program, bpftool (libbpf) looks at what maps are necessary, and only creates the ones that are needed for your program. It may skip maps that are defined in the ELF file if no program uses them.
One hint here is that, if the program was effectively using the map, it couldn't load successfully if the map was missing: given that your program loads, the map would necessarily be present if it was needed. Note that bpftool prog show will show you the ids of the maps used by a program.

Benchmark for loop and range operator

I read that range-based loops have better performance on some programming language. Is it the case in Swift. For instance in Playgroud:
func timeDebug(desc: String, function: ()->() )
{
let start : UInt64 = mach_absolute_time()
function()
let duration : UInt64 = mach_absolute_time() - start
var info : mach_timebase_info = mach_timebase_info(numer: 0, denom: 0)
mach_timebase_info(&info)
let total = (duration * UInt64(info.numer) / UInt64(info.denom)) / 1_000
println("\(desc): \(total) µs.")
}
func loopOne(){
for i in 0..<4000 {
println(i);
}
}
func loopTwo(){
for var i = 0; i < 4000; i++ {
println(i);
}
}
range-based loop
timeDebug("Loop One time"){
loopOne(); // Loop One time: 2075159 µs.
}
normal for loop
timeDebug("Loop Two time"){
loopTwo(); // Loop Two time: 1905956 µs.
}
How to properly benchmark in swift?
// Update on the device
First run
Loop Two time: 54 µs.
Loop One time: 482 µs.
Second
Loop Two time: 44 µs.
Loop One time: 382 µs.
Third
Loop Two time: 43 µs.
Loop One time: 419 µs.
Fourth
Loop Two time: 44 µs.
Loop One time: 399 µs.
// Update 2
func printTimeElapsedWhenRunningCode(title:String, operation:()->()) {
let startTime = CFAbsoluteTimeGetCurrent()
operation()
let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
println("Time elapsed for \(title): \(timeElapsed) s")
}
printTimeElapsedWhenRunningCode("Loop Two time") {
loopTwo(); // Time elapsed for Loop Two time: 4.10079956054688e-05 s
}
printTimeElapsedWhenRunningCode("Loop One time") {
loopOne(); // Time elapsed for Loop One time: 0.000500023365020752 s.
}
You shouldn’t really benchmark in playgrounds since they’re unoptimized. Unless you’re interested in how long things will take when you’re debugging, you should only ever benchmark optimized builds (swiftc -O).
To understand why a range-based loop can be faster, you can look at the assembly generated for the two options:
Range-based
% echo "for i in 0..<4_000 { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
; increment i
incq %rbx
movq %r14, %rdi
movq %r15, %rsi
; print (pre-incremented) i
callq __TFSs7printlnU__FQ_T_
; compare i to 4_000
cmpq $4000, %rbx
; loop if not equal
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
C-style for loop
% echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
; print i
callq __TFSs7printlnU__FQ_T_
; increment i
incq %rbx
; jump if overflow
jo LBB0_4
; compare i to 4_000
cmpq $4000, %rbx
; loop if less than
jl LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
LBB0_4:
; raise illegal instruction due to overflow
ud2
.cfi_endproc
So the reason the C-style loop is slower is because it’s performing an extra operation – checking for overflow. Either Range was written to avoid the overflow check (or do it up front), or the optimizer was more able to eliminate it with the Range version.
If you switch to using the check-free addition operator, you can eliminate this check. This produces near-identical code to the range-based version (the only difference being some immaterial ordering of the code):
% echo "for var i = 0;i < 4_000;i = i &+ 1 { println(i) }" | swiftc -O -emit-assembly -
; snip
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
callq __TFSs7printlnU__FQ_T_
incq %rbx
cmpq $4000, %rbx
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
Never Benchmark Unoptimized Builds
If you want to understand why, try looking at the output for the Range-based version of the above, but with no optimization: echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -Onone -emit-assembly -. You will see it output a lot more code. That’s because Range used via for…in is an abstraction, a struct used with custom operators and functions returning generators, and does a lot of safety checks and other helpful things. This makes it a lot easier to write/read code. But when you turn on the optimizer, all this disappears and you’re left with very efficient code.
Benchmarking
As to ways to benchmark, this is the code I tend to use, just replacing the array:
import CoreFoundation.CFDate
func timeRun<T>(name: String, f: ()->T) -> String {
let start = CFAbsoluteTimeGetCurrent()
let result = f()
let end = CFAbsoluteTimeGetCurrent()
let timeStr = toString(Int((end - start) * 1_000_000))
return "\(name)\t\(timeStr)µs, produced \(result)"
}
let n = 4_000
let runs: [(String,()->Void)] = [
("for in range", {
for i in 0..<n { println(i) }
}),
("plain ol for", {
for var i = 0;i < n;++i { println(i) }
}),
("w/o overflow", {
for var i = 0;i < n;i = i &+ 1 { println(i) }
}),
]
println("\n".join(map(runs, timeRun)))
But the results will probably be meaningless, since jitter during println will likely obscure actual measurement. To really benchmark (assuming you don’t just trust the assembly analysis :) you’d need to replace it with something very lightweight.

Getting bit-pattern of bool in Swift

In ObjC, a bool's bit pattern could be retrieved by casting it to a UInt8.
e.g.
true => 0x01
false => 0x00
This bit pattern could then be used in further bit manipulation operations.
Now I want to do the same in Swift.
What I got working so far is
UInt8(UInt(boolValue))
but this doesn't look like it is the preferred approach.
I also need the conversion in O(1) without data-dependent branching. So, stuff like the following is not allowed.
boolValue ? 1 : 0
Also, is there some documentation about the way the UInt8 and UInt initializers are implemented? e.g. if the UInt initializer to convert from bool uses data-dependent branching, I can't use it either.
Of course, the fallback is always to use further bitwise operations to avoid the bool value altogether (e.g. Check if a number is non zero using bitwise operators in C).
Does Swift offer an elegant way to access the bit pattern of a Bool / convert it to UInt8, in O(1) without data-dependent branching?
When in doubt, have a look at the generated assembly code :)
func foo(someBool : Bool) -> UInt8 {
let x = UInt8(UInt(someBool))
return x
}
compiled with ("-O" = "Compile with optimizations")
xcrun -sdk macosx swiftc -emit-assembly -O main.swift
gives
.globl __TF4main3fooFSbVSs5UInt8
.align 4, 0x90
__TF4main3fooFSbVSs5UInt8:
.cfi_startproc
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
callq __TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber
movq %rax, %rdi
callq __TFE10FoundationSuCfMSuFCSo8NSNumberSu
movzbl %al, %ecx
cmpq %rcx, %rax
jne LBB0_2
popq %rbp
retq
The function names can be demangled with
$ xcrun -sdk macosx swift-demangle __TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber __TFE10FoundationSuCfMSuFCSo8NSNumberSu
_TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber ---> ext.Foundation.Swift.Bool._bridgeToObjectiveC (Swift.Bool)() -> ObjectiveC.NSNumber
_TFE10FoundationSuCfMSuFCSo8NSNumberSu ---> ext.Foundation.Swift.UInt.init (Swift.UInt.Type)(ObjectiveC.NSNumber) -> Swift.UInt
There is no UInt initializer that takes a Bool argument.
So the smart compiler has used the automatic conversion between Swift
and Foundation types and generated some code like
let x = UInt8(NSNumber(bool: someBool).unsignedLongValue)
Probably not very efficient with two function calls. (And it does not
compile if you only import Swift, without Foundation.)
Now the other method where you assumed data-dependent branching:
func bar(someBool : Bool) -> UInt8 {
let x = UInt8(someBool ? 1 : 0)
return x
}
The assembly code is
.globl __TF4main3barFSbVSs5UInt8
.align 4, 0x90
__TF4main3barFSbVSs5UInt8:
pushq %rbp
movq %rsp, %rbp
andb $1, %dil
movb %dil, %al
popq %rbp
retq
No branching, just an "AND" operation with 0x01!
Therefore I do not see a reason not to use this "straight-forward" conversion.
You can then profile with Instruments to check if it is a bottleneck for
your app.
#martin-r’s answer is more fun :-), but this can be done in a playground.
// first check this is true or you’ll be sorry...
sizeof(Bool) == sizeof(UInt8)
let t = unsafeBitCast(true, UInt8.self) // = 1
let f = unsafeBitCast(false, UInt8.self) // = 0

How to extract information from the llvm metadata

I have the following piece of code:
int main(int argc, char *argv[])
{
int a = 2;
int b = 5;
int soma = a + b;
//...}
The resulting llvm bitcode is:
define i32 #main(i32 %argc, i8** %argv) #0 {
entry:
...
%a = alloca i32, align 4
%b = alloca i32, align 4
%soma = alloca i32, align 4
...
call void #llvm.dbg.declare(metadata !{i32* %a}, metadata !15), !dbg !16
store i32 2, i32* %a, align 4, !dbg !16
call void #llvm.dbg.declare(metadata !{i32* %b}, metadata !17), !dbg !18
store i32 5, i32* %b, align 4, !dbg !18
call void #llvm.dbg.declare(metadata !{i32* %soma}, metadata !19), !dbg !20
%0 = load i32* %a, align 4, !dbg !20
%1 = load i32* %b, align 4, !dbg !20
%add = add nsw i32 %0, %1, !dbg !20
store i32 %add, i32* %soma, align 4, !dbg !20
...
!1 = metadata !{i32 0}
!2 = metadata !{metadata !3}
...
!15 = metadata !{i32 786688, metadata !3, metadata !"a", metadata !4, i32 6, metadata !7, i32 0, i32 0} ; [ DW_TAG_auto_variable ] [a] [line 6]
!16 = metadata !{i32 6, i32 0, metadata !3, null}
!17 = metadata !{i32 786688, metadata !3, metadata !"b", metadata !4, i32 7, metadata !7, i32 0, i32 0} ; [ DW_TAG_auto_variable ] [b] [line 7]
!18 = metadata !{i32 7, i32 0, metadata !3, null}
!19 = metadata !{i32 786688, metadata !3, metadata !"soma", metadata !4, i32 8, metadata !7, i32 0, i32 0} ; [ DW_TAG_auto_variable ] [soma] [line 8]
!20 = metadata !{i32 8, i32 0, metadata !3, null}
From the bitcode I need to get the following text:
a = 2
b = 5
soma = a + b
My doubt is how to extract the information I need from the metadata (dgb)??
Right now I only have the name of the instructions I-> getName () and the name of the operands with valueOp Value * = I-> getOperand (i); valueOp-> getName (). Str ();
The metadata is very extensive. How do I get this information from the metadata?
Relying on I->getName() for finding the variable name is not a good idea - you have the debug info for that. The proper approach for finding out the names of all the C/C++ local variables is to traverse the IR and look for all calls to #llvm.dbg.declare, then go to their 2nd operand (the debug metadata), and retrieve the variable name from there.
Use the source-level debugging guide to find out how the debug metadata is laid out. In particular, for local variables, the 3rd argument will be a metadata string with the variable name in the C/C++ source.
So the remaining thing is to find out what the variables are initialized to. For that, follow the 1st argument to #llvm.dbg.declare for getting the actual LLVM value used, then locate the 1st store instruction into it, and check what data is used there.
If it's a constant, you now have everything you need to output a = 5-style information. If it's another instruction, you have to follow it yourself and "decode" it - e.g., if it's an "add", then you need to print its two operands with a "+" in-between, etc. And of course the prints have to be recursive... not simple. But it will provide you with the accurate initialization value.
If you're looking for something a lot more coarse and you have access to the original source, you can just get the line number in which the variable is declared (the 5th operand in the debug metadata, assuming the tag (1st operand) is indeed DW_TAG_auto_variable and not DW_TAG_arg_variable, which indicates a parameter). Then print out that line from the original source. But this will not print all the relevant information (if the initialization value is constructed from multiple lines) and can print irrelevant information (if there are multiple statements in that line, for example).
Finally, remember optimization can seriously screw with debug information. If getting those print-outs is important, be careful with your -O option, maybe stick to -O0.