So, if I multiply two values:
emacs -batch -eval '(print (* 1252463 -4400000000000))'
It will exceed most-negative-fixnum frame and will return mathematically wrong answer. What will be the difference in instruction level between
-O2 flag, -O2 -fsanitize=undefined flag, and -O2 -fwrapv flag?
In emacs? Probably nothing. The function that is compiled probably looks like this:
int multiply(int x, int y) {
return x * y;
}
If we compile that and look at the assembly (gcc -S multiply.c && cat multiply.s), we get
multiply:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
imull -8(%rbp), %eax
popq %rbp
ret
See the imull instruction? It's doing a regular multiply. What if we try gcc -O2 -S multiply.c?
multiply:
movl %edi, %eax
imull %esi, %eax
ret
Well that's certainly removed some code, but it's still doing imull, a regular multiplication.
Let's try to get it to not use imull:
int multiply(int x) {
return x * 2;
}
With gcc -O2 -S multiply.c, we get
multiply:
leal (%rdi,%rdi), %eax
ret
Instead of computing the slower x * 2, it instead computed x + x, because addition is faster than multiplication.
Can we get -fwrapv to do produce different code? Yes:
int multiply(int x) {
return x * 2 < 0;
}
With gcc -O2 -S multiply.c, we get
multiply:
movl %edi, %eax
shrl $31, %eax
ret
So it was simplified into x >> 31, which is the same thing as x < 0. In math, if x * 2 < 0 then x < 0. But in the reality of processors, if x * 2 overflows it may become negative, for example 2,000,000,000 * 2 = -294967296.
If you force gcc to take this into account with gcc -O2 -fwrapv -S temp.c, we get
multiply:
leal (%rdi,%rdi), %eax
shrl $31, %eax
ret
So it optimized x * 2 < 0 to x + x < 0. It might seem strange to have -fwrapv not be the default, but C was created before it was standard for multiplication to overflow in this predictable manner.
Related
I'm currently building object files using swiftc, with:
swiftc -emit-object bar.swift
where bar.swift is something simple like:
class Bar {
var value: Int
init(value: Int) {
self.value = value
}
func plusValue(_ value: Int) -> Int {
return self.value + value
}
}
When I then move on to linking this against my main object to create an executable, I get the following error:
$ cc -o foobar foo.o bar.o
duplicate symbol '_main' in:
foo.o
bar.o
ld: 1 duplicate symbol for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
This indicates that swiftc is adding a main implementation to the object file, which can be confirmed with:
$ nm bar.o | grep _main
0000000000000000 T _main
As far as I can tell, this added function does very little:
$ otool -tV bar.o
bar.o:
(__TEXT,__text) section
_main:
0000000000000000 pushq %rbp
0000000000000001 movq %rsp, %rbp
0000000000000004 xorl %eax, %eax
0000000000000006 movl %edi, -0x4(%rbp)
0000000000000009 movq %rsi, -0x10(%rbp)
000000000000000d popq %rbp
000000000000000e retq
000000000000000f nop
....snip...
Is there a way to tell swiftc -emit-object to not add this vestigial implementation of main?
Short answer
The command line argument that I was missing was:
-parse-as-library
Long answer
In my quest to find an answer, I resorted to looking at the Swift source code on Github, and with a bit of luck found the following compiler invocation in the test suite:
// RUN: %target-build-swift %S/Inputs/CommandLineStressTest/CommandLineStressTest.swift -parse-as-library -force-single-frontend-invocation -module-name CommandLineStressTestSwift -emit-object -o %t/CommandLineStressTestSwift.o
According to swiftc --help, -parse-as-library causes the compiler to:
Parse the input file(s) as libraries, not scripts
This has proven to work for me, and the only difference in exported symbols is the removal of _main:
$ diff -U1 <(swiftc -emit-object -module-name bar -o a.o bar.swift && nm a.o | cut -c18-) <(swiftc -emit-object -parse-as-library -module-name bar -o b.o bar.swift && nm b.o | cut -c18-)
--- /dev/fd/63 2020-03-28 17:13:08.000000000 +1100
+++ /dev/fd/62 2020-03-28 17:13:08.000000000 +1100
## -23,3 +23,2 ##
U __objc_empty_cache
-T _main
s _objc_classes
and in the generated assembly the only change is the removal of the _main code:
$ diff -U1 <(swiftc -emit-object -module-name bar -o a.o bar.swift && objdump -d -no-leading-addr -no-show-raw-insn a.o) <(swiftc -emit-object -parse-as-library -module-name bar -o b.o bar.swift && objdump -d -no-leading-addr -no-show-raw-insn b.o)
--- /dev/fd/63 2020-03-28 17:19:03.000000000 +1100
+++ /dev/fd/62 2020-03-28 17:19:03.000000000 +1100
## -1,15 +1,5 ##
-a.o: file format Mach-O 64-bit x86-64
+b.o: file format Mach-O 64-bit x86-64
Disassembly of section __TEXT,__text:
-_main:
- pushq %rbp
- movq %rsp, %rbp
- xorl %eax, %eax
- movl %edi, -4(%rbp)
- movq %rsi, -16(%rbp)
- popq %rbp
- retq
- nop
-
_$S3bar3BarC5valueSivg:
I read that range-based loops have better performance on some programming language. Is it the case in Swift. For instance in Playgroud:
func timeDebug(desc: String, function: ()->() )
{
let start : UInt64 = mach_absolute_time()
function()
let duration : UInt64 = mach_absolute_time() - start
var info : mach_timebase_info = mach_timebase_info(numer: 0, denom: 0)
mach_timebase_info(&info)
let total = (duration * UInt64(info.numer) / UInt64(info.denom)) / 1_000
println("\(desc): \(total) µs.")
}
func loopOne(){
for i in 0..<4000 {
println(i);
}
}
func loopTwo(){
for var i = 0; i < 4000; i++ {
println(i);
}
}
range-based loop
timeDebug("Loop One time"){
loopOne(); // Loop One time: 2075159 µs.
}
normal for loop
timeDebug("Loop Two time"){
loopTwo(); // Loop Two time: 1905956 µs.
}
How to properly benchmark in swift?
// Update on the device
First run
Loop Two time: 54 µs.
Loop One time: 482 µs.
Second
Loop Two time: 44 µs.
Loop One time: 382 µs.
Third
Loop Two time: 43 µs.
Loop One time: 419 µs.
Fourth
Loop Two time: 44 µs.
Loop One time: 399 µs.
// Update 2
func printTimeElapsedWhenRunningCode(title:String, operation:()->()) {
let startTime = CFAbsoluteTimeGetCurrent()
operation()
let timeElapsed = CFAbsoluteTimeGetCurrent() - startTime
println("Time elapsed for \(title): \(timeElapsed) s")
}
printTimeElapsedWhenRunningCode("Loop Two time") {
loopTwo(); // Time elapsed for Loop Two time: 4.10079956054688e-05 s
}
printTimeElapsedWhenRunningCode("Loop One time") {
loopOne(); // Time elapsed for Loop One time: 0.000500023365020752 s.
}
You shouldn’t really benchmark in playgrounds since they’re unoptimized. Unless you’re interested in how long things will take when you’re debugging, you should only ever benchmark optimized builds (swiftc -O).
To understand why a range-based loop can be faster, you can look at the assembly generated for the two options:
Range-based
% echo "for i in 0..<4_000 { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
; increment i
incq %rbx
movq %r14, %rdi
movq %r15, %rsi
; print (pre-incremented) i
callq __TFSs7printlnU__FQ_T_
; compare i to 4_000
cmpq $4000, %rbx
; loop if not equal
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
C-style for loop
% echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -O -emit-assembly -
; snip opening boiler plate...
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
; print i
callq __TFSs7printlnU__FQ_T_
; increment i
incq %rbx
; jump if overflow
jo LBB0_4
; compare i to 4_000
cmpq $4000, %rbx
; loop if less than
jl LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
LBB0_4:
; raise illegal instruction due to overflow
ud2
.cfi_endproc
So the reason the C-style loop is slower is because it’s performing an extra operation – checking for overflow. Either Range was written to avoid the overflow check (or do it up front), or the optimizer was more able to eliminate it with the Range version.
If you switch to using the check-free addition operator, you can eliminate this check. This produces near-identical code to the range-based version (the only difference being some immaterial ordering of the code):
% echo "for var i = 0;i < 4_000;i = i &+ 1 { println(i) }" | swiftc -O -emit-assembly -
; snip
LBB0_1:
movq %rbx, -32(%rbp)
movq %r14, %rdi
movq %r15, %rsi
callq __TFSs7printlnU__FQ_T_
incq %rbx
cmpq $4000, %rbx
jne LBB0_1
xorl %eax, %eax
addq $8, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
retq
.cfi_endproc
Never Benchmark Unoptimized Builds
If you want to understand why, try looking at the output for the Range-based version of the above, but with no optimization: echo "for var i = 0;i < 4_000;++i { println(i) }" | swiftc -Onone -emit-assembly -. You will see it output a lot more code. That’s because Range used via for…in is an abstraction, a struct used with custom operators and functions returning generators, and does a lot of safety checks and other helpful things. This makes it a lot easier to write/read code. But when you turn on the optimizer, all this disappears and you’re left with very efficient code.
Benchmarking
As to ways to benchmark, this is the code I tend to use, just replacing the array:
import CoreFoundation.CFDate
func timeRun<T>(name: String, f: ()->T) -> String {
let start = CFAbsoluteTimeGetCurrent()
let result = f()
let end = CFAbsoluteTimeGetCurrent()
let timeStr = toString(Int((end - start) * 1_000_000))
return "\(name)\t\(timeStr)µs, produced \(result)"
}
let n = 4_000
let runs: [(String,()->Void)] = [
("for in range", {
for i in 0..<n { println(i) }
}),
("plain ol for", {
for var i = 0;i < n;++i { println(i) }
}),
("w/o overflow", {
for var i = 0;i < n;i = i &+ 1 { println(i) }
}),
]
println("\n".join(map(runs, timeRun)))
But the results will probably be meaningless, since jitter during println will likely obscure actual measurement. To really benchmark (assuming you don’t just trust the assembly analysis :) you’d need to replace it with something very lightweight.
In ObjC, a bool's bit pattern could be retrieved by casting it to a UInt8.
e.g.
true => 0x01
false => 0x00
This bit pattern could then be used in further bit manipulation operations.
Now I want to do the same in Swift.
What I got working so far is
UInt8(UInt(boolValue))
but this doesn't look like it is the preferred approach.
I also need the conversion in O(1) without data-dependent branching. So, stuff like the following is not allowed.
boolValue ? 1 : 0
Also, is there some documentation about the way the UInt8 and UInt initializers are implemented? e.g. if the UInt initializer to convert from bool uses data-dependent branching, I can't use it either.
Of course, the fallback is always to use further bitwise operations to avoid the bool value altogether (e.g. Check if a number is non zero using bitwise operators in C).
Does Swift offer an elegant way to access the bit pattern of a Bool / convert it to UInt8, in O(1) without data-dependent branching?
When in doubt, have a look at the generated assembly code :)
func foo(someBool : Bool) -> UInt8 {
let x = UInt8(UInt(someBool))
return x
}
compiled with ("-O" = "Compile with optimizations")
xcrun -sdk macosx swiftc -emit-assembly -O main.swift
gives
.globl __TF4main3fooFSbVSs5UInt8
.align 4, 0x90
__TF4main3fooFSbVSs5UInt8:
.cfi_startproc
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
callq __TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber
movq %rax, %rdi
callq __TFE10FoundationSuCfMSuFCSo8NSNumberSu
movzbl %al, %ecx
cmpq %rcx, %rax
jne LBB0_2
popq %rbp
retq
The function names can be demangled with
$ xcrun -sdk macosx swift-demangle __TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber __TFE10FoundationSuCfMSuFCSo8NSNumberSu
_TFE10FoundationSb19_bridgeToObjectiveCfSbFT_CSo8NSNumber ---> ext.Foundation.Swift.Bool._bridgeToObjectiveC (Swift.Bool)() -> ObjectiveC.NSNumber
_TFE10FoundationSuCfMSuFCSo8NSNumberSu ---> ext.Foundation.Swift.UInt.init (Swift.UInt.Type)(ObjectiveC.NSNumber) -> Swift.UInt
There is no UInt initializer that takes a Bool argument.
So the smart compiler has used the automatic conversion between Swift
and Foundation types and generated some code like
let x = UInt8(NSNumber(bool: someBool).unsignedLongValue)
Probably not very efficient with two function calls. (And it does not
compile if you only import Swift, without Foundation.)
Now the other method where you assumed data-dependent branching:
func bar(someBool : Bool) -> UInt8 {
let x = UInt8(someBool ? 1 : 0)
return x
}
The assembly code is
.globl __TF4main3barFSbVSs5UInt8
.align 4, 0x90
__TF4main3barFSbVSs5UInt8:
pushq %rbp
movq %rsp, %rbp
andb $1, %dil
movb %dil, %al
popq %rbp
retq
No branching, just an "AND" operation with 0x01!
Therefore I do not see a reason not to use this "straight-forward" conversion.
You can then profile with Instruments to check if it is a bottleneck for
your app.
#martin-r’s answer is more fun :-), but this can be done in a playground.
// first check this is true or you’ll be sorry...
sizeof(Bool) == sizeof(UInt8)
let t = unsafeBitCast(true, UInt8.self) // = 1
let f = unsafeBitCast(false, UInt8.self) // = 0
I wish to replace the default CFAllocator in my iPhone app with my own implementation. I want to control the memory allocated by the UIWebView since it seems to hold on to so much memory after loading a website and that memory still lingers around when the UIWebView is released.
After I call CFAllocatorSetDefault I get an EXC_BREAKPOINT exception when the next allocation occurs.
The exception seems to happen inside of a call to CFRetain (done in the simulator but the same thing happens on a device):
CoreFoundation`CFRetain:
0x1c089b0: pushl %ebp
0x1c089b1: movl %esp, %ebp
0x1c089b3: pushl %edi
0x1c089b4: pushl %esi
0x1c089b5: subl $16, %esp
0x1c089b8: calll 0x1c089bd ; CFRetain + 13
0x1c089bd: popl %edi
0x1c089be: movl 8(%ebp), %esi
0x1c089c1: testl %esi, %esi
0x1c089c3: jne 0x1c089db ; CFRetain + 43
0x1c089c5: int3
0x1c089c6: calll 0x1d66a00 ; symbol stub for: getpid <- EXC_BREAKPOINT (code=EXC_I386_BPT subcode=0x0)
0x1c089cb: movl %eax, (%esp)
0x1c089ce: movl $9, 4(%esp)
0x1c089d6: calll 0x1d66a4e ; symbol stub for: kill
0x1c089db: movl (%esi), %eax
0x1c089dd: testl %eax, %eax
0x1c089df: je 0x1c08a17 ; CFRetain + 103
0x1c089e1: cmpl 1838519(%edi), %eax
0x1c089e7: je 0x1c08a17 ; CFRetain + 103
0x1c089e9: movl 4(%esi), %ecx
0x1c089ec: shrl $8, %ecx
0x1c089ef: andl $1023, %ecx
0x1c089f5: cmpl 1834423(%edi,%ecx,4), %eax
0x1c089fc: je 0x1c08a17 ; CFRetain + 103
0x1c089fe: movl 1766575(%edi), %eax
0x1c08a04: movl %eax, 4(%esp)
0x1c08a08: movl %esi, (%esp)
0x1c08a0b: calll 0x1d665c8 ; symbol stub for: objc_msgSend
UPDATE
Core Foundation has a bug that makes CFAllocatorSetDefault useless.
Specifically, if you study the implementation of _CFRuntimeCreateInstance in CFRuntime.c, you'll see that:
If it's not using the system default allocator, it tries to retain the allocator.
If it's been passed NULL as its allocator argument, it will try to retain that NULL instead of the current default allocator.
The call to CFRetain will therefore crash.
What it should do is retain the current default allocator when it's given NULL as its allocator argument.
Since lots of functions in Apple's own libraries apparently pass NULL (or kCFAllocatorDefault, which is also a null pointer) to functions that create a Core Foundation object, you're bound to crash quickly if you change the default allocator at all.
My test case: I created a new, single-view iPhone app. I added one line to main:
int main(int argc, char *argv[])
{
CFAllocatorSetDefault(kCFAllocatorMalloc);
#autoreleasepool {
return UIApplicationMain(argc, argv, nil, NSStringFromClass([AppDelegate class]));
}
}
The app crashes during startup on the simulator and on my test device, in CFRetain, with EXC_BREAKPOINT, with a null pointer as the function argument.
ORIGINAL
You are passing a null pointer to CFRetain. If this has anything to do with your custom allocator, you need to post more details, like the full call stack when the exception occurs.
In your disassembly listing, the instructions from 0x1c089b0 through 0x1c089bd are the function prologue.
At 0x1c089be, the movl 8(%ebp), %esi instruction loads the first function argument from the stack into %esi.
At 0x1c089c1, the testl %esi, %esi instruction sets the processor flags based on the value of %esi. In particular, it sets the Z (zero) flag to 1 if %esi contains zero, and sets the Z flag to 0 if %esi contains anything else.
At 0x1c089c3, the jne 0x1c089db instruction jumps if the ne condition is true. The ne condition is true when the Z flag is 0 and false when the Z flag is 1. So this instruction jumps when %esi (the first argument) is non-zero, and falls through when %esi is zero.
At 0x1c089c5, the int3 instruction raises a SIGTRAP signal with exception code EXC_BREAKPOINT. The int3 instruction is normally stuffed into a program by the debugger when you set a breakpoint. In this case, it was hardcoded in the program at compile-time.
Thus, you are getting this exception because you are passing a null pointer to CFRetain.
You can also look at the source code of CFRetain if you like. It is in CFRuntime.c:
CFTypeRef CFRetain(CFTypeRef cf) {
if (NULL == cf) { CRSetCrashLogMessage("*** CFRetain() called with NULL ***"); HALT; }
if (cf) __CFGenericAssertIsCF(cf);
return _CFRetain(cf, false);
}
So the very first thing CFRetain does is test whether its argument is NULL. CGSetCrashLogMessage is a macro defined in CoreFoundation_Prefix.h that does nothing. HALT is a macro defined in CFInternal.h:
#define HALT do {asm __volatile__("int3"); kill(getpid(), 9); } while (0)
As you can see, HALT has a hard-coded int3 instruction. Then it calls kill(getpid(), 9). This matches your disassembly listing.
I am writing assembly long addition in GAS inline assembly,
template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y);
If I specialize I could do :
template <>
void inline KA_add<128>(vli<128> & x, vli<128> const& y){
asm("addq %2, %0; adcq %3, %1;" :"+r"(x[0]),"+r"(x[1]):"g"(y[0]),"g"(y[1]):"cc");
}
Nice it works, now if I try to generalize to allow the inline of template, and let work my compiler for any length ...
template <std::size_t NumBits>
void inline KA_add(vli<NumBits> & x, vli<NumBits> const& y){
asm("addq %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
for(int i(1); i < vli<NumBits>::numwords;++i)
asm("adcq %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};
Well, it does not work I have no guarantee that the carry bit (CB) is propagated. It is not conserve between the first asm line and the second one. It may be logic because the loop increment i and so "delete" the CB I thing, it should exist a GAS constraint to conserve the CB over the two ASM call. Unfortunately I do not find such informations.
Any idea ?
Thank you, Merci !
PS I rewrite my function to remove the C++ ideology
template <std::size_t NumBits>
inline void KA_add_test(boost::uint64_t* x, boost::uint64_t const* y){
asm ("addq %1, %0;" :"+r"(x[0]):"g"(y[0]):"cc");
for(int i(1); i < vli<NumBits>::numwords;++i)
asm ("adcq %1, %0;" :"+r"(x[i]):"g"(y[i]):"cc");
};
The asm gives (GCC Debug mode),
APP
addq %rdx, %rax;
NO_APP
movq -24(%rbp), %rdx
movq %rax, (%rdx)
.LBB94:
.loc 9 55 0
movl $1, -4(%rbp)
jmp .L323
.L324:
.loc 9 56 0
movl -4(%rbp), %eax
cltq
salq $3, %rax
movq %rax, %rdx
addq -24(%rbp), %rdx <----------------- Break the carry bit
movl -4(%rbp), %eax
cltq
salq $3, %rax
addq -32(%rbp), %rax
movq (%rax), %rcx
movq (%rdx), %rax
APP
adcq %rcx, %rax;
NO_APP
As we can read there is additional addq, it destroys the propagation of the CB
I see no way to explicitly tell the compiler that the loop code must be created without instructions affecting the C flag.
It's surely possible to do so - use lea to count the array addresses upwards, dec to count the loop downwards and test Z for end condition. That way, nothing in the loop except the actual array sum changes the C flag.
You'd have to do a manual thing, like:
long long tmp; // hold a register
__asm__("0:
movq (%1), %0
lea 8(%1), %1
adcq %0, (%2)
lea 8(%2), %2
dec %3
jnz 0b"
: "=r"(tmp)
: "m"(&x[0]), "m"(&y[0]), "r"(vli<NumBits>::numwords)
: "cc", "memory");
For hot code, a tight loop isn't optimal though; for one, the instructions have dependencies, and there's significantly more instructions per iteration than inlined / unrolled adc sequences. A better sequence would be something like (%rbp resp. %rsi having the start addresses for the source and target arrays):
0:
lea 64(%rbp), %r13
lea 64(%rsi), %r14
movq (%rbp), %rax
movq 8(%rbp), %rdx
adcq (%rsi), %rax
movq 16(%rbp), %rcx
adcq 8(%rsi), %rdx
movq 24(%rbp), %r8
adcq 16(%rsi), %rcx
movq 32(%rbp), %r9
adcq 24(%rsi), %r8
movq 40(%rbp), %r10
adcq 32(%rsi), %r9
movq 48(%rbp), %r11
adcq 40(%rsi), %r10
movq 56(%rbp), %r12
adcq 48(%rsi), %r10
movq %rax, (%rsi)
adcq 56(%rsi), %r10
movq %rdx, 8(%rsi)
movq %rcx, 16(%rsi)
movq %r8, 24(%rsi)
movq %r13, %rbp // next src
movq %r9, 32(%rsi)
movq %r10, 40(%rsi)
movq %r11, 48(%rsi)
movq %r12, 56(%rsi)
movq %r14, %rsi // next tgt
dec %edi // use counter % 8 (doing 8 words / iteration)
jnz 0b // loop again if not yet zero
and looping only around such blocks. The advantage would be that the loads are blocked, and you'd deal with loop count / termination condition only once-per-that.
I would, quite honestly, try not to make the general bit width particularly "neat", but rather specialcase explicitly unrolled code for, say, bit widths of powers-of-two. Rather add a flag / constructor message to the non-optimized template instantiation telling the user "use a power of two" ?