Understand DispatchTime on M1 machines

Understand DispatchTime on M1 machines - swift

In my iOS project were were able to replicate Combine's Schedulers implementation and we have an extensive suit of testing, everything was fine on Intel machines all the tests were passing, now we got some of M1 machines to see if there is a showstopper in our workflow.
Suddenly some of our library code starts failing, the weird thing is even if we use Combine's Implementation the tests still failing.
Our assumption is we are misusing DispatchTime(uptimeNanoseconds:) as you can see in the following screen shot (Combine's implementation)
We know by now that initialising DispatchTime with uptimeNanoseconds value doesn't mean they are the actual nanoseconds on M1 machines, according to the docs
Creates a DispatchTime relative to the system clock that
ticks since boot.
- Parameters:
- uptimeNanoseconds: The number of nanoseconds since boot, excluding
time the system spent asleep
- Returns: A new `DispatchTime`
- Discussion: This clock is the same as the value returned by
`mach_absolute_time` when converted into nanoseconds.
On some platforms, the nanosecond value is rounded up to a
multiple of the Mach timebase, using the conversion factors
returned by `mach_timebase_info()`. The nanosecond equivalent
of the rounded result can be obtained by reading the
`uptimeNanoseconds` property.
Note that `DispatchTime(uptimeNanoseconds: 0)` is
equivalent to `DispatchTime.now()`, that is, its value
represents the number of nanoseconds since boot (excluding
system sleep time), not zero nanoseconds since boot.
so, is the test wrong or we should not use DispatchTime like this?
we try to follow Apple suggestion and use this:
uint64_t MachTimeToNanoseconds(uint64_t machTime)
{
uint64_t nanoseconds = 0;
static mach_timebase_info_data_t sTimebase;
if (sTimebase.denom == 0)
(void)mach_timebase_info(&sTimebase);
nanoseconds = ((machTime * sTimebase.numer) / sTimebase.denom);
return nanoseconds;
}
it didnt help a lot.
Edit: Screenshot code:
func testSchedulerTimeTypeDistance() {
let time1 = DispatchQueue.SchedulerTimeType(.init(uptimeNanoseconds: 10000))
let time2 = DispatchQueue.SchedulerTimeType(.init(uptimeNanoseconds: 10431))
let distantFuture = DispatchQueue.SchedulerTimeType(.distantFuture)
let notSoDistantFuture = DispatchQueue.SchedulerTimeType(
DispatchTime(
uptimeNanoseconds: DispatchTime.distantFuture.uptimeNanoseconds - 1024
)
)
XCTAssertEqual(time1.distance(to: time2), .nanoseconds(431))
XCTAssertEqual(time2.distance(to: time1), .nanoseconds(-431))
XCTAssertEqual(time1.distance(to: distantFuture), .nanoseconds(-10001))
XCTAssertEqual(distantFuture.distance(to: time1), .nanoseconds(10001))
XCTAssertEqual(time2.distance(to: distantFuture), .nanoseconds(-10432))
XCTAssertEqual(distantFuture.distance(to: time2), .nanoseconds(10432))
XCTAssertEqual(time1.distance(to: notSoDistantFuture), .nanoseconds(-11025))
XCTAssertEqual(notSoDistantFuture.distance(to: time1), .nanoseconds(11025))
XCTAssertEqual(time2.distance(to: notSoDistantFuture), .nanoseconds(-11456))
XCTAssertEqual(notSoDistantFuture.distance(to: time2), .nanoseconds(11456))
XCTAssertEqual(distantFuture.distance(to: distantFuture), .nanoseconds(0))
XCTAssertEqual(notSoDistantFuture.distance(to: notSoDistantFuture),
.nanoseconds(0))
}

The difference between Intel and ARM code is precision.
With Intel code, DispatchTime internally works with nanoseconds. With ARM code, it works with nanoseconds * 3 / 125 (plus some integer rounding). The same applies to DispatchQueue.SchedulerTimeType.
DispatchTimeInterval and DispatchQueue.SchedulerTimeType.Stride internally use nanoseconds on both platforms.
So the ARM code uses lower precision for calculations but full precision when comparing distances. In addition, precision is lost when converting from nanoseconds to the internal unit.
The exact formula for the DispatchTime conversions are (executed as integer operations):
rawValue = (nanoseconds * 3 + 124) / 125
nanoseconds = rawValue * 125 / 3
As an example, let's take this code:
let time1 = DispatchQueue.SchedulerTimeType(.init(uptimeNanoseconds: 10000))
let time2 = DispatchQueue.SchedulerTimeType(.init(uptimeNanoseconds: 10431))
XCTAssertEqual(time1.distance(to: time2), .nanoseconds(431))
It results in the calculation:
(10000 * 3 + 124) / 125 -> 240
(10431 * 3 + 124) / 125 -> 251
251 - 240 -> 11
11 * 125 / 3 -> 458
The resulting comparison between 458 and 431 then fails.
So the main fix would be to allow for small differences (I haven't verified if 42 is the maximum difference):
XCTAssertEqual(time1.distance(to: time2), .nanoseconds(431), accuracy: .nanoseconds(42))
XCTAssertEqual(time2.distance(to: time1), .nanoseconds(-431), accuracy: .nanoseconds(42))
And there are more surprises: Other than with Intel code, distantFuture and notSoDistantFuture are equal with ARM code. It has probably been implemented like so to protect from an overflow when multiplying with 3. (The actual calculation would be: 0xFFFFFFFFFFFFFFFF * 3). And the conversion from the internal unit to nanoseconds would result in 0xFFFFFFFFFFFFFFFF * 125 / 3, a value to big to be represented with 64 bits.
Furthermore I think that you are relying on implementation specific behavior when calculating the distance between time stamps at or close to 0 and time stamps at or close to distant future. The tests rely on the fact the distant future internally uses 0xFFFFFFFFFFFFFFFF and that the unsigned subtraction wraps around and produces a result as if the internal value was -1.

I think your issue lies in this line:
nanoseconds = ((machTime * sTimebase.numer) / sTimebase.denom)
... which is doing integer operations.
The actual ratio here for M1 is 125/3 (41.666...), so your conversion factor is truncating to 41. This is a ~1.6% error, which might explain the differences you're seeing.

Related

Looking for a Better Alternative to String Format for MQTT

I am mqtt-ing a string from a Rasbperry Pi(sitting in a field, supported by LTE internet, costing me 10$/500MB/month) to an MQTT broker. I am using paho-mqtt client in python to do this for me. The string looks something like "MM:DD:YYY HH:MM:SS, X1, X2, X3, , , , X24", and I am sending a new string every 30 seconds. X1 to Xn are floating point numbers 0 to 700 with 2 digit precision. I think this will cost me a lot of internet when I deploy it to 24/7 use. Is my data format good? What other data formats should I look at?

You can represent the Unix time with a 4-byte float. And you can represent a float with an IEEE754 float in 4 bytes. So your time and 24 floats can be packed into 100 bytes with Python struct.pack(). That looks like this:
import struct
import time
import random
# Synthesize some sample data - a time and 24 floats 0..700
data = [time.time()] + [ random.uniform(0, 700) for _ in range(24)]
# Pack as 25 IEEE754 floats of 4 bytes each
payload = struct.pack('!25f', *data)
print(len(payload)) # prints 100 (bytes)
Currently, you seem to be using:
19 bytes for your time and
around 7 bytes for each float including separators
So, that's around 180 bytes as you currently have it.
If you multiplied your floats by 100 and made them integer you could maybe encode as 16-bit unsigned values (i.e. half the space of a 4-byte float) which would go from 0..65535 to represent 0..655 which is close to your data range of 0..700. So that would be 4 bytes for the time, plus 24 samples of 2 bytes each, for a total of 52 bytes.
So, rather than 100, use 65535/700 or 93.62:
# Scale the data to the range 0..65535 and make into integers
smallerData = [data[0]] + [ int(93.62*data[i]) for i in range(1,25)]
payload = struct.pack('!f24H', *smallerData)
print(len(payload)) # prints 52 (bytes)
Obviously all the numbers above exclude MQTT protocol overhead.

Why sometimes Apple Accelerate framework is slow?

I am playing with C and Swift 3.0 code using vecLib and Accelerate framework from Apple as dynamic lib + my code in C lang based project and Swift playground.
And in situation with calling Apple's wrapper from framework of SIMD instruction with 1 or < 4 elements computation function like vvcospif() from framework is slower than simple standart cos(x * PI) when functions calls from loop near 1.000 times as example.
I know about difference between vvcospif() and cos(), I should use exactly vvcospif() for x * PI.
Example in playground, you can just copy code and run it:
import Cocoa
import Accelerate
func cosine_interpolate(alpha: Float, a: Float, b: Float) -> Float {
let ft: Float = alpha * 3.1415927;
let f: Float = (1 - cos(ft)) * 0.5;
return a + f*(b - a);
}
var start: Date = NSDate() as Date
var interp: Float;
for index in 0..<1000 {
interp = cosine_interpolate(alpha: 0.25, a: 1.0, b: 0.75)
}
var end = NSDate();
var timeInterval: Double = end.timeIntervalSince(start);
print("cosine_interpolate in \(timeInterval) seconds")
func fast_cosine_interpolate(alpha: Float, a: Float, b: Float) -> Float {
var x: Float = alpha
var count: Int32 = 1
var result: Float = 0
vvcospif(&result, &x, &count)
let SINSIN_HALF_X: Float = (1 - result) * 0.5;
return a + SINSIN_HALF_X * (b - a);
}
start = NSDate() as Date
for index in 0..<1000 {
interp = fast_cosine_interpolate(alpha: 0.25, a: 1.0, b: 0.75)
}
end = NSDate();
timeInterval = end.timeIntervalSince(start);
print("fast_cosine_interpolate in \(timeInterval) seconds")
My question is:
Why vvcospif() is slow in this example?
May be because vvcospif() it is wrapper under Objective-C runtime and converting data structures / copying of memory from Intel SIMD -> Objective-C -> Swift runtime is slower then tiny cos()?
I also have performance issue with C code +
#include <Accelerate/Accelerate.h>
vvcospif(resultVector, inputVector, &count);
when inputVector and resultVector is small arrays with 1 or 2 elements or just float variable, and calls in loop with ~ 1.000.000 times.
cos(x * PI) computation time near 20 ms.
and
vvcospif(x) with processing one float or float array[2] - computation time near 80 ms! Where is Acceleration? :)
Yes, in Xcode I use compiler -O -whole-module-optimization optimisation with whole module opt. enabled.

For a more detailed discussion with examples, see "Introduction to Fast Bezier (and Trying the Accelerate.framework)".
The first, fundamental problem is that non-inlined function calls are extremely expensive. You don't want function calls if you can possibly help it in performance-critical code. Within a module, the compiler can often inline functions for you, and parts of stdlib can be inlined for you. But when you start crossing module barriers, Swift generally cannot optimize out the call.
The point of SIMD functions is that you set up all your data in the right format, and then call them just one time. That way the cost of the function call is made up by the SIMD optimized code you're calling.
But remember, you don't have to call into Accelerate to get SIMD optimizations. The compiler is perfectly capable of noticing you've written a loop and turning it into an inline SIMD algorithm itself (and it does this all the time). So for many simple problems, the compiler is going to win anyway. Think about it: if calling vvcospif with a count of 1 were faster than calling cos, wouldn't they just implement cos that way?
I haven't played with your code much, but if you want to improve its performance with Accelerate, you want to think about how to arrange all your input data so you can call vvcospif one time with a large N. It's quite possible in that case that it will be much faster that a loop (since cos is not trivial).
If you want an example of Accelerate in practice, and how you need to organize your data, see PinchText. This code is computing offsets for a page full of a few thousand glyphs based on up to 10 touches in real-time, with animations (see PinchText.mov for what the result looks like). In particular look at adjustViewPositions:count:forTouchPoint:. Notice how count is large, and the data is transformed step by step with no loops. Even throwing in a (very expensive) ObjC method call into that method doesn't matter very much because it's only made one time. Getting rid of function calls in loops is a huge part of performance programming.

arc4random throwing out huge numbers

In a cocos2d game, I use arc4random to generate random numbers like this:
float x = (arc4random()%10 - 5)*delta;
(delta is the time between updates in the scheduled update method)
NSLog(#"x: %f", x);
I have been checking them like that.
Most of the numbers that I get are like this:
2012-12-29 15:37:18.206 Jumpy[1924:907] x: 0.033444
or
2012-12-29 15:37:18.247 Jumpy[1924:907] x: 0.033369
But for some reason I get numbers like this sometimes:
2012-12-29 15:37:18.244 Jumpy[1924:907] x: 71658664.000000
Edit: Delta is almost always:
2012-12-29 17:01:26.612 Jumpy[2059:907] delta: 0.016590
I thought it should return numbers in a range of -5 to 5 (multiplied by some small number). Why I am getting numbers like this?

arc4random returns a u_int32_t. The u_ part tells you that it's unsigned. So all of the operators inside the parentheses use unsigned arithmetic.
If you perform the subtraction 2 - 5 using unsigned 32-bit arithmetic, you get 232 + 2 - 5 = 232 - 3 = 4294967293 (a “huge number”).
Cast to a signed type before performing the subtraction. Also, prefer arc4random_uniform if your deployment target is iOS 4.3 or later:
float x = ((int)arc4random_uniform(10) - 5) * delta;
If you want the range to include -5 and 5, you need to use 11 instead of 10, because the range [-5,5] (inclusive) contains 11 elements:
float x = ((int)arc4random_uniform(11) - 5) * delta;

arc4random returns a u_int32_t, an unsigned type. The modulus is also performed using unsigned arithmetic, which yields a number between 0 and 9, as expected (by the way, don't ever do this; use arc4random_uniform instead). You then subtract 5, which is interpreted as an unsigned value, yielding a possibly huge positive value due to underflow.
The solution is to explicitly type the 5 by storing it in a variable of signed type or with a suffix (like 5L).

Looks like arc4random % 10 becomes less than 5, and you are working with negative integer later.
What is the value of delta?

53 * .01 = .531250

I'm converting a string date/time to a numerical time value. In my case I'm only using it to determine if something is newer/older than something else, so this little decimal problem is not a real problem. It doesn't need to be seconds precise. But still it has me scratching my head and I'd like to know why..
My date comes in a string format of #"2010-09-08T17:33:53+0000". So I wrote this little method to return a time value. Before anyone jumps on how many seconds there are in months with 28 days or 31 days I don't care. In my math it's fine to assume all months have 31 days and years have 31*12 days because I don't need the difference between two points in time, only to know if one point in time is later than another.
-(float) uniqueTimeFromCreatedTime: (NSString *)created_time {
float time;
if ([created_time length]>19) {
time = ([[created_time substringWithRange:NSMakeRange(2, 2)]floatValue]-10) * 535680; // max for 12 months is 535680.. uh oh y2100 bug!
time=time + [[created_time substringWithRange:NSMakeRange(5, 2)]floatValue] * 44640; // to make it easy and since it doesn't matter we assume 31 days
time=time + [[created_time substringWithRange:NSMakeRange(8, 2)]floatValue] * 1440;
time=time + [[created_time substringWithRange:NSMakeRange(11, 2)]floatValue] * 60;
time=time + [[created_time substringWithRange:NSMakeRange(14, 2)]floatValue];
time = time + [[created_time substringWithRange:NSMakeRange(17, 2)]floatValue] * .01;
return time;
}
else {
//NSLog(#"error - time string not long enough");
return 0.0;
}
}
When passed that very string listed above the result should be 414333.53, but instead it is returning 414333.531250.
When I toss an NSLog in between each time= to track where it goes off I get this result:
time 0.000000
time 401760.000000
time 413280.000000
time 414300.000000
time 414333.000000
floatvalue 53.000000
time 414333.531250
Created Time: 2010-09-08T17:33:53+0000 414333.531250
So that last floatValue returned 53.0000 but when I multiply it by .01 it turns into .53125. I also tried intValue and it did the same thing.

Welcome to floating point rounding errors. If you want accuracy two a fixed number of decimal points, multiply by 100 (for 2 decimal points) then round() it and divide it by 100. So long as the number isn't obscenely large (occupies more than I think 57 bits) then you should be fine and not have any rounding problems on the division back down.
EDIT: My note about 57 bits should be noted I was assuming double, floats have far less precision. Do as another reader suggests and switch to double if possible.

IEEE floats only have 24 effective bits of mantissa (roughly between 7 and 8 decimal digits). 0.00125 is the 24th bit rounding error between 414333.53 and the nearest float representation, since the exact number 414333.53 requires 8 decimal digits. 53 * 0.01 by itself will come out a lot more accurately before you add it to the bigger number and lose precision in the resulting sum. (This shows why addition/subtraction between numbers of very different sizes in not a good thing from a numerical point of view when calculating with floating point arithmetic.)

This is from a classic floating point error resulting from how the number is represented in bits. First, use double instead of float, as it is quite fast to use on modern machines. When the result really really matters, use the decimal type, which is 20x slower but 100% accurate.

You can create NSDate instances form those NSString dates using the +dateWithString: method. It takes strings formatted as YYYY-MM-DD HH:MM:SS ±HHMM, which is what you're dealing with. Once you have two NSDates, you can use the -compare: method to see which one is later in time.

You could try multiplying all your constants by by 100 so you don't have to divide. The division is what's causing the problem because dividing by 100 produces a repeating pattern in binary.

Inaccurate division of doubles (Visual C++ 2008)

I have some code to convert a time value returned from QueryPerformanceCounter to a double value in milliseconds, as this is more convenient to count with.
The function looks like this:
double timeGetExactTime() {
LARGE_INTEGER timerPerformanceCounter, timerPerformanceFrequency;
QueryPerformanceCounter(&timerPerformanceCounter);
if (QueryPerformanceFrequency(&timerPerformanceFrequency)) {
return (double)timerPerformanceCounter.QuadPart / (((double)timerPerformanceFrequency.QuadPart) / 1000.0);
}
return 0.0;
}
The problem I'm having recently (I don't think I had this problem before, and no changes have been made to the code) is that the result is not very accurate. The result does not contain any decimals, but it is even less accurate than 1 millisecond.
When I enter the expression in the debugger, the result is as accurate as I would expect.
I understand that a double cannot hold the accuracy of a 64-bit integer, but at this time, the PerformanceCounter only required 46 bits (and a double should be able to store 52 bits without loss)
Furthermore it seems odd that the debugger would use a different format to do the division.
Here are some results I got. The program was compiled in Debug mode, Floating Point mode in C++ options was set to the default ( Precise (/fp:precise) )
timerPerformanceCounter.QuadPart: 30270310439445
timerPerformanceFrequency.QuadPart: 14318180
double perfCounter = (double)timerPerformanceCounter.QuadPart;
30270310439445.000
double perfFrequency = (((double)timerPerformanceFrequency.QuadPart) / 1000.0);
14318.179687500000
double result = perfCounter / perfFrequency;
2114117248.0000000
return (double)timerPerformanceCounter.QuadPart / (((double)timerPerformanceFrequency.QuadPart) / 1000.0);
2114117248.0000000
Result with same expression in debugger:
2114117188.0396111
Result of perfTimerCount / perfTimerFreq in debugger:
2114117234.1810646
Result of 30270310439445 / 14318180 in calculator:
2114117188.0396111796331656677036
Does anyone know why the accuracy is different in the debugger's Watch compared to the result in my program?
Update: I tried deducting 30270310439445 from timerPerformanceCounter.QuadPart before doing the conversion and division, and it does appear to be accurate in all cases now.
Maybe the reason why I'm only seeing this behavior now might be because my computer's uptime is now 16 days, so the value is larger than I'm used to?
So it does appear to be a division accuracy issue with large numbers, but that still doesn't explain why the division was still correct in the Watch window.
Does it use a higher-precision type than double for it's results?

Adion,
If you don't mind the performance hit, cast your QuadPart numbers to decimal instead of double before performing the division. Then cast the resulting number back to double.
You are correct about the size of the numbers. It throws off the accuracy of the floating point calculations.
For more about this than you probably ever wanted to know, see:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
http://docs.sun.com/source/806-3568/ncg_goldberg.html

Thanks, using decimal would probably be a solution too.
For now I've taken a slightly different approach, which also works well, at least as long as my program doesn't run longer than a week or so without restarting.
I just remember the performance counter of when my program started, and subtract this from the current counter before converting to double and doing the division.
I'm not sure which solution would be fastest, I guess I'd have to benchmark that first.
bool perfTimerInitialized = false;
double timerPerformanceFrequencyDbl;
LARGE_INTEGER timerPerformanceFrequency;
LARGE_INTEGER timerPerformanceCounterStart;
double timeGetExactTime()
{
if (!perfTimerInitialized) {
QueryPerformanceFrequency(&timerPerformanceFrequency);
timerPerformanceFrequencyDbl = ((double)timerPerformanceFrequency.QuadPart) / 1000.0;
QueryPerformanceCounter(&timerPerformanceCounterStart);
perfTimerInitialized = true;
}
LARGE_INTEGER timerPerformanceCounter;
if (QueryPerformanceCounter(&timerPerformanceCounter)) {
timerPerformanceCounter.QuadPart -= timerPerformanceCounterStart.QuadPart;
return ((double)timerPerformanceCounter.QuadPart) / timerPerformanceFrequencyDbl;
}
return (double)timeGetTime();
}