Why strange timing results appear in spite of binding a thread to a specific CPU-core?

Why strange timing results appear in spite of binding a thread to a specific CPU-core? - systems-programming

I'm doing some experiments with low-latency programming. I want to eliminate context switching, and be able to reliably measure latency without affecting performance too much.
To begin with, I wrote a program that requests time in loop 1M times and than prints statistics (code below), since I wanted to know how much time the call to the timer is taking. Surprisingly, the output is the following (in microseconds):
Mean: 0.59, Min: 0.49, Max: 25.77
Mean: 0.59, Min: 0.49, Max: 11.73
Mean: 0.59, Min: 0.42, Max: 14.11
Mean: 0.59, Min: 0.42, Max: 13.34
Mean: 0.59, Min: 0.49, Max: 11.45
Mean: 0.59, Min: 0.42, Max: 14.25
Mean: 0.59, Min: 0.49, Max: 11.80
Mean: 0.59, Min: 0.42, Max: 12.08
Mean: 0.59, Min: 0.49, Max: 21.02
Mean: 0.59, Min: 0.42, Max: 12.15
As you can see, although the average time is less than one microsecond,
there are spikes of up to 20 microseconds. That's despite the fact that the code runs on a dedicated core (affinity set to a specific core, while the affinity of the init process is set to a group of other cores), and that hyper-threading is disabled on the machine. Tried it with multiple kernel versions, including preemptive and RT, and the results are essentially the same.
Can you explain the huge difference between mean and max?Is the problem in the calls to the timer, or with a process isolation?
I also tried this with calls to other timers -- CLOCK_THREAD_CPUTIME_ID, - CLOCK_MONOTONIC,- CLOCK_PROCESS_CPUTIME_ID- and the pattern observed was the same...
#include <time.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define min(a, b) ((a) < (b) ? (a) : (b))
#define max(a, b) ((a) > (b) ? (a) : (b))
uint64_t
time_now()
{
struct timespec ts;
clock_gettime(CLOCK_REALTIME, &ts);
return ts.tv_sec * 1000000000 + ts.tv_nsec;
}
void
set_affinity(int cpu)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
if (sched_setaffinity(0, sizeof(set), &set))
{
perror("sched_setaffinity");
}
}
#define NUM 1000000
#define LOOPS 10
int
main(int argc, char **argv)
{
set_affinity(3);
for (int loop = 0; loop < LOOPS; ++ loop)
{
uint64_t t_0 = time_now();
uint64_t sum_val = 0;
uint64_t max_val = 0;
uint64_t min_val = uint64_t(-1);
for (int k = 0; k < NUM; ++ k)
{
uint64_t t_1 = time_now();
uint64_t t_diff = t_1 - t_0;
sum_val += t_diff;
min_val = min(t_diff, min_val);
max_val = max(t_diff, max_val);
t_0 = t_1;
}
printf("Mean: %.2f, Min: %.2f, Max: %.2f\n", ((double )sum_val)/NUM/1000, ((double )min_val)/1000, ((double)max_val)/1000);
}
return 0;
}

Two sources of unpredictability are going to be from device interrupts and timers. While you have set the affinity of userspace processes, interrupts from devices will be occurring and will affect your process. The kernel will also use timers to occur at periodic ticks so it can keep track of time. I would say these are going to be your first two main sources of unpredictability. Inter-Process Interrupts (IPIs) used for cores to signal each other are going to be another but probably not as high as the first two.

Related

DPDK implementation of MPSC ring buffer

While going through the implementation of the DPDK MPSC (multi-produce & single-consumer) Ring Buffer API, i found the code to move the head of the producer for inserting new elements in the Ring buffer. The function is as follows :
static __rte_always_inline unsigned int
__rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
unsigned int n, enum rte_ring_queue_behavior behavior,
uint32_t *old_head, uint32_t *new_head,
uint32_t *free_entries)
{
const uint32_t capacity = r->capacity;
uint32_t cons_tail;
unsigned int max = n;
int success;
*old_head = __atomic_load_n(&r->prod.head, __ATOMIC_RELAXED);
do {
/* Reset n to the initial burst count */
n = max;
/* Ensure the head is read before tail */
__atomic_thread_fence(__ATOMIC_ACQUIRE);
/* load-acquire synchronize with store-release of ht->tail
* in update_tail.
*/
cons_tail = __atomic_load_n(&r->cons.tail,
__ATOMIC_ACQUIRE);
/* The subtraction is done between two unsigned 32bits value
* (the result is always modulo 32 bits even if we have
* *old_head > cons_tail). So 'free_entries' is always between 0
* and capacity (which is < size).
*/
*free_entries = (capacity + cons_tail - *old_head);
/* check that we have enough room in ring */
if (unlikely(n > *free_entries))
n = (behavior == RTE_RING_QUEUE_FIXED) ?
0 : *free_entries;
if (n == 0)
return 0;
*new_head = *old_head + n;
if (is_sp)
r->prod.head = *new_head, success = 1;
else
/* on failure, *old_head is updated */
success = __atomic_compare_exchange_n(&r->prod.head,
old_head, *new_head,
0, __ATOMIC_RELAXED,
__ATOMIC_RELAXED);
} while (unlikely(success == 0));
return n;
}
The load and compare exchange of the producer's head is done using __ATOMIC_RELAXED memory ordering. Isn't this a problem when multiple producers from different threads produce to the queue. Or am I missing something?
https://doc.dpdk.org/guides/prog_guide/ring_lib.html describes the basic mechanism that DPDK uses for implementing the Ring buffer.

Cant rewrite variable after assigned as infinity Ansi C

I have double field ypos[] and in some cases instead of double is written into field -1.#INF00. I need to rewrite this values with 0.0, but i cant simply assign zero value, application falls everytime. Is there a way, how to solve this problem?

Please see http://codepad.org/KKYLhbkh for a simple example of how things "should work". Since you have not provided the code that doesn't work, we're guessing here.
#include <stdio.h>
#define INFINITY (1.0 / 0.0)
int main(void) {
double a[5] = {INFINITY, 1, 2, INFINITY, 4};
int ii;
for(ii = 0; ii < 5; ii++) {
printf("before: a[%d] is %lf\n", ii, a[ii]);
if(isinf(a[ii])) a[ii] = 0.0;
printf("after: a[%d] is %lf\n", ii, a[ii]);
}
return 0;
}
As was pointed out in #doynax 's comment, you might want to disable floating point exceptions to stop them from causing your program to keel over.
edit if your problem is caused by having taken a logarithm from a number outside of log's domain (i.e. log(x) with x<=0 ), the following code might help:
#include <stdio.h>
#include <signal.h>
#include <math.h>
#define INFINITY (1.0 / 0.0)
int main(void) {
double a[5] = {INFINITY, 1, 2, INFINITY, 4};
int ii;
signal(SIGFPE, SIG_IGN);
a[2] = log(-1.0);
for(ii = 0; ii < 5; ii++) {
printf("before: a[%d] is %lf\n", ii, a[ii]);
if(isinf(a[ii]) || isnan(a[ii])) a[ii] = 0.0;
printf("after: a[%d] is %lf\n", ii, a[ii]);
}
return 0;
}
You get a different value for log(0.0) (namely -Inf) and for log(-1.0) (namely nan). The above code shows how to deal with either of these.

A geuess that your problem is somewhere else. Can you write 0.0 before INF is written?

Sorry guys, I've solving this problem about 2 hours and tried to ask here but I've made a solution by comparing value of variable with -DBL_MAX. But thank you all for try.

Why is the call to array_view::synchronize() so slow?

i've started experimenting with C++ AMP. I've created a simple test app just to see what it can do, however the results are quite surprising to me. Consider the following code:
#include <amp.h>
#include "Timer.h"
using namespace concurrency;
int main( int argc, char* argv[] )
{
uint32_t u32Threads = 16;
uint32_t u32DataRank = u32Threads * 256;
uint32_t u32DataSize = (u32DataRank * u32DataRank) / u32Threads;
uint32_t* pu32Data = new (std::nothrow) uint32_t[ u32DataRank * u32DataRank ];
for ( uint32_t i = 0; i < u32DataRank * u32DataRank; i++ )
{
pu32Data[i] = 1;
}
uint32_t* pu32Sum = new (std::nothrow) uint32_t[ u32Threads ];
Timer tmr;
tmr.Start();
array< uint32_t, 1 > source( u32DataRank * u32DataRank, pu32Data );
array_view< uint32_t, 1 > sum( u32Threads, pu32Sum );
printf( "Array<> deep copy time: %.6f\n", tmr.Stop() );
tmr.Start();
parallel_for_each(
sum.extent,
[=, &source](index<1> idx) restrict(amp)
{
uint32_t u32Sum = 0;
uint32_t u32Start = idx[0] * u32DataSize;
uint32_t u32End = (idx[0] * u32DataSize) + u32DataSize;
for ( uint32_t i = u32Start; i < u32End; i++ )
{
u32Sum += source[i];
}
sum[idx] = u32Sum;
}
);
double dDuration = tmr.Stop();
printf( "gpu computation time: %.6f\n", dDuration );
tmr.Start();
sum.synchronize();
dDuration = tmr.Stop();
printf( "synchronize time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
tmr.Start();
for ( uint32_t idx = 0; idx < u32Threads; idx++ )
{
uint32_t u32Sum = 0;
for ( uint32_t i = 0; i < u32DataSize; i++ )
{
u32Sum += pu32Data[(idx * u32DataSize) + i];
}
pu32Sum[idx] = u32Sum;
}
dDuration = tmr.Stop();
printf( "cpu computation time: %.6f\n", dDuration );
printf( "first and second row sum = %u, %u\n", pu32Sum[0], pu32Sum[1] );
delete [] pu32Sum;
delete [] pu32Data;
return 0;
}
Note that Timer is a simple timing class using QueryPerformanceCounter. Anyway, the output of the code is the following:
Array<> deep copy time: 0.089784
gpu computation time: 0.000449
synchronize time: 8.671081
first and second row sum = 1048576, 1048576
cpu computation time: 0.006647
first and second row sum = 1048576, 1048576
Why is the call to synchronize() taking so long? Is there a way how to get around this? Other than that the performance of the computation performance is amazing, however the synchronize() overhead makes it unusable for me.
It is also possible that i am doing something terribly wrong, if so, please tell me. Thanks in advance.

Function synchronize() is probably taking so long because it is waiting for the actual kernel to complete its work.
From parallel_for_each from amp.h:
Please note that the parallel_for_each executes as if synchronous to the calling code, but in reality, it is asynchronous. I.e. once the parallel_for_each call is made and the kernel has been passed to the runtime, the [code after the parallel_for_each] continues to execute immediately by the CPU thread, while in parallel the kernel is executed by the GPU threads.
So, measuring the time spent in parallel_for_each is not particularly meaningful.
EDIT: The way the algorithm is written, it won't benefit much from GPU acceleration. The read of source[i] is non-coalesced, and so it will be almost 16x slower than a coalesced read. It is possible to coalesce the read by using shared memory, but it is not quite trivial. I'd recommend reading up on GPU programming.
If you just want a simple example that demonstrates the utility of C++ AMP, try matrix multiplication.
Of course, the performance you'll observe also greatly depends on the model of you GPU hardware.

In addition to Igor's response on your specific algorithm, please note that there are multiple incorrect aspects of the way you are measuring C++ AMP performance in general (no runtime initialization exclusion, no discarding of initial JIT, no warmup of data, and the already pointed out assumption of p_f_e being synchronous), so please follow our guidelines here:
http://blogs.msdn.com/b/nativeconcurrency/archive/2011/12/28/how-to-measure-the-performance-of-c-amp-algorithms.aspx

Fastest way of bitwise AND between two arrays on iPhone?

I have two image blocks stored as 1D arrays and have do the following bitwise AND operations among the elements of them.
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y++)
for(int x=0; x<a_lenx; x++)
{
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap ;
}
Actually, I have to do this job about 220,000 times, so it becomes very slow on iphone devices.
How could I accelerate this job on iPhone ?
I heard that NEON could be useful, but I'm not really familiar with it. In addition it seems that NEON doesn't have bitwise AND...

Option 1 - Work in the native width of your platform (it's faster to fetch 32-bits into a register and then do operations on that register than it is to fetch and compare data one byte at a time):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap = 0;
uint32_t* a_int = (uint32_t*)a;
uint32_t* b_int = (uint32_t*)b;
a_leny = a_leny / 4;
a_lenx = a_lenx / 4;
a_pitch = a_pitch / 4;
b_pitch = b_pitch / 4;
for(int y=0; y<a_leny_int; y++)
for(int x=0; x<a_lenx_int; x++)
{
uint32_t aVal = a_int[x + y * a_pitch_int];
uint32_t bVal = b_int[x+y*b_pitch_int];
if (aVal & 0xFF) & (bVal & 0xFF)
overlap++;
if ((aVal >> 8) & 0xFF) & ((bVal >> 8) & 0xFF)
overlap++;
if ((aVal >> 16) & 0xFF) & ((bVal >> 16) & 0xFF)
overlap++;
if ((aVal >> 24) & 0xFF) & ((bVal >> 24) & 0xFF)
overlap++;
}
return overlap ;
}
Option 2 - Use a heuristic to get an approximate result using fewer calculations (a good approach if the absolute difference between 101 overlaps and 100 overlaps is not important to your application):
int compare(unsigned char *a, int a_pitch,
unsigned char *b, int b_pitch, int a_lenx, int a_leny)
{
int overlap =0 ;
for(int y=0; y<a_leny; y+= 10)
for(int x=0; x<a_lenx; x+= 10)
{
//we compare 1% of all the pixels, and use that as the result
if(a[x + y * a_pitch] & b[x+y*b_pitch])
overlap++ ;
}
return overlap * 100;
}
Option 3 - Rewrite your function in inline assembly code. You're on your own for this one.

Your code is Rambo for the CPU - its worst nightmare :
byte access. Like aroth mentioned, ARM is VERY slow reading bytes from memory
random access. Two absolutely unnecessary multiply/add operations in addition to the already steep performance penalty by its nature.
Simply put, everything is wrong that can be wrong.
Don't call me rude. Let me be your angel instead.
First, I'll provide you a working NEON version. Then an optimized C version showing you exactly what you did wrong.
Just give me some time. I have to go to bed right now, and I have an important meeting tomorrow.
Why don't you learn ARM assembly? It's much easier and useful than x86 assembly.
It will also improve your C programming capabilities by a huge step.
Strongly recommended
cya
==============================================================================
Ok, here is an optimized version written in C with ARM assembly in mind.
Please note that both the pitches AND a_lenx have to be multiples of 4. Otherwise, it won't work properly.
There isn't much room left for optimizations with ARM assembly upon this version. (NEON is a different story - coming soon)
Take a careful look at how to handle variable declarations, loop, memory access, and AND operations.
And make sure that this function runs in ARM mode and not Thumb for best results.
unsigned int compare(unsigned int *a, unsigned int a_pitch,
unsigned int *b, unsigned int b_pitch, unsigned int a_lenx, unsigned int a_leny)
{
unsigned int overlap =0;
unsigned int a_gap = (a_pitch - a_lenx)>>2;
unsigned int b_gap = (b_pitch - a_lenx)>>2;
unsigned int aval, bval, xcount;
do
{
xcount = (a_lenx>>2);
do
{
aval = *a++;
// ldr aval, [a], #4
bval = *b++;
// ldr bavl, [b], #4
aval &= bval;
// and aval, aval, bval
if (aval & 0x000000ff) overlap += 1;
// tst aval, #0x000000ff
// addne overlap, overlap, #1
if (aval & 0x0000ff00) overlap += 1;
// tst aval, #0x0000ff00
// addne overlap, overlap, #1
if (aval & 0x00ff0000) overlap += 1;
// tst aval, #0x00ff0000
// addne overlap, overlap, #1
if (aval & 0xff000000) overlap += 1;
// tst aval, #0xff000000
// addne overlap, overlap, #1
} while (--xcount);
a += a_gap;
b += b_gap;
} while (--a_leny);
return overlap;
}

First of all, why the double loop? You can do it with a single loop and a couple of pointers.
Also, you don't need to calculate x+y*pitch for every single pixel; just increment two pointers by one. Incrementing by one is a lot faster than x+y*pitch.
Why exactly do you need to perform this operation? I would make sure there are no high-level optimizations/changes available before looking into a low-level solution like NEON.

"Nearly divisible"

I want to check if a floating point value is "nearly" a multiple of 32. E.g. 64.1 is "nearly" divisible by 32, and so is 63.9.
Right now I'm doing this:
#define NEARLY_DIVISIBLE 0.1f
float offset = fmodf( val, 32.0f ) ;
if( offset < NEARLY_DIVISIBLE )
{
// its near from above
}
// if it was 63.9, then the remainder would be large, so add some then and check again
else if( fmodf( val + 2*NEARLY_DIVISIBLE, 32.0f ) < NEARLY_DIVISIBLE )
{
// its near from below
}
Got a better way to do this?

well, you could cut out the second fmodf by just subtracting 32 one more time to get the mod from below.
if( offset < NEARLY_DIVISIBLE )
{
// it's near from above
}
else if( offset-32.0f>-1*NEARLY_DIVISIBLE)
{
// it's near from below
}

In a standard-compliant C implementation, one would use the remainder function instead of fmod:
#define NEARLY_DIVISIBLE 0.1f
float offset = remainderf(val, 32.0f);
if (fabsf(offset) < NEARLY_DIVISIBLE) {
// Stuff
}
If one is on a non-compliant platform (MSVC++, for example), then remainder isn't available, sadly. I think that fastmultiplication's answer is quite reasonable in that case.

You mention that you have to test near-divisibility with 32. The following theory ought to hold true for near-divisibility testing against powers of two:
#define THRESHOLD 0.11
int nearly_divisible(float f) {
// printf(" %f\n", (a - (float)((long) a)));
register long l1, l2;
l1 = (long) (f + THRESHOLD);
l2 = (long) f;
return !(l1 & 31) && (l2 & 31 ? 1 : f - (float) l2 <= THRESHOLD);
}
What we're doing is coercing the float, and float + THRESHOLD to long.
f (long) f (long) (f + THRESHOLD)
63.9 63 64
64 64 64
64.1 64 64
Now we test if (long) f is divisible with 32. Just check the lower five bits, if they are all set to zero, the number is divisible by 32. This leads to a series of false positives: 64.2 to 64.8, when converted to long, are also 64, and would pass the first test. So, we check if the difference between their truncated form and f is less than or equal to THRESHOLD.
This, too, has a problem: f - (float) l2 <= THRESHOLD would hold true for 64 and 64.1, but not for 63.9. So, we add an exception for numbers less than 64 (which, when incremented by THRESHOLD and subsequently coerced to long -- note that the test under discussion has to be inclusive with the first test -- is divisible by 32), by specifying that the lower 5 bits are not zero. This will hold true for 63 (1000000 - 1 == 1 11111).
A combination of these three tests would indicate whether the number is divisible by 32 or not. I hope this is clear, please forgive my weird English.
I just tested the extensibility to other powers of three -- the following program prints numbers between 383.5 and 388.4 that are divisible by 128.
#include <stdio.h>
#define THRESHOLD 0.11
int main(void) {
int nearly_divisible(float);
int i;
float f = 383.5;
for (i=0; i<50; i++) {
printf("%6.1f %s\n", f, (nearly_divisible(f) ? "true" : "false"));
f += 0.1;
}
return 0;
}
int nearly_divisible(float f) {
// printf(" %f\n", (a - (float)((long) a)));
register long l1, l2;
l1 = (long) (f + THRESHOLD);
l2 = (long) f;
return !(l1 & 127) && (l2 & 127 ? 1 : f - (float) l2 <= THRESHOLD);
}
Seems to work well so far!

I think it's right:
bool nearlyDivisible(float num,float div){
float f = num % div;
if(f>div/2.0f){
f=f-div;
}
f=f>0?f:0.0f-f;
return f<0.1f;
}

For what I gather you want to detect if a number is nearly divisible by other, right?
I'd do something like this:
#define NEARLY_DIVISIBLE 0.1f
bool IsNearlyDivisible(float n1, float n2)
{
float remainder = (fmodf(n1, n2) / n2);
remainder = remainder < 0f ? -remainder : remainder;
remainder = remainder > 0.5f ? 1 - remainder : remainder;
return (remainder <= NEARLY_DIVISIBLE);
}

Why wouldn't you just divide by 32, then round and take the difference between the rounded number and the actual result?
Something like (forgive the untested/pseudo code, no time to lookup):
#define NEARLY_DIVISIBLE 0.1f
float result = val / 32.0f;
float nearest_int = nearbyintf(result);
float difference = abs(result - nearest_int);
if( difference < NEARLY_DIVISIBLE )
{
// It's nearly divisible
}
If you still wanted to do checks from above and below, you could remove the abs, and check to see if the difference is >0 or <0.

This is without uing the fmodf twice.
int main(void)
{
#define NEARLY_DIVISIBLE 0.1f
#define DIVISOR 32.0f
#define ARRAY_SIZE 4
double test_var1[ARRAY_SIZE] = {63.9,64.1,65,63.8};
int i = 54;
double rest;
for(i=0;i<ARRAY_SIZE;i++)
{
rest = fmod(test_var1[i] ,DIVISOR);
if(rest < NEARLY_DIVISIBLE)
{
printf("Number %f max %f larger than a factor of the divisor:%f\n",test_var1[i],NEARLY_DIVISIBLE,DIVISOR);
}
else if( -(rest-DIVISOR) < NEARLY_DIVISIBLE)
{
printf("Number %f max %f less than a factor of the divisor:%f\n",test_var1[i],NEARLY_DIVISIBLE,DIVISOR);
}
}
return 0;
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why strange timing results appear in spite of binding a thread to a specific CPU-core? - systems-programming

Related

DPDK implementation of MPSC ring buffer

Cant rewrite variable after assigned as infinity Ansi C

Why is the call to array_view::synchronize() so slow?

Fastest way of bitwise AND between two arrays on iPhone?

"Nearly divisible"

Categories

Resources