My kernel is pretty simple. It tries to see if the codes are valid and then stores only the unique codes according to the prefix scan output:
__kernel void moveValid(__global int* sortCode, __global int* mark, __global int* processorOffsets, __global int* uniqueCode,__global int* numPoints, __global int* pointIndex)
{
int ig = get_global_id(0);
int m = mark[ig];
int j= processorOffsets[ig];
atomic_inc(&numPoints[j-1]);
// select
if(m == true)
{
uniqueCode[j] = sortCode[ig];
pointIndex[j] = ig;
}
barrier(CLK_GLOBAL_MEM_FENCE);
}
Seems like the kernel is really slow. Is it due to the if statement? Can anyone give any tips as to how the kernel can be improved? Also can select be used in this scenario?
So, without looking too deeply at your code, I can give the following feedback about its speed. I am assuming that you are using a GPU as your device. If you are using a CPU as your device, some on the information may still apply.
atomic_inc(&numPoints[j-1]);
Atomic increments are PAINFULLY slow to global memory on most devices. This is because that data has to be committed into global memory (can't be cached locally).
barrier(CLK_GLOBAL_MEM_FENCE);
This barrier insures that all work_items in the work_group are present before continuing execution. Why do you need this in your code? Especially when nothing is left to do, there is no reason to not allow your threads to finish execution. This is also a big performance hit.
if(m == true)
This is not actually the worst if statement I've seen, because it only has one branch. This will still slow down your code, but not as substantially as other things. The condition will be computed for all threads in serial (for some architectures) before continuing.
Overall, in this code you are performing 4 global memory access as on atomic operation and no math operations. GPUs are the worst type of device to do this type of algorithm because memory accesses are exceptionally slow to global memory especially if the accesses are coalesced. Could you consider moving some of your arrays to local memory instead?
Related
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
I have post here ,a function that i use , to get the accelerator fft .
Setup the accelerator framework for fft on the iPhone
It is working great.
The thing is, that i use it in real time, so for each new audio buffer i call this function with the new buffer.
I get a memory warning because of these lines (probably)
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
questions :
do i have another way, but to malloc them again and again(dont forget i have to feed it with a new buffer many times a second )
how exactly do i free them? (code lines)
can it caused by the fact that the fft is heavy to the system ?
Any way to get rid of this warning will help me a lot .
Thanks a lot.
These things should be done once, at the start of your program:
Allocate memory for buffers, using code like float *buffer = malloc(NumberOfElements * sizeof *buffer);.
Create an FFT setup, using code like FFTSetup setup = vDSP_create_fftsetup(log2n, FFT_RADIX2);.
Also test the return values. If malloc or vDSP_create_fftsetup returns 0, write an error message and exit the program or take other exception behavior.
These things should be done once, at the end of your program:
Destroy the FFT setup, using code like vDSP_destroy_fftsetup(setup);.
Release the memory for the buffers, using code like free(buffer);.
In the middle of your program, while you are processing samples, the code should use the existing buffers and setup. So the variables pointing to the buffers and the setup must be visible to that code. You can either pass them in as parameters (perhaps grouped together in a struct) or make them global (which should be only a temporary solution for small programs).
Your program should be arranged so that it is never necessary to allocate memory or create an FFT setup while samples are being processed.
All memory that is allocated should be freed eventually.
If you are malloc'ing and never freeing, you will run out of memory. Make sure to 'free' your memory using free().
*Note: free() doesn't actually erase any memory. It simply tells the system that we're done with the memory and it's available for other allocations.
// Example:
// allocating memory
int *intpointer;
intpointer = malloc(sizeof(int));
// ... do stuff...
// 'Freeing' it when you are done
free(intpointer);
I have several threads executing concurrently and checking a value of a field in their own object. The field is set by the launch thread like this:
for (i = 0; i < ThreadCount; i++)
{
ThreadContext[i].MyField = 1;
}
Within each thread then I check the value of this value:
if (MyField == 1)
{
...//do something
}
However, I noticed that on a 4 core CPU, some of the (4) running threads need several miliseconds or even longer in order to see the changed MyField. MyField is a single char field. What appears to be happening is that when the memory bus is maxed out by the first thread which detects the change, all other threads may stall almost for the entire duration of the run of the first. (assuming there is enough memory pressure). Only when the first thread eases on memory (and does more with registers), is when other threads also get to see the new value.
I checked the asm and there is no compiler optimization in the way here. Calls go directly to memory. How can this be fixed?
Thanks!
Jam
I got feedback from Intel: Yes, that's how it works (no easy fix).
For a normal app, you'd never want to do this.
But ... I'm making an educational app to show people exactly what happens with the different threading models on different iPhone hardware and OS level. OS 4 has radically changed the different models (IME: lots of existing code DOES NOT WORK when run on OS 4).
I'm writing an interactive test app that lets you fire off threads for different models (selector main thread, selector background, nsoperationqueue, etc), and see what happens to the GUI + main app while it happens.
But one of the common use-cases I want to reproduce is: "Thread that does a backgorund download then does a CPU-intensive parse of the results". We see this a lot in real-world apps.
It's not entirely trivial; the manner of "being busy" matters.
So ... how can I simulate this? I'm looking for something that is guaranteed not to be thrown-away by an optimizing compiler (either now, or with a better compiler), and is enough to force a thread to run at max CPU for about 5 seconds.
NB: in my real-world apps, I've noticed there are some strange things that happen when an iPhone thread gets busy - e.g. background threads will starve the main thread EVEN WHEN set at lower priority. Although this is clearly a bug in Apple's thread scheduler, I'd like to make a busy that demonstrates this - and/or an alternate busy that shows what happens when you DON'T trigger that behavioru in the scheduler.
UPDATE:
For instance, the following can have different effects:
for( int i=0; i<1000; i++ )
for( int k=0; k<1000; k++ )
CC_MD5( cStr, strlen(cStr), result );
for( int i=0; i<1000000; i++ )
CC_MD5( cStr, strlen(cStr), result );
...sometimes, at least, the compiler seems to optimize the latter (and I have no idea what the compiler voodoo is for that - some builds it showed no difference, some it did :()
UPDATE 2:
25 threads, on a first gen iPhone, doing a million MD5's each ... and there's almost no perceptible effect on the GUI.
Whereas 5 threads parsing XML using the bundled SAX-based parser will usually grind the GUI to a halt.
It seems that MD5 hashing doesn't trigger the problems in the iPhone's buggy thread-scheduler :(. I'm going to investigate mem allocations instead, see if that has a different effect.
You can avoid the compiler optimising things away by making sure the compiler can't easily infer what you're trying to do at compile time.
For example, this:
for( int i=0; i<1000000; i++ )
CC_MD5( cStr, strlen(cStr), result );
has an invariant input, so the compiler could realise that it's going to get the same result everytime. Or that the result isn't getting used so it doesn't need to calculate it.
You could avoid both these problems like so:
for( int i=0; i<1000000; i++ )
{
CC_MD5( cStr, strlen(cStr), result );
sprintf(cStr, "%02x%02x", result[0], result[1]);
}
If you're seeing the problem with SAX, then I'd start with getting the threads in your simulation app doing SAX and check you do see the same problems outside of your main app.
If the problem is not related to pure processor power or memory allocations, other areas you could look at would be disk I/O (depending where your xml input is coming from), mutexes or calling selectors/delegates.
Good luck, and do report back how you get on!
Apple actually provides sample code that does something similiar to what you are looking for at developer.apple.com, with the intent to highlight the performance differences between using LibXML (SAX) and CocoaXML. The focus is not on CPU performance, but assuming you can actually measure processor utilization, you could likely just scale up (repeat within your xml) the dataset that the sample downloads.
So after researching engines a lot I've been building a 2d framework for the iphone. As you know the world of engine architecture is vast so I've been trying to apply best practices as much as possible.
I've been using:
uint_fast8_t mId;
If I look up the definition of uint_fast8_t I find:
/* 7.18.1.3 Fastest-width integer types */
...
typedef uint8_t uint_fast8_t;
And I've been using these types throughout my code - My question is, is there a performance benefit to using these types? And what exactly is going on behind the scenes? Besides the obvious fact that this is correct data type (unsigned 8 bit integer) for the data, is it worthwhile to have this peppered throughout my code?
Is this a needless optimization that the compiler would probably take care of anyways?
Thanks.
Edit: No responses/answers, so I'm putting a bounty on this!
the "fast" integer types are defined to be the fastest integer type available with at least the amount of bits required (in this case 8).
If your platform defines uint_fast8_t as uint8_t then there will be absolutely no difference in speed.
The reason is that there may be architectures that are slower when not using their native word length. E.g. I could find one reference where for Alpha processors uint_fast_8_t was defined to be "unsigned int".
An uint_fast8_t is the fastest integer guaranteed to be at least 8 bits wide. Depending on your platform it could be 8 or 16 or 32 bits wide.
It isnt taken care of by the compiler itself, it does indeed make your program execute faster
Here are some resource I found, You might already have seen them http://embeddedgurus.com/stack-overflow/2008/06/efficient-c-tips-1-choosing-the-correct-integer-size/
http://www.mail-archive.com/avr-gcc-list#nongnu.org/msg03149.html
The header in mingw64 said the fast types are "Not actually guaranteed to be fastest for all purposes"
/* 7.18.1.3 Fastest minimum-width integer types
* Not actually guaranteed to be fastest for all purposes <---------------------
* Here we use the exact-width types for 8 and 16-bit ints.
*/
typedef signed char int_fast8_t;
typedef unsigned char uint_fast8_t;
typedef short int_fast16_t;
typedef unsigned short uint_fast16_t;
typedef int int_fast32_t;
typedef unsigned int uint_fast32_t;
__MINGW_EXTENSION typedef long long int_fast64_t;
__MINGW_EXTENSION typedef unsigned long long uint_fast64_t;
and that still applies to ARM or other architectures, because using a narrow type requires zero extension or sign extension in many situations which is less optimal than a native int.
However that'll benefit in large arrays or in case or slow operations (like division). I'm not sure how slow ARM divisions are but on x86 64-bit division is much slower than 32-bit or 8-bit division