I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
Related
I wish to know if there could be any significant difference in terms of mem efficiency between marshaling a struct and marshaling a marshaled struct.
Example:
Assume we have a struct B with some fields.
message B{...}
The common representation:
message A {
B b = 1;
}
Another way:
message A {
bytes b = 1;
}
Where b is a marshaled B struct.
Generally, is it a good practice? any efficiency implications?
Thanks,
Elad
At the payload level, they are identical - however, in terms of how implementations treat them, there may be differences. The most obvious difference is that you can't use a bytes until you further deserialize it; this has pros and cons:
if you weren't ever going to touch it anyway, this could be fine and advantageous - avoiding some CPU processing that you didn't need for read or write; this will also mean that any downstream allocations (strings, etc) don't need to happen - so you only have a single allocation chunk: easy and efficient
if you do need to read it, then in addition to making life less convenient, you could have allocated an extra chunk of memory for the raw form (a chunk of bytes), and you'll need to allocate for the deserialized form; if you went straight to the deserialized form, most implementations would have skipped that intermediate allocation
So: yes, it will have different characteristics. Whether they are advantageous (or the opposite) depends on whether you also need to do the extra deserialization step on the bytes payload
I think it's a bad practice to declare a bytes field instead of a struct you would have otherwise specified in a proto file.
It's called a specification hole: you will have to write an additional documentation to describe how the receiver has to understand the bytes
I'm using MATLAB profile to observe memory using the command
profile -memory on
profile clear
% my code
profile report
and i got this table
1- i want to ask about the meaning of
Allocated Memory,Freed Memory, SelfMemory, and Peak Memory
2- what is the meaning of negative self memory?
After a quick google, it would seem that no-one knows, except perhaps MathWorks and they aren't telling. (I jest, but in truth I found very little information on the subject).
Logically however I would interpret the column names as follows:
Allocated memory = the total amount of memory allocated within the function and any it calls.
Freed memory = the total amount of memory released within the function and any it calls.
Peak Memory = the maximum amount of memory in use at any one time during the execution of the function.
Self Memory = the amount of memory used by the function, but not including any functions it calls.
I would hypothesize that a negative 'Self Memory' would indicate that the function frees more memory than it allocates. This could be that it has ownership of a piece of data passed to it, which it then clears. E.g.:
function A()
foo = B();
clear foo
end
function foo = B()
foo = rand(10000,10000);
end
In the example above, the data is created in the call to B and since Matlab employs a lazy copy memory management, this case works pretty much as pass-by-reference for the return value. So, B allocates the memory, and A frees it.
Indeed, running that code with the profiling method in the question produces the following output, which supports my hypothesis.
I have post here ,a function that i use , to get the accelerator fft .
Setup the accelerator framework for fft on the iPhone
It is working great.
The thing is, that i use it in real time, so for each new audio buffer i call this function with the new buffer.
I get a memory warning because of these lines (probably)
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
questions :
do i have another way, but to malloc them again and again(dont forget i have to feed it with a new buffer many times a second )
how exactly do i free them? (code lines)
can it caused by the fact that the fft is heavy to the system ?
Any way to get rid of this warning will help me a lot .
Thanks a lot.
These things should be done once, at the start of your program:
Allocate memory for buffers, using code like float *buffer = malloc(NumberOfElements * sizeof *buffer);.
Create an FFT setup, using code like FFTSetup setup = vDSP_create_fftsetup(log2n, FFT_RADIX2);.
Also test the return values. If malloc or vDSP_create_fftsetup returns 0, write an error message and exit the program or take other exception behavior.
These things should be done once, at the end of your program:
Destroy the FFT setup, using code like vDSP_destroy_fftsetup(setup);.
Release the memory for the buffers, using code like free(buffer);.
In the middle of your program, while you are processing samples, the code should use the existing buffers and setup. So the variables pointing to the buffers and the setup must be visible to that code. You can either pass them in as parameters (perhaps grouped together in a struct) or make them global (which should be only a temporary solution for small programs).
Your program should be arranged so that it is never necessary to allocate memory or create an FFT setup while samples are being processed.
All memory that is allocated should be freed eventually.
If you are malloc'ing and never freeing, you will run out of memory. Make sure to 'free' your memory using free().
*Note: free() doesn't actually erase any memory. It simply tells the system that we're done with the memory and it's available for other allocations.
// Example:
// allocating memory
int *intpointer;
intpointer = malloc(sizeof(int));
// ... do stuff...
// 'Freeing' it when you are done
free(intpointer);
My MEX file is written in C++/CLI and calls a DLL written in C#.
When gcnew'ing an object, shouldn't it be garbage collected when the mexFunction returns?
Its references should be lost but nothing seems to be garbage collected... each call to the mex function increases MATLAB's memory allocation (and no, the memory is not used for MATLAB variables).
I've experimented with creating a large dummy value with narrow scope and when stepping through the MEX file I can see the memory allocated and released. But not so with the main object created in the mexFunction =(
I've tried to delete it in the destructor and finalizer, but I can't get it to garbage collect. How can I free the managed memory when returning to MATLAB?
I don't think external DLL filers are the problem. To illustrate, I created this silly mexFunction:
public ref class Foo
{
public:
Foo()
{
Dictionary<int,String^>^ bar = gcnew Dictionary<int,String^>;
for(int i=0;i<10000000;i++)
{
bar->Add(i, "abcdefghijklmnopqrstuvxyz");
}
}
};
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, mxArray* prhs[])
{
Foo^ test = gcnew Foo();
}
This bumps MATLAB's memory by around 300 MB, although subsequent calls don't increase the memory further like in my real MEX file.
EDIT:
I answered my own question, the culprit was mxArrayToString
Garbage collection marks the memory as available inside the .NET heap. It doesn't shrink the .NET heap (which would make the memory available to other processes and the address space available to non .NET code within your process).
It's explicitly documented that the Large Object Heap is never shrunk, and a Dictionary with 10 million entries is probably large enough to go onto the LOH.
I found the problem, turns out it wasn't .net related after all... sorry for that red herring
Since I wasn't using new, malloc or mxMalloc I wrongly assumed that all my unmanaged memory would be in the stack and cleaned up when the mexFunction ended.
However mxArrayToString doesn't return a pointer to the MATLAB data as mxGetData and other mx* functions do. It copies the data onto the heap and one has to call mxFree to release it. I used mxArrayToString as input to create a System::String^, the only change needed was to save a temporay char pointer, use that for the String^ constructor and then mxFree it.
So once again for SEO: The pointer from mxArrayToString needs to be mxFree'd!
When we define a variable with following syntax does that mean it is hanging in the memory all the time:
static NSString *const kMyLabel = #"myLabel";
I have 100 of constants. Should I go with this of #define pre-processor compiler considering that #define will not keep them alive in the memory.
Hardcoded strings, in the format #"my string", are baked into the application binary. In order to make it not be permanent, you'd have to do:
static NSString *kMyLabel = nil;
...somewhere else
kMyLabel = [[NSMutableString alloc] initWithString:#"myLabel"];
But that'd be stupid, because then you'd have both #"myLabel" in memory (because it's part of the app binary) AND your allocated string. So double the memory.
In short:
If you have a constant string, there's no way to "unload" it from memory. And unless you're hard coding a few chapters from a book into your binary, it's not going to be something to worry about. Have you measured it as being a performance issue?
There would be no difference between a constant static variable and #define directive. When using #define, the preprocessor will replace the variable with #"myLabel" every time it is used. This could mean that you have one instance of the string for each use, but the compiler combines them so that any strings in the binary are unique. Using the constant static, the code will load the location of the variable when needed. This means #define may be a tiny bit faster as there is less dereferencing to get the string, but it would be unnoticeable.
It will be "in memory", but it will just be a memory mapped section of your application's executable file. If there's memory pressure, that page will be flushed without writing to disk.
Basically, it's "free" except for a tiny bit of IO on startup. Go nuts with them.