My MEX file is written in C++/CLI and calls a DLL written in C#.
When gcnew'ing an object, shouldn't it be garbage collected when the mexFunction returns?
Its references should be lost but nothing seems to be garbage collected... each call to the mex function increases MATLAB's memory allocation (and no, the memory is not used for MATLAB variables).
I've experimented with creating a large dummy value with narrow scope and when stepping through the MEX file I can see the memory allocated and released. But not so with the main object created in the mexFunction =(
I've tried to delete it in the destructor and finalizer, but I can't get it to garbage collect. How can I free the managed memory when returning to MATLAB?
I don't think external DLL filers are the problem. To illustrate, I created this silly mexFunction:
public ref class Foo
{
public:
Foo()
{
Dictionary<int,String^>^ bar = gcnew Dictionary<int,String^>;
for(int i=0;i<10000000;i++)
{
bar->Add(i, "abcdefghijklmnopqrstuvxyz");
}
}
};
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, mxArray* prhs[])
{
Foo^ test = gcnew Foo();
}
This bumps MATLAB's memory by around 300 MB, although subsequent calls don't increase the memory further like in my real MEX file.
EDIT:
I answered my own question, the culprit was mxArrayToString
Garbage collection marks the memory as available inside the .NET heap. It doesn't shrink the .NET heap (which would make the memory available to other processes and the address space available to non .NET code within your process).
It's explicitly documented that the Large Object Heap is never shrunk, and a Dictionary with 10 million entries is probably large enough to go onto the LOH.
I found the problem, turns out it wasn't .net related after all... sorry for that red herring
Since I wasn't using new, malloc or mxMalloc I wrongly assumed that all my unmanaged memory would be in the stack and cleaned up when the mexFunction ended.
However mxArrayToString doesn't return a pointer to the MATLAB data as mxGetData and other mx* functions do. It copies the data onto the heap and one has to call mxFree to release it. I used mxArrayToString as input to create a System::String^, the only change needed was to save a temporay char pointer, use that for the String^ constructor and then mxFree it.
So once again for SEO: The pointer from mxArrayToString needs to be mxFree'd!
Related
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
This is an extension of the discussion here: pycuda shared memory error "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value"
Is there a method in pycuda that is equivalent to the following C++ API call?
#define SHARED_SIZE 0x18000 // 96 kbyte
cudaFuncSetAttribute(func, cudaFuncAttributeMaxDynamicSharedMemorySize, SHARED_SIZE)
Working on a recent GPU (Nvidia V100), going beyond 48 kbyte shared memory requires this function attribute be set. Without it, one gets the same launch error as in the topic above. The "hard" limit on the device is 96 kbyte shared memory (leaving 32 kbyte for L1 cache).
There's a deprecated method Fuction.set_shared_size(bytes) that sounds promising, but I can't find what it's supposed to be replaced by.
PyCUDA uses the driver API, and the corresponding function call for setting a function dynamic memory limits is cuFuncSetAttribute.
I can't find that anywhere in the current PyCUDA tree, and therefore suspect that it has not been implemented.
I'm not sure if this is what you're looking for, but this might help someone looking in this direction.
The dynamic shared memory size in PyCUDA can be set either using:
shared argument in the direct kernel call (the "unprepared call"). For example:
myFunc(arg1, arg2, shared=numBytes, block=(1,1,1), grid=(1,1))
shared_size argument in the prepared kernel call. For example:
myFunc.prepared_call(grid, block, arg1, arg2, shared_size=numBytes)
where numBytes is the amount of memory in bytes you wish to allocate at runtime.
I'm using MATLAB profile to observe memory using the command
profile -memory on
profile clear
% my code
profile report
and i got this table
1- i want to ask about the meaning of
Allocated Memory,Freed Memory, SelfMemory, and Peak Memory
2- what is the meaning of negative self memory?
After a quick google, it would seem that no-one knows, except perhaps MathWorks and they aren't telling. (I jest, but in truth I found very little information on the subject).
Logically however I would interpret the column names as follows:
Allocated memory = the total amount of memory allocated within the function and any it calls.
Freed memory = the total amount of memory released within the function and any it calls.
Peak Memory = the maximum amount of memory in use at any one time during the execution of the function.
Self Memory = the amount of memory used by the function, but not including any functions it calls.
I would hypothesize that a negative 'Self Memory' would indicate that the function frees more memory than it allocates. This could be that it has ownership of a piece of data passed to it, which it then clears. E.g.:
function A()
foo = B();
clear foo
end
function foo = B()
foo = rand(10000,10000);
end
In the example above, the data is created in the call to B and since Matlab employs a lazy copy memory management, this case works pretty much as pass-by-reference for the return value. So, B allocates the memory, and A frees it.
Indeed, running that code with the profiling method in the question produces the following output, which supports my hypothesis.
I have post here ,a function that i use , to get the accelerator fft .
Setup the accelerator framework for fft on the iPhone
It is working great.
The thing is, that i use it in real time, so for each new audio buffer i call this function with the new buffer.
I get a memory warning because of these lines (probably)
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
questions :
do i have another way, but to malloc them again and again(dont forget i have to feed it with a new buffer many times a second )
how exactly do i free them? (code lines)
can it caused by the fact that the fft is heavy to the system ?
Any way to get rid of this warning will help me a lot .
Thanks a lot.
These things should be done once, at the start of your program:
Allocate memory for buffers, using code like float *buffer = malloc(NumberOfElements * sizeof *buffer);.
Create an FFT setup, using code like FFTSetup setup = vDSP_create_fftsetup(log2n, FFT_RADIX2);.
Also test the return values. If malloc or vDSP_create_fftsetup returns 0, write an error message and exit the program or take other exception behavior.
These things should be done once, at the end of your program:
Destroy the FFT setup, using code like vDSP_destroy_fftsetup(setup);.
Release the memory for the buffers, using code like free(buffer);.
In the middle of your program, while you are processing samples, the code should use the existing buffers and setup. So the variables pointing to the buffers and the setup must be visible to that code. You can either pass them in as parameters (perhaps grouped together in a struct) or make them global (which should be only a temporary solution for small programs).
Your program should be arranged so that it is never necessary to allocate memory or create an FFT setup while samples are being processed.
All memory that is allocated should be freed eventually.
If you are malloc'ing and never freeing, you will run out of memory. Make sure to 'free' your memory using free().
*Note: free() doesn't actually erase any memory. It simply tells the system that we're done with the memory and it's available for other allocations.
// Example:
// allocating memory
int *intpointer;
intpointer = malloc(sizeof(int));
// ... do stuff...
// 'Freeing' it when you are done
free(intpointer);
I got a memory leak when I detect with instrument. I don`t have so much experience about memory-management, so I can not figure out what is the possible cause for this problem, the memory leak is as below:
I want to know the possible reason about this kind of memory leak. Is some one who can give me some clues?
strdup uses malloc internally, so anything that has been strdup-ed has to be freed using free.
For example:
char *duplicate = strdup("abcdef");
...
free(duplicate);
strdup() is a library function, so you need to go back up the backtrace until you find a caller that's in your code. There you'll find a library call which is resulting in memory being allocated - it should have a corresponding freeing call elsewhere in your program.
(The freeing function is not necessarily a direct call to free() - for example if you call the getaddrinfo() library function the corresponding freeing function is freeaddrinfo()).