I have post here ,a function that i use , to get the accelerator fft .
Setup the accelerator framework for fft on the iPhone
It is working great.
The thing is, that i use it in real time, so for each new audio buffer i call this function with the new buffer.
I get a memory warning because of these lines (probably)
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
questions :
do i have another way, but to malloc them again and again(dont forget i have to feed it with a new buffer many times a second )
how exactly do i free them? (code lines)
can it caused by the fact that the fft is heavy to the system ?
Any way to get rid of this warning will help me a lot .
Thanks a lot.
These things should be done once, at the start of your program:
Allocate memory for buffers, using code like float *buffer = malloc(NumberOfElements * sizeof *buffer);.
Create an FFT setup, using code like FFTSetup setup = vDSP_create_fftsetup(log2n, FFT_RADIX2);.
Also test the return values. If malloc or vDSP_create_fftsetup returns 0, write an error message and exit the program or take other exception behavior.
These things should be done once, at the end of your program:
Destroy the FFT setup, using code like vDSP_destroy_fftsetup(setup);.
Release the memory for the buffers, using code like free(buffer);.
In the middle of your program, while you are processing samples, the code should use the existing buffers and setup. So the variables pointing to the buffers and the setup must be visible to that code. You can either pass them in as parameters (perhaps grouped together in a struct) or make them global (which should be only a temporary solution for small programs).
Your program should be arranged so that it is never necessary to allocate memory or create an FFT setup while samples are being processed.
All memory that is allocated should be freed eventually.
If you are malloc'ing and never freeing, you will run out of memory. Make sure to 'free' your memory using free().
*Note: free() doesn't actually erase any memory. It simply tells the system that we're done with the memory and it's available for other allocations.
// Example:
// allocating memory
int *intpointer;
intpointer = malloc(sizeof(int));
// ... do stuff...
// 'Freeing' it when you are done
free(intpointer);
Related
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
I have an app written in swift which works fine initially, but throughout time the app gets sluggish. I have opened an instruments profiling session using the Allocation and Leaks profile.
What I have found is that the allocation increases dramatically, doing something that should only overwrite the current data.
The memory in question is in the group < non-object >
Opening this group gives hundreds of different allocations, with the responsible library all being libvDSP. So with this I can conclude it is a vDSP call that is not releasing the memory properly. However, double clicking on any of these does not present me with any code, but the raw language I do not understand.
The function that callas vDSP is wrapped like this:
func outOfPlaceComplexFourierTransform(
setup: FFTSetup,
resultSize:Int,
logSize: UInt,
direction: FourierTransformDirection) -> ComplexFloatArray {
let result = ComplexFloatArray.zeros(count:resultSize)
self.useAsDSPSplitComplex { selfPointer in
result.useAsDSPSplitComplex { resultPointer in
vDSP_fft_zop(
setup,
&selfPointer,
ComplexFloatArray.strideSize,
&resultPointer,
ComplexFloatArray.strideSize,
logSize,
direction.rawValue)
}
}
return result
}
This is called from another function:
var mags1 = ComplexFloatArray.zeros(count: measurement.windowedImpulse!.count)
mags1 = (measurement.windowedImpulse?.outOfPlaceComplexFourierTransform(setup: fftSetup, resultSize: mags1.count, logSize: UInt(logSize), direction: ComplexFloatArray.FourierTransformDirection(rawValue: 1)!))!
Within this function, mags1 is manipulated and overwrites an existing array. It was my understanding that mags1 would be deallocated once this function has finished, as it is only available inside this function.
This is the function that is called, many times per second at times. Any help would be appreciated, as what should only take 5mb, very quickly grows by two hundred megabytes in a couple of seconds.
Any pointers to either further investigate the source of the leak, or to properly deallocate this memory once finished would be appreciated.
I cannot believe I solved this so quickly after posting this. (I genuinely had several hours of pulling my hair out).
Not included in my code here, I was creating a new FFTSetup every time this was called. Obviously this is memory intensive, and it was not reusing this memory.
In instruments looking at the call tree I was able to see the function utilising this memory.
I am currently building an app that uses Metal to efficiently render triangles and evaluate a fitness function on textures, I noticed that the memory usage of my Metal app is growing and I can't really understand why.
First of all I am surprised to see that in debug mode, according to Xcode debug panel, memory usage grows really slowly (about 20 MB after 200 generated images), whereas it grows way faster in release (about 100 MB after 200 generated images).
I don't store the generated images (at least not intentionally... but maybe there is some leak I am unaware of).
I am trying to understand where the leak (if it one) comes from but I don't really know where to start, I took a a GPU Frame capture to see the objects used by Metal and it seems suspicious to me:
Looks like there are thousands of objects (the list is way longer than what you can see on the left panel).
Each time I draw an image, there is a moment when I call this code:
trianglesVerticesCoordiantes = device.makeBuffer(bytes: &positions, length: bufferSize , options: MTLResourceOptions.storageModeManaged)
triangleVerticiesColors = device.makeBuffer(bytes: &colors, length: bufferSize, options: MTLResourceOptions.storageModeManaged)
I will definitely make it a one time allocation and then simply copy data into this buffer when needed, but could it cause the memory leak or not at all ?
EDIT with screenshot of instruments :
EDIT #2 : Tons of command encoder objects present when using Inspector:
EDIT #3 : Here is what seems to be the most suspect memory graph when analysed which Xcode visual debugger:
And some detail :
I don't really know how to interpret this...
Thank you.
I'm using MATLAB profile to observe memory using the command
profile -memory on
profile clear
% my code
profile report
and i got this table
1- i want to ask about the meaning of
Allocated Memory,Freed Memory, SelfMemory, and Peak Memory
2- what is the meaning of negative self memory?
After a quick google, it would seem that no-one knows, except perhaps MathWorks and they aren't telling. (I jest, but in truth I found very little information on the subject).
Logically however I would interpret the column names as follows:
Allocated memory = the total amount of memory allocated within the function and any it calls.
Freed memory = the total amount of memory released within the function and any it calls.
Peak Memory = the maximum amount of memory in use at any one time during the execution of the function.
Self Memory = the amount of memory used by the function, but not including any functions it calls.
I would hypothesize that a negative 'Self Memory' would indicate that the function frees more memory than it allocates. This could be that it has ownership of a piece of data passed to it, which it then clears. E.g.:
function A()
foo = B();
clear foo
end
function foo = B()
foo = rand(10000,10000);
end
In the example above, the data is created in the call to B and since Matlab employs a lazy copy memory management, this case works pretty much as pass-by-reference for the return value. So, B allocates the memory, and A frees it.
Indeed, running that code with the profiling method in the question produces the following output, which supports my hypothesis.
I'm dealing with neural networks here, but it's safe to ignore that, as the real question has to deal with blocks in objective-c. Here is my issue. I found a way to convert a neural network into a big block that can be executed all at once. However, it goes really, really slow, relative to activating the network. This seems a bit counterintuitive.
If I gave you a group of nested functions like
CGFloat answer = sin(cos(gaussian(1.5*x + 2.5*y)) + (.3*d + bias))
//or in block notation
^(CGFloat x, CGFloat y, CGFloat d, CGFloat bias) {
return sin(cos(gaussian(1.5*x + 2.5*y)) + (.3*d + bias));
};
In theory, running that function multiple times should be easier/quicker than looping through a bunch of connections, and setting nodes active/inactive, etc, all of which essentially calculate this same function in the end.
However, when I create a block (see thread: how to create function at runtime) and run this code, it is slow as all hell for any moderately sized network.
Now, what I don't quite understand is:
When you copy a block, what exactly are you copying?
Let's say, I copy a block twice, copy1 and copy2. If I call copy1 and copy2 on the same thread, is the same function called? I don't understand exactly what the docs mean for block copies: Apple Block Docs
Now if I make that copy again, copy1 and copy2, but instead, I call the copies on separate threads, now how do the functions behave? Will this cause some sort of slowdown, as each thread attempts to access the same block?
When you copy a block, what exactly
are you copying?
You are copying any state the block has captured. If that block captures no state -- which that block appears not to -- then the copy should be "free" in that the block will be a constant (similar to how #"" works).
Let's say, I copy a block twice, copy1
and copy2. If I call copy1 and copy2
on the same thread, is the same
function called? I don't understand
exactly what the docs mean for block
copies: Apple Block Docs
When a block is copied, the code of the block is never copied. Only the captured state. So, yes, you'll be executing the exact same set of instructions.
Now if I make that copy again, copy1
and copy2, but instead, I call the
copies on separate threads, now how do
the functions behave? Will this cause
some sort of slowdown, as each thread
attempts to access the same block?
The data captured within a block is not protected from multi-threaded access in any way so, no, there would be no slowdown (but there will be all the concurrency synchronization fun you might imagine).
Have you tried sampling the app to see what is consuming the CPU cycles? Also, given where you are going with this, you might want to become acquainted with your friendly local disassembler (otool -TtVv binary/or/.o/file) as it can be quite helpful in determining how costly a block copy really is.
If you are sampling and seeing lots of time in the block itself, then that is just your computation consuming lots of CPU time. If the block were to consume CPU during the copy, you would see the consumption in a copy helper.
Try creating a source file that contains a bunch of different kinds of blocks; with parameters, without, with captured state, without, with captured blocks with/without captured state, etc.. and a function that calls Block_copy() on each.
Disassemble that and you'll gain a deep understanding on what happens when blocks are copied. Personally, I find x86_64 assembly to be easier to read than ARM. (This all sounds like good blog fodder -- I should write it up).