I don't understand the difference between memory leak and null dereferencing. How are these two terms related?
The operating-system have a memory map for each process (each executable you start like the one you compile and run). This memory map tells the OS what pages of physical memory are allocated to the process. A memory leak is when you allocate memory using the new operator in C++ (or malloc() in C) but never release it later. The actual memory leak happens when you change the address that a pointer points to when it has been allocated with new without releasing memory first with delete.
There are 2 types of memory allocation. One is static the other is dynamic. Static memory allocation works like the following:
unsigned char memory[10];
In this example, I allocate 10 unsigned chars statically. This means that the memory will be allocated in the executable at compilation time. The executable will contain space for these unsigned chars I allocated statically. When you will launch the executable, the OS will place the content of that array in RAM (after loading the executable from disk). In this example, memory represents a pointer to the first element of the unsigned char array.
Dynamic memory works like the following (in C++):
unsigned char* memory = new unsigned char[10];
In this example, I allocate 10 unsigned chars on the heap instead of statically in the executable. The heap is managed by the OS and grows according to how much memory you allocate. There is no limit to memory allocation with new. Nothing prevents a program from allocating the whole RAM. If a program runs for a long time and it has memory leaks, it could allocate a lot of RAM until the OS starts having a hard time to make it work with the rest of the system (or until the amount of memory allocated to the process is bigger than RAM).
This works by doing a system call in the OS. When you compile a program which has the above dynamic memory allocation, you compile the line to a system call in the kernel to ask for memory. This is OS specific so the program you compile will have to be recompiled to work on a different OS.
In the meantime, you can create a nullptr or initialize a pointer to 0 like the following:
unsigned char* ptr = nullptr;
or
unsigned char* ptr = 0;
When you dereference that pointer, the dereference will be compiled to a memory fetch. The memory fetch will trigger a page fault because the memory at 0 wasn't allocated to your process. Then the OS will look in it's memory map for the process. It will determine that this address access wasn't legal and kill the process.
The terms are pretty much different and there isn't much relation between the 2.
Related
First of all, is it even possible, reading groupshared data? Or is groupshared data required to be copied to some RWbuffer before transfering it to cpu memory? Since RWbuffers can't be groupshared (I'm assuming it's because you don't know the size of the buffer at compile time).
For those interested, this is the error it throws when declaring a groupshared buffer:
Shader error in 'FOWComputeShader': 'Result': groupshared variables cannot hold resources at kernel CSMain at ...
Basically I'm declaring a big groupshared uint array in the shader, worth 16kb. I'm linking a computebuffer in the main code to this groupshared array. Dispatching the shader, then reading back from the buffer. Sadly the data I read back is all 0.
I'm working in a unity environment with a compute shader, setting my buffer up like this:
// MapSize is 128 * 128, so 16kb
// sizeof(uint) is the stride size
// ComputeBufferType.Raw, because I intend to use each uint as 4 bytes later on, so I don't want funny stuff to happen to the values
ComputeBuffer FOWMapBuffer = new ComputeBuffer(MapSize, sizeof(uint), ComputeBufferType.Raw);
FOWComputeShader.SetBuffer(kernel, "_FoWMap", FOWMapBuffer);
//just the dispatch
int ThreadCount = Mathf.CeilToInt((float)FOWdata.Count / ThreadGroupSizeX);
FOWComputeShader.Dispatch(kernel, ThreadCount, 1, 1);
//outVisibleToFaction is a byte array of 128 * 128 size
FOWMapBuffer.GetData(outVisibleToFaction);
FOWMapBuffer.Dispose();
Then inside the shader:
// 4096 uints * 4 bytes per uint = 16kb
#define FoWMap_Size 4096
groupshared uint _FoWMap[FoWMap_Size];
[numthreads(32,1,1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
for (uint i = 0; i < FoWMap_Size; i++)
{
_FoWMap[i] = i;
}
}
That's my environment.
Does anyone know if reading back groupshared data is possible, if so then why is my buffer reading back all 0s?
No, you can't access groupshared memory on the CPU directly. Groupshared memory is a block of on-chip memory, and is the name suggests, it's only shared between the threads inside a single group, so there isn't even one single groupshared memory, but rather multiple instances (which may or may not co-exist, depending on hardware and shader). The lifetime of each block of groupshared memory ends once the thread group that it belongs to finished executing (which allows the hardware to re-use that memory for the next thread group). In your case, for example, you're actually dispatching ThreadCount groups, so there will be that many logical blocks of 16 kb groupshared memory, each 16 kb in size.
So, as a summary, groupshared memory is more like a temporary cache that you can use so the threads inside your thread group can communicate with each other. Nothing outside of these 32 threads in your thread group knows about content or even existance of that memory (since it only really exists while these threads are currently executing).
If anything outside of these 32 threads needs to have access to the memory, you will need to write it out to an RW Buffer.
I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.
Since I was introduced to the concept of heap of a process, I have been assuming that the OS allocates it at the creation of the process. But then I was doing some research and read a statement here.
It says:
When a program asks malloc for space, malloc asks sbrk to increment the heap size and returns a pointer to the start
of the new region on the heap.
If I understood what's been said, the OS allocates 0 cell for the process's heap, and it is only by calling malloc that the process gets some heap cells. And for me this makes more sens for the expression "dynamic allocation". Is this correct ?
in figure you can see that your c/c++ program have a free memory area where the heap and the stack can grow until full the region, so Initialy the heap is empty, and when a process call malloc, Normally (but in modern implementation, malloc prefer to call always mmap()) he call the sbrk() function for increase the memory size of the heap (in reality he first search into the free linked list and if there is not any entry into the linked list he call sbrk(), see this for a implementation of malloc() malloc implementation?).
So the os don't directly decide how the heap of a process should be allocated, in c/c++ the thinks work like this, but i think that in other languages the thinks can be slightly different.
My MEX file is written in C++/CLI and calls a DLL written in C#.
When gcnew'ing an object, shouldn't it be garbage collected when the mexFunction returns?
Its references should be lost but nothing seems to be garbage collected... each call to the mex function increases MATLAB's memory allocation (and no, the memory is not used for MATLAB variables).
I've experimented with creating a large dummy value with narrow scope and when stepping through the MEX file I can see the memory allocated and released. But not so with the main object created in the mexFunction =(
I've tried to delete it in the destructor and finalizer, but I can't get it to garbage collect. How can I free the managed memory when returning to MATLAB?
I don't think external DLL filers are the problem. To illustrate, I created this silly mexFunction:
public ref class Foo
{
public:
Foo()
{
Dictionary<int,String^>^ bar = gcnew Dictionary<int,String^>;
for(int i=0;i<10000000;i++)
{
bar->Add(i, "abcdefghijklmnopqrstuvxyz");
}
}
};
void mexFunction(int nlhs, mxArray* plhs[], int nrhs, mxArray* prhs[])
{
Foo^ test = gcnew Foo();
}
This bumps MATLAB's memory by around 300 MB, although subsequent calls don't increase the memory further like in my real MEX file.
EDIT:
I answered my own question, the culprit was mxArrayToString
Garbage collection marks the memory as available inside the .NET heap. It doesn't shrink the .NET heap (which would make the memory available to other processes and the address space available to non .NET code within your process).
It's explicitly documented that the Large Object Heap is never shrunk, and a Dictionary with 10 million entries is probably large enough to go onto the LOH.
I found the problem, turns out it wasn't .net related after all... sorry for that red herring
Since I wasn't using new, malloc or mxMalloc I wrongly assumed that all my unmanaged memory would be in the stack and cleaned up when the mexFunction ended.
However mxArrayToString doesn't return a pointer to the MATLAB data as mxGetData and other mx* functions do. It copies the data onto the heap and one has to call mxFree to release it. I used mxArrayToString as input to create a System::String^, the only change needed was to save a temporay char pointer, use that for the String^ constructor and then mxFree it.
So once again for SEO: The pointer from mxArrayToString needs to be mxFree'd!
I noticed quite a strange thing while trying to allocate a lot of memory on my iPhone 3G running iOS 4.2.1.
When I call malloc(512 * 1024) in a loop it returns a valid pointer for about 1500 times after which I get a NULL and
app(2032,0x3e7518b8) malloc: *** mmap(size=524288) failed (error code=12)
*** error: can't allocate region
It surprised me, since I think my iPhone doesn't 750 MB of RAM. Then I added a memset after the malloc and it brought the number of allocations down to 120, which makes much more sense.
Here's the super-simple code that I used:
for (int i = 1; ; ++i)
{
void *p = malloc(512 * 1024);
NSLog(#"%d %p", i, p);
memset(p, 0, 512 * 1024);
}
I though iPhone didn't have any virtual memory system that could explain behavior similar to this. What would be a reasonable explanation for this?
On iOS (and many other systems), a call to malloc() doesn't actually allocate memory. It requests memory from the OS/kernel, but the request is not satisfied until the memory is written to (e.g. with memset().) This allows for greater efficiency in the system's memory management, but it can result in misleading malloc() behaviour.
The iPhone definitely has a virtual memory system. What it's missing is the ability to page memory out to disk. In other words, it's missing swap space.