Read groupshared variables back to cpu memory - unity3d

First of all, is it even possible, reading groupshared data? Or is groupshared data required to be copied to some RWbuffer before transfering it to cpu memory? Since RWbuffers can't be groupshared (I'm assuming it's because you don't know the size of the buffer at compile time).
For those interested, this is the error it throws when declaring a groupshared buffer:
Shader error in 'FOWComputeShader': 'Result': groupshared variables cannot hold resources at kernel CSMain at ...
Basically I'm declaring a big groupshared uint array in the shader, worth 16kb. I'm linking a computebuffer in the main code to this groupshared array. Dispatching the shader, then reading back from the buffer. Sadly the data I read back is all 0.
I'm working in a unity environment with a compute shader, setting my buffer up like this:
// MapSize is 128 * 128, so 16kb
// sizeof(uint) is the stride size
// ComputeBufferType.Raw, because I intend to use each uint as 4 bytes later on, so I don't want funny stuff to happen to the values
ComputeBuffer FOWMapBuffer = new ComputeBuffer(MapSize, sizeof(uint), ComputeBufferType.Raw);
FOWComputeShader.SetBuffer(kernel, "_FoWMap", FOWMapBuffer);
//just the dispatch
int ThreadCount = Mathf.CeilToInt((float)FOWdata.Count / ThreadGroupSizeX);
FOWComputeShader.Dispatch(kernel, ThreadCount, 1, 1);
//outVisibleToFaction is a byte array of 128 * 128 size
FOWMapBuffer.GetData(outVisibleToFaction);
FOWMapBuffer.Dispose();
Then inside the shader:
// 4096 uints * 4 bytes per uint = 16kb
#define FoWMap_Size 4096
groupshared uint _FoWMap[FoWMap_Size];
[numthreads(32,1,1)]
void CSMain(uint3 id : SV_DispatchThreadID)
{
for (uint i = 0; i < FoWMap_Size; i++)
{
_FoWMap[i] = i;
}
}
That's my environment.
Does anyone know if reading back groupshared data is possible, if so then why is my buffer reading back all 0s?

No, you can't access groupshared memory on the CPU directly. Groupshared memory is a block of on-chip memory, and is the name suggests, it's only shared between the threads inside a single group, so there isn't even one single groupshared memory, but rather multiple instances (which may or may not co-exist, depending on hardware and shader). The lifetime of each block of groupshared memory ends once the thread group that it belongs to finished executing (which allows the hardware to re-use that memory for the next thread group). In your case, for example, you're actually dispatching ThreadCount groups, so there will be that many logical blocks of 16 kb groupshared memory, each 16 kb in size.
So, as a summary, groupshared memory is more like a temporary cache that you can use so the threads inside your thread group can communicate with each other. Nothing outside of these 32 threads in your thread group knows about content or even existance of that memory (since it only really exists while these threads are currently executing).
If anything outside of these 32 threads needs to have access to the memory, you will need to write it out to an RW Buffer.

Related

Differentiate between memory leak and NULL dereferencing

I don't understand the difference between memory leak and null dereferencing. How are these two terms related?
The operating-system have a memory map for each process (each executable you start like the one you compile and run). This memory map tells the OS what pages of physical memory are allocated to the process. A memory leak is when you allocate memory using the new operator in C++ (or malloc() in C) but never release it later. The actual memory leak happens when you change the address that a pointer points to when it has been allocated with new without releasing memory first with delete.
There are 2 types of memory allocation. One is static the other is dynamic. Static memory allocation works like the following:
unsigned char memory[10];
In this example, I allocate 10 unsigned chars statically. This means that the memory will be allocated in the executable at compilation time. The executable will contain space for these unsigned chars I allocated statically. When you will launch the executable, the OS will place the content of that array in RAM (after loading the executable from disk). In this example, memory represents a pointer to the first element of the unsigned char array.
Dynamic memory works like the following (in C++):
unsigned char* memory = new unsigned char[10];
In this example, I allocate 10 unsigned chars on the heap instead of statically in the executable. The heap is managed by the OS and grows according to how much memory you allocate. There is no limit to memory allocation with new. Nothing prevents a program from allocating the whole RAM. If a program runs for a long time and it has memory leaks, it could allocate a lot of RAM until the OS starts having a hard time to make it work with the rest of the system (or until the amount of memory allocated to the process is bigger than RAM).
This works by doing a system call in the OS. When you compile a program which has the above dynamic memory allocation, you compile the line to a system call in the kernel to ask for memory. This is OS specific so the program you compile will have to be recompiled to work on a different OS.
In the meantime, you can create a nullptr or initialize a pointer to 0 like the following:
unsigned char* ptr = nullptr;
or
unsigned char* ptr = 0;
When you dereference that pointer, the dereference will be compiled to a memory fetch. The memory fetch will trigger a page fault because the memory at 0 wasn't allocated to your process. Then the OS will look in it's memory map for the process. It will determine that this address access wasn't legal and kill the process.
The terms are pretty much different and there isn't much relation between the 2.

Behaviour of passing struct as a parameter to a CUDA kernel

I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:
struct protoarray {
size_t dim1;
size_t dim2;
size_t dim3;
float* data;
};
I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel
__global__ void kernel(curandState_t *state, protoarray arr_device){
const size_t dim1 = arr_device.dim1;
const size_t dim2 = arr_device.dim2;
for(size_t j(0); j < dim2; j++){
for(size_t i(0); i < dim1; i++){
// Do something
}
}
}
The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?
Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.
This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.
The struct is passed by copy, so all its contents are copied into shared memory of each block.
Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).
When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).
Generally, how safe is it to pass struct and class into kernels as arguments
My personal rule of thumb on this matter is: If the struct/class...
is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;
then it should be safe to pass to a kernel.
passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?
No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.
I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory
Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:
Unified Memory in CUDA for beginners
Beyond GPU Memory Limits with Unified Memory on Pascal
and that they should be very lightweight if I want to pass [objects to kernels] by value.
Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.
Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.

Can't get max ram size - STM32 with rtos

i'm using STM32F103R8T6,I'm currently setting max heap size for RTOS
When i try setting 12000
#define configTOTAL_HEAP_SIZE ((size_t)12000)
ERROR Compilation
region `RAM' overflowed by 780 bytes Project-STM32 C/C++ Problem
so what's the max i can use ?
Look in the linker (.ld) file. You'll see section defining RAM. That will tell you how much RAM you have, assuming the linker file was properly generated.
The error message you've pasted indicates that linker went 780 bytes past the end of available RAM area. In your case (STM32F103R8T6), it tried to place 21260 bytes (20KB + 780) into RAM which is defined to only fit 20KB. If you decrease configTOTAL_HEAP_SIZE by the amount reported by linker, it'll likely link successfully. There will however be 0 remaining space for regular / non-RTOS heap so no malloc or new will succeed, in case any part of your code wanted to use it.
You can determine exactly what gets put into RAM by your linker by analyzing your *.map file (sidenote: map file is created only if your program gets linked successfully, so you need to at least get it to that state). When you open it, search for 20000000 (start of your RAM region) and there you should see what exactly gets put there, including size of each chunk.
Unless you did something out-of-ordinary to your project (which I think is safe to assume you didn't as you mention using generated project), your RAM area during linking will need to at least fit the following sections:
.data segment where things like global variables initialized by value live
.bss segment which is similar to the one above except values are zero-initialized. This is where eventually the byte array of size configTOTAL_HEAP_SIZE will be put that RTOS uses as its own heap
Stack (don't confuse with RTOS stack sizes, this one is totally separate) - stack used outside of RTOS tasks. This has a constant size - consult your sections.ld file to find the value.
Heap segment that has a size calculated dynamically by the linker and which is equal to total size of RAM minus size of all other sections. The bigger you make your other segments, the smaller your regular heap will be.
Having said that, apart from going through the *.map file to determine what else other than the RTOS heap occupies your RAM, I'd also think twice about why you'd need 12KB (out of 20KB total) assigned only to RTOS heap. Things like do you need so many tasks, do they need such large stacks, do you need so many/so large queues/mutexes/semaphores.

Memory warning when using the accelerator for fft

I have post here ,a function that i use , to get the accelerator fft .
Setup the accelerator framework for fft on the iPhone
It is working great.
The thing is, that i use it in real time, so for each new audio buffer i call this function with the new buffer.
I get a memory warning because of these lines (probably)
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
questions :
do i have another way, but to malloc them again and again(dont forget i have to feed it with a new buffer many times a second )
how exactly do i free them? (code lines)
can it caused by the fact that the fft is heavy to the system ?
Any way to get rid of this warning will help me a lot .
Thanks a lot.
These things should be done once, at the start of your program:
Allocate memory for buffers, using code like float *buffer = malloc(NumberOfElements * sizeof *buffer);.
Create an FFT setup, using code like FFTSetup setup = vDSP_create_fftsetup(log2n, FFT_RADIX2);.
Also test the return values. If malloc or vDSP_create_fftsetup returns 0, write an error message and exit the program or take other exception behavior.
These things should be done once, at the end of your program:
Destroy the FFT setup, using code like vDSP_destroy_fftsetup(setup);.
Release the memory for the buffers, using code like free(buffer);.
In the middle of your program, while you are processing samples, the code should use the existing buffers and setup. So the variables pointing to the buffers and the setup must be visible to that code. You can either pass them in as parameters (perhaps grouped together in a struct) or make them global (which should be only a temporary solution for small programs).
Your program should be arranged so that it is never necessary to allocate memory or create an FFT setup while samples are being processed.
All memory that is allocated should be freed eventually.
If you are malloc'ing and never freeing, you will run out of memory. Make sure to 'free' your memory using free().
*Note: free() doesn't actually erase any memory. It simply tells the system that we're done with the memory and it's available for other allocations.
// Example:
// allocating memory
int *intpointer;
intpointer = malloc(sizeof(int));
// ... do stuff...
// 'Freeing' it when you are done
free(intpointer);

Volatile vars and multi-core thread synchronization!

I have several threads executing concurrently and checking a value of a field in their own object. The field is set by the launch thread like this:
for (i = 0; i < ThreadCount; i++)
{
ThreadContext[i].MyField = 1;
}
Within each thread then I check the value of this value:
if (MyField == 1)
{
...//do something
}
However, I noticed that on a 4 core CPU, some of the (4) running threads need several miliseconds or even longer in order to see the changed MyField. MyField is a single char field. What appears to be happening is that when the memory bus is maxed out by the first thread which detects the change, all other threads may stall almost for the entire duration of the run of the first. (assuming there is enough memory pressure). Only when the first thread eases on memory (and does more with registers), is when other threads also get to see the new value.
I checked the asm and there is no compiler optimization in the way here. Calls go directly to memory. How can this be fixed?
Thanks!
Jam
I got feedback from Intel: Yes, that's how it works (no easy fix).