pycuda shared memory up to device hard limit - pycuda

This is an extension of the discussion here: pycuda shared memory error "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value"
Is there a method in pycuda that is equivalent to the following C++ API call?
#define SHARED_SIZE 0x18000 // 96 kbyte
cudaFuncSetAttribute(func, cudaFuncAttributeMaxDynamicSharedMemorySize, SHARED_SIZE)
Working on a recent GPU (Nvidia V100), going beyond 48 kbyte shared memory requires this function attribute be set. Without it, one gets the same launch error as in the topic above. The "hard" limit on the device is 96 kbyte shared memory (leaving 32 kbyte for L1 cache).
There's a deprecated method Fuction.set_shared_size(bytes) that sounds promising, but I can't find what it's supposed to be replaced by.

PyCUDA uses the driver API, and the corresponding function call for setting a function dynamic memory limits is cuFuncSetAttribute.
I can't find that anywhere in the current PyCUDA tree, and therefore suspect that it has not been implemented.

I'm not sure if this is what you're looking for, but this might help someone looking in this direction.
The dynamic shared memory size in PyCUDA can be set either using:
shared argument in the direct kernel call (the "unprepared call"). For example:
myFunc(arg1, arg2, shared=numBytes, block=(1,1,1), grid=(1,1))
shared_size argument in the prepared kernel call. For example:
myFunc.prepared_call(grid, block, arg1, arg2, shared_size=numBytes)
where numBytes is the amount of memory in bytes you wish to allocate at runtime.

Related

Is the heap preallocated for a process

Since I was introduced to the concept of heap of a process, I have been assuming that the OS allocates it at the creation of the process. But then I was doing some research and read a statement here.
It says:
When a program asks malloc for space, malloc asks sbrk to increment the heap size and returns a pointer to the start
of the new region on the heap.
If I understood what's been said, the OS allocates 0 cell for the process's heap, and it is only by calling malloc that the process gets some heap cells. And for me this makes more sens for the expression "dynamic allocation". Is this correct ?
in figure you can see that your c/c++ program have a free memory area where the heap and the stack can grow until full the region, so Initialy the heap is empty, and when a process call malloc, Normally (but in modern implementation, malloc prefer to call always mmap()) he call the sbrk() function for increase the memory size of the heap (in reality he first search into the free linked list and if there is not any entry into the linked list he call sbrk(), see this for a implementation of malloc() malloc implementation?).
So the os don't directly decide how the heap of a process should be allocated, in c/c++ the thinks work like this, but i think that in other languages the thinks can be slightly different.

What exactly happens when an OS goes into kernel mode?

I find that neither my textbooks or my googling skills give me a proper answer to this question. I know it depends on the operating system, but on a general note: what happens and why?
My textbook says that a system call causes the OS to go into kernel mode, given that it's not already there. This is needed because the kernel mode is what has control over I/O-devices and other things outside of a specific process' adress space. But if I understand it correctly, a switch to kernel mode does not necessarily mean a process context switch (where you save the current state of the process elsewhere than the CPU so that some other process can run).
Why is this? I was kinda thinking that some "admin"-process was switched in and took care of the system call from the process and sent the result to the process' address space, but I guess I'm wrong. I can't seem to grasp what ACTUALLY is happening in a switch to and from kernel mode and how this affects a process' ability to operate on I/O-devices.
Thanks alot :)
EDIT: bonus question: does a library call necessarily end up in a system call? If no, do you have any examples of library calls that do not end up in system calls? If yes, why do we have library calls?
Historically system calls have been issued with interrupts. Linux used the 0x80 vector and Windows used the 0x2F vector to access system calls and stored the function's index in the eax register. More recently, we started using the SYSENTER and SYSEXIT instructions. User applications run in Ring3 or userspace/usermode. The CPU is very tricky here and switching from kernel mode to user mode requires special care. It actually involves fooling the CPU to think it was from usermode when issuing a special instruction called iret. The only way to get back from usermode to kernelmode is via an interrupt or the already mentioned SYSENTER/EXIT instruction pairs. They both use a special structure called the TaskStateSegment or TSS for short. These allows to the CPU to find where the kernel's stack is, so yes, it essentially requires a task switch.
But what really happens?
When you issue an system call, the CPU looks for the TSS, gets its esp0 value, which is the kernel's stack pointer and places it into esp. The CPU then looks up the interrupt vector's index in another special structure the InterruptDescriptorTable or IDT for short, and finds an address. This address is where the function that handles the system call is. The CPU pushes the flags register, the code segment, the user's stack and the instruction pointer for the next instruction that is after the int instruction. After the systemcall has been serviced, the kernel issues an iret. Then the CPU returns back to usermode and your application continues as normal.
Do all library calls end in system calls?
Well most of them do, but there are some which don't. For example take a look at memcpy and the rest.

Determining whether Renderscript is running on CPU/GPU & Number of Threads

I can't seem to find any documentation on how to check if RenderScript is actually parallelizing code. I'd like to know whether the CPU or GPU is being utilized and the number of threads dispatched.
The only thing I've found is this bug report:
http://code.google.com/p/android/issues/detail?id=28662
The author mentions that putting rsForEach in the script resulted it it being serialized by pointing to the following debug output:
01-02 00:21:59.960: D/RenderScript(1256): = 0 0x0
01-02 00:21:59.976: D/RenderScript(1256): = 1 0x1
I tried searching for a similar string in LogCat, but I haven't been able to find a match.
Any thoughts?
Update:
Actually I seem to have figured it out. It just seems that my LogCat foo isn't as good as it should be. I filtered the debug output by my application information and found a line like this:
02-26 22:30:05.657: V/RenderScript(26113): rsContextCreate dev=0x5cec0458
02-26 22:30:05.735: V/RenderScript(26113): 0x5d9f63b8 Launching thread(s), CPUs 2
This will only tell you how many CPUs could be used. This will not indicate how many threads or which processor is being used. By design RS avoid exposing this information
In general RS will use all the available CPU cores unless you call a "serial" function such as the rsg* or time functions. As to what criteria will result in a script being punted from the GPU to CPU, this will vary depending on the abilities of each vendors GPU.
The bug you referenced has been fixed in 4.1
I came across the same issue, when I was working with RS. I used Nexus 5 for my testing. I found that initial launch of RS utilized CPU instead of using GPU, this is verified using Trepn 5.0s application. Later I found that Nexus-5 GPU doesnt support double precision (Link to Adreno 330), so by default it ports it onto CPU. To overcome this I used #pragma rs_fp_relaxed top of my rs file along with header declarations.
So if you strictly want to port it onto GPU then it may be best way to find out your mobile GPU specs and try above trick and measure GPU utilization using Trepn 5.0s or equivalent application. As of now RS doesnt expose thread level details but during the implementation we can utilize the x and y arguments of our root - Kernel as thread indexes.
Debugging properties
RenderScript includes the debug.rs.default-CPU-driver and debug.rs.script debugging properties.
debug.rs.default-CPU-driver
Values = 0 or 1
Default value = 0
If set to 1, the Android Open Source Project (AOSP) implementation of RenderScript Compute
is used. This does not use any GPU features.
debug.rs.script
Values = 0 or 1
Default value = 0
If set to 1, additional diagnostic information is printed in the logcat. This information includes
the actual device a kernel is running on, either GPU or application processor.
If a kernel cannot be run on the GPU more detailed information is provided explaining why. For
example:
[RS-DIAG] No support for recursive calls on GPU
Arm® Mali™ RenderScript Best Practices_pdf

memory sharing -- between sytem call & interupt handler

I read following link
Linux Device Driver Program, where the program starts?
as per this all system calls operate independent to each other.
1> Then how to share common memory between different system call & interrupt handler.
but there should be some way to allocate memory ... so that they have common access to a block of memory.
2> Also which pointer to allocate the memory? so that it is accessiable by all ?
Is there some example which uses driver private data ?

Why syncblk is located at -4 and not at 0?

So if you want to look at sync block for an object, under sos you have to look at -4 bytes (on 32 bit machines) before the object address. Does anyone know what is the wisdom for going back 4 bytes? I mean they could have sync block at 0, then type handle at +4 and then object fields at +8.
This is an implementation detail, so I can't give you the exact reason for the placement of the syncblock. However, if you look at the shared source CLI, you'll see that the runtime has all sorts of optimizations for how objects are allocated and used, and actually the data associated with a single instance is located in several different places. The syncblock for instance is just an index value for a structure located elsewhere. Similarly the MethodTable and the EEClass are stored elsewhere. These are all implementation details. The important point IMO is understanding how to dig out the information needed during debugging. It is of much less importance to understand why the implementation details are as they are.
I'd say it matches expectations, especially for structs that have been explicitly laid out. As Brian says, it's just an implementation detail though. It's similar to how many implementations of malloc will allocate more space than requested, store the allocation size in the first four (or eight) bytes, and then return a pointer that is offset to point to the next byte beyond that.