Determining whether Renderscript is running on CPU/GPU & Number of Threads - renderscript

I can't seem to find any documentation on how to check if RenderScript is actually parallelizing code. I'd like to know whether the CPU or GPU is being utilized and the number of threads dispatched.
The only thing I've found is this bug report:
http://code.google.com/p/android/issues/detail?id=28662
The author mentions that putting rsForEach in the script resulted it it being serialized by pointing to the following debug output:
01-02 00:21:59.960: D/RenderScript(1256): = 0 0x0
01-02 00:21:59.976: D/RenderScript(1256): = 1 0x1
I tried searching for a similar string in LogCat, but I haven't been able to find a match.
Any thoughts?
Update:
Actually I seem to have figured it out. It just seems that my LogCat foo isn't as good as it should be. I filtered the debug output by my application information and found a line like this:
02-26 22:30:05.657: V/RenderScript(26113): rsContextCreate dev=0x5cec0458
02-26 22:30:05.735: V/RenderScript(26113): 0x5d9f63b8 Launching thread(s), CPUs 2

This will only tell you how many CPUs could be used. This will not indicate how many threads or which processor is being used. By design RS avoid exposing this information
In general RS will use all the available CPU cores unless you call a "serial" function such as the rsg* or time functions. As to what criteria will result in a script being punted from the GPU to CPU, this will vary depending on the abilities of each vendors GPU.
The bug you referenced has been fixed in 4.1

I came across the same issue, when I was working with RS. I used Nexus 5 for my testing. I found that initial launch of RS utilized CPU instead of using GPU, this is verified using Trepn 5.0s application. Later I found that Nexus-5 GPU doesnt support double precision (Link to Adreno 330), so by default it ports it onto CPU. To overcome this I used #pragma rs_fp_relaxed top of my rs file along with header declarations.
So if you strictly want to port it onto GPU then it may be best way to find out your mobile GPU specs and try above trick and measure GPU utilization using Trepn 5.0s or equivalent application. As of now RS doesnt expose thread level details but during the implementation we can utilize the x and y arguments of our root - Kernel as thread indexes.

Debugging properties
RenderScript includes the debug.rs.default-CPU-driver and debug.rs.script debugging properties.
debug.rs.default-CPU-driver
Values = 0 or 1
Default value = 0
If set to 1, the Android Open Source Project (AOSP) implementation of RenderScript Compute
is used. This does not use any GPU features.
debug.rs.script
Values = 0 or 1
Default value = 0
If set to 1, additional diagnostic information is printed in the logcat. This information includes
the actual device a kernel is running on, either GPU or application processor.
If a kernel cannot be run on the GPU more detailed information is provided explaining why. For
example:
[RS-DIAG] No support for recursive calls on GPU
Arm® Mali™ RenderScript Best Practices_pdf

Related

ThreadX RAM issue on STM32

I'm currently starting to use ThreadX on a STM32 Nucleo-H723ZG (STM32H723ZG MCU).
I noticed that when loading the Nx_TCP_Echo_Server / Nx_TCP_Echo_Client projects from CubeMX, the RAM gets filled up pretty much to the top, which makes me wonder, how I'm supposed to add my own code and data here.
Since I'm pretty new to RAM partitioning, RTOS and similar, I don't have a perfect feeling for what is wrong or right and how to proceed (and if it is a problem at all).
Nevertheless I wonder, if maybe using a different way of partitioning the RAM or by dropping some non-necessary code-parts, the RAM could be freed-up.
Or a different way of thinking:
Since RAM_D1 got filled, but _D2, _D3 and DTCMRAM are pretty much empty, is there a way to use the free RAM for my own purposes (I would like to let SPI and ADC processing run via DMA, so this needs a place to go ....)
Hope my questions are not too confusing ;)
The system has the following amount of RAM, according to STM:
"SRAM: total 564 Kbytes all with ECC, including 128 Kbytes of data TCM RAM for critical real-time data + 432 Kbytes of system RAM (up to 256 Kbytes can remap on instruction TCM RAM for critical real time instructions) + 4 Kbytes of backup SRAM (available in the lowest-power modes)" (see STMs STM32H723ZG MCU product page)
Down below you'll find screenshots of the current RAM usage, for RAM_D1 especially .tcp_sec eats up most of the RAM.
--> Can .tcp_sec be optimized or kicked-out?
If tcp means here the tcp protocol, maybe this can be a way to optimize this, since I'm not sure whether I need a handshake etc., maybe UDP is sufficient (and faster for the ADC data streaming) ... what do you say?
Edit:
The linker-file shows, that there .tcp_sec (NOLOAD) is written ... is NOLOAD maybe a hint on a "placebo" RAM occupation (pre-allocation / reservation, but no actual usage?)
Linker-script extract:
/* User_heap_stack section, used to check that there is enough RAM left */
._user_heap_stack :
{
. = ALIGN(8);
PROVIDE ( end = . );
PROVIDE ( _end = . );
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
. = ALIGN(8);
} >RAM_D1
.tcp_sec (NOLOAD) : {
. = ABSOLUTE(0x24048000);
*(.RxDecripSection)
. = ABSOLUTE(0x24048060);
*(.TxDecripSection)
} >RAM_D1 AT> FLASH
For context:
I am developing a "system controller", where my plan is to have it running a RTOS, which manages the read-in of analog values, writing control messages via SPI to two other STMs of the same kind and communicating via Ethernet to my desktop application.
The desktop application is then in charge of post-processing the digitized analog values and sending control messages to the system controller. In the best case the system controller digitizes the analog signal on ADC3 with 5 MSPS (at probably 6 Bit resolution = 30 MBit/s) and sends that data hickup-free to my desktop application.
-> Is this plan possible on this MCU?
I tried to buy a higher (more RAM) version of the nucleo I've got, but due to shortages this one is the best one I was able to get.
For the RTOS I'd like to stick with ThreadX, since FreeRTOS support in STM32IDE seems to be phased out now, after ThreadX was employed as the RTOS by STM.
(I like the easy register configuration using CubeMX/STM32 IDE, hence my drive to use that SW universe ... if there are good reasons to use a different RTOS, tell me :) )
Thank you for your time!
I generated the same project on my side and took a look. I believe you should be able to implement what you want in this CPU. You will need to carefully use the available memory.
It seems there is a confusion about the section .tcp_sec. It contains DMA reception and transmission descriptors for the Ethernet controller/driver. These are constrained by the driver and hardware to be at a specific address. The descriptors are rather small, but the buffers are bigger. With some work these can be reduced. If you are using Ethernet you will need this, no mater if you use TCP or not. As I said, the name can be confusing.
The flash has still plenty of space available. In the debug configuration only about 11% is used. The rest is available for your application code.
You can locate you data in other memory regions. Depending on the toolchain you will use is how you will need to tell the compiler/linker where your data goes. You can look towards the top of the main.c file in that example to see how the DMA descriptors are assigned to a specific section for three different toolchains (IAR, ARM MDK, GCC).
In terms of how to most efficiently use and configure the microcontroller peripherals please get in touch with STMicro, they will know best.
This should get you started. Let us know if this helps!

indexing problem when calling fit() function

I'm currently working on a project of a nn to play a game similar to atari games (more details in the link). I'm having trouble with the indexing. perhaps anyone knows what could be the problem? because I cant seem to find it. Thank you for your time. here's my code (click on the link) and here's the full traceback. the problem starts from the way I call
history = network.fit(state, epochs=10, batch_size=10) // in line 82
See this post: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
As said in the correct answer,
Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia:
The warning states that your CPU does support AVX (hooray!).
Pretty much, AVX speeds up your training, etc. Sadly, tensorflow is saying that they aren't going to use it... Why?
Because tensorflow default distribution is built without CPU extensions, such as SSE4.1, SSE4.2, AVX, AVX2, FMA, etc. The default builds (ones from pip install tensorflow) are intended to be compatible with as many CPUs as possible. Another argument is that even with these extensions CPU is a lot slower than a GPU, and it's expected for medium- and large-scale machine-learning training to be performed on a GPU.
What should yo do?
If you have a GPU, you shouldn't care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to). In this case, you can simply ignore this warning by:
# Just disables the warning, doesn't enable AVX/FMA
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
If you don't have a GPU and want to utilize CPU as much as possible, you should build tensorflow from the source optimized for your CPU with AVX, AVX2, and FMA enabled if your CPU supports them. It's been discussed in this question and also this GitHub issue. Tensorflow uses an ad-hoc build system called bazel and building it is not that trivial, but is certainly doable. After this, not only will the warning disappear, tensorflow performance should also improve.
You can find all the details and comments in this StackOverflow question.
NOTE: This answer is a product of my professional copy-and-pasting.
Happy coding,
Bobbay
Has the code been debugged line by line ? as this would trace to the line causing error.
I assume the index error crops up from the below one - where "i" and further targets[i] , outs[i] can be checked for the values they have -
per_sample_losses = loss_fn.call(targets[i], outs[i])

Unaligned accesses are not detected by Raspberry PI version 1

I'm performing a set of activities to make sure Redis runs well in a set of embedded systems, including the Raspberry PI. In order to fix certain code paths of Redis where unaligned memory accesses are performed (due to a change introduced in Redis 3.2) I'm trying to force the PI to either log a message on unaligned memory accesses or send a signal to the process when this happens. In this way I can both make sure that Redis will run well where unaligned accesses are a violation, and that it will run faster in platforms where instead such accesses can be performed but are slower. ARM v6, the one used in the PI v1, is apparently able to deal with unaligned memory accesses, so if I use following command to configure Linux in order to sent a signal to the process performing the unaligned access:
echo 4 > /proc/cpu/alignment
And then run the following program:
#include <stdio.h>
#include <stdint.h>
int main(int argc, char **argv) {
char *buf = "foobareklsjdfklsjdfslkjfskdljfskdfjdslkjfdslkjfsd";
uint32_t *l = (uint32_t*) (buf+1);
printf("%p\n", l);
printf("%d\n", (int)*l);
return 0;
}
I can't see any signal received by the process, or the counters at /proc/cpu/alignment incrementing.
My guess is that this is due to ARM v6 ability to deal with unaligned addresses automatically, if a given CPU configuration flag is set. My question is, is my hypothesis correct? And if so, how to force a PI version 1 to actually raise an exception in case of unaligned accesses so that the Linux kernel can trap it and send a signal, log the access, and so forth, according to /proc/cpu/alignment settings?
EDIT: It is worth to note that not all the instructions can perform unaligned accesses even in ARM v6. For instance STMDB, STMFD, LDMDB, LDMEA and similar multiple words instructions will indeed raise an exception and will be trapped by the Linux kernel.
I think I eventually found my answers:
Yes I'm correct, up to the word size ARM v6 (or greater) can silently handle the unaligned accesses so no trap is generated and is completely transparent for the Linux kernel. Nothing will be logged, nor the traps counter in /proc/cpu/alignment will be incremented.
AFAIK there is no way I can force the kernel to trap word-sized unaligned accesses, since to do that apparently the CPU should be configured in order to trap the unaligned addresses in every case, but the Linux kernel does not do that AFAIK, probably because there is alignment unsafe code inside the kernel itself. Checking the Linux kernel source code indeed one can see:
if (cpu_is_v6_unaligned()) {
set_cr(__clear_cr(CR_A));
ai_usermode = safe_usermode(ai_usermode, false);
}
What this means is that the SCTLR.A bit is always cleared, so no trap
will be generated for unaligned accesses ARM v6 can handle.
There are a great deal of instructions that will still generate traps when used with unaligned addresses, for example multi store/load instructions, loading and storing of double values.
However, there are instructions that GCC (the version shipped in the default Raspberry Linux distribution) will happily produced that are not handled by the Linux kernel correctly, that will result in a SIGBUS generated even when /proc/cpu/alignment is set to fix the access.
So point number 4 basically means that, it is not a good idea to fix programs to run in ARM v6 just letting the Linux kernel handle unaligned addresses for us, even when the performance implications of unaligned addresses are not a problem: the program can still crash since not all the instructions are handled.
How to reliably find all the unaligned accesses in a program remains an open question AFAIK, since unfortunately, the otherwise wonderful valgrind program, never implemented this feature. In the past I had to use QEMU emulating Sparc, however this is a very slow process. Valgrind would be the trivial way to do that.

Where is the mode bit?

I just read this in "Operating System Concepts" from Silberschatz, p. 18:
A bit, called the mode bit, is added to the hardware of the computer
to indicate the current mode: kernel(0) or user(1). With the mode bit,
we are able to distinguish between a task that is executed on behalf
of the operating system and one that is executed on behalf of the
user.
Where is the mode bit stored?
(Is it a register in the CPU? Can you read the mode bit? As far as I understand it, the CPU has to be able to read the mode bit. How does it know which program gets mode bit 0? Do programs with a special adress get mode bit 0? Who does set the mode bit / how is it set?)
Please note that your question depends highly on the CPU itselt; though it's uncommon you might come across certain processors where this concept of user-level/kernel-level does not even exist.
The cs register has another important function: it includes a 2-bit
field that specifies the Current Privilege Level (CPL) of the CPU. The
value 0 denotes the highest privilege level, while the value 3 denotes
the lowest one. Linux uses only levels 0 and 3, which are respectively
called Kernel Mode and User Mode.
(Taken from "Understanding the Linux Kernel 3e", section 2.2.1)
Also note, this depends on the CPU as you can clearly see and it'll change from one to another but the concept, generally, holds.
Who sets it? Typically, the kernel/cpu and a user-process cannot change it but let me explain something here.
**This is an over-simplification, do not take it as it is**
Let's assume that the kernel is loaded and the first application has just started(the first shell), the kernel loads everything for this application to start, sets the bit in the cs register(if you are running x86) and then jumps to the code of the Shell process.
The shell will continue to execute all of its instructions in this context, if the process contains some privileged instruction, the cpu will fetch it and won't execute it; it'll give an exception(hardware exception) that tells the kernel someone tried to execute a privileged instruction and here the kernel code handles the job(CPU sets the cs to kernel mode and jumps to some known-location to handle this type of errors(maybe terminating the process, maybe something else).
So how can a process do something privileged? Talking to a certain device for instance?
Here comes the System Calls; the kernel will do this job for you.
What happens is the following:
You set what you want in a certain place(For instance you set that you want to access a file, the file location is x, you are accessing for reading etc) in some registers(the kernel documentation will let you know about this) and then(on x86) you will call int0x80 instruction.
This interrupts the CPU, stops your work, sets the mode to kernel mode, jumps the IP register to some known-location that has the code which serves file-IO requests and moves from there.
Once your data is ready, the kernel will set this data in a place you can access(memory location, register; it depends on the CPU/Kernel/what you requested), sets the cs flag to user-mode and jumps back to your instruction next to the it int 0x80 instruction.
Finally, this happens whenever a switch happens, the kernel gets notified something happened so the CPU terminates your current instruction, changes the CPU status and jumps to where the code that handles this thing; the process explained above, roughly speaking, applies to how a switch between kernel mode and user-mode happens.
It's a CPU register. It's only accessible if you're already in kernel mode.
The details of how it gets set depend on the CPU design. In most common hardware, it gets set automatically when executing a special opcode that's used to perform system calls. However, there are other architectures where certain memory pages may have a flag set that indicates that they are "gateways" to the kernel -- calling a function on these pages sets the kernel mode bit.
These days it's given other names such as Supervisor Mode or a protection ring.

How to determine SSE prefetch instruction size?

I am working with code which contains inline assembly for SSE prefetch instructions. A preprocessor constant determines whether the instructions for 32-, 64- or 128-bye prefetches are used. The application is used on a wide variety of platforms, and so far I have had to investigate in each case which is the best option for the given CPU. I understand that this is the cache line size. Is this information obtainable automatically? It doesn't seem to be explicitly present in /proc/cpuinfo.
I think your question is related to this question or this one. I think it is clear that - unless you can rely on a OS or library-function - you will want to use the CPUID instruction, but the question then becomes exactly what information you are looking for. - And of course, AMD's and Intel's implementations don't need to agree. This page suggests using Cpuid.1.EBX[15:8] (i.e., BH) for finding out on Intel and function 80000005h on AMD. In addition, on Intel, CPUID.2... seems to contain the relevant information, but it looks like a real pain to parse out the desired information.
I think, from what I've read, both AMD and Intel CPUID instructions will support CPUID.1.EBX[15:8], which returns the size of one cache line in QUADWORDs as used by the CLFLUSH instruction (which isn't present on all processors, so I don't know whether you'll always find something there). So, after executing CPUID.1, you'd have to multiply BH by 8 to get the cache line size in bytes. This hinges on my implicit assumption (please can anyone say whether it is really valid?) that the definition of one cache line size is always the same for CLFLUSH and PREFETCHh instructions.
Also, Intel's manuals states that PREFETCHh is only a hint, but that, if it prefetches anything, it will always be a minimum of 32 bytes.
EDIT1:
Another useful resource (even if not directly answering your question) for the optimised use of PREFETCHh is Intel's optimisation manual here.