CUSP host function parallelization with openMP - cusp-library

Is the CUSP host function also implemented in parallel with openMP?
I am a bit confused since in the github discussion it says "added functionality in openMP" but there is no clear statement in the cusp that it is fully implemented with openMP.

CUSP is built on thrust.
Thrust can take advantage of OpenMP, for example as a device backend. Thrust now also has the ability to re-target the host backend as well.
This means that if you properly selected the thrust OMP backend, say for device activities, I think cusp should respect this and perform "device" operations on the host using OMP.

Related

How do I add a missing peripheral register to a STM32 MCU model in Renode?

I am trying out this MCU / SoC emulator, Renode.
I loaded their existing model template under platforms/cpus/stm32l072.repl, which just includes the repl file for stm32l071 and adds one little thing.
When I then load & run a program binary built with STM32CubeIDE and ST's LL library, and the code hits the initial function of SystemClock_Config(), where the Flash:ACR register is being probed in a loop, to observe an expected change in value, it gets stuck there, as the Renode Monitor window is outputting:
[WARNING] sysbus: Read from an unimplemented register Flash:ACR (0x40022000), returning a value from SVD: 0x0
This seems to be expected, not all existing templates model nearly everything out of the box. I also found that the stm32L071 model is missing some of the USARTs and NVIC channels. I saw how, probably, the latter might be added, but there seems to be not a single among the default models defining that Flash:ACR register that I could use as example.
How would one add such a missing register for this particular MCU model?
Note1: For this test, I'm using a STM32 firmware binary which works as intended on actual hardware, e.g. a devboard for this MCU.
Note2:
The stated advantage of Renode over QEMU, which does apparently not emulate peripherals, is also allowing to stick together a more complex system, out of mocked external e.g. I2C and other devices (apparently C# modules, not yet looked into it).
They say "use the same binary as on the real system".
Which is my reason for trying this out - sounds like a lot of potential for implementing systems where the hardware is not yet fully available, and also automatted testing.
So the obvious thing, commenting out a lot of parts in init code, to only test some hardware-independent code while sidestepping such issues, would defeat the purpose here.
If you want to just provide the ACR register for the flash to pass your init, use a tag.
You can either provide it via REPL (recommended, like here https://github.com/renode/renode/blob/master/platforms/cpus/stm32l071.repl#L175) or via RESC.
Assuming that your software would like to read value 0xDEADBEEF. In the repl you'd use:
sysbus:
init:
Tag <0x40022000, 0x40022003> "ACR" 0xDEADBEEF
In the resc or in the Monitor it would be just:
sysbus Tag <0x40022000, 0x40022003> "ACR" 0xDEADBEEF
If you want more complex logic, you can use a Python peripheral, as described in the docs (https://renode.readthedocs.io/en/latest/basic/using-python.html#python-peripherals-in-a-platform-description):
flash: Python.PythonPeripheral # sysbus 0x40022000
size: 0x1000
initable: false
filename: "script_with_complex_python_logic.py"
```
If you really need advanced implementation, then you need to create a complete C# model.
As you correctly mentioned, we do not want you to modify your binary. But we're ok with mocking some parts we're not interested in for a particular use case if the software passes with these mocks.
Disclaimer: I'm one of the Renode developers.

indexing problem when calling fit() function

I'm currently working on a project of a nn to play a game similar to atari games (more details in the link). I'm having trouble with the indexing. perhaps anyone knows what could be the problem? because I cant seem to find it. Thank you for your time. here's my code (click on the link) and here's the full traceback. the problem starts from the way I call
history = network.fit(state, epochs=10, batch_size=10) // in line 82
See this post: Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
As said in the correct answer,
Modern CPUs provide a lot of low-level instructions, besides the usual arithmetic and logic, known as extensions, e.g. SSE2, SSE4, AVX, etc. From the Wikipedia:
The warning states that your CPU does support AVX (hooray!).
Pretty much, AVX speeds up your training, etc. Sadly, tensorflow is saying that they aren't going to use it... Why?
Because tensorflow default distribution is built without CPU extensions, such as SSE4.1, SSE4.2, AVX, AVX2, FMA, etc. The default builds (ones from pip install tensorflow) are intended to be compatible with as many CPUs as possible. Another argument is that even with these extensions CPU is a lot slower than a GPU, and it's expected for medium- and large-scale machine-learning training to be performed on a GPU.
What should yo do?
If you have a GPU, you shouldn't care about AVX support, because most expensive ops will be dispatched on a GPU device (unless explicitly set not to). In this case, you can simply ignore this warning by:
# Just disables the warning, doesn't enable AVX/FMA
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
If you don't have a GPU and want to utilize CPU as much as possible, you should build tensorflow from the source optimized for your CPU with AVX, AVX2, and FMA enabled if your CPU supports them. It's been discussed in this question and also this GitHub issue. Tensorflow uses an ad-hoc build system called bazel and building it is not that trivial, but is certainly doable. After this, not only will the warning disappear, tensorflow performance should also improve.
You can find all the details and comments in this StackOverflow question.
NOTE: This answer is a product of my professional copy-and-pasting.
Happy coding,
Bobbay
Has the code been debugged line by line ? as this would trace to the line causing error.
I assume the index error crops up from the below one - where "i" and further targets[i] , outs[i] can be checked for the values they have -
per_sample_losses = loss_fn.call(targets[i], outs[i])

Porting word2vec to RISC-V.. potential proxy kernel issue?

We are trying to port word2vec to RISC-V. Towards this end, we have compiled word2vec with a cross compiler and are trying to run it on Spike.
The cross compiler compiles the standard RISC-V benchmarks and they run without failure on Spike, but when we use the same setup for word2vec, it fails with "bad syscall #179!". We tried two different versions, both fail around the same place a minute or two into the run while executing these instructions. After going through the loop several 100k times, we see C1, C2 printed an then the crash. We are thinking this is more of a spike/pk issue than a word2vec issue.
Has anyone had similar experiences when porting code to RISC-V? Any ideas on how we might track down whether it's the proxy kernel?
A related question is about getting gdb working with Spike.. will post that separately.
Thank you.
The riscv-pk does not support all possible syscalls. You'll need to track down which syscall it is and whether you can implement it in riscv-pk or if you need to move to running it on a different kernel. For example, riscv-pk does not support any threading-related syscalls as multithreaded kernel support is an explicitly riscv-pk non-goal.
I would also be wary of using riscv-pk in general. It's a very simple, thin kernel which is great for running newlib user applications in the beginning, but it lacks rigorous testing and validation efforts against it, so running applications that stress virtual memory systems, rely on lots of syscalls (iotcl and friends), or are expecting more glibc-like environments may prove problematic.

Alternative to MQL5

I am starting with Expert Advisors on MetaTrader Terminal software and I have many algorithms to use with it. These algorithms were developed in MATLAB using its powerfull built in functions ( e.g. svd, pinv, fft ).
To test my algorithms I have some alternatives:
Write all the algorithms in MQL5.
Write the algorithms in C++ and than make a DLL to call by MQL5.
Write the algorithms in Python to embed in C and than make a DLL.
Convert the MATLAB source code to C and than make a DLL.
About the problems:
Impracticable because MQL5 does not have built in functions so I will have to implement one by one by hand.
I still did not try this, but I think it will take a long time to implement the algorithms ( I wrote some algorithms in C but took a good time and the result wasn't fast like MATLAB ).
I am getting a lot of errors when compiling to a DLL but if I compile to an executable there is no error ( this would be a good alternative since to convert MATLAB to python is quite simple and fast to do ).
I am trying this now, but I think there is so much work to do.
I researched about other similar pieces of software, like MetaTrader Terminal but I didn't found a good one.
I would like to know, if there is a simplest ( and fast ) way to embed other language in some way to MQL5 or some alternative to my issue.
Thanks.
Yes, there is alternative ... 5 ) Go Distributed :
having a similar motivation for using non-MQL4 code for fast & complex mathematics in external quantitative models for FX-trading, I have started to use both { MATLAB | python | ... } and MetaTrader Terminal environments in an interconnected form of a heterogeneous distributed processing system.
MQL4 part is responsible for:
anAsyncFxMarketEventFLOW processing
aZmqInteractionFRAMEWORK setup and participation in message-patterns handling
anFxTradeManagementPOLICY processing
anFxTradeDetectorPolicyREQUESTOR sending analysis RQST-s to remote AI/ML-predictor
anFxTradeEntryPolicyEXECUTOR processing upon remote node(s) indication(s)
{ MATLAB | python | ... } part is responsible for:
aZmqInteractionFRAMEWORK setup and participation in message-patterns handling
anFxTradeDetectorPolicyPROCESSOR receiving & processing analysis RQST-s to from remote { MQL4 | ... } -requestor
anFxTradeEntryPolicyREQUESTOR sending trade entry requests to remote { MQL4 | other-platform | ... }-market-interfacing-node(s)
Why to start thinking in a Distributed way?
The core advantage is in re-using the strengths of MATLAB and other COTS AI/ML-packages, without any need to reverse engineer the still creeping MQL4 interfacing options ( yes, in the last few years, DLL-interfaces had several dirty hits from newer updates ( strings ceased to be strings and started to become a struct (!!!) etc. -- many man*years of pain with a code-base under maintenance, so there is some un-forgettable experience what ought be avoided ... ).
The next advantage is to become able to add failure-resilience. A distributed system can work in ( 1 + N ) protected shading.
The next advantage is to become able to increase performance. A distributed system can provide a pool of processors - be it in a { SEQ | PAR }-mode of operations ( a pipeline-process or a parallel-form process execution ).
MATLAB node just joins:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MATLAB script to setup
zeromq-matlab
clear all;
if ~ispc
s1 = zmq( 'subscribe', 'ipc', 'MATLAB' ); %% using IPC transport on <localhost>
else
disp( '0MQ IPC not supported on Windows.' )
disp( 'Setup TCP transport class instead' )
disp( 'Setting up TCP') %% using TCP transport on <localhost>
s1 = zmq( 'subscribe', 'tcp', 'localhost', 5555 );
end
recv_data1 = []; %% setup RECV buffer
This said, one can preserve strengths on each side and avoid any form of duplications of already implemented native, high-performance tuned, libraries, while the distributed mode of operations also adds some brand new potential benefits for Expert Advisor modus operandi.
one may add a remote keyboard interface to an EA automation and use some custom-specific commands ( CLI )
a fast, non-blocking, distributed remote logging
GPU / GPU-grid computing being used from inside MetaTrader Terminal
may like to check other posts on extending MetaTrader Terminal programming models
A Distributed System, on top of a Communication Framework:
MATLAB has already available port of ZeroMQ Communication Framework, the same that MetaTrader Terminal has, thanks to Austin CONRAD's wrapper ( though the MQH is interfacing to a ver 2.1.11 DLL, the services needed work like a charm ), so you are straight ready to use it on each side, so these types of nodes are ready to join their respective roles in any form one can design into a truly heterogeneous distributed system.
My recent R&D uses several instances of python-side processes to operate AI/ML-predictor, r/KBD, r/RealTimeANALYSER and a centralised r/LOG services, that are actively used, over many PUSH/PULL + XREQ/XREP + PUB/SUB Scalable Formal Communication Patterns, from several instances of MetaTrader Terminal-s by their respective MQL4-code.
MATLAB functions could be re-used in the same way.

trying to know more about verilog language, vhdl,and assembly language

I would like to know what is the difference between verilog and assembly language.
Next semester we will be working with micro-controllers, but I would like to learn a little bit about it before the semester begins. I've been doing a lot of research about low-level programming, and so far I have gained a good understanding in assembly language, but I get confused trying to understand Verilog and VHDL?
Verilog and VHDL are completely different languages for describing hardware, for purposes of programming FPGAs.
FPGAs are devices that can be on-the-fly programmed to implement any sort of digital logic (and sometimes analog too).
So using verilog or VHDL, I can design a circuit that creates a couple latches, some twos-complement adders, a mux, and a clock source, and suddenly you've just designed a circuit that can calculate. You could then take the output from the VHDL compiler (or whatever its called), "download" it to the FPGA, and now you actually have some hardware that can be used to do calculation.
Of course, you can use FPGAs to implement all sorts of complicated stuff - even a full custom CPU. One uses verilog and VHDL to design the circuits that are programmed to FPGAs. Those circuits could implement something simple like a ripple counter, or something more complex like a LCD driver, or something even more complex like a USB transceiver. You can go from as simple as a few latches to as complicated as a fully operating CPU; as long as its digital hardware, you can make whatever you want with VHDL and some FPGAs.
To clarify further -
"Assembly language" typically refers to raw instructions given to some sort of CPU. Of course, there are many different types of CPUs (x86, ARM, SPARC, MIPS) and further many different variants of those types of CPUs. Each CPU has its own instruction set.
Machine code is complete, fully specified, ready to be executed instructions. Assembly languages allow you type instructions from your CPU's instruction set in plain text, use labels and such, and describe the memory layout structure of the program. Put the assembly through an assembler and out comes machine code in your CPUs machine instruction set.
You could design your own CPU from scratch using VHDL. As you're designing the CPU, you would have it implement your own custom instruction set. From there, you could take the VHDL for your CPU, compile it, write it to an FPGA and have your own custom CPU. Then you could start writing programs for your made-up CPU using your custom instruction set by writing a custom assembler. Some friends of mine in college did this for giggles.
For example, you know how most CPUs are load-store, register based CPUs? Instructions tend to go something like this:
Load the value '1' into register A
Load the value '2' into register B
Add register A and register B, storing result in register A
(You just added 1 + 2! Heh)
That sort of model of computation happens to be the most popular, but it's not the only way you could do computation. What if you had a stack based CPU, where you push values onto a hardware stack, and then computations work with the values on the top of the stack, pushing results back onto the stack.
For instance:
Push 1 onto the stack (stack current contains: 1)
Push 2 onto the stack (stack current contains: 2 1)
Push 3 onto the stack (stack currently contains: 3 2 1 )
Add
'Add' takes the top two elements on the stack, adds them together, and pushes the result on the top of the stack.
Stack now contains: 5 1
Add
Stack now contains: 6
Neat isn't it? As far as a computation model goes, it has its advantages - operands tend to be short, and need fewer bits. Smaller instructions means that the CPU can be faster.
The problem is that no such processor like this exists anymore.
But if you knew what you were doing, you could design one in VHDL, program it to an FGPA, and suddenly you have one of the only operating stack-based processors in existence.
Say, if you were doing a masters thesis, for instance, you might dig around and find out that virtual-machine-based programming languages like C# and Java compile down to a bytecode for a CPU that doesn't really exist, but the model for that CPU proves useful for making code portable. You might find out that the imaginary machines used by these languages are based on stack-based processor models. If you were looking for something interesting to do, perhaps you write in VHDL a processor that natively implements the Java bytecode language. Now you'd be the only person that has a computer that can directly run Java.
Verilog and VHDL are both HDLs (Hardware description languages) used mainly for describing digital electronics. Their targets may be FPGA or ASIC (custom silicon).
Assembly level on the other hand is using an processors instruction set to perform a series of calculations. Every thing executed on a computer eventually ends up as an assembly level instruction. One example of an instruction set would be the x86 ISA.
Summary: Verilog, VHDL describe hardware. Assembly is the low level program being executed on a processor.