MultiCore Programming on Raspberry Pi via Simulink - matlab

i am currently developing a model in simulink with three different main functions (let's call them A,B,C for now), where one of them is running at a different sample time as the other ones. However, I tried to simulate this system on the Raspberry Pi via external mode but got a lot of overruns and a high cpu load. Now, I am trying to split the model, so that for example functions A and B are executed on one core and function C is executed on another core.
For this, I used this article from Mathworks, but I think that you can't actually assign one task to a core but just specify the periodic execution. As a result I could reduce the cpu load to a maximum of 40% but still get a lot of overruns (imo, this also contradicts itself).
As a second approach, I tried this article but I think this is not possible for Raspberry Pis since I can not add and assign cores in the concurrent execution tab.
My goal is to assign each task to a core on the raspberry and see the cpu load on the raspberry pi.
Many thanks in advance!

Related

What is responsible for changing core's load and frequency in multicore processor

Having looked for a description of the multicore design i keep finding several diagrams, but all of them look somewhat like this:
I know from looking at i7z command output that different cores can run at different frequencies.
This would suggest that the decisions regarding which core will be given a new process and for changing the frequency of the core itself are done either by the operating system or by the control block of the core itself.
My question is: What controls the frequencies of each individual core? Is the job of associating a READY process with the specific core placed upon the operating system or is it done by something within the processor.
Scheduling processes/threads to cores is purely up to the OS. The hardware has no understanding of tasks waiting to run. Maintaining the OS's list of processes that are runnable vs. waiting for I/O is completely a software thing.
Migrating a thread from one core to another is done by kernel code on the original core storing the architectural state to memory, then OS code on the new core restoring that saved state and resuming user-space execution.
Traditionally, frequency and voltage scaling decisions are made by the OS. Take Linux as an example: The decision-making code is called a governor (and also this arch wiki link came up high on google). It looks at things like how often processes have used their entire time slice on the current core. If the governor decides the CPU should run at a different speed, it programs some control registers to implement the change. As I understand it, the hardware takes care of choosing the right voltage to support the requested frequency.
As I understand it, the OS running on each core makes decisions independently. On hardware that allows each core to run at different frequencies, the decision-making code doesn't need to coordinate with each other. If running a high frequency on one core requires a high voltage chip-wide, the hardware takes care of that. I think the modern implementation of DVFS (dynamic voltage and frequency scaling) is fairly high-level, with the OS just telling the hardware which of N choices it wants, and the onboard power microcontroller taking care of the details of programming oscillators / clock dividers and voltage regulators.
Intel's "Turbo" feature, which opportunistically boosts the frequency above the max sustainable frequency, does the decision making in hardware. Any time the OS requests the highest advertised frequency, the CPU uses turbo when power and cooling allow.
Intel's Skylake takes this a step further: The OS can hand full control over DVFS to the hardware, optionally with constraints. That lets it react from microsecond to microsecond, rather than on a timescale of milliseconds. This does actually allow better performance in bursty workloads, because more power budget is available for turbo when it's useful. A few benchmarks are bursty enough to observe this, like some browser / javascript ones IIRC.
There was a whole talk about Skylake's new power management at IDF2015, check out the slides and/or archived webcast. The old method is described in a lot of detail there, too, to illustrate the difference, so you should really check it out if you want more detail than my summary. (The list of other IDF talks is here, thanks to Agner Fog's blog for the link)
The core frequency is controlled by a given voltage applied to a core's "oscillator".
This voltage can be changed by the Operating System but it can also be changed by the BIOS itself if a high temperature is detected in the CPU.

How to get significant speed up using Parallel Computing Toolbox of MATLAB in core i7 processor?

I am working on Image Processing . I am having a computer with Intel(R) Core(TM) i7 -3770 CPU #3.40 GHz, RAM 4 GB Configuration. I just want parallelize our code of an algorithm of image processing using SPMD command of PCT. For this i have divided image vertically into 8 parts and send it different labs and by using SPMD command i executed algorithm of image processing parallely on different parts on different lab.
I got the right answer which i got from sequential code. But this is taking more time than a sequential code . I have tried this with very largest image to smallest image but didn't get the significant result.
Suggest me how can i get significant speed up using SPMD command?
Since you did not provide any code I'll have to stick to a general answer. In all parallel computing there are several design considerations, the two most important are: is your code able to run in parallel, and secondly: how much communication overhead do you create.
Calling workers means sending information back and forth, so there is an optimum in parallel computing. Make sure you provide your workers with enough work so that the communication to and from your workers requires less time than the speed-up gained from parallel computing.
Last but not least: if you provide a working code example the community is able to help you much better!
If you want to apply the same operation to several blocks within an image, then rather than worry about constructs such as spmd, you can just apply the command blockproc and set the UseParallel option to true. It will parallelize everything for you, without you needing to do anything.
If that doesn't work for you and you really have a requirement to implement your algorithm directly using spmd, you'll need to post some example code to indicate what you've tried, and where it's going wrong.

Faster way to run simulink simulation repeatedly for a large number of time

I want to run a simulation which includes SimEvent blocks (thus only Normal option is available for sim run) for a large number of times, like at least 1000. When I use sim it compiles the program every time and I wonder if there is any other solution which just run the simulation repeatedly in a faster way. I disabled Rebuild option from Configuration Parameter and it does make it faster but still takes ages to run for around 100 times.
And single simulation time is not long at all.
Thank you!
It's difficult to say why the model compiles every time without actually seeing the model and what's inside it. However, the Parallel Computing Toolbox provides you with the ability to distribute the iterations of your model across several cores, or even several machines (with the MATLAB Distributed Computing Server). See Run Parallel Simulations in the documentation for more details.

Spawn multiple copies of matlab on the same machine

I am facing a huge problem. I built a complex C application with embedded Matlab functions that I call using the Matlab engine (engOpen() and such ...). The following happens:
I spawn multiple instances of this application on a machine, one for each core
However! ... The application then slows down to a halt. In fact, on my 16-core machine, the application slows down approximately by factor 16.
Now I realized this is because there is only a sngle matlab engine started per machine and all my 16 instances share the same copy of matlab!
I tried to replicate this with the matlab GUI and its the same problem. I run a program in the GUI that takes 14 seconds, and THEN I run it in two GUIs at the same time and it takes 28 seconds
This is a huge problem for me, because I will miss my deadline if I have to reprogram my entire c application without matlab. I know that matlab has commands for parallel programming, but my matlab calls are embedded in the C application and I want to run multiple instances of the C application. Again, I cannot refactor my entire c application because I will miss the deadline.
Can anyone please let me know if there is a solution for this (e.g. really start multiple matlab processes on the same machine). I am willing to pay for extra licenses. I currently have fully lincensed matlab installed on all machines.
Thank you so so much!
EDIT
Thank you Ben Voigt for your help. I found that a single instance of Matlab is already using multiple cores. In fact, running one instance shows me full utilization of 4 cores. If I run two copies of Matlab, I get full utilization of 8 cores. Hence it is actually running in parallel. However, even though 2 instances seem to take up double the processing power, I still get 2* slowdown. Hence, 2 instances seem to get twice the result with 4* the compute power total. Why could that be?
Your slowdown is not caused by stuffing all N instances into a single MatLab instance on a single core, but by the fact that there are no longer 16 cores at the disposal of each instance. Many MATLAB vector operations use parallel computation even without explicit parallel constructs, so more than one core per instance is needed for optimal efficiency.
MATLAB libraries are not thread-safe. If you create multithreaded applications, make sure only one thread accesses the engine application.
I think the matlab engine is the wrong technique. For windows platforms, you can try using the com automation server, which has the .Single option which starts one matlab instance for each com client you open.
Alternatives are:
Generate C++ code for the functions.
Create a .NET library. (NE Builder)
Run matlab via command line.

profiling, How to avoid FPU (hardware) in VS pro 2008 or 2012RC and use emulated FPU codes

I need to optimise some codes for Cortex-M3 processor which doesn't have FP unit. I'm completely new to domain of optimisation.anyways,I use VS 2012 Release Candidate for native compiling of codes on my pc(Intel Core i5, windows 7 as os)and then porting them to Cortex_M3.I tried to write my codes in a way that it uses as little as possible the floating point arithmetics.but I still have a few. so i know that when i embedd it in Cortex_M3, it will take advantage of emulated FPU codes instead (Software FPU). Since i'm not able to do profiling for cortex_m3, i did it on my PC using VS2012 (Instrumentation method) to verify which functions take more time and have to be more optimized.
I think that profiling results on my PC can be proportional to that of COrtex_M3 if i don't use FP unit of my PC.
Is there a keyword or way in Visual Studio (2008 pro. or 2012 RC) which allows me to skip the (hardware) FP unit?
your insights are very appreciated
Your PC is soooooo different to a Cortex M3 that optimisations performed there are unlikely to be of any relevance. Some of the differences:
PC can issue more than one instruction per cycle
PC has some billion of those cycles per second vs some tens of millions
PC likely has more cache than your M3 has RAM
As you observe - the floating point unit
The M3 is an embedded processor - if you can't profile in the traditional way either get a better toolset, so that you can, or do it by hand, by using the hardware timers in the device to time your functions. Or toggle some port pins and hang an oscilloscope off it - that's proper embedded :)
EDIT:
You can profile without an OS - higher-end embedded toolchains can instrument the code, run it and pull the results back for post-processing
There are other hardware timers than the watchdog. At the simplest level, write some functions to read the value before you perform some task, read the value afterwards, subtract the result and print it out. More complex schemes can also be done, logging many iterations and keeping track of statistics etc.
If you have a few port pins, just set one before the function(s) you want to profile, clear it when it completes.
With a 4 channel scope you can see the execution times (and when they happen relative to each other, which can be useful if one interrupts another) of 4 sections of code at a time. If you have more, get a logic analyser and you can do loads of them!
You can also see the jitter or variation in execution time which can be instructive. Try it on the libc trig functions as the angle varies, you'll see that at some angles the sin/cos functions (for example) take way longer to run than at other angles. This can be a significant problem in a real-time system.