Real-time benchmark between preempt_rt, Xenomai and RTAI - real-time

I need to compare performance between the preempt_rt patches, Xenomai and RTAI. They each have their own benchmarks but they don't give similar results and not all benchmarks are available in all three platforms.
What I'm looking for is a benchmark that will measure basic performance, like interrupt latency, context switch time, etc and that will run on all three platforms. I tried implementing the a Thread-Metric benchmark, but it was too complicated for me.
Anyone knows if such a benchmark exists? Thanks in advance for any help.

Checkout this Assessment of the Realtime Preemption Patches (RT-Preempt) and their impact on the general purpose performance of the system

Related

Factors that Impact Translation Time

I have run across issues in developing models where the translation time (simulates quickly but takes far too long to translate) has become a serious issue and could use some insight so I can look into resolving this.
So the question is:
What are some of the primary factors that impact the translation time of a model and ideas to address the issue?
For example, things that may have an impact:
for loops vs a vectorized method - a basic model testing this didn't seem to impact anything
using input variables vs parameters
impact of annotations (e.g., Evaluate=true)
or tough luck, this is tool dependent (Dymola, OMEdit, etc.) :(
use of many connect() - this seems to be a factor (perhaps primary) as it forces translater to do all the heavy lifting
Any insight is greatly appreciated.
Clearly the answer to this question if naturally open ended. There are many things to consider when computation times may be a factor.
For distributed models (e.g., finite difference) the use of simple models and then using connect equations to link them in the appropriate order is not the best way to produce the models. Experience has shown that this method significantly increases the translation time to unbearable lengths. It is better to create distributed models in the same approach that is used the MSL Dynamic pipe (not exactly like it but similar).
Changing the approach as described is significantly faster in translational time (orders of magnitude for larger models, >~100,000 equations) than using connect statements as the number of distributed elements increases to larger numbers. This was tested using Dymola 2017 and 2017FD01.
Some related materials pointed out by others that may be useful for more information have been included below:
https://modelica.org/events/modelica2011/Proceedings/pages/papers/07_1_ID_183_a_fv.pdf
Scalable Test Suite : https://dx.doi.org/10.3384/ecp15118459

Identification and Diagnosis of Efficiency Problems in HPC

There are many articles and books on problems in HPC, but I feel like I am missing on the diagnose of scaling and efficiency issues. For example, I am reading a books called "Introduction to High Performance Computing for Scientists and Engineers" by Horst Simon where he discusses a wide variety of problems and solutions such as,
Cache misses
Load Imbalance
Poor Vectorization of code
etc.
But if I were handed a piece of code even remotely complex (ie more than nested for-loops) I would have a very hard time discovering what the bottleneck was or proving that the code had reached the limits of a given piece of hardware.
In analog with medicine, I can currently list out a bunch of possible diseases that make people "less efficient", but this is hardly useful. I need to figure out how to diagnose my "patients" and then prescribe a "cure".
Could I please be referred to literature that teaches how to diagnosis of HPC problems (efficiency, scalability, etc)? Almost a step-by-step guide. Like put stethoscope of chest, then listen, ...
This question is two questions: one is how do I find bottlenecks, the other is how do I know the limits of my hardware and if I am at them.
The first is that you must run the code inside a profiler. Any profiler with a "top down" view of your code according to time is showing you the bottlenecks.
Try the profilers suggested here (answer applies to c++ and Fortran): Good profiler for Fortran and MPI - both Allinea MAP and HPC Toolkit have the sort of presentation you need. (NB I work for Allinea).
The second question is the most "open" part. That one needs your book or optimization guide. However, a good start is to see how much vectorization you have (Some of the profiler examples can show this) as this is where the most compute power can be found.
The bigger question is what the theoretical limit of your problem is - eg. Some problems are not amenable to vectorization, some have memory access needs that can never be cache friendly, some have communication needs that are simple whereas others require costly regular global updates.

Netlogo High performance Computing

Are there any high performance computing facilites available for running NetLogo behavior space like R servers.
Thanks.
You can use headless mode to run batches of experiments on a cluster/cloud computing platform. This involves simply running an executable so should be compatible with most setups. If you don't have access to a cluster through an institution, I know people use AWS and Google compute. You probably want an instance with many cores, since that allows a single instance of BehaviorSpace to automatically distribute the runs involved in an experiment across multiple processes. Higher processing power of course helps too. You shouldn't need much memory. The n1-highcpu-16 or n1-standard-16 instance types in Google compute looks pretty ideal to me.

How to do hardware independent parallel programming?

These days there are two main hardware environments for parallel programming, one is multi-threading CPU's and the other is the graphics cards which can do parallel operations on arrays of data.
The question is, given that there are two different hardware environments, how can I write a program which is parallel but independent of these two different hardware environments.
I mean that I would like to write a program and regardless of whether I have a graphics card or multi-threaded CPU or both, the system should choose automatically what to execute it on, either or both graphics card and/or multi-thread CPU.
Is there any software libraries/language constructs which allow this?
I know there are ways to target the graphics card directly to run code on, but my question is about how can we as programmers write parallel code without knowing anything about the hardware and the software system should schedule it to either graphics card or CPU.
If you require me to be more specific as to the platform/language, I would like the answer to be about C++ or Scala or Java.
Thanks
Martin Odersky's research group at EPFL just recently received a multi-million-euro European Research Grant to answer exactly this question. (The article contains several links to papers with more details.)
In a few years from now programs will rewrite themselves from scratch at run-time (hey, why not?)...
...as of right now (as far as I am aware) it's only viable to target related groups of parallel systems with given paradigms and a GPU ("embarrassingly parallel") is significantly different than a "conventional" CPU (2-8 "threads") is significantly different than a 20k processor supercomputer.
There are actually parallel run-times/libraries/protocols like Charm++ or MPI (think "Actors") that can scale -- with specially engineered algorithms to certain problems -- from a single CPU to tens of thousands of processors, so the above is a bit of hyperbole. However, there are enormous fundamental differences between a GPU -- or even a Cell micoprocessor -- and a much more general-purpose processor.
Sometimes a square peg just doesn't fit into a round hole.
Happy coding.
OpenCL is precisely about running the same code on CPUs and GPUs, on any platform (Cell, Mac, PC...).
From Java you can use JavaCL, which is an object-oriented wrapper around the OpenCL C API that will save you lots of time and effort (handles memory allocation and conversion burdens, and comes with some extras).
From Scala, there's ScalaCL which builds upon JavaCL to completely hide away the OpenCL language : it converts some parts of your Scala program into OpenCL code, at compile-time (it comes with a compiler plugin to do so).
Note that Scala features parallel collections as part of its standard library since 2.9.0, which are useable in a pretty similar way to ScalaCL's OpenCL-backed parallel collections (Scala's parallel collections can be created out of regular collections with .par, while ScalaCL's parallel collections are created with .cl).
The (very-)recently announced MS C++ AMP looks like the kind of thing you're after. It seems (from reading the news articles) that initially it's targeted at using GPUs, but the longer term aim seems to be to include multi-core too.
Sure. See ScalaCL for an example, though it's still alpha code at the moment. Note also that it uses some Java libraries that perform the same thing.
I will cover the more theoretical answer.
Different parallel hardware architectures implement different models of computation. Bridging between these is hard.
In the sequential world we've been happily hacking away basically the same single model of computation: the Random Access Machine. This creates a nice common language between hardware implementors and software writers.
No such single optimal model for parallel computation exists. Since the dawn of modern computers a large design space has been explored; current multicore CPUs and GPUs cover but a small fraction of this space.
Bridging these models is hard because parallel programming is essentially about performance. You typically make something work on two different models or systems by adding a layer of abstraction to hide specifics. However, it is rare that an abstraction does not come with a performance cost. This will typically land you with a lower common denominator of both models.
And now answering your actual question. Having a computational model (language, OS, library, ...) that is independent of CPU or GPU will typically not abstract over both while retaining the full power you're used to with your CPU, due to the performance penalties. To keep everything relatively efficient the model will lean towards GPUs by restricting what you can do.
Silver lining:
What does happen is hybrid computations. Some computations are more suited for other kinds of architectures. You also rarely do only one type of computation, so that a 'sufficiently smart compiler/runtime' will be able to distinguish what part of your computation should run on what architecture.

How do I estimate tasks using function points?

What are the steps to estimating using function points?
Is there a quick-reference guide of some sort out there?
I took a conference session on Function Point Analysis a few years back. There is a lot too it. You can check out the Free Function Point Training Manual online, the Fundamentals of Function Points, or I suspect you can get a book on it at a computer store.
You might also check out the International Function Point Users Group and see if they have some resources or a local meeting for you.
You really need to get some training on it. Check with IFPUG. You will unknowingly pick up some destructive bad habits if self-taught. It also helps to have an experienced FP analyst review some of your early attempts.
It's the kind of thing that appears overwhelmingly complex until you "get it" and then it's fairly quick to do. It improved my requirements analysis a lot too. I often spot contradictions and gaps when doing a count.
It isn't limited to BDUF Waterfall projects either. I spent three years using FP and Planning Poker as cross-checks on one another when contracting agile methods projects.
I was IFPUG-certified from 2002-2005 and am still using FP analysis. I've seen it misused a lot, and I think that's why it has such a bad reputation.
I recommend you take a look at COSMIC Function points. https://cosmic-sizing.org. COSMIC Function points are also an ISO standard for measuring software size. They are an evolved improvement over IFPUG.
You can quickly estimate size by counting the entries, exits, reads and writes.
Compared with the IFPUG manual, learning COSMIC is much easier, the free book below is all you need, and you can read it in a day.
Recommended reading: https://cosmic-sizing.org/publications/measurement-guide/