Factors that Impact Translation Time - modelica

I have run across issues in developing models where the translation time (simulates quickly but takes far too long to translate) has become a serious issue and could use some insight so I can look into resolving this.
So the question is:
What are some of the primary factors that impact the translation time of a model and ideas to address the issue?
For example, things that may have an impact:
for loops vs a vectorized method - a basic model testing this didn't seem to impact anything
using input variables vs parameters
impact of annotations (e.g., Evaluate=true)
or tough luck, this is tool dependent (Dymola, OMEdit, etc.) :(
use of many connect() - this seems to be a factor (perhaps primary) as it forces translater to do all the heavy lifting
Any insight is greatly appreciated.

Clearly the answer to this question if naturally open ended. There are many things to consider when computation times may be a factor.
For distributed models (e.g., finite difference) the use of simple models and then using connect equations to link them in the appropriate order is not the best way to produce the models. Experience has shown that this method significantly increases the translation time to unbearable lengths. It is better to create distributed models in the same approach that is used the MSL Dynamic pipe (not exactly like it but similar).
Changing the approach as described is significantly faster in translational time (orders of magnitude for larger models, >~100,000 equations) than using connect statements as the number of distributed elements increases to larger numbers. This was tested using Dymola 2017 and 2017FD01.
Some related materials pointed out by others that may be useful for more information have been included below:
https://modelica.org/events/modelica2011/Proceedings/pages/papers/07_1_ID_183_a_fv.pdf
Scalable Test Suite : https://dx.doi.org/10.3384/ecp15118459

Related

checking for convergence in complex hierarchical models JAGS

I have estimated a complex hierarchical model with many random effects, but don't really know what the best approach is to checking for convergend. I have complex longitudinal data from a few hundred individuals and estimate quite a few parameters for every individual. Because of that, I have way to many traceplots to inspect visually. Or should I really spend a day going through all the traceplots? What would be a better way to check for convergence? Do I have to calculate Gelman and Rubin's Rhat for every parameter on the person level? And when can I conclude that the model converged? When absolutely all of the thousends of parameters reached convergence? Is it even sensible to expect that? Or is there something like "overall convergence"? And what does it mean when some person-level parameters did not converge? Does it make sense to use autorun.jags from the R2jags package with such a model or will it just run for ever? I know, these are a lot of question, but I just don't know how to approach that.
The measure I am using for convergence is a potential scale reduction factor (psrf)* using the gelman.diag function from the R package coda.
But nevertheless, I am also quickly visually inspecting all the traceplots, even though I also have tens/hundreds of them. It can be really fast if you put them in PNG files and then quickly go through them using e.g. IrfanView (let me know if you need me to expand on this).
The reason you should inspect the traceplots is pretty well described by an example from Marc Kery (author of great Bayesian books): see "Never blindly trust Rhat for convergence in a Bayesian analysis", here I include a self explanatory image from this email:
This is related to Rhat statistics while I use psrf, but it's pretty likely that psrf suffers from this too... and better to check the chains.
*) Gelman, A. & Rubin, D. B. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–472 (1992).

Identification and Diagnosis of Efficiency Problems in HPC

There are many articles and books on problems in HPC, but I feel like I am missing on the diagnose of scaling and efficiency issues. For example, I am reading a books called "Introduction to High Performance Computing for Scientists and Engineers" by Horst Simon where he discusses a wide variety of problems and solutions such as,
Cache misses
Load Imbalance
Poor Vectorization of code
etc.
But if I were handed a piece of code even remotely complex (ie more than nested for-loops) I would have a very hard time discovering what the bottleneck was or proving that the code had reached the limits of a given piece of hardware.
In analog with medicine, I can currently list out a bunch of possible diseases that make people "less efficient", but this is hardly useful. I need to figure out how to diagnose my "patients" and then prescribe a "cure".
Could I please be referred to literature that teaches how to diagnosis of HPC problems (efficiency, scalability, etc)? Almost a step-by-step guide. Like put stethoscope of chest, then listen, ...
This question is two questions: one is how do I find bottlenecks, the other is how do I know the limits of my hardware and if I am at them.
The first is that you must run the code inside a profiler. Any profiler with a "top down" view of your code according to time is showing you the bottlenecks.
Try the profilers suggested here (answer applies to c++ and Fortran): Good profiler for Fortran and MPI - both Allinea MAP and HPC Toolkit have the sort of presentation you need. (NB I work for Allinea).
The second question is the most "open" part. That one needs your book or optimization guide. However, a good start is to see how much vectorization you have (Some of the profiler examples can show this) as this is where the most compute power can be found.
The bigger question is what the theoretical limit of your problem is - eg. Some problems are not amenable to vectorization, some have memory access needs that can never be cache friendly, some have communication needs that are simple whereas others require costly regular global updates.

Barriers to translation stage in Modelica?

Some general Modelica advice?
We've built a model with ~2000 equations and three vectors of input from measured data. Using OpenModelica, attempts at simulation have begun to hang in the translation stage (which runs for hours where it used to take less than a minute) and now I regularly "lose connection to omc.exe." Is there perhaps something cumulative occurring that's degrading translation/compilation performance?
In general, are there any good rules of thumb for keeping simulations lighter and faster? I realize that, depending on the couplings, additional equations could be exponentially increasing the size of the resulting system of equations - could this be a problem?
Thanks for your thoughts!
It shouldn't take that long. Seems like a bug.
You can report this bug here:
https://trac.openmodelica.org/OpenModelica (New Ticket).
If your model is public you can post it there, if not you can contact the OpenModelica team privately.
I did some cleaning in the code; and got the part that repeats 12x (the module) down to ~180 equations; in the process I reduced the size of my input vectors (and also a 2D look-up table the module refers to) by quite a bit - they're both down to a few hundred values. It's working now--simulations run in reasonable time, a few minutes each.
Since all these tables were defined within Modelica functions (as you pointed out, Mr. Tiller) perhaps shrinking them helped to improve the performance. I had assumed that all that data just got spread out in a memory array, without going through any real processing, but maybe that's not the case...time to know more about what's going on under the hood in this environment (as always).
Thanks for the help!

Which language should I prefer working with if I want to use the Fast Artificial Neural Network Library (FANN)?

I am working on reducing dimentionality of a set of (Boolean) vectors with both the number and dimentionality of vectors tending to be of the order of 10^5-10^6 using autoencoders. Hence even though speed is not of essence (it is supposed to be a pre-computation for a clustering algorithm) but obviously one would expect that the computations take a reasonable amount of time. Seeing how the library itself was written in c++ would it be a good idea to stick to it or to code in Java (Since the rest of the code is written in Java)? Or would it not matter at all?
That question is difficult to answer. It depends on:
How computationally demanding will be your code? If the hard part is done by the library and your code is only to generate the input and post-process the output, Java would be a valid choice. Compare it to Matlab: The language is very slow but the built-in algorithms are super-fast.
How skilled are you (or your team, or your future students) in Java and C++. Consider learning C++ takes a lot of time. If you have only a small scaled project, it could be easier to buy a bigger machine or wait two days instead of one, to get the results.
Have you legacy code in one of the languages you want to couple or maybe re-use?
Overall, I would advice you to set up a benchmark example in whatever language you like more. Then give it a try. If the speed is ok, stick to it. If you wait to long, think about alternatives (new hardware, parallel execution, different language).

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.