Programming practice: Does not creating variables at first lead to faster computation? - matlab

I have a, b and A.
a = some expression 1
b = some expression 2
A = a + b
vs
A = some expression 1 + some expression 2
In my code, there are not just a and b but a lot of those. By using the later method without creating variables at first, i.e. by just summing all the expressions in A, I get 1s faster in my program, total is about 11s. This is confirmed after a long of tests. So it reduces from 11s to about 10s. Is this due to just not creating variables at first? Does not creating variables at first lead to faster computation?
I need to run a lot of for loop and run ode solver and for long computation. Variables are calculated and created inside the loop. If i can get a about 10% decrease this is good.

In general (not just MATLAB).
Your first scenario these additional steps are required, which do not apply to the second scenario.
When variable is created, memory needs to be allocated where the value for the variable can be stored.
When a value is assigned to that variable, that value needs to be written to the variable's space in memory.
When the calculation is requested, the value for each variable needs to be retrieved from memory.
Many compilers optimize away these additional overheads by using various techniques, but many interpreted languages do not. (This is not a hard and fast rule though, there are smart interpreted languages and stupid compiled ones).
I do not know exactly how the internals of MATLAB works, but I do think it is
interpreted, which means that the additional steps likely will incur additional overhead.
The problem with your second scenario is that is less readable and maintainable in the long run though. It is easier to read computations and intermediate steps when variable names are used. The trick is balance performance and readability.

I'm not sure how much of a difference it would make in terms of performance, but I don't think it would be a sizeable difference. Maybe a few hundredths of a second.
You can test it for yourself by using the tic toc function.
tic
a = some expression 1
b = some expression 2
A = a + b
toc
VS
tic
A = some expression 1 + some expression 2
toc
As mentioned in the other answer, readability is the main difference. You want to keep your code as simple as possible so that if there is a problem you know exactly where it is and hopefully why there was a problem!

Related

Hoare partition : is this implementation more/less efficient than the standard partition algorithm?

In the canonical implementations of the Hoare algorithm, we need the starting and ending elements of the array as arguments, and the algo maintains a couple of flags for the start and end of the partitioned array. Here are some standard impls I found:
QuickSort and Hoare Partition
Hoare Partition Correctness in java
Now, I did the following, and tested it out with a few random arrays. I'm not quite sure if I've done anything wrong - are there any holes in this implementation? It sort of intuitively feels very similar to the implementations above, except for the fact that it takes less arguments. Does it have a better/worse performance (even marginally so)compared to the standard implementation? (even though, yes, both of these are O(n))
(MATLAB)
function partition(m_input)
pivot = m_input(1);
size = length(m_input);
flag = 1;
k = 1;
while(k<=size)
if(m_input(k)>pivot)
swap(m_input(size), m_input(k))
size = size-1;
else
swap(m_input(k), m_input(flag))
flag =k;
k=k+1;
end
end
end
Edit : input changed to m_input .
I always prefer the standard Hoare's implementation. If you look at it, it is not very intuitive, but it has a visible advantage: Less number of swaps. While your implementation effectively always does exactly N comparisons and N swaps, the Hoare's implementation does only N comparisons, but it does not swap anything if it is not needed.
The difference is significant in some scenarios. At first in a case you use environment where swaps or assignment of variables/objects is a costy operation. For example if you use C/C++ with arrays of objects. Another typical examples where Hoare's partition implementation performs better if when many of the items in your array are of the same value or when the array is almost sorted and needs just to swap a few items. In that cases Hoare's version performs almost no swaps, while your one still needs to swap N items, which takes N*3 assignment instructions.

Using i and j as variables in MATLAB

i and j are very popular variable names (see e.g., this question and this one).
For example, in loops:
for i=1:10,
% Do something...
end
As indices into a matrix:
mat(i, j) = 4;
Why shouldn't they be used as variable names in MATLAB?
Because i and j are both functions denoting the imaginary unit:
http://www.mathworks.co.uk/help/matlab/ref/i.html
http://www.mathworks.co.uk/help/matlab/ref/j.html
So a variable called i or j will override them, potentially silently breaking code that does complex maths.
Possible solutions include using ii and jj as loop variables instead, or using 1i whenever i is required to represent the imaginary unit.
It is good practice to avoid i and j variables to prevent confusion about them being variables or the imaginary unit.
Personally, however, I use i and j as variables quite often as the index of short loops. To avoid problems in my own code, I follow another good practice regarding i and j: don't use them to denote imaginary numbers. In fact, MATLAB's own documentation states:
For speed and improved robustness, you can replace complex i and j by 1i.
So rather than avoiding two very commonly used variable names because of a potential conflict, I'm explicit about imaginary numbers. It also makes my code more clear. Anytime I see 1i, I know that it represents sqrt(-1) because it could not possibly be a variable.
In old versions of MATLAB, there used to be a good reason to avoid the use of i and j as variable names - early versions of the MATLAB JIT were not clever enough to tell whether you were using them as variables or as imaginary units, and would therefore turn off many otherwise possible optimizations.
Your code would therefore get slower just by the very presence of i and j as variables, and would speed up if you changed them to something else. That's why, if you read through much MathWorks code, you'll see ii and jj used fairly widely as loop indices. For a while, MathWorks might even have unofficially advised people to do that themselves (although they always officially advise people to program for elegance/maintainability rather than to whatever the current JIT does, as it's a moving target each version).
But that's rather a long time ago, and nowadays it's a bit of a "zombie" issue that is really much less important than many people still think, but refuses to die.
In any recent version, it's really a personal preference whether to use i and j as variable names or not. If you do a lot of work with complex numbers, you may want to avoid i and j as variables, to avoid any small potential risk of confusion (although you may also/instead want to only use 1i or 1j for even less confusion, and a little better performance).
On the other hand, in my typical work I never deal with complex numbers, and I find my code more readable if I feel free to use i and j as loop indices.
I see a lot of answers here that say It is not recommended... without saying who's doing that recommending. Here's the extent of MathWorks' actual recommendations, from the current release documentation for i:
Since i is a function, it can be overridden and used as a variable. However, it is best to avoid using i and j for variable names if you intend to use them in complex arithmetic. [...] For speed and improved robustness, you can replace complex i and j by 1i.
As described in other answers, the use of i in general code is not recommended for two reasons:
If you want to use the imaginary number, it can be confused with or overwritten by an index
If you use it as an index it can overwrite or be confused with the imaginary number
As suggested: 1i and ii are recommended. However, though these are both fine deviations from i, it is not very nice to use both of these alternatives together.
Here is an example why (personally) I don't like it:
val2 = val + i % 1
val2 = val + ii % 2
val2 = val + 1i % 3
One will not easily be misread for two or three, but two and three resemble each other.
Therefore my personal recommendation would be: In case you sometimes work with complex code, always use 1i combined with a different loop variable.
Examples of single letter indices that for if you don't use many loop variables and letters suffice: t,u,k and p
Example of longer indices: i_loop,step,walk, and t_now
Of course this is a matter of personal taste as well, but it should not be hard to find indices to use that have a clear meaning without growing too long.
It was pointed out that 1i is an acceptable and unambiguous way to write sqrt(-1), and that as such there is no need to avoid using i. Then again, as Dennis pointed out, it can be hard to see the difference between 1i and ii. My suggestion: use 1j as the imaginary constant where possible. It's the same trick that electrical engineers employ - they use j for sqrt(-1) because i is already taken for current.
Personally I never use i and j; I use ii and jj as shorthand indexing variables, (and kk, ll, mm, ...) and 1j when I need to use complex numbers.
Confusion with the imaginary unit has been well covered here, but there are some other more prosaic reasons why these and other single-letter variable names are sometimes discouraged.
MATLAB specifically: if you're using coder to generate C++ source from your MATLAB code (don't, it's horrible) then you are explicitly warned not to reuse variables because of potential typing clashes.
Generally, and depending on your IDE, a single-letter variable name can cause havoc with highlighters and search/replace. MATLAB doesn't suffer from this and I believe Visual Studio hasn't had a problem for some time, but the C/C++ coding standards like MISRA, etc. tend to advise against them.
For my part I avoid all single-letter variables, despite the obvious advantages for directly implementing mathematical sources. It takes a little extra effort the first few hundred times you do it, but after that you stop noticing, and the advantages when you or some other poor soul come to read your code are legion.
By default i and j stand for the imaginary unit. So from MATLAB's point of view, using i as a variable is somehow like using 1 as a variable.
Any non-trivial code contains multiple for loops, and the best practices recommend you use a descriptive name indicative of its purpose and scope. For times immemorial (and unless its 5-10 lines script that I am not going to save), I have always been using variable names like idxTask, idxAnotherTask and idxSubTask etc.
Or at the very least doubling the first letter of the array it is indexing e.g. ss to index subjectList, tt to index taskList, but not ii or jj which doesn't help me effortlessly identify which array they are indexing out of my multiple for loops.
Unless you are a very confused user I think there is very little risk in using variable names i and j and I use them regularly. I haven't seen any official indication that this practice should be avoided.
While it's true that shadowing the imaginary unit could cause some confusion in some context as mentioned in other posts, overall I simply don't see it as a major issue. There are far more confusing things you can do in MATLAB, take for instance defining false=true
In my opinion the only time you should probably avoid them is if your code specifically deals with imaginary numbers.

optimization, reduction variables, and MATLAB parfor

I'm trying to write a simple generic parallel code for minimizing a function in MATLAB. The idea is very simple, essentially:
parfor k = 1:N
(...find a good solution xcurrent with cost fcurrent ... )
% keep best current value
fmin = min(fmin,fxcurrent)
end
This works fine, because fmin is a reduction variable, and thus I can use this construction to update the current best value.
However, I couldn't find a nice elegant way of keeping (or storing) the best current solution ("xcurrent").
How do I keep track of the best solution found so far?
In other words, if the current value is strictly smaller than fmin, how can I save xcurrent (subject to the constraints that parallel loops impose in MATLAB)?
[Of course, the serial version is trivial, just prepend
if fxcurrent < fmin;
xbest = xcurrent;
end;
but this does not work on a parfor loop.]
A few approaches that come to mind:
I could just store all solutions and costs (using sliced variables), but this is hugely memory inefficient (the number of iterations N is very large, and the solutions themselves are very big).
Similarly, I could use a (set or matrix) reduction variable and do:
solutionset = [solutionset,xcurrent]
but this is almost as bad in terms of memory requirement.
I could also save xcurrent to disk every time the solution is improved.
I tried to look around for a simpler solution, but nothing was very useful.
The question seems to be well-defined (so it's not like in other problems, where the output could depend on iteration order), but I couldn't find an elegant way of doing this.
Apologies in advance if I'm missing something obvious, and thanks a lot in advance!
Thanks so I copy the suggestion down here.
Just an idea- what if you write your own reduction function - basically just containing the if block and a save or output?
You will presumably need to maintain multiple xcurrent structures in memory anyway, since there will have to be a separate copy for each worker executing the loop-body. I would try splitting your loop into an outer parallel part and an inner serial part -- this will allow you to adjust the number of copies of xcurrent separately to the total iteration count.
The inner (serial) loop can use the normal if fxcurrent < fmin; xmin = xcurrent; end construct to update its best solution, and the outer (parallel) loop can just store all solutions using slicing. As a final step you select the best solution from your (small) set.

Fastest language for FOR loops

I'm trying to figure out the best programming language for an analytical model I'm building. Primary consideration is speed at which it will run FOR loops.
Some detail:
The model needs to perform numerous (~30 per entry, over 12 cycles) operations on a set of elements from an array -- there are ~300k rows, and ~150 columns in the array. Most of these operations are logical in nature, e.g., if place(i) = 1, then j(i) = 2.
I've built an earlier version of this model using Octave -- to run it takes ~55 hours on an Amazon EC2 m2.xlarge instance (and it uses ~10 GB of memory, but I'm perfectly happy to throw more memory at it). Octave/Matlab won't do elementwise logical operations, so a large number of for loops are needed -- I'm relatively certain that I've vectorized as much as possible -- the loops that remain are necessary. I've gotten octave-multicore to work with this code, which makes some improvement (~30% speed reduction when I get it running on 8 EC2 cores), but ends up being unstable with file locking, etc.
+I'm really looking for a step change in run-time -- I know that actually using Matlab might get me as much as a 50% improvement from looking at some benchmarks, but that is cost-prohibitive. The original plan when starting this was to actually run a Monte Carlo with this, but at 55 hours a run, that's completely impractical.
The next version of this is going to be a complete rebuild from the ground up (for IP reasons I won't get in to if nothing else), so I'm completely open to any programming language. I'm most familiar with Octave/Matlab, but have dabbled in R, C, C++, Java. I'm also proficient w/ SQL if the solution involves storing the data in a database. I'll learn any language for this -- these aren't complicated functionality we're looking for, no interfacing with other programs, etc., so not too concerned about learning curve.
So with all that said, what's the fastest programming language specifically for FOR loops? From a search of SO and Google, Fortran and C bubble to the top, but looking for some more advice before diving in to one or the other.
Thanks!
This for loop looks no more complex than this when it hits the CPU:
for(int i = 0; i != 1024; i++) translates to
mov r0, 0 ;;start the counter
top:
;;some processing
add r0, r0, 1 ;;increment the counter by 1
jne top: r0, 1024 ;;jump to the loop top if we havn't hit the top of the for loop (1024 elements)
;;continue on
As you can tell, this is sufficiently simple you can't really optimize it very well[1]... Refocus towards the algorithm level.
The first cut at the problem is to look at cache locality. Look up the classic example of matrix multiplication and swapping the i and j indexes.
edit: As a second cut, I would suggest evaluating the algorithm for data-dependencies between iterations and data dependency between localities in your 'matrix' of data. It may be a good candidate for parallelization.
[1] There are some micro-optimizations possible, but those will not produce the speedsups you're looking for.
~300k * ~150 * ~30 * ~12 = ~16G iterations, right?
This number of primitive operations should complete in a matter of minutes (if not seconds) in any compiled language on any decent CPU.
Fortran, C/C++ should do it almost equally well. Even managed languages, such as Java and C#, would only fall behind by a small margin (if at all).
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more. (as was already suggested, is disk access limiting performance? is it network access?)
As #Rotsor said, 16G operations / 55 hours is about 80,000 operations per second, or one operation every 12.5 microseconds. That's a lot of time per operation.
That means your loops are not the cause of poor performance, it's what's in the innermost loop that's taking the time. And Octave is an interpreted language. That alone means an order of magnitude slowdown.
If you want speed, you first need to be in a compiled language. Then you need to do performance tuning (aka profiling) or, just single step it in a debugger at the instruction level. That will tell you what it is actually doing in its heart of hearts. Once you've got it to where it's not wasting cycles, fancier hardware, cores, CUDA, etc. will give you a parallelism speedup. But it's silly to do that if your code is taking unnecessarily many cycles. (Here's an example of performance tuning - a 43x speedup just by trimming the fat.)
I can't believe the number of responders talking about matlab, APL, and other vectorized languages. Those are interpreters. They give you concise source code, which is not at all the same thing as fast execution. When it comes down to the bare metal, they are stuck with the same hardware as every other language.
Added: to show you what I mean, I just ran this C++ code, which does 16G operations, on this moldy old laptop, and it took 94 seconds, or about 6ns per iteration. (I can't believe you baby-sat that thing for 2 whole days.)
void doit(){
double sum = 0;
for (int i = 0; i < 1000; i++){
for (int j = 0; j < 16000000; j++){
sum += j * 3.1415926;
}
}
}
In terms of absolute speed, probably Fortran followed by C, followed by C++. In practical application, well written code in any of the three, compiled with a descent compiler should be plenty fast.
Edit- generally you are going to see much better performance with any kind of looped or forking (e.g.- if statements) code with a compiled language, versus an interpreted language.
To give an example, on a recent project I was working on, the data sizes and operations were about 3/4 the size of what you're talking about here, but like your code, had very little room for vectorization, and required significant looping. After converting the code from matlab to C++, runtimes went from 16-18 hours, down to around 25 minutes.
For what you're discussing, Fortran is probably your first choice. The closest second place is probably C++. Some C++ libraries use "expression templates" to gain some speed over C for this kind of task. It's not entirely certain those will do you any good, but C++ can be at least as fast as C, and possibly somewhat faster.
At least in theory, there's no reason Ada couldn't be competitive as well, but it's been so long since I used it for anything like this that I hesitate to recommend it -- not because it isn't good, but because I just haven't kept track of current Ada compilers well enough to comment on them intelligently.
Any compiled language should perform the loop itself on roughly equal terms.
If you can formulate your problem in its terms, you might want to look at CUDA or OpenCL and run your matrix code on the GPU - though this is less good for code with lots of conditionals.
If you want to stay on conventional CPUs, you may be able to formulate your problem in terms of SSE scatter/gather and bitmask operations.
Probably the assembly language for whatever your platform is. But compilers (especially special-purpose ones that specifically target a single platform (e.g., Analog Devices or TI DSPs)) are often as good as or better than humans. Also, compilers often know about tricks that you don't. For example, the aforementioned DSPs support zero-overhead loops and the compiler will know how to optimize code to use those loops.
Matlab will do element-wise logical operations and they are generally quite fast.
Here is a quick example on my computer (AMD Athalon 2.3GHz w/3GB) :
d=rand(300000,150);
d=floor(d*10);
>> numel(d(d==1))
ans =
4501524
>> tic;d(d==1)=10;toc;
Elapsed time is 0.754711 seconds.
>> numel(d(d==1))
ans =
0
>> numel(d(d==10))
ans =
4501524
In general I've found matlab's operators are very speedy, you just have to find ways to express your algorithms directly in terms of matrix operators.
C++ is not fast when doing matrixy things with for loops. C is, in fact, specifically bad at it. See good math bad math.
I hear that C99 has __restrict pointers that help, but don't know much about it.
Fortran is still the goto language for numerical computing.
How is the data stored? Your execution time is probably more effected by I/O (especially disk or worse, network) than by your language.
Assuming operations on rows are orthogonal, I would go with C# and use PLINQ to exploit all the parallelism I could.
Might you not be best with a hand-coded assembler insert? Assuming, of course, that you don't need portability.
That and an optimized algorithm should help (and perhaps restructuring the data?).
You might also want to try multiple algorithms and profile them.
APL.
Even though it's interpreted, its primitive operators all operate on arrays natively, therefore you rarely need any explicit loops. You write the same code, whether the data is scalar or array, and the interpreter takes care of any looping needed internally, and thus with the minimum overhead - the loops themselves are in a compiled language, and will have been heavily optimised for the specific architecture of the CPU it's running on.
Here's an example of the simplicity of array handling in APL:
A <- 2 3 4 5 6 8 10
((2|A)/A) <- 0
A
2 0 4 0 6 8 10
The first line sets A to a vector of numbers.
The second line replaces all the odd numbers in the vector with zeroes.
The third line queries the new values of A, and the fourth line is the resulting output.
Note that no explicit looping was required, as scalar operators such as '|' (remainder) automatically extend to arrays as required. APL also has built-in primitives for searching and sorting, which will probably be faster than writing your own loops for these operations.
Wikipedia has a good article on APL, which also provides links to suppliers such as IBM and Dyalog.
Any modern compiled or JITted language is going to render down to pretty much the same machine language code, giving a loop overhead of 10 nano seconds or less, per iteration, on modern processors.
Quoting #Rotsor:
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more.
80k operations per second is around 12.5 microseconds each - a factor of 1000 greater than the loop overhead you'd expect.
Assuming your 55 hour runtime is single threaded, and if your per item operations are as simple as suggested, you should be able to (conservatively) achieve a speedup of 100x and cut it down to under an hour very easily.
If you want to run faster still, you'll want to look at writing multi-threaded solution, in which case a language that provides good support would be essential. I'd lean towards PLINQ and C# 4.0, but that's because I already know C# - YMMV.
what about a lazy loading language like clojure. it is a lisp so like most lisp dialects lacks a for loop but has many other forms that operate more idiomatically for list processing. It might help your scaling issues as well because the operations are thread safe and because the language is functional it has fewer side effects. If you wanted to find all the items in the list that were 'i' values to operate on them you might do something like this.
(def mylist ["i" "j" "i" "i" "j" "i"])
(map #(= "i" %) mylist)
result
(true false true true false true)

Matrix-Algebra Design Decomposition

I am looking at refactoring some very complex code which is a subsystem of a project I have at work. Part of my examination of this code is that it is incredibly complex, and contains a lot of inputs, intermediate values and outputs depending on some core business logic.
I want to redesign this code to be easier to maintain as well as executing a hell of a lot faster, so to start off with I have been trying to look at each of the parameters and their dependencies on each other. This has lead to quite a large and tangled graph and I would like a mechanism for simplifying this graph.
A while back I came across a technique in a book about SOA design called "Matrix Design Decomposition" which uses a matrix of outputs and the dependencies they have on the inputs, applies some form of matrix algebra and can generate Business Process diagrams for those dependencies.
I know there is a web tool available at http://www.designdecomposition.com/ however it is limited in the number of input/output dependencies you can have. I have tried looking around for the algorithmic source for this tool (so I could attempt to implement it myself without the size limitation), however I have had no luck.
Does anybody know a similar technique that I could use? Currently I am even considering taking the dependency matrix and applying some Genetic Algorithms to see if evolution can come up with a simpler workflow...
Cheers,
Aidos
EDIT:
I will explain the motivation:
The original code was written for a system which computed all of the values (about 60) every time the user performed an operation (adding, removing or modifying certain properties of a item). This code was written over ten years ago and is definitely showing signs of age - others have added more complex calculations into the system and now we are getting completely unreasonable performance (up to 2 minutes before control is returned to the user). It has been decided to detach the calculations from the user actions and provide a button to "recalculate" the values.
My problem arises because there are so many calculations that are going on and they are based on the assumption that all of the required data will be available for their computation - now when I try to re-implement the calculations I keep encountering problems because I haven't got the result for a different calculation that this calculation relies on.
This is where I want to use the matrix-decomposition approach. The MD approach allows me to specify all of the inputs and outputs and gives me the "simplest" workflow that I can use for generating all of the outputs.
I can then use this "workflow" to know the precedence of the calculations I need to perform to get the same result without generating any exceptions. It also shows me which parts of the calculation system I can parallelise and where the fork and join points will be (I won't worry about that part just yet). At the moment all I have is an insanely large matrix with lots of dependencies showing in it, with no idea where to start.
I will elaborate from my comment a little more:
I don't want to use the solution from the EA process in the actual program. I want to take the dependency matrix and decompose it into modules that I will then code manually - this is purely a design aid - I am just interested in what the inputs/outputs for these modules will be. Basically a representation of the complex interdependencies between these calculations, as well as some idea of precedence.
Say I have A requires B and C. D requires A and E. F requires B, A and E, I want to effectively partition the problem space from a complex set of dependencies into a "workflow" that I can examine to get a better understanding. Once I have this understanding I can come up with a better design / implementation that is still human readable, so for the example I know I need to calculate A, then C, then D, then F.
--
I know this seems kind of strange, if you take a look at the website I linked to before the matrix based decomposition there should give you some understanding of what I am thinking of...
kquinn, If it's the piece of code I think he's referring to (I used to work there), it's already a black box solution that no human can understand as is. He's not looking to make it more complicated, less in fact. What he's trying to achieve is a whole heap of interlinked calculations.
What currently happens, is that whenever anything changes, it's an avalanche of events which cause a whole bunch of calculations to fire off, which in turn causes a whole bunch more events which continues on until finally it reaches a state of equilibrium.
What I assume he wants to do is find the dependencies for those outlying calculations and work in from there so they can be rewritten and find a way for the calculations from happening for the sake of it, rather than because they need to.
I can't offer much advice in regards to simplifying the graph, as unfortunately it's not something I have much experience in. That said, I would start looking for those outlying calculations which have no dependencies, and just traverse the graph from there. Start building up a new framework that includes the core business logic of each calculation in the simplest possible way, and refactor the crap out of it along the way.
If this is, as you say, "core business logic", then you really don't want to be screwing around with fancy decompositions and evolutionary algorithms that produce a "black box" solution that no one in the world understands or is capable of modifying. I would be very surprised if any of these techniques actually yielded any useful result; the human brain is still incomprehensibly more capable than any machine at untangling complicated relationships.
What you want to do is traditional refactoring: clean up the individual procedures, streamlining them and merging them where possible. Your goal is to make the code clear, so your successor doesn't have to go through the same process.
What language are you using?
Your problem should be pretty easy to model using Java Executors and Future<> tasks, but a similar framework is perhaps availabe on your chosen platform as well?
Also, if I understand this correctly, you want to generate a critical path for a large set of interdependent calculations -- is that something done dynamically, or do you "just" need a static analysis?
Regarding an algorithmic solution; pick up the closest copy of your numerical analysis textbook and refresh your memory on singular value decompositions and LU factorization; I'm guessing from the top off my head that this is what lies behind the tool you linked to.
EDIT: Since you're using Java, I'll give a brief outline of a suggestion proposal:
-> Use a threadpool executor to parallellize all calculations easily
-> Solve interdependencies with an object map of Future<> or FutureTask<>:s, i.e. if you variables are A, B and C, where A = B + C, do something like this:
static final Map<String, FutureTask<Integer>> mapping = ...
static final ThreadPoolExecutor threadpool = ...
FutureTask<Integer> a = new FutureTask<Integer>(new Callable<Integer>() {
public Integer call() {
Integer b = mapping.get("B").get();
Integer c = mapping.get("C").get();
return b + c;
}
}
);
FutureTask<Integer> b = new FutureTask<Integer>(...);
FutureTask<Integer> c = new FutureTask<Integer>(...);
map.put("A", a);
map.put("B", a);
map.put("C", a);
for ( FutureTask<Integer> task : map.values() )
threadpool.execute(task);
Now, if I'm not totally off (and I may very well be, it was a while since I worked in Java), you should be able to solve the apparent deadlock problem by tuning the thread pool size, or use a growing thread pool. (You still have to make sure that there are no interdependent tasks though, such as if A = B + C, and B = A + 1...)
If the black-box is linear you can discover all the coefficients by simply concatenating many vectors of input and many vectors of output.
you have input x[i] and output y[i], then you create a matrix Y whose columns are y[0], y[1], ... y[n], and a matrix X whose columns are x[0], x[1], ..., x[n]. There will be a transformation Y = T * X, then you may determine T = Y * inverse(X).
But since you said it is complex I bet it is not linear. Then if you still want a general framework you can use this a factor-graph
https://ieeexplore.ieee.org/document/910572
I would be curious if you can do this.
What I think is easier is to understand the code and rewrite it using the best practices.