Matrix-Algebra Design Decomposition - decomposition

I am looking at refactoring some very complex code which is a subsystem of a project I have at work. Part of my examination of this code is that it is incredibly complex, and contains a lot of inputs, intermediate values and outputs depending on some core business logic.
I want to redesign this code to be easier to maintain as well as executing a hell of a lot faster, so to start off with I have been trying to look at each of the parameters and their dependencies on each other. This has lead to quite a large and tangled graph and I would like a mechanism for simplifying this graph.
A while back I came across a technique in a book about SOA design called "Matrix Design Decomposition" which uses a matrix of outputs and the dependencies they have on the inputs, applies some form of matrix algebra and can generate Business Process diagrams for those dependencies.
I know there is a web tool available at http://www.designdecomposition.com/ however it is limited in the number of input/output dependencies you can have. I have tried looking around for the algorithmic source for this tool (so I could attempt to implement it myself without the size limitation), however I have had no luck.
Does anybody know a similar technique that I could use? Currently I am even considering taking the dependency matrix and applying some Genetic Algorithms to see if evolution can come up with a simpler workflow...
Cheers,
Aidos
EDIT:
I will explain the motivation:
The original code was written for a system which computed all of the values (about 60) every time the user performed an operation (adding, removing or modifying certain properties of a item). This code was written over ten years ago and is definitely showing signs of age - others have added more complex calculations into the system and now we are getting completely unreasonable performance (up to 2 minutes before control is returned to the user). It has been decided to detach the calculations from the user actions and provide a button to "recalculate" the values.
My problem arises because there are so many calculations that are going on and they are based on the assumption that all of the required data will be available for their computation - now when I try to re-implement the calculations I keep encountering problems because I haven't got the result for a different calculation that this calculation relies on.
This is where I want to use the matrix-decomposition approach. The MD approach allows me to specify all of the inputs and outputs and gives me the "simplest" workflow that I can use for generating all of the outputs.
I can then use this "workflow" to know the precedence of the calculations I need to perform to get the same result without generating any exceptions. It also shows me which parts of the calculation system I can parallelise and where the fork and join points will be (I won't worry about that part just yet). At the moment all I have is an insanely large matrix with lots of dependencies showing in it, with no idea where to start.
I will elaborate from my comment a little more:
I don't want to use the solution from the EA process in the actual program. I want to take the dependency matrix and decompose it into modules that I will then code manually - this is purely a design aid - I am just interested in what the inputs/outputs for these modules will be. Basically a representation of the complex interdependencies between these calculations, as well as some idea of precedence.
Say I have A requires B and C. D requires A and E. F requires B, A and E, I want to effectively partition the problem space from a complex set of dependencies into a "workflow" that I can examine to get a better understanding. Once I have this understanding I can come up with a better design / implementation that is still human readable, so for the example I know I need to calculate A, then C, then D, then F.
--
I know this seems kind of strange, if you take a look at the website I linked to before the matrix based decomposition there should give you some understanding of what I am thinking of...

kquinn, If it's the piece of code I think he's referring to (I used to work there), it's already a black box solution that no human can understand as is. He's not looking to make it more complicated, less in fact. What he's trying to achieve is a whole heap of interlinked calculations.
What currently happens, is that whenever anything changes, it's an avalanche of events which cause a whole bunch of calculations to fire off, which in turn causes a whole bunch more events which continues on until finally it reaches a state of equilibrium.
What I assume he wants to do is find the dependencies for those outlying calculations and work in from there so they can be rewritten and find a way for the calculations from happening for the sake of it, rather than because they need to.
I can't offer much advice in regards to simplifying the graph, as unfortunately it's not something I have much experience in. That said, I would start looking for those outlying calculations which have no dependencies, and just traverse the graph from there. Start building up a new framework that includes the core business logic of each calculation in the simplest possible way, and refactor the crap out of it along the way.

If this is, as you say, "core business logic", then you really don't want to be screwing around with fancy decompositions and evolutionary algorithms that produce a "black box" solution that no one in the world understands or is capable of modifying. I would be very surprised if any of these techniques actually yielded any useful result; the human brain is still incomprehensibly more capable than any machine at untangling complicated relationships.
What you want to do is traditional refactoring: clean up the individual procedures, streamlining them and merging them where possible. Your goal is to make the code clear, so your successor doesn't have to go through the same process.

What language are you using?
Your problem should be pretty easy to model using Java Executors and Future<> tasks, but a similar framework is perhaps availabe on your chosen platform as well?
Also, if I understand this correctly, you want to generate a critical path for a large set of interdependent calculations -- is that something done dynamically, or do you "just" need a static analysis?
Regarding an algorithmic solution; pick up the closest copy of your numerical analysis textbook and refresh your memory on singular value decompositions and LU factorization; I'm guessing from the top off my head that this is what lies behind the tool you linked to.
EDIT: Since you're using Java, I'll give a brief outline of a suggestion proposal:
-> Use a threadpool executor to parallellize all calculations easily
-> Solve interdependencies with an object map of Future<> or FutureTask<>:s, i.e. if you variables are A, B and C, where A = B + C, do something like this:
static final Map<String, FutureTask<Integer>> mapping = ...
static final ThreadPoolExecutor threadpool = ...
FutureTask<Integer> a = new FutureTask<Integer>(new Callable<Integer>() {
public Integer call() {
Integer b = mapping.get("B").get();
Integer c = mapping.get("C").get();
return b + c;
}
}
);
FutureTask<Integer> b = new FutureTask<Integer>(...);
FutureTask<Integer> c = new FutureTask<Integer>(...);
map.put("A", a);
map.put("B", a);
map.put("C", a);
for ( FutureTask<Integer> task : map.values() )
threadpool.execute(task);
Now, if I'm not totally off (and I may very well be, it was a while since I worked in Java), you should be able to solve the apparent deadlock problem by tuning the thread pool size, or use a growing thread pool. (You still have to make sure that there are no interdependent tasks though, such as if A = B + C, and B = A + 1...)

If the black-box is linear you can discover all the coefficients by simply concatenating many vectors of input and many vectors of output.
you have input x[i] and output y[i], then you create a matrix Y whose columns are y[0], y[1], ... y[n], and a matrix X whose columns are x[0], x[1], ..., x[n]. There will be a transformation Y = T * X, then you may determine T = Y * inverse(X).
But since you said it is complex I bet it is not linear. Then if you still want a general framework you can use this a factor-graph
https://ieeexplore.ieee.org/document/910572
I would be curious if you can do this.
What I think is easier is to understand the code and rewrite it using the best practices.

Related

checking for convergence in complex hierarchical models JAGS

I have estimated a complex hierarchical model with many random effects, but don't really know what the best approach is to checking for convergend. I have complex longitudinal data from a few hundred individuals and estimate quite a few parameters for every individual. Because of that, I have way to many traceplots to inspect visually. Or should I really spend a day going through all the traceplots? What would be a better way to check for convergence? Do I have to calculate Gelman and Rubin's Rhat for every parameter on the person level? And when can I conclude that the model converged? When absolutely all of the thousends of parameters reached convergence? Is it even sensible to expect that? Or is there something like "overall convergence"? And what does it mean when some person-level parameters did not converge? Does it make sense to use autorun.jags from the R2jags package with such a model or will it just run for ever? I know, these are a lot of question, but I just don't know how to approach that.
The measure I am using for convergence is a potential scale reduction factor (psrf)* using the gelman.diag function from the R package coda.
But nevertheless, I am also quickly visually inspecting all the traceplots, even though I also have tens/hundreds of them. It can be really fast if you put them in PNG files and then quickly go through them using e.g. IrfanView (let me know if you need me to expand on this).
The reason you should inspect the traceplots is pretty well described by an example from Marc Kery (author of great Bayesian books): see "Never blindly trust Rhat for convergence in a Bayesian analysis", here I include a self explanatory image from this email:
This is related to Rhat statistics while I use psrf, but it's pretty likely that psrf suffers from this too... and better to check the chains.
*) Gelman, A. & Rubin, D. B. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–472 (1992).

matlab running all linprog algortithms (is there a matlab-list of algorithms?)

Matlab offers multiple algorithms for solving Linear Programs.
For example Matlab R2012b offers: 'active-set', 'trust-region-reflective', 'interior-point', 'interior-point-convex', 'levenberg-marquardt', 'trust-region-dogleg', 'lm-line-search', or 'sqp'.
But other versions of Matlab support different algorithms.
I would like to run a loop over all algorithms that are supported by the users Matlab-Version. And I would like them to be ordered like the recommendation order of Matlab.
I would like to implement something like this:
i=1;
x=[];
while (isempty(x))
options=optimset(options,'Algorithm',Here_I_need_a_list_of_Algorithms(i))
x = linprog(f,A,b,Aeq,beq,lb,ub,x0,options);
end
In 99% this code should be equivalent to
x = linprog(f,A,b,Aeq,beq,lb,ub,x0,options);
but sometimes the algorithm gives back an empty array because of numerical problems (exitflag -4). If there is a chance that one of the other algorithms can find a solution I would like to try them too.
So my question is:
Is there a possibility to automatically get a list of all linprog-algorithms that are supported by the installed Matlab-version ordered like Matlab recommends them.
I think looping through all algorithms can make sense in other scenarios too. For example when you need very precise data and have a lot of time, you could run them all and than evaluate which gives the best results.
Or one would like to loop through all algorithms, if one wants to find which algorithms is the best for LPs with a certain structure.
There's no automatic way to do this as far as I know. If you really want to do it, the easiest thing to do would be to go to the online documentation, and check through previous versions (online documentation is available for old versions, not just the most recent release), and construct some variables like this:
r2012balgos = {'active-set', 'trust-region-reflective', 'interior-point', 'interior-point-convex', 'levenberg-marquardt', 'trust-region-dogleg', 'lm-line-search', 'sqp'};
...
r2017aalgos = {...};
v = ver('matlab');
switch v.Release
case '(R2012b)'
algos = r2012balgos;
....
case '(R2017a)'
algos = r2017aalgos;
end
% loop through each of the algorithms
Seems boring, but it should only take you about 30 minutes.
There's a reason MathWorks aren't making this as easy as you might hope, though, because what you're asking for isn't a great idea.
It is possible to construct artificial problems where one algorithm finds a solution and the others don't. But in practice, typically if the recommended algorithm doesn't find a solution this doesn't indicate that you should switch algorithms, it indicates that your problem wasn't well-formulated, and you should consider modifying it, perhaps by modifying some constraints, or reformulating the objective function.
And after all, why stop with just looping through the alternative algorithms? Why not also loop through lots of values for other options such as constraint tolerances, optimality tolerances, maximum number of function evaluations, etc.? These may have just as much likelihood of affecting things as a choice of algorithm. And soon you're running an optimisation algorithm to search through the space of meta-parameters for your original optimisation.
That's not a great plan - probably better to just choose one of the recommended algorithms, stick to that, and if things don't work out then focus on improving your formulation of the problems rather than over-tweaking the optimisation itself.

How to remove nodes from TensorFlow graph?

I need to write a program where part of the TensorFlow nodes need to keep being there storing some global information(mainly variables and summaries) while the other part need to be changed/reorganized as program runs.
The way I do now is to reconstruct the whole graph in every iteration. But then, I have to store and load those information manually from/to checkpoint files or numpy arrays in every iteration, which makes my code really messy and error prone.
I wonder if there is a way to remove/modify part of my computation graph instead of reset the whole graph?
Changing the structure of TensorFlow graphs isn't really possible. Specifically, there isn't a clean way to remove nodes from a graph, so removing a subgraph and adding another isn't practical. (I've tried this, and it involves surgery on the internals. Ultimately, it's way more effort than it's worth, and you're asking for maintenance headaches.)
There are some workarounds.
Your reconstruction is one of them. You seem to have a pretty good handle on this method, so I won't harp on it, but for the benefit of anyone else who stumbles upon this, a very similar method is a filtered deep copy of the graph. That is, you iterate over the elements and add them in, predicated on some condition. This is most viable if the graph was given to you (i.e., you don't have the functions that built it in the first place) or if the changes are fairly minor. You still pay the price of rebuilding the graph, but sometimes loading and storing can be transparent. Given your scenario, though, this probably isn't a good match.
Another option is to recast the problem as a superset of all possible graphs you're trying to evaluate and rely on dataflow behavior. In other words, build a graph which includes every type of input you're feeding it and only ask for the outputs you need. Good signs this might work are: your network is parametric (perhaps you're just increasing/decreasing widths or layers), the changes are minor (maybe including/excluding inputs), and your operations can handle variable inputs (reductions across a dimension, for instance). In your case, if you have only a small, finite number of tree structures, this could work well. You'll probably just need to add some aggregation or renormalization for your global information.
A third option is to treat the networks as physically split. So instead of thinking of one network with mutable components, treat the boundaries between fixed and changing pieces are inputs and outputs of two separate networks. This does make some things harder: for instance, backprop across both is now ugly (which it sounds like might be a problem for you). But if you can avoid that, then two networks can work pretty well. It ends up feeling a lot like dealing with a separate pretraining phase, which you many already be comfortable with.
Most of these workarounds have a fairly narrow range of problems that they work for, so they might not help in your case. That said, you don't have to go all-or-nothing. If partially splitting the network or creating a supergraph for just some changes works, then it might be that you only have to worry about save/restore for a few cases, which may ease your troubles.
Hope this helps!

Graph/tree representation and recursion

I'm currently writing an optimization algorithm in MATLAB, at which I completely suck, therefore I could really use your help. I'm really struggling to find a good way of representing a graph (or well more like a tree with several roots) which would look more or less like this:
alt text http://img100.imageshack.us/img100/3232/graphe.png
Basically 11/12/13 are our roots (stage 0), 2x is stage1, 3x stage2 and 4x stage3. As you can see nodes from stageX are only connected to several nodes from stage(X+1) (so they don't have to be connected to all of them).
Important: each node has to hold several values (at least 3-4), one will be it's number and at least two other variables (which will be used to optimize the decisions).
I do have a simple representation using matrices but it's really hard to maintain, so I was wondering is there a good way to do it?
Second question: when I'm done with that representation I need to calculate how good each route (from roots to the end) is (like let's say I need to compare is 11-21-31-41 the best or is 11-21-31-42 better) to do that I will be using the variables that each node holds. But the values will have to be calculated recursively, let's say we start at 11 but to calcultate how good 11-21-31-41 is we first need to go to 41, do some calculations, go to 31, do some calculations, go to 21 do some calculations and then we can calculate 11 using all the previous calculations. Same with 11-21-31-42 (we start with 42 then 31->21->11). I need to check all the possible routes that way. And here's the question, how to do it? Maybe a BFS/DFS? But I'm not quite sure how to store all the results.
Those are some lengthy questions, but I hope I'm not asking you for doing my homework (as I got all the algorithms, it's just that I'm not really good at matlab and my teacher wouldn't let me to do it in java).
Granted, it may not be the most efficient solution, but if you have access to Matlab 2008+, you can define a node class to represent your graph.
The Matlab documentation has a nice example on linked lists, which you can use as a template.
Basically, a node would have a property 'linksTo', which points to the index of the node it links to, and a method to calculate the cost of each of the links (possibly with some additional property that describe each link). Then, all you need is a function that moves down each link, and brings the cost(s) with it when it moves back up.

R-tree implementation in matlab

please, any one tell me how we can implement the R-tree structure in matlab to speed the image retrieval system , I would like to inform you that my database space a feature vector of Color Histogram (Multidimensional ) and also I I have a distance vector for similarity measure...
thanks
I don't use Matlab. So I do not have any idea how much cost is associated in Matlab with index structures. It doesn't appear to be designed for such things.
R-Trees seem to make quite a difference. Judging from http://elki.dbs.ifi.lmu.de/wiki/Benchmarking some algorithms can benefit immensely from having a good index structure. The numbers on that web page are 5 to 7 times faster on a 110250 image color histogram data set.
From my experience, R-Trees can indeed be quite hard to get right. But only if you want to go the full way. If you have a static database, you can get easily away with a bulk loaded R-Tree. Neither the bulk loading nor the queries are very hard to do. R-Trees get messy once you want to do the R*-Tree optimizations with complex split strategies, reinsertions, balancing, and do all this efficiently and on-disk with smart caching. But as long as you are operating in-memory and do not dynamically add objects, a STR bulk-loaded R-tree will help a lot and be a lot easier to implement.
You might still be better off building on something that has already a working R-Tree. Say SQLite with the rtree module or ELKI mentioned above.
Implementing R-tree is not really a simple task. You can use matlab binding for the LidarK library, it should be fast enough. The code is here:
http://graphics.cs.msu.ru/en/science/research/3dpoint/lidark
If you decide to use kd-tree (which is typical for image retrieval), there's a good implementation too.
http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
I'm not familiar with R-trees specifically but in general trees are dynamic data structures. Matlab doesn't really do dynamic data structures unless you start using its OO facilities. If you don't want to do that you can flatten your tree into a cell array. For example I'll write a (strictly) binary tree flattened into a cell array, which will save me having to draw a tree. Here goes:
{1,{2},{3}}
which represents a binary tree with root 1 and branches left to 2, right to 3. I can make this deeper:
{1,{2,{5,6}},{3,{7,8}}}
which adds another level to the previous tree. If you want to add data at any of the nodes, then your (first) tree might look like this:
{1,[a b c],{2,[e f]},{3,[h i j k l]}}
An alternative to this would be to define your nodes separately, like this
node1 = [a b c]; node2 = [e f]; node3 = [h i j k l],
then your tree becomes
{node1, node2, node3}
Your problem then becomes writing functions to build and to traverse the tree in your chosen representation. Most tree functions are best written as recursions. Any good text, and lots of Internet sites, will tell you all that you want to know about such functions.