What we should do with highly correlated features? - linear-regression

In my data set 2 features C1 and C2 are highly correlated. I did following steps. Could you please let me know if it is correct and make sense? do you have a better approach?
First I used linear model to find the fitted line:
C1=a*C2+b
from sklearn import linear_model
reg=linear_model.LinearRegression()
y_reg = data1['C1']
x_reg = data1['C2']
reg.fit(x_reg2,y_reg2)
a=reg.coef_
b=reg.intercept_
print(a,b)
After finding a and b I removed C1 and C2 from the dataset and added a new Vriable: new=a*C1+b
My next question is how I can understand if this line is good?

In general, it is recommended to avoid having correlated features in your dataset. Indeed, a group of highly correlated features will not bring additional information (or just very few), but will increase the complexity of the algorithm, thus increasing the risk of errors. Depending on the features and the model, correlated features might not always harm the performance of the model but that is a real risk.
You can see this as an interpretation of Occam's razor: with no significant difference in performance, simpler models should be preferred. In your case, a simpler model is a model with only C1 or C2 instead of both, if the performance is similar.
Now what you did when replacing C1 and C2 by a*C1+b actually gets rid of the multicolinearity, but this does not make much sense to me. I don't see any benefits in that compared to keeping C1 only: indeed, you replaced C1 and C2 by a new variable fitted to match... C1! If the linear fit is good, there will be almost no difference.
Feature engineering and feature selection should be justified by underlying theory or at least domain knowledge, so here are several things you could do instead:
Apply PCA to the dataset before training the model: this will produce a new and uncorrelated feature set. The disvantage is that you will not be able to explain the model's decision with the original features if that is something you needed.
Use a feature selection algorithm. The best algorithm will depend on the data and the model. Here are scikit learn's feature selection algorithms as an example.
Simply compare the performance of the model with both C1 and C2, then only C1 and only C2. Then estimate whether the difference in performance makes it worth keeping both features (this is actually applying Occam's razor principle). This can be seen as a "manual" feature selection algorithm.
Use domain knowledge to choose which variable to keep: which one is the most relevant to the problrm?. This can be done if the training of the model is too expensive to allow several experiments.
I would also recommend you to read this thread on data science stack exchange as it will give you several other opinions on the issue of multicolinearity. I think this is interesting for you because it will complete the insights I gave here and help you deciding what to do with C1 and C2.

Just remove the feature that are highly correlated. Not removing feature may decrease efficiency and unnecessary cost.

Related

How to make the dynamic model in Dymola agree with the steady-state design result?

Modelica modeling is the first principle modeling, so how to test the model and set an effective benchmark is important, for example, I could design a fluid network as my wish, but when building a dynamic simulation model, I need to know the detailed geometry structure and parameters to set up every piece of my model. Usually, I would build a steady-state model with simple energy and mass conservation laws, then design every piece of equipment based on the corresponding design manual, but when I put every dynamic component together, when simulation till steady-state, the result is different from the steady-state model more or less. So I was wondering if I should modify my workflow to make the dynamic model agree with the steady-state model. Any suggestions are welcome.
#dymola #modelica
To my understanding of the question, your parameter values are fixed and physically known. I would attempt the following approach as a heuristic to identify the (few) component(s) that one needs to carefully investigate in order to understand how they influence or violates the assumed first principles.
This is just as a first trial and it could be subject to further improvement and fine-tuning.
Consider the set of significant set of variables xd(p,t) \in R^n and parameters p. Note that p also includes significant start values. p in R^m includes only the set of additional parameters not available in the steady state model.
Denote the corresponding variables of the steady state model by x_s
Denote a time point where the dynamic model is "numerically" in "semi-" steady-state by t*
Consider the function C(xd(p,t*),xs) = ||D||^2 with D = xd(p,t*) - xs
It could be beneficial to describe C as a vector rather than a single valued function.
Compute the partial derivatives of C w.t. p expressed in terms of dxd/dp, i.e.
dC/dp = d[D^T D]/dp
= d[(x_d-x_s)^T (x_d - x_s)]/dp
= (dx_d/dp)^T D + ...
Consider scaling the above function, i.e. dC/dp * p/C (avoid expected numerical issues via some epsilon-tricks)
Here you get a ranking of most significant parameters which are causing the apparent differences. The hopefully few number of components including these parameters could be the ones causing such violation.
If this still does not help, may be due to expected high correlation among parameters, I would go further and consider a dummy parameter identification problem, out of which a more rigorous ranking of significant model parameters can be obtained.
If the Modelica language had capabilities for expressing dynamic parameter sensitivities, all the above computation can be easily carried out as a single Modelica model (with a slightly modified formulation).
For instance, if we had something like der(x,p) corresponding to dx/dp, one could simply state
dcdp = der(C,p)
An alternative approach is proposed via the DerXP library

checking for convergence in complex hierarchical models JAGS

I have estimated a complex hierarchical model with many random effects, but don't really know what the best approach is to checking for convergend. I have complex longitudinal data from a few hundred individuals and estimate quite a few parameters for every individual. Because of that, I have way to many traceplots to inspect visually. Or should I really spend a day going through all the traceplots? What would be a better way to check for convergence? Do I have to calculate Gelman and Rubin's Rhat for every parameter on the person level? And when can I conclude that the model converged? When absolutely all of the thousends of parameters reached convergence? Is it even sensible to expect that? Or is there something like "overall convergence"? And what does it mean when some person-level parameters did not converge? Does it make sense to use autorun.jags from the R2jags package with such a model or will it just run for ever? I know, these are a lot of question, but I just don't know how to approach that.
The measure I am using for convergence is a potential scale reduction factor (psrf)* using the gelman.diag function from the R package coda.
But nevertheless, I am also quickly visually inspecting all the traceplots, even though I also have tens/hundreds of them. It can be really fast if you put them in PNG files and then quickly go through them using e.g. IrfanView (let me know if you need me to expand on this).
The reason you should inspect the traceplots is pretty well described by an example from Marc Kery (author of great Bayesian books): see "Never blindly trust Rhat for convergence in a Bayesian analysis", here I include a self explanatory image from this email:
This is related to Rhat statistics while I use psrf, but it's pretty likely that psrf suffers from this too... and better to check the chains.
*) Gelman, A. & Rubin, D. B. Inference from iterative simulation using multiple sequences. Stat. Sci. 7, 457–472 (1992).

How to ensure the output of _best_programs of SymbolicTransformer of gplearn is different?

I am using the SymbolicTransformer of gplearn to generate some automated features. The issue is, when I inspect the expression of the features via looking at _best_programs after fitting, I find that most of the features have the same expression. I am wondering whether there is a way to ensure that we output different features using SymbolicTransformer after fitting?
I don't know if there is a way to explicitly enforce this but you can probably try to enforce more diverse populations each generation in the hopes that this leads to a a collection of more diverse _best_programs. In my opinion a few parameters you could look into are:
p_crossover
p_subtree_mutation
p_hoise_mutation
p_point_mutation
p_point_replace
If you increase the chance of crossover or mutation you will increase your expected diversity but you must not overdue it. There is a balance between a diverse population and an accurate one. The higher the crossover or mutation the more likely that you will take a strong individual candidate and change it into something meaningless.

How to employ Learning-to-rank models (CNN, LSTM) in short-pair ranking?

In common applied learn-to-rank tasks, the inputs are usually semantic and have good syntactic structure, like Question-Answer ranking tasks. In this scenario, CNN or LSTM is a good structure to capture the latent information (local or long dependency) of QA-pairs.
But in reality, sometimes we just have short pair and discrete words. In this occasion, CNN or LSTM is still a fair choice?Or is there some more appropriate method can handle this?
The bigger question is how much training data you have. There's a lot of interesting work, but the reason that the deep neural network approaches tend to use QA ranking tasks is because those tasks typically have hundreds of thousands or millions of training examples.
When you have shorter queries, i.e. title or web queries, you will possibly need even more data to learn, because less of the network will be exercised by each training instance. It is possible, but the method you choose should be based on the training data you have available, rather than the size of your queries, in general.
[0-50 queries] -> Hand-tuned, time-tested model such as Query Likelihood, BM25, (or if you want better results, ngram models such as SDM) (if you want more recall, pseudo-relevance-feedback models such as RM3).
[50-1000 queries] -> Linear or Tree-based learning-to-rank methods
[1000-millions] -> Deep approach, or possibly still learning-to-rank. I'm not sure any of the deep papers have truly dominated a state-of-the-art gradient-boosted-regression-tree setup.
A recent paper by one of my labmates used pseudo-labels from BM25 to bootstrap a DNN. They got good results (better than BM25), but they literally had to be Google (training-time-wise) to pull it off.

Matrix-Algebra Design Decomposition

I am looking at refactoring some very complex code which is a subsystem of a project I have at work. Part of my examination of this code is that it is incredibly complex, and contains a lot of inputs, intermediate values and outputs depending on some core business logic.
I want to redesign this code to be easier to maintain as well as executing a hell of a lot faster, so to start off with I have been trying to look at each of the parameters and their dependencies on each other. This has lead to quite a large and tangled graph and I would like a mechanism for simplifying this graph.
A while back I came across a technique in a book about SOA design called "Matrix Design Decomposition" which uses a matrix of outputs and the dependencies they have on the inputs, applies some form of matrix algebra and can generate Business Process diagrams for those dependencies.
I know there is a web tool available at http://www.designdecomposition.com/ however it is limited in the number of input/output dependencies you can have. I have tried looking around for the algorithmic source for this tool (so I could attempt to implement it myself without the size limitation), however I have had no luck.
Does anybody know a similar technique that I could use? Currently I am even considering taking the dependency matrix and applying some Genetic Algorithms to see if evolution can come up with a simpler workflow...
Cheers,
Aidos
EDIT:
I will explain the motivation:
The original code was written for a system which computed all of the values (about 60) every time the user performed an operation (adding, removing or modifying certain properties of a item). This code was written over ten years ago and is definitely showing signs of age - others have added more complex calculations into the system and now we are getting completely unreasonable performance (up to 2 minutes before control is returned to the user). It has been decided to detach the calculations from the user actions and provide a button to "recalculate" the values.
My problem arises because there are so many calculations that are going on and they are based on the assumption that all of the required data will be available for their computation - now when I try to re-implement the calculations I keep encountering problems because I haven't got the result for a different calculation that this calculation relies on.
This is where I want to use the matrix-decomposition approach. The MD approach allows me to specify all of the inputs and outputs and gives me the "simplest" workflow that I can use for generating all of the outputs.
I can then use this "workflow" to know the precedence of the calculations I need to perform to get the same result without generating any exceptions. It also shows me which parts of the calculation system I can parallelise and where the fork and join points will be (I won't worry about that part just yet). At the moment all I have is an insanely large matrix with lots of dependencies showing in it, with no idea where to start.
I will elaborate from my comment a little more:
I don't want to use the solution from the EA process in the actual program. I want to take the dependency matrix and decompose it into modules that I will then code manually - this is purely a design aid - I am just interested in what the inputs/outputs for these modules will be. Basically a representation of the complex interdependencies between these calculations, as well as some idea of precedence.
Say I have A requires B and C. D requires A and E. F requires B, A and E, I want to effectively partition the problem space from a complex set of dependencies into a "workflow" that I can examine to get a better understanding. Once I have this understanding I can come up with a better design / implementation that is still human readable, so for the example I know I need to calculate A, then C, then D, then F.
--
I know this seems kind of strange, if you take a look at the website I linked to before the matrix based decomposition there should give you some understanding of what I am thinking of...
kquinn, If it's the piece of code I think he's referring to (I used to work there), it's already a black box solution that no human can understand as is. He's not looking to make it more complicated, less in fact. What he's trying to achieve is a whole heap of interlinked calculations.
What currently happens, is that whenever anything changes, it's an avalanche of events which cause a whole bunch of calculations to fire off, which in turn causes a whole bunch more events which continues on until finally it reaches a state of equilibrium.
What I assume he wants to do is find the dependencies for those outlying calculations and work in from there so they can be rewritten and find a way for the calculations from happening for the sake of it, rather than because they need to.
I can't offer much advice in regards to simplifying the graph, as unfortunately it's not something I have much experience in. That said, I would start looking for those outlying calculations which have no dependencies, and just traverse the graph from there. Start building up a new framework that includes the core business logic of each calculation in the simplest possible way, and refactor the crap out of it along the way.
If this is, as you say, "core business logic", then you really don't want to be screwing around with fancy decompositions and evolutionary algorithms that produce a "black box" solution that no one in the world understands or is capable of modifying. I would be very surprised if any of these techniques actually yielded any useful result; the human brain is still incomprehensibly more capable than any machine at untangling complicated relationships.
What you want to do is traditional refactoring: clean up the individual procedures, streamlining them and merging them where possible. Your goal is to make the code clear, so your successor doesn't have to go through the same process.
What language are you using?
Your problem should be pretty easy to model using Java Executors and Future<> tasks, but a similar framework is perhaps availabe on your chosen platform as well?
Also, if I understand this correctly, you want to generate a critical path for a large set of interdependent calculations -- is that something done dynamically, or do you "just" need a static analysis?
Regarding an algorithmic solution; pick up the closest copy of your numerical analysis textbook and refresh your memory on singular value decompositions and LU factorization; I'm guessing from the top off my head that this is what lies behind the tool you linked to.
EDIT: Since you're using Java, I'll give a brief outline of a suggestion proposal:
-> Use a threadpool executor to parallellize all calculations easily
-> Solve interdependencies with an object map of Future<> or FutureTask<>:s, i.e. if you variables are A, B and C, where A = B + C, do something like this:
static final Map<String, FutureTask<Integer>> mapping = ...
static final ThreadPoolExecutor threadpool = ...
FutureTask<Integer> a = new FutureTask<Integer>(new Callable<Integer>() {
public Integer call() {
Integer b = mapping.get("B").get();
Integer c = mapping.get("C").get();
return b + c;
}
}
);
FutureTask<Integer> b = new FutureTask<Integer>(...);
FutureTask<Integer> c = new FutureTask<Integer>(...);
map.put("A", a);
map.put("B", a);
map.put("C", a);
for ( FutureTask<Integer> task : map.values() )
threadpool.execute(task);
Now, if I'm not totally off (and I may very well be, it was a while since I worked in Java), you should be able to solve the apparent deadlock problem by tuning the thread pool size, or use a growing thread pool. (You still have to make sure that there are no interdependent tasks though, such as if A = B + C, and B = A + 1...)
If the black-box is linear you can discover all the coefficients by simply concatenating many vectors of input and many vectors of output.
you have input x[i] and output y[i], then you create a matrix Y whose columns are y[0], y[1], ... y[n], and a matrix X whose columns are x[0], x[1], ..., x[n]. There will be a transformation Y = T * X, then you may determine T = Y * inverse(X).
But since you said it is complex I bet it is not linear. Then if you still want a general framework you can use this a factor-graph
https://ieeexplore.ieee.org/document/910572
I would be curious if you can do this.
What I think is easier is to understand the code and rewrite it using the best practices.