Warning " X is rank deficient to within machine precision" problem - matlab

I am trying to build a multiple linear regression in MATLAB with 20 predictors, which are categorical with 4 levels each. I am using the function "regress", like this (these are not the actual variables):
X = [ones(size(x1)) x1 x2 x3...x20];
[b,bint,r,rint] = regress(Y, X);
Before this, I transformed the vectors x1,x2...x20 in categorical variables with dummyvar.
I get this error and a lot of 0's in the b coefficients and this error:
Warning: X is rank deficient to within machine precision.
In the dummyvar documentation it is mentioned:
To use the dummy variables in a regression model, you must either delete a column (to create a reference group) or fit a regression model with no intercept term.
I tried not using the intercept ones(size(x1)) and I get the same error.
I would appreciate any input on how to solve this.

Try to simplify the problem down to the minimum working example, and then post that here, so we can reproduce it and help you through. See https://en.wikipedia.org/wiki/Rank_(linear_algebra)
for examples of rank deficiency.

Related

Matlab ode functions to get specified number of values/outputs

I have a function file with my differential equations, I am performing a ode23s on the function in the standard form i.e
[t,m]=ode23s('DE_function',tspan,[mA pA mB pB mC pC mD],optionsDE,p)
I obtain about 150 values/results/output for each mA and so on. My ode23s is working fine.
I have experimental dataset for the same mA and so on which i have to use to calculate the least squared error.. i am trying to do this:
a = m(:,1) - A(:,2); and so on. Here in my experimental data, I have just 20 values/results/outputs etc according to 20 time points. I have defined the same time points for the tspan as well. But since my matrices do not match in dimension, i am unable to proceed with my calculations. Is there a way to receive exactly 20 values according to the 20 time points such as 1, 2, etc in the ode23s as well, or may be a way to get and store them only.
I have been trying to find a solution for this error but unable to find anything suitable. Many thanks for any kind of suggestions and hits.
The Matlab documentation has all you need. When you call ode23 you can specify the time locations in tspan.
"Interval of integration, specified as a vector. At minimum, tspan must be a two element vector [t0 tf] specifying the initial and final times. To obtain solutions at specific times between t0 and tf, use a longer vector of the form [t0,t1,t2,...,tf]. The elements in tspan must be all increasing or all decreasing."

For loop equation into Octave / Matlab code

I'm using octave 3.8.1 which works like matlab.
I have an array of thousands of values I've only included three groupings as an example below:
(amp1=0.2; freq1=3; phase1=1; is an example of one grouping)
t=0;
amp1=0.2; freq1=3; phase1=1; %1st grouping
amp2=1.4; freq2=2; phase2=1.7; %2nd grouping
amp3=0.8; freq3=5; phase3=1.5; %3rd grouping
The Octave / Matlab code below solves for Y so I can plug it back into the equation to check values along with calculating values not located in the array.
clear all
t=0;
Y=0;
a1=[.2,3,1;1.4,2,1.7;.8,5,1.5]
for kk=1:1:length(a1)
Y=Y+a1(kk,1)*cos ((a1(kk,2))*t+a1(kk,3))
kk
end
Y
PS: I'm not trying to solve for Y since it's already solved for I'm trying to solve for Phase
The formulas located below are used to calculate Phase but I'm not sure how to put it into a for loop that will work in an array of n groupings:
How would I write the equation / for loop for finding the phase if I want to find freq=2.5 and amp=.23 and the phase is unknown I've looked online and it may require writing non linear equations which I'm not sure how to convert what I'm trying to do into such an equation.
phase1_test=acos(Y/amp1-amp3*cos(2*freq3*pi*t+phase3)/amp1-amp2*cos(2*freq2*pi*t+phase2)/amp1)-2*freq1*pi*t
phase2_test=acos(Y/amp2-amp3*cos(2*freq3*pi*t+phase3)/amp2-amp1*cos(2*freq1*pi*t+phase1)/amp2)-2*freq2*pi*t
phase3_test=acos(Y/amp3-amp2*cos(2*freq2*pi*t+phase2)/amp3-amp1*cos(2*freq1*pi*t+phase1)/amp3)-2*freq2*pi*t
Image of formula below:
I would like to do a check / calculate phases if given a freq and amp values.
I know I have to do a for loop but how do I convert the phase equation into a for loop so it will work on n groupings in an array and calculate different values not found in the array?
Basically I would be given an array of n groupings and freq=2.5 and amp=.23 and use the formula to calculate phase. Note: freq will not always be in the array hence why I'm trying to calculate the phase using a formula.
Ok, I think I finally understand your question:
you are trying to find a set of phase1, phase2,..., phaseN, such that equations like the ones you describe are satisfied
You know how to find y, and you supply values for freq and amp.
In Matlab, such a problem would be solved using, for example fsolve, but let's look at your problem step by step.
For simplicity, let me re-write your equations for phase1, phase2, and phase3. For example, your first equation, the one for phase1, would read
amp1*cos(phase1 + 2 freq1 pi t) + amp2*cos(2 freq2 pi t + phase2) + amp3*cos(2 freq3 pi t + phase3) - y = 0
Note that ampX (X is a placeholder for 1, 2, 3) are given, pi is a constant, t is given via Y (I think), freqX are given.
Hence, you are, in fact, dealing with a non-linear vector equation of the form
F(phase) = 0
where F is a multi-dimensional (vector) function taking a multi-dimensional (vector) input variable phase (comprised of phase1, phase2,..., phaseN). And you are looking for the set of phaseX, where all of the components of your vector function F are zero. N.B. F is a shorthand for your functions. Therefore, the first component of F, called f1, for example, is
f1 = amp1*cos(phase1+...) + amp2*cos(phase2+...) + amp3*cos(phase3+...) - y = 0.
Hence, f1 is a one-dimensional function of phase1, phase2, and phase3.
The technical term for what you are trying to do is find a zero of a non-linear vector function, or find a solution of a non-linear vector function. In Matlab, there are different approaches.
For a one-dimensional function, you can use fzero, which is explained at http://www.mathworks.com/help/matlab/ref/fzero.html?refresh=true
For a multi-dimensional (vector) function as yours, I would look into using fsolve, which is part of Matlab's optimization toolbox (which means I don't know how to do this in Octave). The function fsolve is explained at http://www.mathworks.com/help/optim/ug/fsolve.html
If you know an approximate solution for your phases, you may also look into iterative, local methods.
In particular, I would recommend you look into the Newton's Method, which allows you to find a solution to your system of equations F. Wikipedia has a good explanation of Newton's Method at https://en.wikipedia.org/wiki/Newton%27s_method . Newton iterations are very simple to implement and you should find a lot of resources online. You will have to compute the derivative of your function F with respect to each of your variables phaseX, which is very simple to compute since you're only dealing with cos() functions. For starters, have a look at the one-dimensional Newton iteration method in Matlab at http://www.math.colostate.edu/~gerhard/classes/331/lab/newton.html .
Finally, if you want to dig deeper, I found a textbook on this topic from the society for industrial and applied math: https://www.siam.org/books/textbooks/fr16_book.pdf .
As you can see, this is a very large field; Newton's method should be able to help you out, though.
Good luck!

leave-one-out regression using lasso in Matlab

I have 300 data samples with around 4000 dimension feature each. Each input has a 5 dim. output which is in the range of -2 to 2. I am trying to fit a lasso model to it. I went through a few posts which talk about cross validation strategies like this one: Leave one out cross validation algorithm in matlab
But I saw that lasso does not support leaveout in Matlab! http://www.mathworks.com/help/stats/lasso.html
How can I train a model using leave one out cross validation and fit a model using lasso on my dataset? I am trying to do this in matlab. I would like to get a set of weights which I will be able to use for future predictions on other data.
I tried using glmnet: http://www.stanford.edu/~hastie/glmnet_matlab/intro.html but I couldn't compile it on my machine due to lack of proper mex compiler.
Any solutions to my problem? Thanks :)
EDIT
I am also trying to use lasso function in-built with MATLAB. It has an option to perform cross validation. It outputs B and Fit Statistics, where B is Fitted coefficients, a p-by-L matrix, where p is the number of predictors (columns) in X, and L is the number of Lambda values.
Now given a new test sample, how can I calculate the output using this model?
You can use a leave-one-out approach regardless of your training method. As explained here, you can use crossvalind to split the data into training and test sets.
[Train, Test] = crossvalind('LeaveMOut', N, M)

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

Matlab - how to compute PCA on a huge data set [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MATLAB is running out of memory but it should not be
I want to perform PCA analysis on a huge data set of points. To be more specific, I have size(dataPoints) = [329150 132] where 328150 is the number of data points and 132 are the number of features.
I want to extract the eigenvectors and their corresponding eigenvalues so that I can perform PCA reconstruction.
However, when I am using the princomp function (i.e. [eigenVectors projectedData eigenValues] = princomp(dataPoints); I obtain the following error :
>> [eigenVectors projectedData eigenValues] = princomp(pointsData);
Error using svd
Out of memory. Type HELP MEMORY for your options.
Error in princomp (line 86)
[U,sigma,coeff] = svd(x0,econFlag); % put in 1/sqrt(n-1) later
However, if I am using a smaller data set, I have no problem.
How can I perform PCA on my whole dataset in Matlab? Have someone encountered this problem?
Edit:
I have modified the princomp function and tried to use svds instead of svd, but however, I am obtaining pretty much the same error. I have dropped the error bellow :
Error using horzcat
Out of memory. Type HELP MEMORY for your options.
Error in svds (line 65)
B = [sparse(m,m) A; A' sparse(n,n)];
Error in princomp (line 86)
[U,sigma,coeff] = svds(x0,econFlag); % put in 1/sqrt(n-1) later
Solution based on Eigen Decomposition
You can first compute PCA on X'X as #david said. Specifically, see the script below:
sz = [329150 132];
X = rand(sz);
[V D] = eig(X.' * X);
Actually, V holds the right singular vectors, and it holds the principal vectors if you put your data vectors in rows. The eigenvalues, D, are the variances among each direction. The singular vectors, which are the standard deviations, are computed as the square root of the variances:
S = sqrt(D);
Then, the left singular vectors, U, are computed using the formula X = USV'. Note that U refers to the principal components if your data vectors are in columns.
U = X*V*S^(-1);
Let us reconstruct the original data matrix and see the L2 reconstruction error:
X2 = U*S*V';
L2ReconstructionError = norm(X(:)-X2(:))
It is almost zero:
L2ReconstructionError =
6.5143e-012
If your data vectors are in columns and you want to convert your data into eigenspace coefficients, you should do U.'*X.
This code snippet takes around 3 seconds in my moderate 64-bit desktop.
Solution based on Randomized PCA
Alternatively, you can use a faster approximate method which is based on randomized PCA. Please see my answer in Cross Validated. You can directly compute fsvd and get U and V instead of using eig.
You may employ randomized PCA if the data size is too big. But, I think the previous way is sufficient for the size you gave.
My guess is that you have a huge data set. You don't need all of the svd coefficients. In this case, use svds instead of svd :
Taken directly from Matlab help:
s = svds(A,k) computes the k largest singular values and associated singular vectors of matrix A.
From your question, I understand that you don't call svd directly. But you might as well take a look at princomp (It is editable!) and alter the line that calls it.
You probably needed to calculate an n by n matrix in your computation somehow that is to say:
329150 * 329150 * 8btyes ~ 866GB`
of space which explains why you're getting a memory error. There seems to be an efficient way to calculate pca using princomp(X, 'econ') which I suggest you give it a try.
More on this in stackoverflow and mathworks..
Manually compute X'X (132x132) and svd on it. Or find NIPALS script.