I have a rather simple question that needs addressing in matlab. I think I understand but I need someone to clarify I'm doing this correctly:
In the following example I'm trying to calculate the correlation between two vectors and the p values for the correlation.
dat = [1,3,45,2,5,56,75,3,3.3];
dat2 = [3,33,5,6,4,3,2,5,7];
[R,p] = corrcoef(dat,dat2,'rows','pairwise');
R2 = R(1,2).^2;
pvalue = p(1,2);
From this I have a R2 value of 0.11 and a p value of 0.38. Does this mean that the vectors are correlated by 0.11 (i.e. 11%) and this would be expected to occur 38 % of the same, so 62 % of the time a different correlation could occur?
>> [R,p] = corrcoef(dat,dat2,'rows','pairwise')
R =
1.0000 -0.3331
-0.3331 1.0000
p =
1.0000 0.3811
0.3811 1.0000
The correlation is -0.3331 and the p-value is 0.3811. The latter is the probability of getting a correlation as large as -0.3331 by random chance, when the true correlation is zero. The p-value is large, so we cannot reject the null hypothesis of no correlation at any reasonable significance level.
The correlation coefficient here is
r(1,2)
ans =
-0.3331
which is a correlation of -33.3%, which tells you that the two datasets are negatively linearly correlated. You can see this by plotting them:
plot(dat, dat2, '.'), grid, lsline
The p-value of the correlation is
p(1,2)
ans =
0.3811
This tells you that even if there was no correlation between two random variables, then in a sample of 9 observations you would expect to see a correlation at least as extreme as -33.3% about 38.1% of the time.
By at least as extreme we mean that the measured correlation in a sample would be below -33.3%, or above 33.3%.
Given that the p value is so large, you cannot reliably make any conclusions about whether the null hypothesis of zero correlation should be rejected or not.
Related
I have 3 matrices (55000x3 double) and want to compare them.
I'm taking the arithmetic mean of the value of each position and want to provide in addition an indicator how the three matrices correlate.
The values in one position of the matrices are for example:
Matrix1 pos(1:1): 3.679
Matrix2 pos(1:1): 3.721
Matrix3 pos(1:1): 3.554
As I cannot just give the standard deviation for each value because it would be to much information I'm looking for a way to give a meaningful statement for the correlation without having to much information.
What's the best way to do this?
I think you want the correlation coefficient. You can reshape each of your matrices into a vector (using (:)), and then compute the correlation coefficient for each pair of vectors (originally matrices) using corrcoef.
For example, let:
Matrix1 = [ 1 2; 3 4; 5 6 ];
Matrix2 = -2*[ 1 2; 3 4; 5 6 ];
Matrix3 = [ 1.1 2.3; 3.4 4.1; 4.9 6.3 ];
Then
C = corrcoef([Matrix1(:) Matrix2(:) Matrix3(:)]);
gives
C =
1.0000 -1.0000 0.9952
-1.0000 1.0000 -0.9952
0.9952 -0.9952 1.0000
This tells you that, in this case,
Each of the three matrices is totally correlated with itself (C(1,1), C(2,2) and C(3,3) equal 1). This is obvious.
Matrices 1 and 2 have correlation coefficient C(1,2) equal to -1. This was expected, because matrix 2 is a negative multiple of matrix 1.
Matrices 1 and 3 are highly correlated (C(1,3) is 0.9952). This is because matrix 3 was defined as matrix 1 with some random "noise".
Matrices 2 and 3 are also highly correlated but with negative sign (C(2,3) is -0.9952), as should be clear from the above.
Have you tried representing your data using boxplot?
boxplot(([data(:,1); data(:,2); data(:,3)])');
I have two variables both class double
X = 11x3 Matrix (Showing number of Negative, Neutral, Positive elements in each row)
Y = 11x1 (showing prices)
How would I show the correlation between these two variables and also fit this to a Linear regression model.
I have tried :
corrcoef([X,Y])
ans =
1.0000 0.3119 0.6753 0.0996
0.3119 1.0000 0.4582 -0.0565
0.6753 0.4582 1.0000 -0.0627
0.0996 -0.0565 -0.0627 1.0000
But not sure if this is correct
Many thanks
The specific problem with your code is that in your line corrcoef([X,Y]) you just lumped your X and Y into one variable. You can definitely get the answer that you want out of this matrix (the off-diagonal terms are the correlation between the columns of X and your Y) but this might not be quite what you were expecting.
When you are unsure, I always recommend breaking the problem down into the smallest steps. In this case, things are perhaps confusing for you because your X has three columns while your Y only has one column. What does corrcoef do in this case? If you're not sure, I'd suggest breaking it down into smaller steps...
For the operation that you are interested in (correlation with Y and a linear regression), there is no interdependence between the three columns of X. So, a good simplifying step would be to deal with the 3 columns independently. You can do it in a for loop (yes, you can do it all vectorized at once, but doing it in a for loop makes it easier to understand when one is unsure)...
%see the correation between the two variables
for I=1:3
x_foo = X(:,I)
%http://www.mathworks.com/help/matlab/ref/corrcoef.html
c = corrcoeff(x_foo,Y)
end
Then, you can do the next step...the linear regression. Use the polyfit function to fit a line.
figure;
for I=1:3
x_foo = X(:,I);
%http://www.mathworks.com/help/matlab/ref/polyfit.html
N = 1; % order of the desired polynominal. N=1 means a line
p = polyfit(x_foo,Y,N); %N=1 will fit a line
%plot
subplot(3,1,I)
plot(x_foo,Y,'o',x_foo,polyval(p,x_foo),'s');
legend('Data','Linear Fit');
end
I already have my data prepared in terms of:
p1=input1 %load of today current hour
p2=input2 %load of today past one hour
p3=input3 $load of today past two hours
a1=output %load of next day current hour
I have the following code below:
%Input Set 1 For Weekday Load(d+1,t)
%(d,t),(d,t-1), (d,t-2)
L=xlsread('input_set1_weekday.xlsx',1); %2011
k=1;
size(L,1);
for a=5:2:size(L,1)-48 % L load for 2011
P(1,k)= L(a,1);
P(2,k)= L(a-2,1);
P(3,k)= L(a-4,1);
P(4,k)= L(a+48,1);
k=k+1;
end
I have my data arranged in such a way that in every column, p1, p2, p3 are my predictor variables and a1 is my response variable.
How do I now fit a linear model to this set of data to check the performance of my predictions? By the way it is electrical load forecasting model.
My other doubt is that in the examples shown by most of the sources, they use the last column data as response variable and this is the part I'm struggling with.
fitlm will be able to do this for you quite nicely. You use fitlm to train a linear regression model, so you provide it the predictors as well as the responses. Once you do this, you can then use predict to predict the new responses based on new predictors that you put in.
The basic way for you to call this is:
lmModel = fitlm(X, y, 'linear', 'RobustOpts', 'on');
X is a data matrix where each column is a predictor and each row is an observation. Therefore, you would have to transpose your matrix before running this function. Basically, you would do P(1:3,:).' as you only want the first three rows (now columns) of your data. y would be your output values for each observation and this is a column vector that has the same number of rows as your observations. Regarding your comment about using the "last" column as the response vector, you don't have to do this at all. You specify your response vector in a completely separate input variable, which is y. As such, your a1 would serve here, while your predictors and observations would be stored in X. You can totally place your response vector as a column in your matrix; you would just have to subset it accordingly.
As such, y would be your a1 variable, and make sure it's a column vector, and so you can do this a1(:) to be sure. The linear flag specifies linear regression, but that is the default flag anyway. RobustOpts is recommended so that you can perform robust linear regression. For your case, you would have to call fitlm this way:
lmModel = fitlm(P(1:3,:).', a1(:), 'linear', 'RobustOpts', 'on');
Now to predict new responses, you would do:
ypred = predict(lmModel, Xnew);
Xnew would be your new observations that follow the same style as X. You have to have the same number of columns as X, but you can have as many rows as you want. The output ypred will give you the predicted response for each observation of X that you have. As an example, let's use a dataset that is built into MATLAB, split up the data into a training and test data set, fit a model with the training set, then use the test dataset and see what the predicted responses are. Let's split up the data so that it's a 75% / 25% ratio. We will use the carsmall dataset which contains 100 observations for various cars and have descriptors such as Weight, Displacement, Model... typically used to describe cars. We will use Weight, Cylinders and Acceleration as the predictor variables, and let's try and predict the miles per gallon MPG as our outcome. Once I do this, let's calculate the difference between the predicted values and the true values and compare between them. As such:
load carsmall; %// Load in dataset
%// Build predictors and outcome
X = [Weight Cylinders Acceleration];
y = MPG;
%// Set seed for reproducibility
rng(1234);
%// Generate training and test data sets
%// Randomly select 75 observations for the training
%// dataset. First generate the indices to select the data
indTrain = randperm(100, 75);
%// The above may generate an error if you have anything below R2012a
%// As such, try this if the above doesn't work
%//indTrain = randPerm(100);
%//indTrain = indTrain(1:75);
%// Get those indices that haven't been selected as the test dataset
indTest = 1 : 100;
indTest(indTrain) = [];
%// Now build our test and training data
trainX = X(indTrain, :);
trainy = y(indTrain);
testX = X(indTest, :);
testy = y(indTest);
%// Fit linear model
lmModel = fitlm(trainX, trainy, 'linear', 'RobustOpts', 'on');
%// Now predict
ypred = predict(lmModel, testX);
%// Show differences between predicted and true test output
diffPredict = abs(ypred - testy);
This is what happens when you echo out what the linear model looks like:
lmModel =
Linear regression model (robust fit):
y ~ 1 + x1 + x2 + x3
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ _______ __________
(Intercept) 52.495 3.7425 14.027 1.7839e-21
x1 -0.0047557 0.0011591 -4.1031 0.00011432
x2 -2.0326 0.60512 -3.359 0.0013029
x3 -0.26011 0.1666 -1.5613 0.12323
Number of observations: 70, Error degrees of freedom: 66
Root Mean Squared Error: 3.64
R-squared: 0.788, Adjusted R-Squared 0.778
F-statistic vs. constant model: 81.7, p-value = 3.54e-22
This all comes from statistical analysis, but for a novice, what matters are the p-values for each of our predictors. The smaller the p-value, the more suitable this predictor is for your model. You can see that the first two predictors: Weight and Cylinders are a good representation on determining the MPG. Acceleration... not so much. What this means is that this variable is not a meaningful predictor to use, so you should probably use something else. In fact, if you were to remove this predictor and retrain your model, you would most likely see that the predicted values would closely match those where the Acceleration was included.
This is a truly bastardized version of interpreting p-values and so I defer you to an actual regression models or statistics course for more details.
This is what we have predicted the values to be, given our test set and beside it what the true values are:
>> [ypred testy]
ans =
17.0324 18.0000
12.9886 15.0000
13.1869 14.0000
14.1885 NaN
16.9899 14.0000
29.1824 24.0000
23.0753 18.0000
28.6148 28.0000
28.2572 25.0000
29.0365 26.0000
20.5819 22.0000
18.3324 20.0000
20.4845 17.5000
22.3334 19.0000
12.2569 16.5000
13.9280 13.0000
14.7350 13.0000
26.6757 27.0000
30.9686 36.0000
30.4179 31.0000
29.7588 36.0000
30.6631 38.0000
28.2995 26.0000
22.9933 22.0000
28.0751 32.0000
The fourth actual output value from the test data set is NaN, which denotes that the value is missing. However, when we run our the observation corresponding to this output value into our linear model, it predicts a value anyway which is to be expected. You have other observations to help train the model that when using this observation to find a prediction, it would naturally draw from those other observations.
When we compute the difference between these two, we get:
diffPredict =
0.9676
2.0114
0.8131
NaN
2.9899
5.1824
5.0753
0.6148
3.2572
3.0365
1.4181
1.6676
2.9845
3.3334
4.2431
0.9280
1.7350
0.3243
5.0314
0.5821
6.2412
7.3369
2.2995
0.9933
3.9249
As you can see, there are some instances where the prediction was quite close, and others where the prediction was far from the truth.... it's the crux of any prediction algorithm really. You'll have to play around with what predictors you want, as well as playing with the options with your training. Have a look at the fitlm documentation for more details on what you can play around with.
Edit - July 30th, 2014
As you don't have fitlm, you can easily use LinearModel.fit. You would call it with the same inputs like fitlm. As such:
lmModel = LinearModel.fit(trainX, trainy, 'linear', 'RobustOpts', 'on');
This should give you exactly the same results. predict should exist pre-R2014a, so that should be available to you.
Good luck!
I have a set of features which and I wish to rank according to their Correlation Coefficient with each other, without accounting for the true label (that would by a Supervised feature selection, right?).
My objective is selecting the first feature as the one more correlated with every other, take it out and so on.
The problem is how to test the correlation of a vector with a matrix (all the other vectors/features)? Is it possible to do this or am I doing this all right.
PS: I'm using MATLAB 2013b
Thank you all
Say you had a n-by-d matrix X where the rows are instances and columns are the features/dimensions, then you can compute the correlation coefficient matrix simply using the corr or corrcoeff functions:
% Fisher Iris dataset, 150x4
>> load fisheriris
>> X = meas;
>> C = corr(X)
C =
1.0000 -0.1176 0.8718 0.8179
-0.1176 1.0000 -0.4284 -0.3661
0.8718 -0.4284 1.0000 0.9629
0.8179 -0.3661 0.9629 1.0000
The result is a d-by-d matrix containing correlation coefficients of each feature against every other feature. The diagonal is thus all ones (because corr(x,x) = 1), the matrix is also symmetric (because corr(x,y) = corr(y,x)). Values range from -1 to 1, where -1 means inverse correlation between two variables, 1 means positive correlation, and 0 means no linear correlation.
Now because you want to remove the feature which is on average the most correlated with other features, you have to summarize that matrix as one number per feature. One way to do that is to compute the mean:
% mean
>> mean_corr = mean(C)
mean_corr =
0.6430 0.0220 0.6015 0.6037
% most correlated feature on average
>> [~,idx] = max(mean_corr)
idx =
1
% drop that feature
>> X(:,idx) = [];
EDIT:
I probably should have taken the mean of the absolute value of C in the above code, because we don't care if two variables are positively or negatively correlated, only how strong the correlation is.
I'd like to compute a low-rank approximation to a matrix which is optimal under the Frobenius norm. The trivial way to do this is to compute the SVD decomposition of the matrix, set the smallest singular values to zero and compute the low-rank matrix by multiplying the factors. Is there a simple and more efficient way to do this in MATLAB?
If your matrix is sparse, use svds.
Assuming it is not sparse but it's large, you can use random projections for fast low-rank approximation.
From a tutorial:
An optimal low rank approximation can be easily computed using the SVD of A in O(mn^2
). Using random projections we show how to achieve an ”almost optimal” low rank pproximation in O(mn log(n)).
Matlab code from a blog:
clear
% preparing the problem
% trying to find a low approximation to A, an m x n matrix
% where m >= n
m = 1000;
n = 900;
%// first let's produce example A
A = rand(m,n);
%
% beginning of the algorithm designed to find alow rank matrix of A
% let us define that rank to be equal to k
k = 50;
% R is an m x l matrix drawn from a N(0,1)
% where l is such that l > c log(n)/ epsilon^2
%
l = 100;
% timing the random algorithm
trand =cputime;
R = randn(m,l);
B = 1/sqrt(l)* R' * A;
[a,s,b]=svd(B);
Ak = A*b(:,1:k)*b(:,1:k)';
trandend = cputime-trand;
% now timing the normal SVD algorithm
tsvd = cputime;
% doing it the normal SVD way
[U,S,V] = svd(A,0);
Aksvd= U(1:m,1:k)*S(1:k,1:k)*V(1:n,1:k)';
tsvdend = cputime -tsvd;
Also, remember the econ parameter of svd.
You can rapidly compute a low-rank approximation based on SVD, using the svds function.
[U,S,V] = svds(A,r); %# only first r singular values are computed
svds uses eigs to compute a subset of the singular values - it will be especially fast for large, sparse matrices. See the documentation; you can set tolerance and maximum number of iterations or choose to calculate small singular values instead of large.
I thought svds and eigs could be faster than svd and eig for dense matrices, but then I did some benchmarking. They are only faster for large matrices when sufficiently few values are requested:
n k svds svd eigs eig comment
10 1 4.6941e-03 8.8188e-05 2.8311e-03 7.1699e-05 random matrices
100 1 8.9591e-03 7.5931e-03 4.7711e-03 1.5964e-02 (uniform dist)
1000 1 3.6464e-01 1.8024e+00 3.9019e-02 3.4057e+00
2 1.7184e+00 1.8302e+00 2.3294e+00 3.4592e+00
3 1.4665e+00 1.8429e+00 2.3943e+00 3.5064e+00
4 1.5920e+00 1.8208e+00 1.0100e+00 3.4189e+00
4000 1 7.5255e+00 8.5846e+01 5.1709e-01 1.2287e+02
2 3.8368e+01 8.6006e+01 1.0966e+02 1.2243e+02
3 4.1639e+01 8.4399e+01 6.0963e+01 1.2297e+02
4 4.2523e+01 8.4211e+01 8.3964e+01 1.2251e+02
10 1 4.4501e-03 1.2028e-04 2.8001e-03 8.0108e-05 random pos. def.
100 1 3.0927e-02 7.1261e-03 1.7364e-02 1.2342e-02 (uniform dist)
1000 1 3.3647e+00 1.8096e+00 4.5111e-01 3.2644e+00
2 4.2939e+00 1.8379e+00 2.6098e+00 3.4405e+00
3 4.3249e+00 1.8245e+00 6.9845e-01 3.7606e+00
4 3.1962e+00 1.9782e+00 7.8082e-01 3.3626e+00
4000 1 1.4272e+02 8.5545e+01 1.1795e+01 1.4214e+02
2 1.7096e+02 8.4905e+01 1.0411e+02 1.4322e+02
3 2.7061e+02 8.5045e+01 4.6654e+01 1.4283e+02
4 1.7161e+02 8.5358e+01 3.0066e+01 1.4262e+02
With size-n square matrices, k singular/eigen values and runtimes in seconds. I used Steve Eddins' timeit file exchange function for benchmarking, which tries to account for overhead and runtime variations.
svds and eigs are faster if you want a few values from a very large matrix. It also depends on the properties of the matrix in question (edit svds should give you some idea why).