Multivariate Linear Regression in MATLAB

Multivariate Linear Regression in MATLAB - matlab

I already have my data prepared in terms of:
p1=input1 %load of today current hour
p2=input2 %load of today past one hour
p3=input3 $load of today past two hours
a1=output %load of next day current hour
I have the following code below:
%Input Set 1 For Weekday Load(d+1,t)
%(d,t),(d,t-1), (d,t-2)
L=xlsread('input_set1_weekday.xlsx',1); %2011
k=1;
size(L,1);
for a=5:2:size(L,1)-48 % L load for 2011
P(1,k)= L(a,1);
P(2,k)= L(a-2,1);
P(3,k)= L(a-4,1);
P(4,k)= L(a+48,1);
k=k+1;
end
I have my data arranged in such a way that in every column, p1, p2, p3 are my predictor variables and a1 is my response variable.
How do I now fit a linear model to this set of data to check the performance of my predictions? By the way it is electrical load forecasting model.
My other doubt is that in the examples shown by most of the sources, they use the last column data as response variable and this is the part I'm struggling with.

fitlm will be able to do this for you quite nicely. You use fitlm to train a linear regression model, so you provide it the predictors as well as the responses. Once you do this, you can then use predict to predict the new responses based on new predictors that you put in.
The basic way for you to call this is:
lmModel = fitlm(X, y, 'linear', 'RobustOpts', 'on');
X is a data matrix where each column is a predictor and each row is an observation. Therefore, you would have to transpose your matrix before running this function. Basically, you would do P(1:3,:).' as you only want the first three rows (now columns) of your data. y would be your output values for each observation and this is a column vector that has the same number of rows as your observations. Regarding your comment about using the "last" column as the response vector, you don't have to do this at all. You specify your response vector in a completely separate input variable, which is y. As such, your a1 would serve here, while your predictors and observations would be stored in X. You can totally place your response vector as a column in your matrix; you would just have to subset it accordingly.
As such, y would be your a1 variable, and make sure it's a column vector, and so you can do this a1(:) to be sure. The linear flag specifies linear regression, but that is the default flag anyway. RobustOpts is recommended so that you can perform robust linear regression. For your case, you would have to call fitlm this way:
lmModel = fitlm(P(1:3,:).', a1(:), 'linear', 'RobustOpts', 'on');
Now to predict new responses, you would do:
ypred = predict(lmModel, Xnew);
Xnew would be your new observations that follow the same style as X. You have to have the same number of columns as X, but you can have as many rows as you want. The output ypred will give you the predicted response for each observation of X that you have. As an example, let's use a dataset that is built into MATLAB, split up the data into a training and test data set, fit a model with the training set, then use the test dataset and see what the predicted responses are. Let's split up the data so that it's a 75% / 25% ratio. We will use the carsmall dataset which contains 100 observations for various cars and have descriptors such as Weight, Displacement, Model... typically used to describe cars. We will use Weight, Cylinders and Acceleration as the predictor variables, and let's try and predict the miles per gallon MPG as our outcome. Once I do this, let's calculate the difference between the predicted values and the true values and compare between them. As such:
load carsmall; %// Load in dataset
%// Build predictors and outcome
X = [Weight Cylinders Acceleration];
y = MPG;
%// Set seed for reproducibility
rng(1234);
%// Generate training and test data sets
%// Randomly select 75 observations for the training
%// dataset. First generate the indices to select the data
indTrain = randperm(100, 75);
%// The above may generate an error if you have anything below R2012a
%// As such, try this if the above doesn't work
%//indTrain = randPerm(100);
%//indTrain = indTrain(1:75);
%// Get those indices that haven't been selected as the test dataset
indTest = 1 : 100;
indTest(indTrain) = [];
%// Now build our test and training data
trainX = X(indTrain, :);
trainy = y(indTrain);
testX = X(indTest, :);
testy = y(indTest);
%// Fit linear model
lmModel = fitlm(trainX, trainy, 'linear', 'RobustOpts', 'on');
%// Now predict
ypred = predict(lmModel, testX);
%// Show differences between predicted and true test output
diffPredict = abs(ypred - testy);
This is what happens when you echo out what the linear model looks like:
lmModel =
Linear regression model (robust fit):
y ~ 1 + x1 + x2 + x3
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ _______ __________
(Intercept) 52.495 3.7425 14.027 1.7839e-21
x1 -0.0047557 0.0011591 -4.1031 0.00011432
x2 -2.0326 0.60512 -3.359 0.0013029
x3 -0.26011 0.1666 -1.5613 0.12323
Number of observations: 70, Error degrees of freedom: 66
Root Mean Squared Error: 3.64
R-squared: 0.788, Adjusted R-Squared 0.778
F-statistic vs. constant model: 81.7, p-value = 3.54e-22
This all comes from statistical analysis, but for a novice, what matters are the p-values for each of our predictors. The smaller the p-value, the more suitable this predictor is for your model. You can see that the first two predictors: Weight and Cylinders are a good representation on determining the MPG. Acceleration... not so much. What this means is that this variable is not a meaningful predictor to use, so you should probably use something else. In fact, if you were to remove this predictor and retrain your model, you would most likely see that the predicted values would closely match those where the Acceleration was included.
This is a truly bastardized version of interpreting p-values and so I defer you to an actual regression models or statistics course for more details.
This is what we have predicted the values to be, given our test set and beside it what the true values are:
>> [ypred testy]
ans =
17.0324 18.0000
12.9886 15.0000
13.1869 14.0000
14.1885 NaN
16.9899 14.0000
29.1824 24.0000
23.0753 18.0000
28.6148 28.0000
28.2572 25.0000
29.0365 26.0000
20.5819 22.0000
18.3324 20.0000
20.4845 17.5000
22.3334 19.0000
12.2569 16.5000
13.9280 13.0000
14.7350 13.0000
26.6757 27.0000
30.9686 36.0000
30.4179 31.0000
29.7588 36.0000
30.6631 38.0000
28.2995 26.0000
22.9933 22.0000
28.0751 32.0000
The fourth actual output value from the test data set is NaN, which denotes that the value is missing. However, when we run our the observation corresponding to this output value into our linear model, it predicts a value anyway which is to be expected. You have other observations to help train the model that when using this observation to find a prediction, it would naturally draw from those other observations.
When we compute the difference between these two, we get:
diffPredict =
0.9676
2.0114
0.8131
NaN
2.9899
5.1824
5.0753
0.6148
3.2572
3.0365
1.4181
1.6676
2.9845
3.3334
4.2431
0.9280
1.7350
0.3243
5.0314
0.5821
6.2412
7.3369
2.2995
0.9933
3.9249
As you can see, there are some instances where the prediction was quite close, and others where the prediction was far from the truth.... it's the crux of any prediction algorithm really. You'll have to play around with what predictors you want, as well as playing with the options with your training. Have a look at the fitlm documentation for more details on what you can play around with.
Edit - July 30th, 2014
As you don't have fitlm, you can easily use LinearModel.fit. You would call it with the same inputs like fitlm. As such:
lmModel = LinearModel.fit(trainX, trainy, 'linear', 'RobustOpts', 'on');
This should give you exactly the same results. predict should exist pre-R2014a, so that should be available to you.
Good luck!

Related

How to use the reduced data - the output of principal component analysis

I am finding it hard to link the theory with the implementation. I would appreciate help in knowing where my understanding is wrong.
Notations - matrix in bold capital and vectors in bold font small letter
is a dataset on observations, each of variables. So, given these observed -dimensional data vectors, the -dimensional principal axes are , for in where is the target dimension.
The principal components of the observed data matrix would be where matrix , matrix , and matrix .
Columns of form an orthogonal basis for the features and the output is the principal component projection that minimizes the squared reconstruction error:
and the optimal reconstruction of is given by .
The data model is
X(i,j) = A(i,:)*S(:,j) + noise
where PCA should be done on X to get the output S. S must be equal to Y.
Problem 1: The reduced data Y is not equal to S that is used in the model. Where is my understanding wrong?
Problem 2: How to reconstruct such that the error is minimum?
Please help. Thank you.
clear all
clc
n1 = 5; %d dimension
n2 = 500; % number of examples
ncomp = 2; % target reduced dimension
%Generating data according to the model
% X(i,j) = A(i,:)*S(:,j) + noise
Ar = orth(randn(n1,ncomp))*diag(ncomp:-1:1);
T = 1:n2;
%generating synthetic data from a dynamical model
S = [ exp(-T/150).*cos( 2*pi*T/50 )
exp(-T/150).*sin( 2*pi*T/50 ) ];
% Normalizing to zero mean and unit variance
S = ( S - repmat( mean(S,2), 1, n2 ) );
S = S ./ repmat( sqrt( mean( Sr.^2, 2 ) ), 1, n2 );
Xr = Ar * S;
Xrnoise = Xr + 0.2 * randn(n1,n2);
h1 = tsplot(S);
X = Xrnoise;
XX = X';
[pc, ~] = eigs(cov(XX), ncomp);
Y = XX*pc;
UPDATE [10 Aug]
Based on the Answer, here is the full code that
clear all
clc
n1 = 5; %d dimension
n2 = 500; % number of examples
ncomp = 2; % target reduced dimension
%Generating data according to the model
% X(i,j) = A(i,:)*S(:,j) + noise
Ar = orth(randn(n1,ncomp))*diag(ncomp:-1:1);
T = 1:n2;
%generating synthetic data from a dynamical model
S = [ exp(-T/150).*cos( 2*pi*T/50 )
exp(-T/150).*sin( 2*pi*T/50 ) ];
% Normalizing to zero mean and unit variance
S = ( S - repmat( mean(S,2), 1, n2 ) );
S = S ./ repmat( sqrt( mean( S.^2, 2 ) ), 1, n2 );
Xr = Ar * S;
Xrnoise = Xr + 0.2 * randn(n1,n2);
X = Xrnoise;
XX = X';
[pc, ~] = eigs(cov(XX), ncomp);
Y = XX*pc; %Y are the principal components of X'
%what you call pc is misleading, these are not the principal components
%These Y columns are orthogonal, and should span the same space
%as S approximatively indeed (not exactly, since you introduced noise).
%If you want to reconstruct
%the original data can be retrieved by projecting
%the principal components back on the original space like this:
Xrnoise_reconstructed = Y*pc';
%Then, you still need to project it through
%to the S space, if you want to reconstruct S
S_reconstruct = Ar'*Xrnoise_reconstructed';
plot(1:length(S_reconstruct),S_reconstruct,'r')
hold on
plot(1:length(S),S)
The plot is which is very different from the one that is shown in the Answer. Only one component of S exactly matches with that of S_reconstructed. Shouldn't the entire original 2 dimensional space of the source input S be reconstructed?
Even if I cut off the noise, then also onely one component of S is exactly reconstructed.

I see nobody answered your question, so here goes:
What you computed in Y are the principal components of X' (what you call pc is misleading, these are not the principal components). These Y columns are orthogonal, and should span the same space as S approximatively indeed (not exactly, since you introduced noise).
If you want to reconstruct Xrnoise, you must look at the theory (e.g. here) and apply it correctly: the original data can be retrieved by projecting the principal components back on the original space like this:
Xrnoise_reconstructed = Y*pc'
Then, you still need to transform it through pinv(Ar)*Xrnoise_reconstructed, if you want to reconstruct S.
Matches nicely for me:
answer to UPDATE [10 Aug]: (EDITED 12 Aug)
Your Ar matrix does not define an orthonormal basis, and as such, the transpose Ar' is not the reverse transformation. The earlier answer I provided was thus wrong. The answer has been corrected above.

Your understanding is quite right. One of the reasons for somebody to use PCA would be to reduce the dimensionality of the data. The first principal component has the largest sample variance amongst of all the normalized linear combinations of the columns of X. The second principal component has maximum variance subject to being orthogonal to the next one, etc.
One might then do a PCA on a dataset, and decide to cut off the last principal component or several of the last principal components of the data. This is done to reduce the effect of the curse of dimensionality. The curse of dimensionality is a term used to point out the fact that any group of vectors is sparse in a relatively high dimensional space. Conversely, this means that you would need an absurd amount of data to form any model on a fairly high dimension dataset, such as an word histogram of a text document with possibly tens of thousands of dimensions.
In effect a dimensionality reduction by PCA removes components that are strongly correlated. For example let's take a look at a picture:
As you can see, most of the values are almost the same, strongly correlated. You could meld some of these correlated pixels by removing the last principal components. This would reduce the dimensionality of the image, pack it, by removing some of the information in the image.
There is no magic way to determine the best amount of principal components or the best reconstruction that I'm aware of.

Forgive me if i am not mathematically rigorous.
If we look at the equation: X = A*S we can say that we are taking some two dimensional data and we map it to a 2 dimensional subspace in 5 dimensional space. Were A is some base for that 2 dimensional subspace.
When we solve the PCA problem for X and look at PC (principal compononet) we see that the two big eignvectors (which coresponds to the two largest eignvalues) span the same subspace that A did. (multpily A'*PC and see that for the first three small eignvectors we get 0 which means that the vectors are orthogonal to A and only for the two largest we get values that are different than 0).
So what i think that the reason why we get a different base for this two dimensional space is because X=A*S can be product of some A1 and S1 and also for some other A2 and S2 and we will still get X=A1*S1=A2*S2. What PCA gives us is a particular base that maximize the variance in each dimension.
So how to solve the problem you have? I can see that you chose as the testing data some exponential times sin and cos so i think you are dealing with a specific kind of data. I am not an expert in signal processing but look at MUSIC algorithm.

You could use the pca function from Statistics toolbox.
coeff = pca(X)
From documentation, each column of coeff contains coefficients for one principal component. So you can reconstruct the observed data X by multiplying with coeff, e.g. X*coeff

MATLAB's newrb for designing radial basis networks does not behave in accordance to the documentation. Why?

I'm trying to approximate various signals using radial basis networks. In particular, I make use of MATLAB's newrb.
My problem is that this function seems to behave incorrectly if I follow the description of newrb. As I understand it, it makes sense to transpose all arguments despite the documentation.
The following example hopefully illustrates my problem.
I create one period of a sine wave with 100 samples. I would like to approximate this sine wave by means of a radial basis network with maximally two hidden neurons. I have one input vector (t) and one target vector (s). Hence, according to the documentation, I should call newrb with two column vectors. However, the approximation is too good. In fact, the mean squared error is 0 which can't be true using only two neurons. Additionally, the visualization with view(net) shows not only one but 100 inputs if I use column vectors.
In the example, the vectors corresponding to the "correct" (according to the documentation) function call are indicated by _doc, the ones corresponding to the "incorrect" call by _not_doc.
Can anybody explain this behavior?
% one period sine signal with
% carrier frequency = 1, sampling frequency = 100
Ts = 1 / 100;
t = 2 * pi * (0:Ts:1-Ts); % size(t) = 1 100
s = sin(t); % size(s) = 1 100
% design radial basis network
MSE_goal = 0.0; % mean squared error goal, default value
spread = 1.0; % spread of readial basis functions, default value
max_neurons = 2; % maximum number of neurons, custom value
DF = 25; % number of neurons to add between displays, default value
net_not_doc = newrb( t , s , MSE_goal, spread, max_neurons, DF ); % row vectors
net_doc = newrb( t', s', MSE_goal, spread, max_neurons, DF ); % column vectors
% simulate network
approx_not_doc = sim( net_not_doc, t );
approx_doc = sim( net_doc, t' );
% plot
figure;
plot( t, s, 'DisplayName', 'Sine' );
hold on;
plot( t, approx_not_doc, 'r:', 'DisplayName', 'Approximation_{not doc}');
hold on;
plot( t, approx_doc', 'g:', 'DisplayName', 'Approximation_{doc}');
grid on;
legend show;
% view neural networks
view(net_not_doc);
view(net_doc);

Because I had the same problem myself, I'll try to give an answer for anyone who will stumble upon the same post.
As I figured the problem is not the transpose vectors. You can use your data as it is, without transposing anything.
The fact that you train your RBF network with vector t and then simulate with the same vector that you trained your network, is the reason why you have so perfect approximation. You test your network with the same values that you taught it.
If you realy want to test your network you must choose a different vector for testing. In your example I used this:
% simulate network
t_test = 2 * pi * ((1-Ts)/2:Ts:3-Ts);
approx_not_doc = sim( net_not_doc, t_test );
And now when you plot your results, you can observe that the points that have the same value as in your train vector are almost flawless. The rest have an unknown target because of the small number of neurons (as you expected).
Plot of t_test with approx_not_doc.
Now If you add more neurons (in this example I used 100), you can see that now the new network can predict, with the same test vector t_test, an unknown part of your function. Plot t_test with approx_not_doc for 100 neurons. Of course, if you try with different number of neurons and spread your results will vary.
Hope this will help anyone with the same problem.

Cross-Validation with libsvm to find best parameters

In order to find the best parameters to be used with libsvm I used the code below. Instead of './heart_scale' I had a file containing positive and negative examples each with a hog vector in libsvm format. I had 1000 positive examples and 4000 negative. However these were put in order, i.e. the 1st 1000 examples were positive examples and the others were negative.
Question: Now, I came in doubt whether the accuracy returned by this code is actual accuracy. This is because when I read on 5 fold cross-validation, it takes the first 4/5 of the data as training and the 1/5 left for testing. Does this mean that it can be the case the testing set is all negative? Or it takes the examples randomly please?
%# read some training data
[labels,data] = libsvmread('./heart_scale');
%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3);
%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
cv_acc(i) = svmtrain(labels, data, ...
sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end
%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);
%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), ...
'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')
%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
%# ...

You can find answer to your question in the LIBSVM source code.
See the function svm_cross_validation in the svm.cpp.
As you can see, for classification cross-validation problem LIBSVM firstly performs class grouping and than shuffling.
So, answer to your question: yes, the accuracy returned by this code is actual accuracy.
Note: the accuracy estimation depends also on data nature, cross-validation folds number and itself is a random value with some distribution.

Unsupervised Filter Feature Selection - Rank by Correlation

I have a set of features which and I wish to rank according to their Correlation Coefficient with each other, without accounting for the true label (that would by a Supervised feature selection, right?).
My objective is selecting the first feature as the one more correlated with every other, take it out and so on.
The problem is how to test the correlation of a vector with a matrix (all the other vectors/features)? Is it possible to do this or am I doing this all right.
PS: I'm using MATLAB 2013b
Thank you all

Say you had a n-by-d matrix X where the rows are instances and columns are the features/dimensions, then you can compute the correlation coefficient matrix simply using the corr or corrcoeff functions:
% Fisher Iris dataset, 150x4
>> load fisheriris
>> X = meas;
>> C = corr(X)
C =
1.0000 -0.1176 0.8718 0.8179
-0.1176 1.0000 -0.4284 -0.3661
0.8718 -0.4284 1.0000 0.9629
0.8179 -0.3661 0.9629 1.0000
The result is a d-by-d matrix containing correlation coefficients of each feature against every other feature. The diagonal is thus all ones (because corr(x,x) = 1), the matrix is also symmetric (because corr(x,y) = corr(y,x)). Values range from -1 to 1, where -1 means inverse correlation between two variables, 1 means positive correlation, and 0 means no linear correlation.
Now because you want to remove the feature which is on average the most correlated with other features, you have to summarize that matrix as one number per feature. One way to do that is to compute the mean:
% mean
>> mean_corr = mean(C)
mean_corr =
0.6430 0.0220 0.6015 0.6037
% most correlated feature on average
>> [~,idx] = max(mean_corr)
idx =
1
% drop that feature
>> X(:,idx) = [];
EDIT:
I probably should have taken the mean of the absolute value of C in the above code, because we don't care if two variables are positively or negatively correlated, only how strong the correlation is.

MATLAB - How to calculate 2D least squares regression based on both x and y. (regression surface)

I have a set of data with independent variable x and y. Now I'm trying to build a two dimensional regression model that has a regression surface cutting through my data points. However, I couldn't find a way to achieve this. Can anyone give me some assistance?

You could use my favorite, polyfitn for linear or polynomial models. If you would like a different model, please edit your question or add a comment. HTH!
EDIT
Also, take a look here under Multiple Regression, likely can help you as well.
EDIT AGAIN
Sorry, I'm having too much fun with this, here's an example of multivariate regression using least squares with stock Matlab:
t = (1:10)';
x = t;
y = exp(-t);
A = [ y x ];
z = 10*y + 0.5*x;
A\z
ans =
10.0000
0.5000

If you are performing linear regression, the best tool is the regress function. Note that, if you are fitting a model of the form y(x1,x2) = b1.f(x1) + b2.g(x2) + b3 this is still a linear regression, as long as you know the functions f and g.
Nsamp = 100; %number of samples
X1 = randn(Nsamp,1); %regressor 1 (could also be some computed f(x1) )
X2 = randn(Nsamp,1); %regressor 2 (could also be some computed g(x2) )
Y = X1 + X2 + randn(Nsamp,1); %generate some data to be regressed
%now run the regression
[b,bint,r,rint,stats] = regress(Y,[X1 X2 ones(Nsamp,1)]);
% 'b' contains the coefficients, b1,b2,b3 of the fit; can be used to plot regression surface)
% 'r' contains residuals of the fit
% 'stats' contains the overall regression R^2, F stat, p-value and error variance