Dimensions Reduction in Matlab using PCA - matlab

I have a matrix with 35 columns and I'm trying to reduce the dimension using PCA. I run PCA on my data:
[coeff,score,latent,tsquared,explained,mu] = pca(data);
explained =
99.9955
0.0022
0.0007
0.0003
0.0002
0.0001
0.0001
0.0001
Then, by looking at the vector explained, I notice the value of the first element is 99. Based on this, I decided to take only the first compoenet. So I did the follwoing:
k=1;
X = bsxfun(#minus, data, mean(data)) * coeff(:, 1:k);
and now, I used X for SVM training:
svmStruct = fitcsvm(X,Y,'Standardize',true, 'Prior','uniform','KernelFunction','linear','KernelScale','auto','Verbose',0,'IterationLimit', 1000000);
However, when I tried to predict and calculate the miss-classification rate:
[label,score,cost] = predict(svmStruct, X);
the result was disappointing. I notice, when I select only one component (k=1), I all classification was wrong. However, as I increase number of included components, k, the result improves, as you can see from the diagram below. But this doesn't make sense according to explained, which indicates that I should be fine with only the first eigenvector.
Did I do any mistake?
This diagram shows the classification error as a function of the number of included eginvectors:
This graph is generated after by doing normalization before doing PCA as suggested by #zelanix:
This is also plotted graph:
and this explained values obtained after doing normalization before PCA:
>> [coeff,score,latent,tsquared,explained,mu] = pca(data_normalised);
Warning: Columns of X are linearly dependent to within machine precision.
Using only the first 27 components to compute TSQUARED.
> In pca>localTSquared (line 501)
In pca (line 347)
>> explained
explained =
32.9344
15.6790
5.3093
4.7919
4.0905
3.8655
3.0015
2.7216
2.6300
2.5098
2.4275
2.3078
2.2077
2.1726
2.0892
2.0425
2.0273
1.9135
1.8809
1.7055
0.8856
0.3390
0.2204
0.1061
0.0989
0.0334
0.0085
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

Parag S. Chandakkar is absolutely right that there is no reason to expect that PCA will automatically improve your classification result. It is an unsupervised method so is not intended to improve separability, only to find the components with the largest variance.
But there are some other problems with your code. In particular, this line confuses me:
X = bsxfun(#minus, data, mean(data)) * coeff(:, 1:k);
You need to normalise your data before performing PCA, and each feature needs to be normalised separately. I use the following:
data_normalised = data;
for f = 1:size(data, 2)
data_normalised(:, f) = data_normalised(:, f) - nanmean(data_normalised(:, f));
data_normalised(:, f) = data_normalised(:, f) / nanstd(data_normalised(:, f));
end
pca_coeff = pca(data_normalised);
data_pca = data_normalised * pca_coeff;
You can then extract the first principal component as data_pca(:, 1).
Also, always plot your PCA results to get an idea of what is actually going on:
figure
scatter(data_pca(Y == 1, 1), data_pca(Y == 1, 2))
hold on;
scatter(data_pca(Y == 2, 1), data_pca(Y == 2, 2))

PCA gives the direction of maximum variance in the data, it does not necessarily have to do better classification. If you want to reduce your data while trying to maximize your accuracy, you should do LDA.
The following picture illustrates exactly what I want to convey.

Related

Selecting the right essential matrix

I want to code myself the sfm pipeline using Matlab because I need some outputs that opencv functions don't provide. However, I'm using opencv for comparison.
The Opencv function [E,mask] = cv.findEssentialMat(points1, points2, 'CameraMatrix',K, 'Method','Ransac'); provides the essential matrix solution using Nister's fivepoint algorithm and RANSAC.
the inlier indices are found using :InliersIndices=find(mask>0);
I used this Matlab impelmentation of Nister's algorithm:
Fivepoint_algoithm_code
The call to the function is as follows:
[E_all, R_all, t_all, Eo_all] = five_point_algorithm( pts1, pts2, K, K);
The algorithm outputs up to 10 solutions of essential matrices. However, I encountered the following issues:
The impelmentation stated above is only for perfect correspondances (without Ransac) and I'm providing to the algorithm 5 correspondances using InliersIndices, the outputted essential matrices (up to 10) are all different from the one returned by Opencv.
All the returned essential matrices should be solutions so why when I triangulate for each one using the below function, I don't obtain the same 3D points?
How to choose the right essential marix solution?
I triangulate using the function of matlab toolbox
Projection matrices:
P1=K*[eye(3) [0;0;0]];
P2=K*[R_all{i} t_all{i}];
[pts3D,rep_error] = triangulate(pts1', pts2', P1',P2');
Edit
The returned E from [E,mask] = cv.findEssentialMat(points1, points2, 'CameraMatrix',K, 'Method','Ransac');
E =
0.0052 -0.7068 0.0104
0.7063 0.0050 -0.0305
-0.0113 0.0168 0.0002
For the 5-point Matlab implementation,5 random indices from inliers are taken so:
pts1 =
736.7744 740.2372 179.2428 610.5297 706.8776
112.2673 109.9687 45.7010 91.4371 87.8194
pts2 =
722.3037 725.3770 150.3997 595.3550 692.5383
111.7898 108.6624 43.6847 90.6638 86.8139
K =
723.3631 7.9120 601.7643
-3.8553 719.6517 182.0588
0.0075 0.0044 1.0000
and 4 solutions are returned:
E1 =
-0.2205 0.9436 -0.1835
0.8612 0.2447 -0.1531
0.4442 -0.0600 -0.0378
E2 =
-0.2153 0.9573 0.1626
0.8948 0.2456 -0.3474
0.1003 0.1348 -0.0306
E3 =
0.0010 -0.9802 -0.0957
0.9768 0.0026 -0.1912
0.0960 0.1736 -0.0019
E4 =
-0.0005 -0.9788 -0.1427
0.9756 0.0021 -0.1658
0.1436 0.1470 -0.0030
Edit2:
pts1 and pts2 when triangulated using the essential matrix E, R and t returned [R, t] = cv.recoverPose(E, p1, p2,'CameraMatrix',K);
X1 =
-0.0940 0.0478 -0.4984
-0.0963 0.0497 -0.4987
0.3033 0.1009 -0.5202
-0.0065 0.0636 -0.5053
-0.0737 0.0653 -0.5011
with
R =
-0.9977 -0.0063 0.0670
0.0084 -0.9995 0.0305
0.0667 0.0310 0.9973
and
t =
0.0239
0.0158
0.9996
When triangulated with the Matlab code, the chosen solution is E_all{2}
R_all{2}=
-0.8559 -0.2677 0.4425
-0.1505 0.9475 0.2821
-0.4948 0.1748 -0.8512
and
t_all{2}=
-0.1040
-0.1355
0.9853
X2 =
0.1087 -0.0552 0.5762
0.1129 -0.0578 0.5836
0.4782 0.1582 -0.8198
0.0028 -0.0264 0.2099
0.0716 -0.0633 0.4862
When doing
X1./X2
ans =
-0.8644 -0.8667 -0.8650
-0.8524 -0.8603 -0.8546
0.6343 0.6376 0.6346
-2.3703 -2.4065 -2.4073
-1.0288 -1.0320 -1.0305
There is an almost constant scale factor between triangulated 3D points.
However, rotation matrices are different and there is no scale factor between translations.
t./t_all{2}=
-0.2295
-0.1167
1.0145
which makes the plotted trajectory wrong
Answering your numbered questions:
Beware that Nister's 5 point algorithm has many implementations, but most of them don't work well. Personal experience and unpublished work by colleagues show that OpenCV does not have a good implementation. The open implementation in Bundler and other working SfM pipelines work better in practice (but there is a lot of room for improvement).
The 10 solutions are simply zeros of a certain polynomial equation. As far as the polynomial equation can describe the problem, these 10 solutions all zero the equation. The equation does not describe that these 10 points are real, or that the 3D points corresponding to the 5 point correspondences have to be the same for each solution, but only that there are some 3D points (for each solution) that project to the 5 points, without even considering if the 3D points are in front of the respective cameras. Moreover, there may well be two sets of 3D points and cameras that happen to generate the same images of 5 points, so you would have to weed them out with some other procedure (below).
The choice of the right solution among the 10 complex solutions is usually done by many techniques:
Discard solutions that would lead to purely complex points or 3D points with negative depth (currently Bundler does not do this last check)
Discard solutions that are not physical for some other reason (you may have to do some of that yourself for your application)
The more usual procedure: For each remaining solution, check which one is more consistent with additional correspondences. In a real system you don't know which additional correspondences are right and which are pure trash. So run RANSAC for each of the solutions and keep the one with the most inliers. This is computationally heavy so should be used as a last resort.
You can see how Bundler does this at file 5point.c line 668:
generate_Ematrix_hypotheses(5, r_pts_inner, l_pts_inner, &num_hyp, E);
for (i = 0; i < num_hyp; i++) {
int best_inlier;
double score = 0.0;
double E2[9], tmp[9], F[9];
memcpy(E2, E + 9 * i, 9 * sizeof(double));
E2[0] = -E2[0];
E2[1] = -E2[1];
E2[3] = -E2[3];
E2[4] = -E2[4];
E2[8] = -E2[8];
matrix_transpose_product(3, 3, 3, 3, K2_inv, E2, tmp);
matrix_product(3, 3, 3, 3, tmp, K1_inv, F);
inliers = evaluate_Ematrix(n, r_pts, l_pts, // r_pts_norm, l_pts_norm,
thresh_norm, F, // E + 9 * i,
&best_inlier, &score);
if (inliers > max_inliers ||
(inliers == max_inliers && score < min_score)) {
best = 1;
max_inliers = inliers;
min_score = score;
memcpy(E_best, E + 9 * i, sizeof(double) * 9);
r_best = r_pts_norm[best_inlier];
l_best = l_pts_norm[best_inlier];
}
inliers_hyp[i] = inliers;
}

libsvm: optimization finished, #iter = 1 nu = nan

I use libsvm to train a svm model in matlab, but when I call
model = svmtrain(labels,Feature,'-t 0');
It gives me this result:
*
optimization finished, #iter = 1
nu = nan
obj = nan, rho = nan
nSV = 0, nBSV = 0
Total nSV = 0
My positive and negative samples are of almost equal number: 935 vs 904 so this problem is not caused by unbalanced training dataset. Also I tried other kernels and none of them work.
You do not want to use svmtrain any more. The new version is templatesvm paired with fitcecoc. The datapages on both functions are quite extensive.
You'll eventually want to use your model to predict other data, use predict for that.
I recently encoutered similar problems when trying to classify terrain in a point cloud with more than two classes. templatesvm and fitcecoc solved my problems.
The code I used is as follows, where trainingdata is my 5 dimensional training data and groups contains the label for each class, which correspond to the cell array classes.
SVMtemp = templateSVM('KernelFunction','polynomial','IterationLimit',1e4,...
'PolynomialOrder',4,'OutlierFraction',ExpOut,...
'Standardize',true); % Create SVM template
Model = fitcecoc(trainingdata(:,4:8),groups,'learners',SVMtemp,'ClassNames',...
classes); % Create the SVM model

Results not right by using Sedumi in Yalmip

I want to use Yalmip in Matlab for solving a sdp problem,
min X11+X13
s.t. X22=1
X is positive semidefinite
Following is the code
P = sdpvar(3,3);
cons = [P >= 0,P(2,2)==1];
options = sdpsettings('Solver','Sedumi');
obj = [P(1,1)+P(1,3)];
solvesdp(cons,obj,options);
PP = double(P)
PP(1,1)+PP(2,3)
results are shown below
PP =
1.2900 0.0000 -2.2900
0.0000 0.0000 0.0000
-2.2900 0.0000 5.8700
ans =
1.2900
I am quite curious about the results, I've had the constraint P(2,2)==1, while in the final results, P(2,2)=5.87, why does this happen? Any one can help?
yalmip assumes symmetrical decision matrix, P = sdpvar(3,3,'full') will do better

How to train neural network incrementally in Matlab? and iteratively combine them

I have very big train set so that Matlab. And I need to do large scale training.
Is it possible to split the training set into parts and iteratively train the network and on each iteration update the "net" instead of over-writing to it?
The code below shows the idea and it won't work. In each iteration it updates the net depending on the only the trained data set.
TF1 = 'tansig';TF2 = 'tansig'; TF3 = 'tansig';% layers of the transfer function , TF3 transfer function for the output layers
net = newff(trainSamples.P,trainSamples.T,[NodeNum1,NodeNum2,NodeOutput],{TF1 TF2 TF3},'traingdx');% Network created
net.trainfcn = 'traingdm' ; %'traingdm';
net.trainParam.epochs = 1000;
net.trainParam.min_grad = 0;
net.trainParam.max_fail = 2000; %large value for infinity
while(1) // iteratively takes 10 data point at a time.
p %=> get updated with following 10 new data points
t %=> get updated with following 10 new data points
[net,tr] = train(net, p, t,[], []);
end
I haven't get a chance to take a look at adapt function yet, but I suspect it is updating instead of overwriting. To verify this statement, you may need to select a subset of your first data chunk as the second chunk in training. If it is overwriting, when you use the trained net with the subset to test your first data chunk, it is supposed to poorly predict those data that do not belong to the subset.
I tested it with a very simple program: train the curve y=x^2. During first training process, I learned the data set [1,3,5,7,9]:
m=6;
P=[1 3 5 7 9];
T=P.^2;
[Pn,minP,maxP,Tn,minT,maxT] = premnmx(P,T);
clear net
net.IW{1,1}=zeros(m,1);
net.LW{2,1}=zeros(1,m);
net.b{1,1}=zeros(m,1);
net.b{2,1}=zeros(1,1);
net=newff(minmax(Pn),[m,1],{'logsig','purelin'},'trainlm');
net.trainParam.show =100;
net.trainParam.lr = 0.09;
net.trainParam.epochs =1000;
net.trainParam.goal = 1e-3;
[net,tr]=train(net,Pn,Tn);
Tn_predicted= sim(net,Pn)
Tn
The result (note that the output are scaled with the same reference. If you are doing the standard normalization, make sure you always apply the mean and std value from the 1st training set to all the rest):
Tn_predicted =
-1.0000 -0.8000 -0.4000 0.1995 1.0000
Tn =
-1.0000 -0.8000 -0.4000 0.2000 1.0000
Now we are implementing the second training process, with the training data [1,9]:
Pt=[1 9];
Tt=Pt.^2;
n=length(Pt);
Ptn = tramnmx(Pt,minP,maxP);
Ttn = tramnmx(Tt,minT,maxT);
[net,tr]=train(net,Ptn,Ttn);
Tn_predicted= sim(net,Pn)
Tn
The result:
Tn_predicted =
-1.0000 -0.8000 -0.4000 0.1995 1.0000
Tn =
-1.0000 -0.8000 -0.4000 0.2000 1.0000
Note that the data with x=[3,5,7]; are still precisely predicted.
However, if we train only x=[1,9]; from the very beginning:
clear net
net.IW{1,1}=zeros(m,1);
net.LW{2,1}=zeros(1,m);
net.b{1,1}=zeros(m,1);
net.b{2,1}=zeros(1,1);
net=newff(minmax(Ptn),[m,1],{'logsig','purelin'},'trainlm');
net.trainParam.show =100;
net.trainParam.lr = 0.09;
net.trainParam.epochs =1000;
net.trainParam.goal = 1e-3;
[net,tr]=train(net,Ptn,Ttn);
Tn_predicted= sim(net,Pn)
Tn
Watch the result:
Tn_predicted =
-1.0071 -0.6413 0.5281 0.6467 0.9922
Tn =
-1.0000 -0.8000 -0.4000 0.2000 1.0000
Note the trained net did not perform well on x=[3,5,7];
The test above indicates that the training is based on previous net instead of restarting. The reason why you get worse performance is you only implement once for each data chunk (stochastic gradient descent rather than batch gradient descent), so the total error curve may not converge yet. Suppose you only have two data chunk, you may need to re-training the chunk 1 after done training chunk 2, then re-training chunk 2, then chunk 1, so on and so forth until some conditions are met. If you have much much more chunks, you may not need to worry about the 2nd compared with 1st training effect. Online learning just drops out the previous data set no matter whether the updated weights compromise the performance on them.
Here an example of how to train a NN iteratively ( mini batch ) in matlab:
just create a toy dataset
[ x,t] = building_dataset;
minibatch size and number
M = 420
imax = 10;
lets check direct-training vs minibatch training
net = feedforwardnet(70,'trainscg');
dnet = feedforwardnet(70,'trainscg');
standard training here : 1 single call with the whole data
dnet.trainParam.epochs=100;
[ dnet tr y ] = train( dnet, x, t ,'useGPU','only','showResources','no');
a measure of error : MEA , easy to measure MSE or any other you want
dperf = mean(mean(abs(t-dnet(x))))
this is the iterative part: 1 epoch per call
net.trainParam.epochs=1;
e=1;
until we reach the previous method error, for epoch comparison
while perf(end)>dperf
very very important to randomize the data at each epoch !!
idx = randperm(size(x,2));
train iteratively with all the data chunks
for i=1:imax
k = idx(1+M*(i-1) : M*i);
[ net tr ] = train( net, x( : , k ), t( : , k ) );
end
compute the performance at each epoch
perf(e) = mean(mean(abs(t-net(x))))
e=e+1;
end
check the performance, we want a nice quasi-smooth and exp(-x) like curve
plot(perf)

Unexpected behaviour of function findpeaks in MATLAB's Signal Processing Toolbox

Edit: Actually this is not unexpected behaviour, but I still need a solution.
findpeaks compares each element of data to its neighboring values.
I have data which contains peaks which I detect with the function findpeaks from the Signal Processing Toolbox. Sometimes the function seems not to detect the peaks properly, when I have the same value twice next to each other. This occurs very rarly in my data, but here is a sample to illustrate my problem:
>> values
values =
-0.0324
-0.0371
-0.0393
-0.0387
-0.0331
-0.0280
-0.0216
-0.0134
-0.0011
0.0098
0.0217
0.0352
0.0467
0.0548
0.0639
0.0740
0.0813
0.0858 <-- here should be another peak
0.0858 <--
0.0812
0.0719
0.0600
0.0473
0.0353
0.0239
0.0151
0.0083
0.0034
-0.0001
-0.0025
-0.0043
-0.0057
-0.0048
-0.0038
-0.0026
0.0007
0.0043
0.0062
0.0083
0.0106
0.0111
0.0116
0.0102
0.0089
0.0057
0.0025
-0.0025
-0.0056
Now the findpeaks function only finds one peak:
>> [pks loc] = findpeaks(values)
pks =
0.0116
loc =
42
If I plot the data, it becomes obvious that findpeaks misses one of the peaks at the location 18/19 because they both have the value 0.08579.
What is the best way to find those missing peaks?
If you have the image processing toolbox, you can use IMREGIONALMAX to find the peaks, after which you can use regionprops to find the center of the regions (if that's what you need), i.e.
bw = imregionalmax(signal);
peakLocations = find(bw); %# returns n peaks for an n-tuple of max-values
stats = regionprops(bw,'Centroid');
peakLocations = cat(1,stats.Centroid); %# returns the center of the n-tuple of max-values
This is an old topic, but maybe some are still looking for an easier solution to this (like I did today):
You could also just substract some very small fixed value from all values on a plateau, except from the first value. This causes each first value on a plateau to always be the highest on the respective plateaus, causing them to be included as peaks.
Just make something like this part of your code:
peaks = yourdata;
verysmallvalue = .001;
plateauvalue = peaks(1);
for i = 2:size(peaks,1)
if peaks(i) == plateauvalue
peaks(i) = peaks(i) - verysmallvalue;
else
plateauvalue = peaks(i);
end
end
[PKS,LOCS] = findpeaks(peaks);
plot(yourdata);
hold on;
plot(LOCS, yourdata(LOCS), 'Color', 'Red', 'Line', 'None', 'Marker', 'o');
Hope this helps!
Use the second derivative test instead?
I ended up writing my own simpler version of findpeaks, which seems to work for my purpose.
function [pks,locs] = my_findpeaks(X)
M = numel(X);
pks = [];
locs = [];
if (M < 4)
datamsgid = generatemsgid('emptyDataSet');
error(datamsgid,'Data set must contain at least 4 samples.');
else
for idx=1:M-3
if X(idx)< X(idx+1) && X(idx+1)>=X(idx+2) && X(idx+2)> X(idx+3)
pks = [pks X(idx)];
locs = [locs idx];
end
end
end
end
Edit: To clarify, the problem arose, when I had a peak which was exactly between two sample points and those two sample points had coincidentally the same value. It only happend a couple of times in more than 10.000 cases.
The behavior that you describe is a known bug in versions of MATLAB prior to R2010b. The minimum example is
findpeaks([0 1 1 0])
which returns [], while
findpeaks([0 1 0])
returns the (position of the) peak.
The bug has been fixed in R2010b and later, see the official Bug Report. With that fix, findpeaks returns the rising edge of "peaks with repeated values" (which I would call plateaus).