PCA explained variances based on the size of dataset - cluster-analysis

I have a dataset of 600000 rows by 262 columns, I am applying PCA to the whole dataset to get the first N components that can explain 95 % of the variances, which turns out to be 140. I then did a clustering analysis on these data which was transformed using PCA. I want to predict additional 60k data points cluster, and to do that, I run the PCA analysis on them too. My question arises here, when I try to extract the first N components which can explain 95 % of the variances from these new data, first 142 components now can explain the 95 % of the variances. I was kind of expecting first 140 components to explain the 95 % of the variances again, is this behaviour normal or am I making a mistake, wrong assumption somewhere? I want to run these new data through the clustering, so instead of doing the PCA and getting the first N components which can explain 95 % of the variances, should I just extract the first 140 components and run them through the model ?

Related

Projecting new points on pca of second degree MATLAB

I am trying to use PCA to visualize my implementation of k-means algorithm. I am following the tutorial on Principal Component Coefficients, Scores, and Variances in this link.
I am using the following command: [coeff,score,~]=pca(X'); where X is my data.
My data is a 30 by 455 matrix, that is 30 features with 455 samples. I have successfully used the score parameter to create a 2D plot for visualization purposes. Now I wish to project the 30 dimensional center to that plain. I have tried coeff*centers(:,1) but I do not understand if this is the correct usage.
How do I project a new 30 dimensional point to the 2D of the first vs the second pca components?
I assume that by centers(:, 1) you denote a new observation. To express this observation in the principal components you should write
[coeff, score, ~, ~, ~, mu]=pca(X'); %return the estimated mean "mu"
tmp = centers(:, 1) - mu'; %remove mean since pca() by default centers data
coeff' * tmp; % the new observation expressed in the principal components
Note that you have to subtract the mean since pca() by default centers the data. Also, note the transpose ' on coeff. In fact it should be inv(coeff), but since coeff is an orthogonal matrix we can use transpose instead.

Is transposing training set affects the results with SVM

I am working on human age claffication where i have to classify data into two classes namely Young and Old. As a classifier i am using SVM and this is what i did so far to parepare the data :
The TrainingSet where it's size is (11264, 284) : where each column corresponds to an observation (a person). This means that i have 284 persons for the training task and 11264 features.
The TestSet is also formated as the TrainingSet.
The Groups (labels) is a matrix Groups(284, 1) filled with (1) for Olds and (-1) for Youngs.
I trained SVM using matlab built-in function to have the `SvmStruct'.
SvmStruct = svmtrain(TrainingSet, Groups')
Then i introduce the TestSet using this matlab function in order to have the classification results.
SvmClassify = svmclassify(SvmStruct, TestSet)
After i reviewed the matlab help about SVM i deduced that the data have to be introduced to the SVM classifier in a way that the each row of the TrainingSet corresponds to an Observation (a person in my case) and each column corresponds to a feature. So what i was doing so far was transposing those matrices (TrainingSet and TestSet).
Is what i did was wrong and all the results i got are wrong?
I looked at the source code for svmtrain, and it transposes the training data if the number of groups does not match the number of rows (svmtrain.m, line 249 ff, MATLAB 2015b):
% make sure data is the right size
if size(training,1) ~= size(groupIndex,1)
if size(training,2) == size(groupIndex,1)
training = training';
else
error(message('stats:svmtrain:DataGroupSizeMismatch'))
end
end
So no, your training results are not wrong.
However, svmclassify does not transpose the test data, it only checks for the right number of features (svmclassify.m, line 63 ff.):
if size(sample,2)~=size(svmStruct.SupportVectors,2)
error(message('stats:svmclassify:TestSizeMismatch'));
end
So this should have triggered an error (sample is your TestSet).

How can I find the difference between two plots with a dimensional mismatch?

I have a question that I don't know if there is a solution off the bat.
Here it goes,
I have two data sets, plotted on the same figure. I need to find their difference, simple so far...
the problem arises in the fact that say matrix A has 1000 data points while the second (matrix B) has 580 data points. How will I be able to find the difference between the two graphs since there is a dimensional miss match between the two figures.
One way that I thought of is artificially inflating matrix B to 1000 data points, but the trend of the plot will remain the same. Would this be possible? and if yes how?
for example:
A=[1 45 33 4 1009 ];
B=[1 22 33 44 55 66 77 88 99 1010];
Ya=A.*20+4;
Yb=B./10+3;
C=abs(B - A)
plot(A,Ya,'r',B,Yb)
xlim([-100 1000])
grid on
hold on
plot(length(B),C)
One way to do it is to resample the 580 element vector to 1000 samples. Use matlab resample (requires the Signal Processing Toolbox, I believe) for this:
x = randn(580,1);
y = randn(1000,1);
xr = resample(x, 50,29); # 50/29 = 1000/580 is the resampling ratio
You should then be able to compare the two data vectors.
There are two ways that I can think of:
1- Matching the size:
Generating more data for the matrix with lower number of elements (using interpolation, etc.)
Removing some data from the matrix with higher number of elements (i.e. outlier removal)
2- Comparing the matrices with their properties.
For instance, you can calculate the mean and the covariance of a matrix and compare it to the other matrix. The other options include, cov , mean , median , std, var , xcorr , xcov.

K-means clustering

I want to use K-means clustering for my features which are of size 286 x 276 , So I can do clustering before using SVM. These features are of 16 different gestures. I am using MATLAB function IDX=kmeans(Feat_train,16). In IDX variable I am getting vector of size 286x1 in which their is numbers in between 1-16 randomly. I am not understanding what that number shows and what I have to next for giving input to SVM for training.
The way you invoked kmeans in Matlab with your 286-by-276 feature matrix, kmeans assume you have 286 1D vectors in a 276-dimensional space. kmeans then tries to find k=16 centers best representing your 286 high-dimensional points.
Finally, it gives back IDX: an index per point telling you to which of the 16 centers this point belongs to.
It is now up to you to decide how to feed this information into an SVM machinery.
The number shows which cluster does each 1x276 "point" belongs to.

kmeans in MATLAB : number of clusters > number of rows?

I'm using the Statistics Toolbox function kmeans in MATLAB for the first time. I want to get the total euclidian distance to nearest centroid as an indicator of optimal k.
Here is my code :
clear all
N=10;
opts=statset('MaxIter',1000);
X=dlmread(['data.txt']);
crit=zeros(1,N);
for j=1:N
[a,b,c]=kmeans(X,j,'Start','cluster','EmptyAction','drop','Options',opts);
clear a b
crit(j)=sum(c);
end
save(['crit_',VF,'_',num2str(i),'_limswvl1.mat'],'crit')
Well everything should go well except that I get this error for j = 6 :
X must have more rows than the number of clusters.
I do not understand the problem since X has 54 rows, and no NaNs.
I tried using different EmptyAction options but it still won't work.
Any idea ? :)
The problem occurs since you use the cluster method to get initial centroids. From MATLAB documentation:
'cluster' - Perform preliminary clustering phase on random 10%
subsample of X. This preliminary phase is itself
initialized using 'sample'.
So when j=6, it tries to divide 10% of data into 6 clusters, i.e. 10% of 54 ~ 5. Therefore, you get the error X must have more rows than the number of clusters.
To get around this problem, either choose the points randomly (sample method) or choose points uniformly (uniform method).