feature selection using kernel PCA (KPCA) - matlab

I have tried Principal component analysis (PCA) for feature selection which gave me 4 optimal features from set of nine features (Mean of Green, Variance of Green, Std. div. of Green , Mean of Red, Variance of Red, Std. div. of Red, Mean of Hue, Variance of Hue, Std. div. of Hue, i.e. [ MGcorr,VarGcorr, stdGcorr,MRcorr,VarRcorr,stdRcorr,MHcorr,VarHcorr,stdHcorr ]) for classification of data into two clusters. From Literature, it seems that PCA is not very good method but rather is better to apply kernel PCA (KPCA) for the feature selection. I want to apply KPCA for feature selection and I have tried following:
d=4; % number of features to be selected, or d: reduced dimension
[Y2 eigVector para ]=kPCA(feature,d); % feature is 300X9 matrix with 300 as number of
% observation and 9 features
% Y: dimensionanlity-reduced data
Above kPCA.m function can be downloaded from :
http://www.mathworks.com/matlabcentral/fileexchange/39715-kernel-pca-and-pre-image-reconstruction/content/kPCA_v1.0/code/kPCA.m
In above implementation I want to know how to find which 4 features from 9 features to select (i.e. which top features are optimal) for clustering.
Alternatively, I also tried following function for KPCA implementation:
options.KernelType = 'Gaussian';
options.t = 1;
options.ReducedDim = 4;
[eigvector, eigvalue] = KPCA(feature', options);
In above implementation also I have same problem in determining the 4 top /optimal features from set of 9 features.
Above KPCA.m function can be downloaded from :
http://www.cad.zju.edu.cn/home/dengcai/Data/code/KPCA.m
That will be great if someone will help me in implementing kernel PCA for my problem.
Thanks

PCA doesn't provide optimal features per se. What it provides is a new set of features that are uncorrelated. When you select the "best" 4 features, you are picking the ones that have the greatest variance (largest eigenvalues). So for "normal" PCA, you simply select the 4 eigenvectors corresponding to the 4 largest eigenvalues, then you project the original 9 features onto those eigenvectors via matrix multiplication.
From the link you provided for the kernel PCA function, the return value Y2 appears to be the original data transformed to the top d features of the kernel-PCA space, so the transformation is already done for you.

Related

How to compute distance and estimate quality of heterogeneous grids in Matlab?

I want to evaluate the grid quality where all coordinates differ in the real case.
Signal is of a ECG signal where average life-time is 75 years.
My task is to evaluate its age at the moment of measurement, which is an inverse problem.
I think 2D approximation of the 3D case is hard (done here by Abo-Zahhad) with with 3-leads (2 on chest and one at left leg - MIT-BIT arrhythmia database):
where f is a piecewise continuous function in R^2, \epsilon is the error matrix and A is a 2D matrix.
Now, I evaluate the average grid distance in x-axis (time) and average grid distance in y-axis (energy).
I think this can be done by Matlab's Image Analysis toolbox.
However, I am not sure how complete the toolbox's approaches are.
I think a transform approach must be used in the setting of uneven and noncontinuous grids. One approach is exact linear time euclidean distance transforms of grid line sampled shapes by Joakim Lindblad et all.
The method presents a distance transform (DT) which assigns to each image point its smallest distance to a selected subset of image points.
This kind of approach is often a basis of algorithms for many methods in image analysis.
I tested unsuccessfully the case with bwdist (Distance transform of binary image) with chessboard (returns empty square matrix), cityblock, euclidean and quasi-euclidean where the last three options return full matrix.
Another pseudocode
% https://stackoverflow.com/a/29956008/54964
%// retrieve picture
imgRGB = imread('dummy.png');
%// detect lines
imgHSV = rgb2hsv(imgRGB);
BW = (imgHSV(:,:,3) < 1);
BW = imclose(imclose(BW, strel('line',40,0)), strel('line',10,90));
%// clear those masked pixels by setting them to background white color
imgRGB2 = imgRGB;
imgRGB2(repmat(BW,[1 1 3])) = 255;
%// show extracted signal
imshow(imgRGB2)
where I think the approach will not work here because the grids are not necessarily continuous and not necessary ideal.
pdist based on the Lumbreras' answer
In the real examples, all coordinates differ such that pdist hamming and jaccard are always 1 with real data.
The options euclidean, cytoblock, minkowski, chebychev, mahalanobis, cosine, correlation, and spearman offer some descriptions of the data.
However, these options make me now little sense in such full matrices.
I want to estimate how long the signal can live.
Sources
J. Müller, and S. Siltanen. Linear and nonlinear inverse problems with practical applications.
EIT with the D-bar method: discontinuous heart-and-lungs phantom. http://wiki.helsinki.fi/display/mathstatHenkilokunta/EIT+with+the+D-bar+method%3A+discontinuous+heart-and-lungs+phantom Visited 29-Feb 2016.
There is a function in Matlab defined as pdist which computes the pairwisedistance between all row elements in a matrix and enables you to choose the type of distance you want to use (Euclidean, cityblock, correlation). Are you after something like this? Not sure I understood your question!
cheers!
Simply, do not do it in the post-processing. Those artifacts of the body can be about about raster images, about the viewer and/or ... Do quality assurance in the signal generation/processing step.
It is much easier to evaluate the original signal than its views.

How can I classify my data for K-Means Clustering

A proof of concept prototype I have to do for my final year project is to implement K-Means Clustering on a big data set and display the results on a graph. I only know object-oriented languages like Java and C# and decided to give MATLAB a try. I notice that with a functional language the approach to solving problems is very different, so I would like some insight on a few things if possible.
Suppose I have the following data set:
raw_data
400.39 513.29 499.99 466.62 396.67
234.78 231.92 215.82 203.93 290.43
15.07 14.08 12.27 13.21 13.15
334.02 328.79 272.2 306.99 347.79
49.88 52.2 66.35 47.69 47.86
732.88 744.62 687.53 699.63 694.98
And I picked row 2 and 4 to be the 2 centroids:
centroids
234.78 231.92 215.82 203.93 290.43 % Centroid 1
334.02 328.79 272.2 306.99 347.79 % Centroid 2
I want to now compute the euclidean distances of each point to each centroid, then assign each point to it's closest centroid and display this on a graph. Let's say I want I want to classify the centroids as blue and green. How can I do this in MATLAB? If this was Java I would initialise each row as an object and add to separate ArrayLists (representing the clusters).
If rows 1, 2 and 3 all belong to the first centroid / cluster, and rows 4, 5 and 6 belong to the second centroid / cluster - how can I classify these to display them as blue or green points on a graph? I am new to MATLAB and really curious about this. Thanks for any help.
(To begin with, Matlab has a flexible distance measuring function, pdist2 and also kmeans implementation, but I'm assuming that you want to build your code from scratch).
In Matlab, you try to implement everything as matrix algebra, without loops over elements.
In your case, if R is the raw_data matrix and C is the centroids matrix,
you can shift the dimension that represents centroid number to the 3rd place by
permC=permute(C,[3 2 1]); Then the bsxfun function allows you to subtract C from R while expanding R's third dimension as necessary: D=bsxfun(#minus,R,permC). Element-wise square followed by summation across columns SqD=sum(D.^2,2) will give you the squared distances of each observation from each centroid. Performing all these operations within a single statement and shifting the third (centroid) dimension back to the 2nd place will look like this:
SqD=permute(sum(bsxfun(#minus,R,permute(C,[3 2 1])).^2,2),[1 3 2])
Picking the centroid of minimal distance is now straightforward: [minDist,minCentroid]=min(SqD,[],2)
If this looks complex, I recommend inspecting the product of each sub-step and reading the help of each command.

Plotting K-means results in Matlab

I have 3 sets of signals, each containing 4 distinct operational states, and I have to classify the states in each signal using K-means in Matlab. The classification is done after I have smoothened the original signal using a filter. My output should be a plot of the smoothened signal with each part of the signal in a different color to denote the different operational state.
I am very new to Matlab, and this is what I have for the classification part.
numClusters = 4;
idx_1 = kmeans([X_1 smoothY_1],numClusters,'Replicates', 5);
[numDataPoints,numDimensions] = size(smoothY_1);
Colors = hsv(numClusters);
for i = 1 : numDataPoints
plot(X_1(i),smoothY_1(i),'.','Color',Colors(idx_1(i),:))
hold on
end
I have a few questions.
1) It appears to me that the kmeans function in Matlab will return a set of arbitrary cluster index in every run. For example, running the code above on the same signal twice may give me the cluster index (for 10 data points) [4 4 2 2 2 1 1 3 3 3] and [2 2 1 1 1 4 4 3 3 3], resulting in arbitrary colors denoting each state. Ideally, I would like the indices to be (somewhat) ordered and the colors to be the same for corresponding states, so that it makes sense to say "Red means Operational State 1, blue means State 2, etc". How can I synchronize this?
I have 2 pictures to illustrate this.
Set 1 and 2 are two of the datasets. Each stage of the signal is in a different color. I would like, for example, the first segment to be red, second in cyan, third in green, fourth in purple.
2) I can't seem to plot the graph using the specifier '-'. There is no output when I tried to do that, so I'm forced to use '.', which isn't what i want. How can I plot a continuous curve here?
3) Right now, I'm running K-means independently on all 3 sets of data, so there's no concept of training/test datasets. I would like to use one dataset for training and the other 2 for testing, but I don't know how to do that using K-means in Matlab. How can I do that?
ETA: I noticed that my smoothed plots are all about half the heights of my plots of the original data, e.g. the highest point in my original signal is y = 22, while the highest point in my smoothed signal is y = 11, although the shape remains the same. Is this correct?
ETA2: I realized that it seems as if what the K-means clustering did was simply divide the graph into numClusters segments (based on X_1 values) and that's it. I've tried with different values of numClusters and each gave me equally divided segments. Surely this can't be right? For instance, isn't it more likely that the long segment after the biggest spike belong to the same cluster, rather than 3 clusters? Should I be using K-means at all?
For the first question:
You can reorder your vector with
[~,~,a] = unique(a,'stable');
For the second question:
You can find all the information about the LineSpec here:
LineSpec
If you don't add a LineSpec the default option is a continuous line, as you want.
For the third question:
I don't think that you can train your kmean algorithm (due to the method) as it could be possible with an SVM, but i'm waiting for an expert opinion.

Alternative to spatial histograms in Bag of Words approach using vlfeat

The phow_caltech101 demo app in vlfeat creates a complete Bag of Words process for image classification on the Caltech101 dataset, roughly put:
Feature Extraction
Visual Vocabulary building
Spatial Histograms computation
SVM training
SVM testing and evaluation,
obtaining a model that can be used to later classify new, unclassified instances.
The only problem the histograms computed are spatial histograms, this means if I have a visual vocabulary of size n, I would have expected the histogram to have size n x (size_collection), containing the ocurrences of each visual word in each training instance.
The spatial histograms, however, are stored in a structure according to the model specified, by default it has two spatial arguments, spatialX and spatialY, which results in a structure with size spatialX * spatialY * (size_vocabulary) which is later normalized and this is the one used to train the SVM.
Now, what if i want to use the normal histogram, normalized or not, but the histogram that gives me a 1-1 correspondance on visual word per image, or obtain this information from the spatial histogram? Also, how much more efficient is the use of the spatial histogram instead of the classical one I take into account when I picture the Bag of Words process?
Any help appreciated.
UPDATE:
Here is part of the code where the histograms are computed, you can see how instead of ending with a histogram vector of size (number_visual_words) you end up with a histogram of size (spatialX * spatialY * number_visual_words). Let me clarify, in this case, the model is defined to have spatialX = [2 4] and spatialY = [2 4].
for i = 1:length(model.numSpatialX)
binsx = vl_binsearch(linspace(1,width,model.numSpatialX(i)+1), frames(1,:)) ;
binsy = vl_binsearch(linspace(1,height,model.numSpatialY(i)+1), frames(2,:)) ;
% combined quantization
bins = sub2ind([model.numSpatialY(i), model.numSpatialX(i), numWords], ...
binsy,binsx,binsa) ;
hist = zeros(model.numSpatialY(i) * model.numSpatialX(i) * numWords, 1) ;
hist = vl_binsum(hist, ones(size(bins)), bins) ;
hists{i} = single(hist / sum(hist)) ;
end
hist = cat(1,hists{:}) ;
hist = hist / sum(hist) ;
And part of the problem is that I havent worked with spatial histogram either, so Im not sure how much better than "normal" histograms they are. Maybe someone who has worked with this kind of histograms before could give a more helpful insight.

Detect steps in a Piecewise constant signal

I have a piecewise constant signal shown below. I want to detect the location of step transition (Marked in red).
My current approach:
Smooth signal using moving average filter (http://www.mathworks.com/help/signal/examples/signal-smoothing.html)
Perform Discrete Wavelet transform to get discontinuities
Locate the discontinuities to get the location of step transition
I am currently implementing the last step of detecting the discontinuities. However, I cannot get the precise location and end with many false detection.
My question:
Is this the correct approach?
If yes, can someone shed some info/ algorithm to use for the last step?
Please suggest an alternate/ better approach.
Thanks
Convolve your signal with a 1st derivative of a Gaussian to find the step positions, similar to a Canny edge detection in 1-D. You can do that in a multi-scale approach, starting from a "large" sigma (say ~10 pixels) detect local maxima, then to a smaller sigma (~2 pixels) to converge on the right pixels where the steps are.
You can see an implementation of this approach here.
If your function is really piecewise constant, why not use just abs of diff compared to a threshold?
th = 0.1;
x_steps = x(abs(diff(y)) > th)
where x a vector with your x-axis values, y is your y-axis data, and th is a threshold.
Example:
>> x = [2 3 4 5 6 7 8 9];
>> y = [1 1 1 2 2 2 3 3];
>> th = 0.1;
>> x_steps = x(abs(diff(y)) > th)
x_steps =
4 7
Regarding your point 3: (Please suggest an alternate/ better approach)
I suggest to use a Potts "filter". This is a variational approach to get an accurate estimation of your piecewise constant signal (similar to the total variation minimization). It can be interpreted as adaptive median filtering. Given the Potts estimate u, the jump points are the points of non-zero gradient of u, that is, diff(u) ~= 0. (There are free Matlab implementations of the Potts filters on the web)
See also http://en.wikipedia.org/wiki/Step_detection
Total Variation Denoising can produce a piecewise constant signal. Then, as pointed out above, "abs of diff compared to a threshold" returns the position of the transitions.
There exist very efficient algorithms for TVDN that process millions of data points within milliseconds:
http://www.gipsa-lab.grenoble-inp.fr/~laurent.condat/download/condat_fast_tv.c
Here's an implementation of a variational approach with python and matlab interface that also uses TVDN:
https://github.com/qubit-ulm/ebs
I think, smoothing with a sharper lowpass filter should work better.
Try to use medfilt1() (a median filter) instead, since you have very concrete levels. If you know how long your plateau is, you can take half/quarter of the plateau length for example. Then you would get very sharp edges. The sharp edges should be detectable using a Haar wavelet or even just using simple differentiation.