My programme uses K-means clustering of a set amount of clusters from the user. For this k=4 but I would like to run the clustered information through matlabs naive bayes classifier afterwards.
Is there a way to split the clusters up and feed them into different naive classifiers in matlab?
Naive Bayes:
class = classify(test,training, target_class, 'diaglinear');
K-means:
%% generate sample data
K = 4;
numObservarations = 5000;
dimensions = 42;
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
something like outputing k clusters to a format k1,k2,k3 then having the naive classifier pick those up, instead of test it would be k1,k2.. etc
class = classify(k1,training, target_class, 'diaglinear');
But I just dont know how to send the output of the k clusters in matlab to some type of format? (really new to this programme)
EDIT
training = [1;0;-1;-2;4;0]; % this is the sample data.
target_class = ['posi';'zero';'negi';'negi';'posi';'zero'];% This should have the same number of rows as training data. The elements and the class on the same row should correspond.
% target_class are the different target classes for the training data; here 'positive' and 'negetive' are the two classes for the given training data
% Training and Testing the classifier (between positive and negative)
test = 10*randn(10,1) % this is for testing. I am generating random numbers.
class = classify(test,training, target_class, 'diaglinear') % This command classifies the test data depening on the given training data using a Naive Bayes classifier
% diaglinear is for naive bayes classifier; there is also diagquadratic
Try this:
% create 100 random points (this is the training data)
X = rand(100,3);
% cluster into 5 clusters
K = 5;
[IDX, C] = kmeans(X, K);
% now let us say you have new data and you want
% to classify it based on the training:
SAMPLE = rand(10,3);
CLASS = classify(SAMPLE,X,IDX);
And if you just want to filter out one of the clusters out of the data you can do something like that:
K1 = X(IDX==1)
Hope that was helpful..
Related
I want to fit a histogram to some data using predefined bins. All my data points are between 1 and 10, so I want the bins to start from xmin=1, and end at xmax=10, with a step of 0.5.
I use the following commands:
x = d1.data(:,4); % x is my data
H = histfit(x,10,'normal'); % fits a histogram using 10 bins
However when doing the above, bins are determined automatically per dataset and do not correspond to the edges I want. How can I ensure that the same bin edges are used for all datasets?
If you have access to the Curve Fitting Toolbox, I would suggest another approach that provides the required flexibility. This involves doing the fit "yourself" instead of relying on histfit:
% Generate some data:
rng(66221105) % set random seed, for reproducibility
REAL_SIG = 1.95;
REAL_MU = 5.5;
X = randn(200,1)*REAL_SIG + REAL_MU;
% Define the bin edges you want
EDGES = 1:0.5:10;
% Bin the data according to the predefined edges:
Y = histcounts(X, EDGES);
% Fit a normal distribution using the curve fitting tool:
binCenters = conv(EDGES, [0.5, 0.5], 'valid'); % moving average
[xData, yData] = prepareCurveData( binCenters, Y );
ft = fittype( 'gauss1' );
fitresult = fit( xData, yData, ft );
disp(fitresult); % optional
% Plot fit with data (optional)
figure();
histogram(X, EDGES); hold on; grid on;
plot(fitresult);
Which yields the following plot:
and the fitted model:
General model Gauss1:
fitresult(x) = a1*exp(-((x-b1)/c1)^2)
Coefficients (with 95% confidence bounds):
a1 = 19.65 (17.62, 21.68)
b1 = 5.15 (4.899, 5.401)
c1 = 2.971 (2.595, 3.348)
I am training an Elman network (a specific type of Recurrent Neural Network) and for that reason my datasets (input/target) need to be cell arrays (so that the examples are considered as a sequence by the train function).
But, I don't manage to trigger the use of a validation and test set by the train function.
Here is an example, where I want a validation and test set to be used but the train function is not using any (I know that by looking at the performance plot from the 'nntraintool' wizard or by looking at the content of the 'tr' variable in my example below). It seems the "divideind" property and indexes are ignored.
%% Set the parameters of the run
n_neurons = 50; % Number of neurons
n = 1000; % Total number of samples
ne = 500; % Number of epochs
%% Create the samples
% Allocate memory
u = zeros(1, n);
x = zeros(1, n);
y = zeros(1, n);
% Initialize u, x and y
u(1)=randn;
x(1)=rand+sin(u(1));
y(1)=x(1);
% Calculate the samples
for i=2:n
u(i)=randn;
x(i)=.8*x(i-1)+sin(u(i));
y(i)=x(i);
end
%% Create the datasets
X=num2cell(u);
T=num2cell(y);
%% Train and simulate the network
% Create the net and apply the selected parameters
net = newelm(X,T,n_neurons); % Create network
net.trainParam.epochs = ne; % Number of epochs
%% This seems to be ignored
net.divideFcn = 'divideind';
net.divideParam.trainInd = 1:800;
net.divideParam.valInd = 801:900;
net.divideParam.testInd = 901:1000;
[net,tr]= train(net,X,T);
I found the answer, I need to add:
net.divideMode = 'time';
so that cell arrays can be divided in train/validation/test sets, for example with:
net.divideFcn = 'divideind';
I have conducted a linear SVM on a large dataset, however in order to reduce the number of dimensions I performed a PCA, than conducted the SVM on a subset of the component scores (the first 650 components which explained 99.5% of the variance). Now I want to plot the decision boundary in the original variable space using the beta weights and bias from the SVM created in PCA space. But I can't figure out how to project the bias term from the SVM into the original variable space. I've written a demo using the fisher iris data to illustrate:
clear; clc; close all
% load data
load fisheriris
inds = ~strcmp(species,'setosa');
X = meas(inds,3:4);
Y = species(inds);
mu = mean(X)
% perform the PCA
[eigenvectors, scores] = pca(X);
% train the svm
SVMModel = fitcsvm(scores,Y);
% plot the result
figure(1)
gscatter(scores(:,1),scores(:,2),Y,'rgb','osd')
title('PCA space')
% now plot the decision boundary
betas = SVMModel.Beta;
m = -betas(1)/betas(2); % my gradient
b = -SVMModel.Bias; % my y-intercept
f = #(x) m.*x + b; % my linear equation
hold on
fplot(f,'k')
hold off
axis equal
xlim([-1.5 2.5])
ylim([-2 2])
% inverse transform the PCA
Xhat = scores * eigenvectors';
Xhat = bsxfun(#plus, Xhat, mu);
% plot the result
figure(2)
hold on
gscatter(Xhat(:,1),Xhat(:,2),Y,'rgb','osd')
% and the decision boundary
betaHat = betas' * eigenvectors';
mHat = -betaHat(1)/betaHat(2);
bHat = b * eigenvectors';
bHat = bHat + mu; % I know I have to add mu somewhere...
bHat = bHat/betaHat(2);
bHat = sum(sum(bHat)); % sum to reduce the matrix to a single value
% the correct value of bHat should be 6.3962
f = #(x) mHat.*x + bHat;
fplot(f,'k')
hold off
axis equal
title('Recovered feature space')
xlim([3 7])
ylim([0 4])
Any guidance on how I'm calculating bHat incorrectly would be much appreciated.
Just in case anyone else comes across this problem, the solution is the bias term can be used to find the y-intercept, b = -SVMModel.Bias/betas(2). And the y-intercept is just another point in space [0 b] which can be recovered/unrotated by inverse transforming it through the PCA. This new point can then be used to solve the linear equation y = mx + b (i.e., b = y - mx). So the code should be:
% and the decision boundary
betaHat = betas' * eigenvectors';
mHat = -betaHat(1)/betaHat(2);
yint = b/betas(2); % y-intercept in PCA space
yintHat = [0 b] * eigenvectors'; % recover in original space
yintHat = yintHat + mu;
bHat = yintHat(2) - mHat*yintHat(1); % solve the linear equation
% the correct value of bHat is now 6.3962
I have a question on how to use silhouette function in matlab
if i have my correlation matrix X = 90x90 and my cluster membership numbers for my data
; say i have five clusters. This is defined as cidx which is length 90x1 each value is assigned a number from 1 to 5.
Can I just pass the correlation matrix and cidx to the silhouette function and specify the measure as 'correlation' or should i be passing in my returns matrix instead?
Thanks for your help!
First of all you need to make your clusters. For example kmeans function in matlab does this for you.
cidx = kmeans(X,2,'distance','Euclidean');
According to MATLAB:
IDX = kmeans(X,k) partitions the points in the n-by-p data matrix X
into k clusters. This iterative partitioning minimizes the sum, over
all clusters, of the within-cluster sums of point-to-cluster-centroid
distances. Rows of X correspond to points, columns correspond to
variables. kmeans returns an n-by-1 vector IDX containing the cluster
indices of each point.
so here cidx is the n-by-1 cluster indices.
After finding the indices you can pass the X and the cidx to the silhouette function:
s = silhouette(X,cidx,'Euclidean')
s is the silhouette values in the n-by-1 vector.
Silhouette is used to determine the quality of clustering. The way this function works is illustrated below using a matrix of 100*3 size.
Example -
NofClusters=3;
numObservarations = 100;
dimensions = 3;
data = rand([numObservarations dimensions]);
numObservarations = length(data);
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
I have a 1000x6 dataset and using the below kmeans script is fine but when I want to output one of the clusters it only comes out as one column?
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',6);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
% Output cluster data to K datasets
K1 = data(clustIDX==1)
K2 = data(clustIDX==2)... etc
Shouldnt K1 = data(clustIDX==1) output the full row information? Not just one column but six like the original dataset? Or is this just outputting the distances?
Replace
K1 = data(clustIDX==1)
K2 = data(clustIDX==2)
with
K1 = data(clustIDX==1,:)
K2 = data(clustIDX==2,:)
The first one retrieves only the first column of corresponding rows. The second one should fix it, I've tried and it works.