I have a question on how to use silhouette function in matlab
if i have my correlation matrix X = 90x90 and my cluster membership numbers for my data
; say i have five clusters. This is defined as cidx which is length 90x1 each value is assigned a number from 1 to 5.
Can I just pass the correlation matrix and cidx to the silhouette function and specify the measure as 'correlation' or should i be passing in my returns matrix instead?
Thanks for your help!
First of all you need to make your clusters. For example kmeans function in matlab does this for you.
cidx = kmeans(X,2,'distance','Euclidean');
According to MATLAB:
IDX = kmeans(X,k) partitions the points in the n-by-p data matrix X
into k clusters. This iterative partitioning minimizes the sum, over
all clusters, of the within-cluster sums of point-to-cluster-centroid
distances. Rows of X correspond to points, columns correspond to
variables. kmeans returns an n-by-1 vector IDX containing the cluster
indices of each point.
so here cidx is the n-by-1 cluster indices.
After finding the indices you can pass the X and the cidx to the silhouette function:
s = silhouette(X,cidx,'Euclidean')
s is the silhouette values in the n-by-1 vector.
Silhouette is used to determine the quality of clustering. The way this function works is illustrated below using a matrix of 100*3 size.
Example -
NofClusters=3;
numObservarations = 100;
dimensions = 3;
data = rand([numObservarations dimensions]);
numObservarations = length(data);
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
Related
I'm having problems in curve fitting my randomized data for the function
Here is my code
N = 100;
mu = 5; stdev = 2;
x = mu+stdev*randn(N,1);
bin=mu-6*stdev:0.5:mu+6*stdev;
f=hist(x,bin);
plot(bin,f,'bo'); hold on;
x_ = x(1):0.1:x(end);
y_ = (1./sqrt(8.*pi)).*exp(-((x_-mu).^2)./8);
plot(x_,y_,'b-'); hold on;
It seems like I'm having vector size problems since it is giving me the error
Error using plot
Vectors must be the same length.
Note that I simplified y_ since mu and the standard deviation is known.
Plot:
Well first of all some adjustments to your question:
You are not trying to do curve fitting. What you are trying to do (in my opinion) is to overlay a probability density function on an histogram obtained by taking random points from the same distribution (A normal distribution with parameters (mu,sigma)). These two curve should indeed overlay, as they represent the same thing, only one is analytical and the other one is obtained numerically.
As seen in the hist documentation, hist is not recommended and you should use histogram instead
First step: Generating your random data
Knowing the distribution is the Normal distribution, we can use MATLAB's random function to do that :
N = 150;
rng('default') % For reproducibility
mu = 5;
sigma = 2;
r = random('Normal',mu,sigma,N,1);
Second step: Plot the histogram
Because we don't just want a count of the elements in each bin, but a feel of the probability density function, we can use the 'Normalization' 'pdf' arguments
Nbins = 25;
f=histogram(r,Nbins,'Normalization','pdf');
hold on
Here I'd rather specify a number of bins than specifying the bins themselves, because you never know in advance how far from the mean your data is going to be.
Last step: overlay the probability density function over the histogram
The histogram being already consistent with a probability density function, it is sufficient to just overlay the density function:
x_ = linspace(min(r),max(r),100);
y_ = (1./sqrt(2*sigma^2*pi)).*exp(-((x_-mu).^2)./(2*sigma^2));
plot(x_,y_,'b-');
With N = 150
With N = 1500
With N = 150.000 and Nbins = 50
If for some obscure reason you want to use old hist() function
The old hist() function can't handle normalization, so you'll have to do it by hand, by normalizing your density function to fit your histogram:
N = 1500;
% rng('default') % For reproducibility
mu = 5;
sigma = 2;
r = random('Normal',mu,sigma,1,N);
Nbins = 50;
[~,centers]=hist(r,Nbins);
hist(r,Nbins); hold on
% Width of bins
Widths = diff(centers);
x_ = linspace(min(r),max(r),100);
y_ = N*mean(Widths)*(1./sqrt(2*sigma^2*pi)).*exp(-((x_-mu).^2)./(2*sigma^2));
plot(x_,y_,'r-');
In MATLAB there is a function called cov. If I insert a matrix X into cov like this cov(X), then cov will return a square matrix of covariance.
My question is very simple:
How can I, with MATLAB, plot that matrix cov(X) onto a 2D plot like this.
I can see a lot of covariance matrix plots at Google. But how do they create them?
My best guess is that you're trying to add the principal components to the plot. To do that, you could do something like this.
%% generate data points
S_tru = [2 1; 1 1];
N = 1000;
%% compute mean, covariance, principal components
X = mvnrnd([0,0],S_tru,N);
mu = mean(X);
S = cov(X);
[U,D] = eig(S);
%% specify base points/directions for arrows
base = [mu;mu];
vecs = sqrt(D)*U';
vecs = 2 * vecs;
%% plot
plot(X(:,1),X(:,2), 'r.')
axis equal
hold on
quiver(base(:,1),base(:,2),vecs(:,1),vecs(:,2),'blue','LineWidth',2)
Resulting graph:
I have a matrix M(mx2). The first column is my bins and the second one is the frequency associated with each bin. I want to fit a smooth curve to this histogram in matlab, but most of what I have tried (like kdensity) needs the real distribution of data, which I don't have them.
Is there any functions that can take the bins and their frequency and give me a smooth curve of bin-freq. ?
Here's a hack that should work for you: generate a sample from your histogram, then run ksdensity on the sample.
rng(42) % seed RNG to make reproducible
% make example histogram
N = 1e3;
bins = -5:5;
counts = round(rand(size(bins))*N);
M = [bins' counts'];
figure
hold on
bar(M(:,1), M(:,2));
% draw a sample from it
sampleCell = arrayfun( #(i) repmat(M(i,1), M(i,2), 1), 1:size(M,1), 'uniformoutput', false )';
sample = cat(1, sampleCell{:});
[f, x] = ksdensity(sample);
plot(x, f*sum(M(:,2)));
So suppose you pass some matrix N to hist3 in Matlab, which is a m-by-2 matrix, simply for an example purposes. Where the first column is your variable X and column 2 corresponds to your variable Y.
When you run the cnt = hist3(N, {bins_X bins_Y}), you would get a m-by-m matrix. Rows here are which variable, X or Y?
OP seems to have solved his problem. However, I am leaving a code snippet exemplifying hist3's output indexing in case anyone finds it useful.
% Simulate random 2-column matrix
X = randn(1e5,2);
% Scale x-axis data to see label distinction
X(:,1) = X(:,1)*10;
% Define bins
bin_x = linspace(-30,30,80);
bin_y = linspace(-3,3,100);
% Get frequency grid
cnt = hist3(X,{bin_x,bin_y});
% Plot frequency values with surf
[x,y] = meshgrid(bin_x,bin_y);
figure
surf(x,y,cnt')
title('Original hist3 output')
xlabel('First Column')
ylabel('Second Column')
zlabel('Frequency')
% Access and modify cnt, and plot again
cnt(end,1:10) = 60;
cnt(25:55,1:55)= 0;
figure
surf(x,y,cnt')
title('Modified hist3 output')
xlabel('First Column')
ylabel('Second Column')
zlabel('Frequency')
My programme uses K-means clustering of a set amount of clusters from the user. For this k=4 but I would like to run the clustered information through matlabs naive bayes classifier afterwards.
Is there a way to split the clusters up and feed them into different naive classifiers in matlab?
Naive Bayes:
class = classify(test,training, target_class, 'diaglinear');
K-means:
%% generate sample data
K = 4;
numObservarations = 5000;
dimensions = 42;
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
something like outputing k clusters to a format k1,k2,k3 then having the naive classifier pick those up, instead of test it would be k1,k2.. etc
class = classify(k1,training, target_class, 'diaglinear');
But I just dont know how to send the output of the k clusters in matlab to some type of format? (really new to this programme)
EDIT
training = [1;0;-1;-2;4;0]; % this is the sample data.
target_class = ['posi';'zero';'negi';'negi';'posi';'zero'];% This should have the same number of rows as training data. The elements and the class on the same row should correspond.
% target_class are the different target classes for the training data; here 'positive' and 'negetive' are the two classes for the given training data
% Training and Testing the classifier (between positive and negative)
test = 10*randn(10,1) % this is for testing. I am generating random numbers.
class = classify(test,training, target_class, 'diaglinear') % This command classifies the test data depening on the given training data using a Naive Bayes classifier
% diaglinear is for naive bayes classifier; there is also diagquadratic
Try this:
% create 100 random points (this is the training data)
X = rand(100,3);
% cluster into 5 clusters
K = 5;
[IDX, C] = kmeans(X, K);
% now let us say you have new data and you want
% to classify it based on the training:
SAMPLE = rand(10,3);
CLASS = classify(SAMPLE,X,IDX);
And if you just want to filter out one of the clusters out of the data you can do something like that:
K1 = X(IDX==1)
Hope that was helpful..