why do k-means clustering different results everytime?

why do k-means clustering different results everytime? - matlab

I am using k-means clustering for segmentation of retinal image. However everytime when I run my code segmentation yeilds different results for same image. What is the reason of this change? Following are three segmentation results of same image.
Below is the code used for this segmenation.
idx = kmeans(double(imreslt1(:)),2);
classimage = reshape(idx, size(imreslt1));
minD = min( classimage (:));
maxD = max( classimage (:));
g = (double(classimage ) - minD) ./ (maxD - minD);
imshow(g);

This is the initialization problem for kmeans, as when kmeans starts it picks up the random initial points to cluster your data. Then matlab selects k number of random points and calculates the distance of points in your data to these locations and finds new centroids to further minimize the distance. so because of these random initial points you get different results for centroid locations, but the answer is similar.

If you read the MATLAB help file for the kmeans function, you'll see that the initial points for the k-means clustering algorithm are chosen randomly according to the k-means++ algorithm. To make this reproducible, you can either pass in your own initial points as follows:
kmeans(...,'Start',[random_points_matrix])
or, you could try seeding the MATLAB internal random number generator using the following:
rng(seed); % where seed is some constant you choose
idx = kmean(...);
However, I'm not clear on the internals of the kmean function, so I can't guarantee that this will necessarily produce reproducible results.

Related

On Solving ODE equations and specifying number of samples

I am trying to understand the following set of equations given here: https://matlabgeeks.com/tips-tutorials/modeling-with-odes-in-matlab-part-5b/
The equations are those of a chaotic Lorenz system. The tutorial is quite easy to understand but what I do not follow is how to set the number of data points to generate i.e., the length of the time series? Which parameter helps to decide to generate how many data points will be generated. Can somebody please help? I have looked into other resources as well but I could not understand. For instance, by trial and error I found that if I specify
eps = 0.000001; T = [0 45] then the number of data points are about 7000. If I want the number of data points to 10,000 I don't know what the values of these parameters should be.

As described in the article (and the previous parts 1 and 2 of the series), the sequence of sample points is generated dynamically so that each segment contributes about the same amount of truncation error towards the global error, weighted by the absolute and relative tolerances. Additionally, it uses interpolation inside the segment to produce 3 inner points so that a plot will appear curved also for large tolerances. That is, the internal segmentation is given by T(1:4:end), the other points are interpolated.
You can also prescribe your own sample times, the values there get likewise interpolated from the "dense output", the interpolations over the internally produced segmentation.
T = linspace(t0, tend, 7000);
Y = ode45('lorenz', T, Y0, options);
You could also extract the dense output via
sol = ode45('lorenz', [t0 tend], Y0, options);
and then use the provided interpolation to compute samples at arbitrary times
Y = deval(sol,T);
In Empirical error proof Runge-Kutta algorithm ... I also computed the error for the Lorenz system for a fixed-step RK method, which shows the same divergence of the solutions after a relatively short time.

How to get the threshold value of k-means algorithm that is used to binarize the images?

I applied k-means algorithm for segmenting images. I used built in k-means function. It works properly but I want to know the threshold value that converts it to binary images in k-means method. For example, we can get threshold value by using built in function in MATLAB:
threshold=graythresh(grayscaledImage);
a=im2bw(a,threshold);
%Applying k-means....
imdata=reshape(grayscaledImage,[],1);
imdata=double(imdata);
[imdx mn]=kmeans(imdata,2);
imIdx=reshape(imdx,size(grayscaledImage));
imshow(imIdx,[]);

Actually, k-means and the well known Otsu threshold for binarizing intensity images based on a global threshold have an interesting relationship:
http://www-cs.engr.ccny.cuny.edu/~wolberg/cs470/doc/Otsu-KMeansHIS09.pdf
It can be shown that k-means is a locally optimal, iterative solution to the same objective function as Otsu, where Otsu is a globally optimal, non-iterative solution.
Given greyscale intensity data, one could compute a threshold based on otsu, which can be expressed in MATLAB using graythresh, or otsuthresh, depending on which interface you prefer.
A = imread('cameraman.tif');
A = im2double(A);
totsu = otsuthresh(histcounts(A,10000))
[~,c] = kmeans(A(:),2,'Replicates',10);
tkmeans = mean(c)
You can obtain a grayscale threshold from kmeans by just finding the midpoint of the two centroids, which should make sense geometrically since on either side of that midpoint, you are closer to one of the centroids or the other, and should therefore lie in that respective cluster.
totsu =
0.3308
tkmeans =
0.3472

You can't get the threshold because there is no threshold in the kmeans algorithm.
K-means is a clustering algorithm, it returns clusters which in many cases cannot be obtained with a simple thresholding.
See this link to learn further on how k-means works.

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.

If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip

I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

Using Linear Prediction Over Time Series to Determine Next K Points

I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.

So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.

Measuring the entropy of a transition probability matrix in matlab

I'm working on a project which requires to analyze certain graph properties of transition probability matrices which are constructed as weighted directed graphs.
one of the properties of interest is the entropy of these graphs, which i have yet to find a proper way to measure, the general idea is that i need some sort of measure which allows me to quantify the extent to which a certain graph is "ordered" in order to ascertain the predictive value of the nodes within the graph (I.E if all the nodes have the exact same connection patterns, then effectively their predictive value is zero, though this is a very simplistic explanation as there are many other contributing factors to a nodes predictive power).
Iv'e experimented with certain built in matlab commands:
entropy - generally used to determine the entropy of an image
wentropy - to be honest i do not fully understand the proper use of this function, but iv'e tried using it with the 'shannon' and 'log energy' types, and have produced some incosistent results
this is a very basic script i whipped up to some testing, which produces two matrices:
an 20*20 matrix constructed with values drawn entirely from a uniform distribution, intended to produce a matrix with a relatively low degree of order - unordgraph
a 20*20 matrix constructed with 4 5*5 "patches" in which the values are integers drawn from a uniform distribution with a given range that is significantly larger than one, while the rest of the values are drawn from a uniform distribution on the range 0-1 (as in the previous matrix), this form of graph is more "ordered" than the previous patch - ordgraph
when i run the code:
clear all;
n = 50;
gsize = 20;
orderedrange = [100 200];
enttype = 'shannon';
for i = 1:n;
unordgraph = rand(gsize);
% entvec(1,i) = entropy(unordgraph);
entvec(1,i) = wentropy(unordgraph,enttype);
% ordgraph = reshape(1:gsize^2,gsize,gsize);
ordgraph = rand(gsize);
ordgraph(1:5,1:5) = randi(orderedrange,5);
ordgraph(6:10,6:10) = randi(orderedrange,5);
ordgraph(11:15,11:15) = randi(orderedrange,5);
ordgraph(16:20,16:20) = randi(orderedrange,5);
% entvec(2,i) = entropy(ordgraph);
entvec(2,i) = wentropy(ordgraph,enttype);
end
fprintf('the mean entropy of the unordered graph is: %.4f\n',mean(entvec(1,:)));
fprintf('the mean entropy of the ordered graph is: %.4f\n',mean(entvec(2,:)));
i get outputs such as:
the mean entropy of the unordered graph is: 88.8871
the mean entropy of the ordered graph is: -23936552.0113
i'm not really sure about the meaning of such negative values as running the same script on a matrix comprised entirely of zeros or ones (and hence maximally ordered) produces a mean entropy of 0.
i have a pretty rudimentary background in graph theory, making this task that much more difficult, and i would be really grateful for any help, whether theoretical or algorithmical
thanks in advance,
Ron

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

why do k-means clustering different results everytime? - matlab

Related

On Solving ODE equations and specifying number of samples

How to get the threshold value of k-means algorithm that is used to binarize the images?

Generate random samples from arbitrary discrete probability density function in Matlab

Using Linear Prediction Over Time Series to Determine Next K Points

Measuring the entropy of a transition probability matrix in matlab

Categories

Resources