Why Matlab K-means does not find the best centroids while Excel Solver does? - matlab

I have a data set as follows:
Data = [4 12; 5 10; 8 7; 5 3; 5 4; 2 11; 5 4; 3 8; 6 2; 7 4; 10 8; 8 9; 10 9; 10 12]
Then I proceed with:
[idx,ctrs, sumD] = kmeans(Data,3)
It gives me the centroids and sumD (sums of point-to-centroid distances within cluster) like:
ctrs = [5.6000 3.4000; 3.5000 10.2500; 9.2000 9.0000]
sumD = [6.4000; 13.7500; 18.8000]
Whereas according to Excel Solver (from a published article), ctrs and sumD are as follows for k=3:
ctrs = [5.21815716 3.66736761; 3.615385665 10.461533; 9.47841197 8.75055345]
sumD = [5.151897802; 7.285383286; 8.573829765]
(NB: In that article, the authors give an initial (seed) centroid to each cluster such as [4 4; 5 12; 10 6] by visual decision from the plot.)
Apparently, Excel finds more accurate ctrs values thereby smaller sumD values. I could not achieve this with Matlab. That's why I used other parameters of kmeans function. I used 'replicates'` and 'options' (MaxIter) and also 'start' parameters - even with 3D array seed - to no avail. I even adopted the same initial seed from the article to Matlab. Followings are what I tried and failed:
First:
opts = statset('MaxIter',100);
Seed = [4 4; 5 12; 10 6];
[idx,ctrs] = kmeans(Data,3,'Replicates',50,'options',opts,'start',Seed)
This gives an error: The third dimension of the 'Start' array must match the 'replicates' parameter value.
Second:
I created a 3D array of 50 pages where the first page is the same initial seed above and the rest 49 are random. I created the random pages as:
T = rand(3,2,49);
After that, I created the 50 pages 3D array like this:
Seed2 = cat(3,Seed,T);
Then used kmeans:
[idx,ctrs] = kmeans(Data,3,'Replicates',50,'options',opts,'start',Seed2)
However, Matlab gave warnings indicated that all the replicates after the first replication were terminated due to empty cluster created at iteration 1. Also, the idx, ctrs and sumD values obtained were still the same as before - as if I ran my very first function above (i.e. [idx,ctrs, sumD] = kmeans(Data,3) ).
I am stuck. I am trying to verify the results of the Excel solver published in the article using Matlab because then I will apply the same algorithm applied on 14 observations from the article to a larger data set of 900+ observations.
What am I doing wrong? What should I correct in my coding to obtain the same or much similar result of the Excel Solver?

The difference appears to be in the choice of the measure of distance used, not in the coding. There is more than one way to define "distance" in this context.
MATLAB uses squared Euclidean distance by default. By hand calculating this with the MATLAB results I can replicate the sumD results you get. However, using squared Euclidean distance measure with the results you give from the paper gives a higher value of sumD.
I get the same results for sumD as the paper if I use plain (not squared) Euclidean distance. Using this measure the MATLAB results return higher values for sumD.
So neither result is wrong as such, they're just measuring "rightness" in different ways.

How can you be certain that excel values are correct and MATLAB kmeans gives you not so accurate result.
With the quick MATLAB script below, I plotted the centroids, and at least visually it seems correct
Data = [4 12; 5 10; 8 7; 5 3; 5 4; 2 11; 5 4; 3 8; 6 2; 7 4; 10 8; 8 9; 10 9; 10 12];
plot(Data(:,1), Data(:,2),'ob','markersize', 10);
axis([min(Data(:,1))-2, max(Data(:,1))+2, min(Data(:,2))-2, max(Data(:,2))+2]);
hold on;
[idx,ctrs, sumD] = kmeans(Data,3);
plot(ctrs(:,1), ctrs(:,2), '*r', 'markersize', 10);
If this is not accurate enough, Instead of trying to customize MATLAB's kmeans, we can define our kmean function. I had implemented the kmeans sometime ago and it seemed easier that asking matlab to fine tune the parameters.

Related

How to use GPU for extracting patches from images

I am looking for a way to extract patches of size m x n from a matrix in Matlab by using GPU. The location of the patch extraction is determined by shifting a m x n window for s pixels horizontally and vertically.
Originally, I solved this problem by just using naive 2 nested for loops. However, it gets slow when I have a large image and small m and n. Because, this task has to be done multiple times, I want to increase the speed. I thought this process could be done in parallel, so I tried using parfor to solve the problem. But, it gets even slower than the normal for loops method.
Right now I am trying to use GPU for helping me in this task. But, currently I have now idea how to implement it. I have checked arrayfun, but it seems that it only support for element-wise calculation.
So, is it possible to use GPU to help solving the problem? It seems like it could be done in parallel.
EDIT: an example to the problem
Let's say I have a 3 x 3 matrix (I will write the matrix in MATLAB format).
`
A = [1 2 3
4 5 6
7 8 9];
`
I want to extract patches with size of 2 x 2. And I want the each patches are generated by 1 pixel shift. So, the result should be
`p1 = [1 2; 4 5]; p2 = [2 3; 5 6]; p3 = [4 5; 7 8]; p4 = [5 6; 8 9];`
And here is an example of what I did in the for loops method.
A = magic(3);
patchSize = [2 2];
shift = 1;
finalI = floor((size(A,1) - patchSize(1)) / shift);
finalJ = floor((size(A,2) - patchSize(2)) / shift);
patch = cell(finalI,finalJ);
for i = 0:finalI
for j = 0:finalJ
patch{i+1,j+1} = A(1+i*shift : patchSize(1)+i*shift, 1+j*shift : patchSize(2)+j*shift);
end
end

How to get the maximal values and the related coordinates? [duplicate]

suppose that we are determine peaks in vector as follow:
we have real values one dimensional vector with length m,or
x(1),x(2),.....x(m)
if x(1)>x(2) then clearly for first point peak(1)=x(1);else we are then comparing x(3) to x(2),if x(3)
[ indexes,peaks]=function(x,m);
c=[];
b=[];
if x(1)>x(2)
peaks(1)=x(1);
else
for i=2:m-1
if x(i+1)< x(i) & x(i)>x(i-1)
peak(i)=x(i);
end;
end
end
end
peaks are determined also using following picture:
sorry for the second picture,maybe it is not triangle,just A and C are on straight line,but here peak is B,so i can't continue my code for writing algorithm to find peak values in my vector.please help me to continue it
updated.numercial example given
x=[2 1 3 5 4 7 6 8 9]
here because first point is more then second,so it means that peak(1)=2,then we are comparing 1 to 3,because 3 is more then 1,we now want to compare 5 to 3,it is also more,compare 5 to 4,because 5 is more then 4,then it means that peak(2)=5,,so if we continue next peak is 7,and final peak would be 9
in case of first element is less then second,then we are comparing second element to third one,if second is more then third and first elements at the same time,then peak is second,and so on
You could try something like this:
function [peaks,peak_indices] = find_peaks(row_vector)
A = [min(row_vector)-1 row_vector min(row_vector)-1];
j = 1;
for i=1:length(A)-2
temp=A(i:i+2);
if(max(temp)==temp(2))
peaks(j) = row_vector(i);
peak_indices(j) = i;
j = j+1;
end
end
end
Save it as find_peaks.m
Now, you can use it as:
>> A = [2 1 3 5 4 7 6 8 9];
>> [peaks, peak_indices] = find_peaks(A)
peaks =
2 5 7 9
peak_indices =
1 4 6 9
This would however give you "plateaus" as well (adjacent and equal "peaks").
You can use diff to do the comparison and add two points in the beginning and end to cover the border cases:
B=[1 diff(A) -1];
peak_indices = find(B(1:end-1)>=0 & B(2:end)<=0);
peaks = A(peak_indices);
It returns
peak_indices =
1 4 6 9
peaks =
2 5 7 9
for your example.
findpeaks does it if you have a recent matlab version, but it's also a bit slow.
This proposed solution would be quite slow due to the for loop, and you also have a risk of rounding error due to the fact that you compare the maximal value to the central one instead of comparing the position of the maximum, which is better for your purpose.
You can stack the data so as to have three columns : the first one for the preceeding value, the second is the data and the third one is the next value, do a max, and your local maxima are the points for which the position of the max along columns is 2.
I've coded this as a subroutine of my own peak detection function, that adds a further level of iterative peak detection
http://www.mathworks.com/matlabcentral/fileexchange/42927-find-peaks-using-scale-space-approach

Using bin counts as weights for random number selection

I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.

Indexing matrix to get corresponding values for condition,store in a new matrix each time?

Sorry for the perhaps confusing title...
Basically I have a 3x3 matrix containing elevation angle, azimuth angle and range. I want to generate new matrices each time elevation >5 deg. There are usually about 5 segments that have this data and I want to separate each one into a new matrix.
I know how to index but not sure how to put this condition in...
Thanks
sat_tcs=llh2tcsT(sat_llh,station_llh);
sat_elev=atan2(sat_tcs(3,:),sqrt(sat_tcs(1,:).^2+sat_tcs(2,:).^2));
sat_azim=atan2(-sat_tcs(2,:),sat_tcs(1,:));
range=sqrt(sat_tcs(1,:).^2+sat_tcs(2,:).^2+sat_tcs(3,:).^2);` sat_elev(sat_elev < 5*deg2rad) = NaN; sat_look_tcs=[sat_elev;sat_azim;range];
It would be helpful to have some examples of the input and expected output, but taking a guess at what you mean I'd try this:
elevation_column = 3;
threshold = 5;
m = [1 2 3; 4 5 6; 7 8 9; 1 2 3];
n = m(m(:,elevation_column)>threshold,:);
This produces:
n =
4 5 6
7 8 9
Sorry, I would post an image of my graph but supposedly I need reputation points for that..but it the elevation data looks almost sinusoidal and so it has regions over 5 deg and then falls again. I want to generate a new matrix for every set above this angle

find peak values in matlab

suppose that we are determine peaks in vector as follow:
we have real values one dimensional vector with length m,or
x(1),x(2),.....x(m)
if x(1)>x(2) then clearly for first point peak(1)=x(1);else we are then comparing x(3) to x(2),if x(3)
[ indexes,peaks]=function(x,m);
c=[];
b=[];
if x(1)>x(2)
peaks(1)=x(1);
else
for i=2:m-1
if x(i+1)< x(i) & x(i)>x(i-1)
peak(i)=x(i);
end;
end
end
end
peaks are determined also using following picture:
sorry for the second picture,maybe it is not triangle,just A and C are on straight line,but here peak is B,so i can't continue my code for writing algorithm to find peak values in my vector.please help me to continue it
updated.numercial example given
x=[2 1 3 5 4 7 6 8 9]
here because first point is more then second,so it means that peak(1)=2,then we are comparing 1 to 3,because 3 is more then 1,we now want to compare 5 to 3,it is also more,compare 5 to 4,because 5 is more then 4,then it means that peak(2)=5,,so if we continue next peak is 7,and final peak would be 9
in case of first element is less then second,then we are comparing second element to third one,if second is more then third and first elements at the same time,then peak is second,and so on
You could try something like this:
function [peaks,peak_indices] = find_peaks(row_vector)
A = [min(row_vector)-1 row_vector min(row_vector)-1];
j = 1;
for i=1:length(A)-2
temp=A(i:i+2);
if(max(temp)==temp(2))
peaks(j) = row_vector(i);
peak_indices(j) = i;
j = j+1;
end
end
end
Save it as find_peaks.m
Now, you can use it as:
>> A = [2 1 3 5 4 7 6 8 9];
>> [peaks, peak_indices] = find_peaks(A)
peaks =
2 5 7 9
peak_indices =
1 4 6 9
This would however give you "plateaus" as well (adjacent and equal "peaks").
You can use diff to do the comparison and add two points in the beginning and end to cover the border cases:
B=[1 diff(A) -1];
peak_indices = find(B(1:end-1)>=0 & B(2:end)<=0);
peaks = A(peak_indices);
It returns
peak_indices =
1 4 6 9
peaks =
2 5 7 9
for your example.
findpeaks does it if you have a recent matlab version, but it's also a bit slow.
This proposed solution would be quite slow due to the for loop, and you also have a risk of rounding error due to the fact that you compare the maximal value to the central one instead of comparing the position of the maximum, which is better for your purpose.
You can stack the data so as to have three columns : the first one for the preceeding value, the second is the data and the third one is the next value, do a max, and your local maxima are the points for which the position of the max along columns is 2.
I've coded this as a subroutine of my own peak detection function, that adds a further level of iterative peak detection
http://www.mathworks.com/matlabcentral/fileexchange/42927-find-peaks-using-scale-space-approach