Cluster Analysis: correcting observations with negative silhouette width - cluster-analysis

I am trying to find patterns in a dataset (~1000 series) containing time series data with yearly frequency. Some sample data:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 1.0000 0.6154 0.0000 0.0769 0.0000 0.0000 0.0000 0.2308 0.6923 0.6923 0.6923 0.6923 0.6923 0.3846 0.3846 0.0769 0.0769 0.0769
2 1.0000 0.8354 0.5274 0.4451 0.4604 0.4634 0.4543 0.2195 0.0976 0.1159 0.0793 0.0000 0.0152 0.0305 0.0305 0.0335 0.0915 0.0152
3 0.9524 0.8571 0.2381 0.1429 0.6667 1.0000 1.0000 0.1905 0.4286 0.3810 0.3810 0.5714 0.0952 0.1905 0.0000 0.0000 0.0952 0.8571
4 0.9200 1.0000 0.6000 0.4000 0.0000 0.4200 0.3600 0.4400 0.4200 0.3200 0.4800 0.6400 0.5200 0.5200 0.5200 0.5400 0.4800 0.7800
5 0.8372 1.0000 0.7209 0.7907 0.6279 0.6047 0.6047 0.6279 0.5349 0.4419 0.4419 0.2791 0.4419 0.2326 0.1860 0.1860 0.1860 0.0000
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6154 0.6154 0.6154 0.6154 1.0000
Note that the data is normalized, because I want to cluster the timeseries based on similar shapes. I imagined that a cluster analysis would be an appropiate analysis and I tried to cluster the time series with the following function:
a <- factoextra::eclust(Normalized_df, FUNcluster = "kmeans", nstart = 25, k.max = 5)
However, I have a couple of observations which have a negative silouhette width. Is there a way to correct for these assignments? For example, if the value sil_width is negative, then place the observation in neighbour cluster. An example can be found below.
cluster neighbor sil_width
1 1 3 -0.001258464
2 1 3 -0.004661913
3 1 4 -0.010083277
4 1 4 -0.012569472
5 1 3 -0.012793575
6 1 4 -0.013089868
7 1 5 -0.013346165
The motivation is to correct for these observations, in order to increase the average silhouette width for the clusters.
Any help would be much appreciated!

Moving points with a negative silhouette to another cluster would likely decrease the Silhouette of other points in that cluster. It's not obvious how to druther improve the results, and a) the best solution may contain negative Silhouette values, and b) it might be impossible to find a solution with only positive values. Last but not least, c) it will not be a k-means clustering result anymore - some points will no longer be assigned to the closest mean.
The core reason is that the scores within each cluster are tied. Moving one point to another cluster changes all their scores.

Related

How do i perform operations on matrix rows while keeping the matrix intact?

Question/problem summary:
Create a 10 by 10 matrix whose first column is the numbers 1,2,3,4,5,6,7,8,9,10
the next column contains the squares of first column: 1, 4, 9,...,100
the third column contains the 3rd power of first column: 1, 8, 27,..., 1000.
the 10th column contains the 10th power of the first column.
Background:
This is for a class assignment, intro to analytical programming. I have tried the following code, but i am not sure why it is not giving the correct output. Any advice or suggestions is appreciated.
row1 = [1:10]
tenXtenMatrix = repmat(row1,10,1)
[row col] = size(tenXtenMatrix)
for i=2:row
for j=1:col
tenXtenMatrix(i,:).^i
end
end
what is expected:
1 2 3 4 5 6 7 8 9 10
1 4 9 16 25 36 49 64 81 100
1 8 27 64 125 216 343 512 729 1000
1 16 81 256 625 1296 2401 4096 6561 10000
etc..
what i got:
0.0000 0.0000 0.0000 0.0001 0.0010 0.0060 0.0282 0.1074 0.3487 1.0000
0.0000 0.0000 0.0000 0.0001 0.0010 0.0060 0.0282 0.1074 0.3487 1.0000
0.0000 0.0000 0.0000 0.0001 0.0010 0.0060 0.0282 0.1074 0.3487 1.0000
0.0000 0.0000 0.0000 0.0001 0.0010 0.0060 0.0282 0.1074 0.3487 1.0000
etc...
Using implicit expansion:
x = 1:10
A = x.^(x.')
Where:
.^ is the element-wise power operator
.' is the transpose operator
More informations about implicit expansion here.
Fixes:
you run on j and not using it.
you calculate the power but not updating the matrix
row1 = [1:10];
tenXtenMatrix = repmat(row1,10,1);
[row col] = size(tenXtenMatrix);
for i=2:row
tenXtenMatrix(i,:) = tenXtenMatrix(i,:).^i;
end

Augmented matrix rounding issue [duplicate]

This question already has an answer here:
How to round double to something that a 'normal' human can read. (MATLAB)
(1 answer)
Closed 6 years ago.
I am trying to create an augmented matrix to solve a problem, but I can't get to not round the values. The matrix d is trying to be augmented to the matrix Diff. I want the decimal values in Diff to remain decimals and the larger values in d to remain larger values, yet whenever I try to add it, MATLAB automatically reduces all of the values. Why is it doing this and how to fix it?
d = [74000;56000;10500;25000;17500;196000;5000]
d =
74000
56000
10500
25000
17500
196000
5000
Diff = I - A
Diff =
0.8412 -0.0064 -0.0025 -0.3404 -0.0014 -0.0083 -0.1594
-0.0057 0.7355 -0.0436 -0.0099 -0.0083 -0.0201 -0.3413
-0.0264 -0.1506 0.6443 -0.0139 -0.0142 -0.0070 -0.0236
-0.3299 -0.0565 -0.0495 0.6364 -0.0204 -0.0483 -0.0649
-0.0089 -0.0081 -0.0333 -0.0295 0.6588 -0.0237 -0.0020
-0.1190 -0.0901 -0.0996 -0.1260 -0.1722 0.7632 -0.3369
-0.0063 -0.0126 -0.0196 -0.0098 -0.0064 -0.0132 0.9988
Aug = [Diff,d]
Aug =
1.0e+05 *
0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.7400
-0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.5600
-0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.1050
-0.0000 -0.0000 -0.0000 0.0000 -0.0000 -0.0000 -0.0000 0.2500
-0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 -0.0000 0.1750
-0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 1.9600
-0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 0.0500
MATLAB is not rounding any values. If you look at the top left corner when you display Aug, you will see (1.0e+05) which means that all values being displayed are the actual values divided by 1e5 (fixed-decimal floating-point notation). Since you are concatenating very large values (A) with relatively small values (Diff), the significant digits of the small values don't appear because you are not displaying enough decimal points. As a result, they look like 0. This is an artifact of the way that your command window is displaying numbers.
You can change the display format to something else such as "shortg" which is typically used for large data ranges (the default is "short") and you will see that your data is not rounded.
format shortg
[Diff, d]
0.8412 -0.0064 -0.0025 -0.3404 -0.0014 -0.0083 -0.1594 74000
-0.0057 0.7355 -0.0436 -0.0099 -0.0083 -0.0201 -0.3413 56000
-0.0264 -0.1506 0.6443 -0.0139 -0.0142 -0.007 -0.0236 10500
-0.3299 -0.0565 -0.0495 0.6364 -0.0204 -0.0483 -0.0649 25000
-0.0089 -0.0081 -0.0333 -0.0295 0.6588 -0.0237 -0.002 17500
-0.119 -0.0901 -0.0996 -0.126 -0.1722 0.7632 -0.3369 1.96e+05
-0.0063 -0.0126 -0.0196 -0.0098 -0.0064 -0.0132 0.9988 5000
In general, you should rarely rely on the MATLAB command window output for much. If you think your data is being rounded, then you would actually want to test this explicitly.
data = [Diff, d];
isequal(Diff, data(:,1:end-1))
1

spectral clustering

First off I must say that I'm new to matlab (and to this site...) , so please excuse my ignorance.
I'm trying to write a function in matlab that will use Spectral Clustering to split a set of points into two clusters.
my code is as follows
function Groups = TrySpectralClustering(data)
dist_mat = squareform(pdist(data));
W= zeros(length(data),length(data));
for i=1:length(data),
for j=(i+1):length(data),
W(i,j)=10^(-dist_mat(i,j));
W(j,i)=W(i,j);
end
end
D = zeros(length(data),length(data));
for i=1:length(W),
D(i,i)=sum(W(i,:));
end
L=D-W;
L=D^(-0.5)*L*D^(-0.5);
[ V E ] = eig(L);
disp ('V:');
disp (V);
If I understand correctly, then by using the second smallest eigenvector I should be able to perform a partition of the data into two clusters - If the ith member of the 2nd eigenvector is positive, the ith data point would be in the one cluster, otherwise it would be in the other cluster.
However, when I try the following
f=[1,1;0,0;1,0;0,1;100,100;100,101;101,101;101,100]
TrySpectralClustering(f)
I would expect that the first four points would form one cluster, and the last four would form another.
However, I receive
V:
-0.0000 -0.5000 0.0000 -0.5777 0.0000 0.4078 -0.0000 0.5000
-0.0000 -0.5000 0.0000 0.5777 0.0000 -0.4078 -0.0000 0.5000
-0.0000 -0.5000 0.0000 0.4078 0.0000 0.5777 -0.0000 -0.5000
-0.0000 -0.5000 0.0000 -0.4078 0.0000 -0.5777 -0.0000 -0.5000
-0.5000 -0.0000 -0.0000 -0.0000 -0.7071 -0.0000 0.5000 -0.0000
-0.5000 -0.0000 0.7071 0.0000 -0.0000 -0.0000 -0.5000 -0.0000
-0.5000 0.0000 -0.0000 0.0000 0.7071 0.0000 0.5000 0.0000
-0.5000 0 -0.7071 0 0 0 -0.5000 0
Taking the 2nd eigenvector
-0.0000 -0.5000 0.0000 0.5777 0.0000 -0.4078 -0.0000 0.5000
I find the one cluster includes the points 1,0;0,1;100,100;101,100
and the other cluster is made from the points 1,1;0,0;100,101;101,101
I wonder what am I doing wrong.
Note: I am working on the above as a part of a homework project.
Thanks in advance!
What you are getting is correct. Let U be the matrix containing the eigenvectors as shown above and let them be arranged such that the 1st column corresponds to the smallest eigenvalue and progressive columns correspond to the ascending eigenvalues. Then, take a subset of columns of U by retaining the eigenvectors corresponding to the smaller eigenvalues. Now, read these columns row-wise into a new set of vectors, call it Y. Cluster Y to get the spectral clusters. So, let us assume our subset is only the first column. We clearly see that if u were to cluster the first column, u would get the first 4 into 1 cluster and the next 4 into another cluster, which is what you want.
Take a look at the implementation on Prof. J. Shi's webpage. Pay close attention to discretisation.m function.
Moreover, your code is very inefficient. You need to take more advantage of Matlab's vectorization:
W = 10.^( - dist_mat ); % single liner of nested loop for comuting W
% computing the symmetric laplacian
d = sum( W, 2 ); % sum each row
d( d == 0 ) = 1; % avoid division by zero
d_half = 1./sqrt( d );
L = eye( n ) - bsxfun( #times, bsxfun( #times, W, d_half' ), d_half );
Two observations:
L=D-W; L=D^(-0.5)*L*D^(-0.5);
Why do you let him calculate the identity matrix? Just use the identity matrix eye(n) and substract D^(-0.5) * W * D^(-0.5) from that to calculate the Laplacian L
eig returns the eigenvectors as columns, why do you take the row? Did you check the values of the corresponding eigenvalues in E, so you can be sure you are looking at a eigenvec corresponding to the 2nd smallest eigenval?

Powers Table MATLAB

For this question, I'm supposed to create a NxN powers table in matlab using arrays.
The code I have so far is as follows:
C = [];
D = [];
N = input('Enter the value you would like to use for your NxN Powers Table: ');
for i = 1:N
for j = 1:N
C = [C;i^j];
end
C = transpose(C);
D = [D;C];
C = [];
end
D
This code works perfectly fine for any numbers from 1-9, as soon as I enter anything greater than that, it prints out weird values.
Here is the output I have using 5 as an input, and the second one is using 10 as an input.
Enter the value you would like to use for your NxN Powers Table: 5
D =
1 1 1 1 1
2 4 8 16 32
3 9 27 81 243
4 16 64 256 1024
5 25 125 625 3125
Enter the value you would like to use for your NxN Powers Table: 10
D =
1.0e+010 *
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0010
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0010 0.0060
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0040 0.0282
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0017 0.0134 0.1074
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0043 0.0387 0.3487
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0100 0.1000 1.0000
Any ideas what could be wrong with my code? Seems like a simple fix, I just can't figure out whats wrong with it. Any help is greatly appreciated. Thanks
Notice the 1.0e+010 *. It means that the numbers should be multiplied by 10000000000. Five digits are not enough to print it. Insert format long or format short g to see the whole numbers.
I think your code works fine. Note that 10^10 = 1e10; the very last element in your output D is indeed 1e10. Check individual elements D(i,j) to verify that those are correct. MATLAB can't display all the elements because some elements are so much larger than other ones; 1e10 has 10 digits in it, for instance, while 1^1 = 1 has 1 digit. So spacing would get screwed up if this behavior didn't happen.

Matrix creation MATLAB

I am building a nxn matrix in matlab with the following code:
x = linspace(a,b,n);
for i=1:n
for j=1:n
A(i,j) = x(j)^(i-1);
end
A
i
b(i) = (1/i)*x(n)^i - (1/i)*x(1)^i;
end
I am testing it with a=1 b=10 and n=10. I get the expected results up to i=8
i =
8
A =
Columns 1 through 7
1 1 1 1 1 1 1
1 2 3 4 5 6 7
1 4 9 16 25 36 49
1 8 27 64 125 216 343
1 16 81 256 625 1296 2401
1 32 243 1024 3125 7776 16807
1 64 729 4096 15625 46656 117649
1 128 2187 16384 78125 279936 823543
1 256 6561 65536 390625 1679616 5764801
Columns 8 through 10
1 1 1
8 9 10
64 81 100
512 729 1000
4096 6561 10000
32768 59049 100000
262144 531441 1000000
2097152 4782969 10000000
16777216 43046721 100000000
however from i=9 on it becomes this:
i =
9
A =
1.0e+09 *
Columns 1 through 9
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0005
0.0000 0.0000 0.0000 0.0000 0.0001 0.0003 0.0008 0.0021 0.0048
0.0000 0.0000 0.0000 0.0001 0.0004 0.0017 0.0058 0.0168 0.0430
0.0000 0.0000 0.0000 0.0003 0.0020 0.0101 0.0404 0.1342 0.3874
Column 10
0.0000
0.0000
0.0000
0.0000
0.0000
0.0001
0.0010
0.0100
0.1000
1.0000
Can someone please tell me what is happening? I am not very experienced in matlab (I mostly use c++ or python) and so far can't seem to figure it out myself.
It's just a formatting issue for larger numbers. Try
sprintf('%20.0f', A(end,end))
and you will see that the number is correct. At least up to some point, where you will run into double representation problems...
Because a common scaling is applied to your data display. See in your output:
A =
1.0e+09 *
A common factor of 10^9 was factored out of every entry in your matrix.
You may want to adjust your output display using:
format short g