I am trying to find patterns in a dataset (~1000 series) containing time series data with yearly frequency. Some sample data:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1 1.0000 0.6154 0.0000 0.0769 0.0000 0.0000 0.0000 0.2308 0.6923 0.6923 0.6923 0.6923 0.6923 0.3846 0.3846 0.0769 0.0769 0.0769
2 1.0000 0.8354 0.5274 0.4451 0.4604 0.4634 0.4543 0.2195 0.0976 0.1159 0.0793 0.0000 0.0152 0.0305 0.0305 0.0335 0.0915 0.0152
3 0.9524 0.8571 0.2381 0.1429 0.6667 1.0000 1.0000 0.1905 0.4286 0.3810 0.3810 0.5714 0.0952 0.1905 0.0000 0.0000 0.0952 0.8571
4 0.9200 1.0000 0.6000 0.4000 0.0000 0.4200 0.3600 0.4400 0.4200 0.3200 0.4800 0.6400 0.5200 0.5200 0.5200 0.5400 0.4800 0.7800
5 0.8372 1.0000 0.7209 0.7907 0.6279 0.6047 0.6047 0.6279 0.5349 0.4419 0.4419 0.2791 0.4419 0.2326 0.1860 0.1860 0.1860 0.0000
6 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.6154 0.6154 0.6154 0.6154 1.0000
Note that the data is normalized, because I want to cluster the timeseries based on similar shapes. I imagined that a cluster analysis would be an appropiate analysis and I tried to cluster the time series with the following function:
a <- factoextra::eclust(Normalized_df, FUNcluster = "kmeans", nstart = 25, k.max = 5)
However, I have a couple of observations which have a negative silouhette width. Is there a way to correct for these assignments? For example, if the value sil_width is negative, then place the observation in neighbour cluster. An example can be found below.
cluster neighbor sil_width
1 1 3 -0.001258464
2 1 3 -0.004661913
3 1 4 -0.010083277
4 1 4 -0.012569472
5 1 3 -0.012793575
6 1 4 -0.013089868
7 1 5 -0.013346165
The motivation is to correct for these observations, in order to increase the average silhouette width for the clusters.
Any help would be much appreciated!
Moving points with a negative silhouette to another cluster would likely decrease the Silhouette of other points in that cluster. It's not obvious how to druther improve the results, and a) the best solution may contain negative Silhouette values, and b) it might be impossible to find a solution with only positive values. Last but not least, c) it will not be a k-means clustering result anymore - some points will no longer be assigned to the closest mean.
The core reason is that the scores within each cluster are tied. Moving one point to another cluster changes all their scores.
Is it possible to vectorize, and possibly run on a GPU, the following code
x = linspace(0,100,1000);
h = zeros(size(x));
for i = 1 : length(x)
exprho = expm(-x(i)*rho);
h(i) = trace(drho*exprho*drho*exprho);
end
out = 2 * trapz(x,h);
where rho and drho are two complex Hermitian square matrices of the same size. rho is in fact a quantum density matrix and drho is its derivative with respect to a parameter.
The size can range from 10 x 10 to 300 x 300 approximately but I would also like to reach bigger sizes.
Here are two sample matrices:
rho =
0.4046 0.3849 0.2589 0.1422 0.0676 0.0288 0.0112 0.0040 0.0014 0.0004 0.0001
0.3849 0.3661 0.2462 0.1352 0.0643 0.0274 0.0106 0.0038 0.0013 0.0004 0.0001
0.2589 0.2462 0.1656 0.0910 0.0433 0.0184 0.0071 0.0026 0.0009 0.0003 0.0001
0.1422 0.1352 0.0910 0.0500 0.0238 0.0101 0.0039 0.0014 0.0005 0.0002 0.0000
0.0676 0.0643 0.0433 0.0238 0.0113 0.0048 0.0019 0.0007 0.0002 0.0001 0.0000
0.0288 0.0274 0.0184 0.0101 0.0048 0.0020 0.0008 0.0003 0.0001 0.0000 0.0000
0.0112 0.0106 0.0071 0.0039 0.0019 0.0008 0.0003 0.0001 0.0000 0.0000 0.0000
0.0040 0.0038 0.0026 0.0014 0.0007 0.0003 0.0001 0.0000 0.0000 0.0000 0.0000
0.0014 0.0013 0.0009 0.0005 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000
0.0004 0.0004 0.0003 0.0002 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
drho =
0.0366 0.0156 -0.0025 -0.0085 -0.0074 -0.0046 -0.0023 -0.0010 -0.0004 -0.0002 -0.0001
0.0156 -0.0035 -0.0147 -0.0148 -0.0103 -0.0057 -0.0028 -0.0012 -0.0005 -0.0002 -0.0001
-0.0025 -0.0147 -0.0181 -0.0145 -0.0091 -0.0048 -0.0022 -0.0009 -0.0004 -0.0001 -0.0000
-0.0085 -0.0148 -0.0145 -0.0105 -0.0062 -0.0031 -0.0014 -0.0006 -0.0002 -0.0001 -0.0000
-0.0074 -0.0103 -0.0091 -0.0062 -0.0035 -0.0017 -0.0008 -0.0003 -0.0001 -0.0000 -0.0000
-0.0046 -0.0057 -0.0048 -0.0031 -0.0017 -0.0008 -0.0004 -0.0001 -0.0001 -0.0000 -0.0000
-0.0023 -0.0028 -0.0022 -0.0014 -0.0008 -0.0004 -0.0002 -0.0001 -0.0000 -0.0000 -0.0000
-0.0010 -0.0012 -0.0009 -0.0006 -0.0003 -0.0001 -0.0001 -0.0000 -0.0000 -0.0000 -0.0000
-0.0004 -0.0005 -0.0004 -0.0002 -0.0001 -0.0001 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0002 -0.0002 -0.0001 -0.0001 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000
-0.0001 -0.0001 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000 -0.0000
I'm trying to get two matrices to divide, properly, element by element.
Essentially, firstd is a 6x499 and secd is 6x498. I first eliminate firstd's extra elements by doing firstd(:,499)=[]; making it 6x498. Now the next step is to transform firstd into the nominator, nom=((firstd.^2)+1).^1.5; My denominator is just denom=secd;
Both nom and denom have come out as 6x498 matrices with real, non-zero data for each element. However, when doing Rlayer=nom./denom, Rlayer comes out as this ludicrous 6x498 zero-ridden matrix.
I also trimmed out the elements in denom that were =0 by changing them to 0.0001.
Segment of result for Rlayer (Columns 493 through 498)
-0.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000
-0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000
-0.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 -0.0000 0.0000
-0.0000 0.0000 -0.0000 0.0000 0.0000 -0.0000
Below are two segments of denom (Columns 487 through 492)
0.0250 0.0281 -0.0281 0.0125 -0.0500 0.0969
-0.0125 0.0750 -0.1219 0.1094 -0.0938 0.0937
0.0344 0.0406 -0.1094 0.1187 -0.1344 0.1531
0.0001 0.0250 0.0001 -0.0437 0.0500 0.0062
0.0781 -0.0219 0.0094 -0.0125 -0.0188 0.1062
0.0250 0.0438 -0.0812 0.0937 -0.1063 0.1562
(Columns 493 through 498)
-0.1187 0.1156 -0.0844 0.0688 -0.0406 0.0125
-0.0969 0.1094 -0.0906 0.0469 0.0062 -0.0156
-0.1375 0.1719 -0.1656 0.0781 0.0187 -0.0531
-0.0562 0.1188 -0.1500 0.1438 -0.1187 0.1187
-0.1781 0.2281 -0.2156 0.1750 -0.1250 0.0812
-0.1750 0.1938 -0.1469 0.0563 0.0031 -0.0156
and this is a segment of nom (Columns 493 through 498)
1.0904 1.0235 1.0881 1.0368 1.0769 1.0514
1.0685 1.0201 1.0769 1.0272 1.0497 1.0532
1.0928 1.0180 1.1210 1.0201 1.0568 1.0685
1.0568 1.0285 1.1001 1.0170 1.0952 1.0260
1.0952 1.0078 1.1380 1.0107 1.1026 1.0272
1.0928 1.0078 1.1077 1.0212 1.0463 1.0480
Why is this division leading to this result? I've tried dividing with rdivide, in a double for loop, and row by row in a for loop. All number types are double.
First off I must say that I'm new to matlab (and to this site...) , so please excuse my ignorance.
I'm trying to write a function in matlab that will use Spectral Clustering to split a set of points into two clusters.
my code is as follows
function Groups = TrySpectralClustering(data)
dist_mat = squareform(pdist(data));
W= zeros(length(data),length(data));
for i=1:length(data),
for j=(i+1):length(data),
W(i,j)=10^(-dist_mat(i,j));
W(j,i)=W(i,j);
end
end
D = zeros(length(data),length(data));
for i=1:length(W),
D(i,i)=sum(W(i,:));
end
L=D-W;
L=D^(-0.5)*L*D^(-0.5);
[ V E ] = eig(L);
disp ('V:');
disp (V);
If I understand correctly, then by using the second smallest eigenvector I should be able to perform a partition of the data into two clusters - If the ith member of the 2nd eigenvector is positive, the ith data point would be in the one cluster, otherwise it would be in the other cluster.
However, when I try the following
f=[1,1;0,0;1,0;0,1;100,100;100,101;101,101;101,100]
TrySpectralClustering(f)
I would expect that the first four points would form one cluster, and the last four would form another.
However, I receive
V:
-0.0000 -0.5000 0.0000 -0.5777 0.0000 0.4078 -0.0000 0.5000
-0.0000 -0.5000 0.0000 0.5777 0.0000 -0.4078 -0.0000 0.5000
-0.0000 -0.5000 0.0000 0.4078 0.0000 0.5777 -0.0000 -0.5000
-0.0000 -0.5000 0.0000 -0.4078 0.0000 -0.5777 -0.0000 -0.5000
-0.5000 -0.0000 -0.0000 -0.0000 -0.7071 -0.0000 0.5000 -0.0000
-0.5000 -0.0000 0.7071 0.0000 -0.0000 -0.0000 -0.5000 -0.0000
-0.5000 0.0000 -0.0000 0.0000 0.7071 0.0000 0.5000 0.0000
-0.5000 0 -0.7071 0 0 0 -0.5000 0
Taking the 2nd eigenvector
-0.0000 -0.5000 0.0000 0.5777 0.0000 -0.4078 -0.0000 0.5000
I find the one cluster includes the points 1,0;0,1;100,100;101,100
and the other cluster is made from the points 1,1;0,0;100,101;101,101
I wonder what am I doing wrong.
Note: I am working on the above as a part of a homework project.
Thanks in advance!
What you are getting is correct. Let U be the matrix containing the eigenvectors as shown above and let them be arranged such that the 1st column corresponds to the smallest eigenvalue and progressive columns correspond to the ascending eigenvalues. Then, take a subset of columns of U by retaining the eigenvectors corresponding to the smaller eigenvalues. Now, read these columns row-wise into a new set of vectors, call it Y. Cluster Y to get the spectral clusters. So, let us assume our subset is only the first column. We clearly see that if u were to cluster the first column, u would get the first 4 into 1 cluster and the next 4 into another cluster, which is what you want.
Take a look at the implementation on Prof. J. Shi's webpage. Pay close attention to discretisation.m function.
Moreover, your code is very inefficient. You need to take more advantage of Matlab's vectorization:
W = 10.^( - dist_mat ); % single liner of nested loop for comuting W
% computing the symmetric laplacian
d = sum( W, 2 ); % sum each row
d( d == 0 ) = 1; % avoid division by zero
d_half = 1./sqrt( d );
L = eye( n ) - bsxfun( #times, bsxfun( #times, W, d_half' ), d_half );
Two observations:
L=D-W; L=D^(-0.5)*L*D^(-0.5);
Why do you let him calculate the identity matrix? Just use the identity matrix eye(n) and substract D^(-0.5) * W * D^(-0.5) from that to calculate the Laplacian L
eig returns the eigenvectors as columns, why do you take the row? Did you check the values of the corresponding eigenvalues in E, so you can be sure you are looking at a eigenvec corresponding to the 2nd smallest eigenval?
For this question, I'm supposed to create a NxN powers table in matlab using arrays.
The code I have so far is as follows:
C = [];
D = [];
N = input('Enter the value you would like to use for your NxN Powers Table: ');
for i = 1:N
for j = 1:N
C = [C;i^j];
end
C = transpose(C);
D = [D;C];
C = [];
end
D
This code works perfectly fine for any numbers from 1-9, as soon as I enter anything greater than that, it prints out weird values.
Here is the output I have using 5 as an input, and the second one is using 10 as an input.
Enter the value you would like to use for your NxN Powers Table: 5
D =
1 1 1 1 1
2 4 8 16 32
3 9 27 81 243
4 16 64 256 1024
5 25 125 625 3125
Enter the value you would like to use for your NxN Powers Table: 10
D =
1.0e+010 *
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0010
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0010 0.0060
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0006 0.0040 0.0282
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0002 0.0017 0.0134 0.1074
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0005 0.0043 0.0387 0.3487
0.0000 0.0000 0.0000 0.0000 0.0000 0.0001 0.0010 0.0100 0.1000 1.0000
Any ideas what could be wrong with my code? Seems like a simple fix, I just can't figure out whats wrong with it. Any help is greatly appreciated. Thanks
Notice the 1.0e+010 *. It means that the numbers should be multiplied by 10000000000. Five digits are not enough to print it. Insert format long or format short g to see the whole numbers.
I think your code works fine. Note that 10^10 = 1e10; the very last element in your output D is indeed 1e10. Check individual elements D(i,j) to verify that those are correct. MATLAB can't display all the elements because some elements are so much larger than other ones; 1e10 has 10 digits in it, for instance, while 1^1 = 1 has 1 digit. So spacing would get screwed up if this behavior didn't happen.