Need to generate a cluster of points in k-dimensional space in MATLAB - matlab

The points generated should be something like this-
21 32 34 54 76 34
23 55 67 45 75 23.322
54 23 45 76 85.1 32
the above example is when k=6.
How can I generate such a cluster of say around 1000 points and vary the value of k and the radius of the cluster.
Is there any built-in function that can do this for me? I can use any other tool if needed.
Any help would be appreciated.

Have a look at ELKI. It comes with a quite flexible data generator for clustering datasets, and there is a 640d subspace clustering example somewhere on the wiki.
Consider using d for the dimensionality, as when you are talking about clusters k usually refers to the number of clusters (think of k-means ...)

I think you would need to write your own code for this. Supposing your center is at the origin, you have to pick k numbers, in sequence, with the constraint at every step that the sum of the squares of all the numbers upto (and including) it must not exceed the radius of the hypersphere squared. That is, the k th number squared must be less than or equal to the radius squared minus the sum of the squares of all previously picked numbers.

If you have the stats toolbox this is easy
http://www.mathworks.co.uk/help/toolbox/stats/kmeans.html
Otherwise, you can quite easily write the code yourself using Lloyds algorithm.

Related

Writing (and using) principal component analysis in matlab

I (hope to) obtain a matrix with data on different characteristics on rat calls (in the ultrasound). Variables include starting frequency, ending frequency, duration etc etc. The observations will include all the rat calls in my audio recording.
I want to use PCA to analyze my data, hopefully decorrelating any principal components that are not important to the structure of these calls and how they work, allowing me to group the calls up.
My problem is that while I have a basic understanding of how PCA works, I don't have an understanding of the finer points including how to implement this in Matlab.
I know you should standardise my data. All methods I have seen involve means adjusting by subtracting the mean. However some others also divide by the standard deviation or divide the transpose of the means adjusted data by the square root of N-1 (N being the number of variables).
I know with the standardised data, you can then find the covariance matrix, and extract the eigen values and vectors such as with using eig(cov(...)). some others use svd(...) instead. I still don't understand what this is and why it is important
I know there are different ways to implement PCA, but I don't like how I get different results for all of them.
There is even a pca(...) command also.
While reconstructing the data, some people multiply the means adjust data with the principal component data, others do the same but with the transpose of the principal component data
I just want to be able to analyse my data by plotting graphs of the principal components, and of the data (with the most insignificant principal components removed). I want to know about the variances of these eigen vectors and how much they represent the total variance of the data. I want to be able to fully exploit all the information PCA can allow me to get out
can anyone help?
=========================================================
This code seems to work based on pg 20 of http://people.maths.ox.ac.uk/richardsonm/SignalProcPCA.pdf
X = [105 103 103 66; 245 227 242 267; 685 803 750 586;...
147 160 122 93; 193 235 184 209; 156 175 147 139;...
720 874 566 1033; 253 265 171 143; 488 570 418 355;...
198 203 220 187; 360 365 337 334; 1102 1137 957 674;...
1472 1582 1462 1494; 57 73 53 47; 1374 1256 1572 1506;...
375 475 458 135; 54 64 62 41];
[M,N] = size(X);
mn = mean(X,2);
data = X - repmat(mn,1,N);
Y = data' / sqrt(N-1);
[~,S,PC] = svd(Y);
S = diag(S);
V = S .* S;
signals = PC' * data;
%plotting single PC1 on its own
figure;
plot(signals(1,:),zeros(1,size(signals,2)),'b.','markersize',15)
xlabel('PC1')
title('plotting single PC1 on its own')
%plotting PC1 against PC2
figure;
plot(signals(1,:),signals(2,:),'b.','markersize',15)
xlabel('PC1'),ylabel('PC2')
title('plotting PC1 against PC2')
figure;
plot(PC(:,1),PC(:,2),'m.','markersize',15)
xlabel('effect(PC1)'),ylabel('effect(PC2)')
but where is the standard deviation? how is the result different to
B=zscore(X);
[PC, D] = eig(cov(B));
D = diag(D);
cumsum(flipud(D)) / sum(D)
PC*B %(note how this says PC whereas above it says PC')
If the principal components are represented as columns, then I can remove the most insignificant eigen vectors by finding the smallest eigenvalue and setting its corresponding eigen vector column to a column of zeros.
How can either of these methods above be applied by using the pca(...) command and achieve THE SAME result? can anyone help explain this to me (and ideally show me how all of these can achieve the same results)?

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

Definition of standard deviation, matlab

I have a matrix
tst=[20 15 26 32 18 28 35 14 26 22 17]
meantst= mean(tst)=23
stdtst= std(tst)=6.6
Matlab command
s = std(X)
one get standard deviation.
http://www.mathworks.de/de/help/matlab/ref/std.html
How can I get std with 1-sigma(68%), 2 sigma(95%), 3sigma(99%)".
I think you are not actually looking for the standarddeviation, but rather the quantiles of your distribution.
Assuming they are normally distributed, you could go for norminv:
X = norminv(P,mu,sigma)
In your particular examples says that 68 percent of data present between 16.4 to 29.6. If you consider 2Sigma, Then 95% of data present between 9.8 to 35.5. Standard deviation just tells, how much data is deviating from its mean value. If just consider standard deviation then standard deviation gives a range around mean, and in that range 68% of data exist. If we take 2Sigma, we increase that range twice and now 95% of data fall in that range.
Maybe what you want is the percentiles of the distribution?
prctile(tst,68) % or prctile(tst,100-68), depending on which direction you need
prctile(tst,95) % or prctile(tst,100-95)
prctile(tst,99) % or prctile(tst,100-99)
Note that you would need many more samples than your example contains in order to obtain accurate percentile values.

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

element wise operation - MATLAB

I have a matrix in MATLAB, lets say:
a = [
89 79 96
72 51 74
94 88 87
69 47 78
]
I want to subtract from each element the average of its column and divide by the column's standard deviation. How can I do it in a way which could be implemented to any other matrix without using loops.
thanks
If your version supports bsxfun (which is probably the case unless you have very old matlab version), you should use it, it's much faster than repmat, and consumes much less memory.
You can just do: result = bsxfun(#rdivide,bsxfun(#minus,a,mean(a)),std(a))
You can use repmat to make your average/std vector the same size as your original matrix, then use direct computation like so:
[rows, cols] = size(a); %#to get the number of rows
avgc= repmat(avg(a),[rows 1]); %# average by column, vertically replicated by number of rows
stdc= repmat(std(a),[rows 1]); %# std by column, vertically replicated by number of rows
%# Here, a, avgc and stdc are the same size
result= (a - avgc) ./ stdc;
Edit:
Judging from a mathworks blog post,bsxfun solution is faster and consumes less memory (see acai answer). For moderate size matrices, I personally prefer repmat that makes code easier to read and debug (for me).
You could also use the ZSCORE function from the Statistics Toolbox:
result = zscore(a)
In fact, it calls BSXFUN underneath, but it is careful not to divide by a zero standard deviation (you can look at the source code yourself: edit zscore)