How to merge pair-wise table(s) and non-pairwise table(s) in MatLab? - matlab

I am trying to perform clustering analysis. I have extracted all the possible data. And have made several pairwise comparisons. Now I want to know how do I merge the data?
Table 1:
entry smth coefficient
entry1 smth1 1.23
entry2 smth1 2.05
entry3 smth2 0.95
entry2 smth4 1.65
Table 2:
smth
smth1 smth2 smth 4
smth1 100 59 35
smth2 59 100 82
smth4 35 82 100
Table 3:
entry
entry1 entry2 entry3
entry1 100 82 75
entry2 82 100 59
entry3 75 59 100
I am trying to understand how to program this. I am new to matlab, I am training a lot, so there's definitely some progress, but now enough to get how to solve my problem.
UPDATE:
Here is the illustration to the table 2:
There is some similarity and difference between different smths.
Now here is the illustration to the table 3:
Entries also have some relational distance.
I also have input cases - table 1. Each row of the table is a unique input case. However, the real table is long, so some of the cases may be actually the same, though named differently. Now, I want to combine pairwise comparison 1, pairwise comparison 2 and if required up to pairwise comparison n. Finally, there are also some non-relative parameters (in table 1 there's one only - called coefficient), by which I want to multiply the position of pairwise-verified point in space (i.e. I am introducing a new axis - corresponding to coefficient and move the points by this axis. So there will be some sort of distribution, which I will be able to use in clustering analysis. I.e. I have an axis1/plane1 - corresponding to the pairwise comparisons of the smths, I have an axis2/plane2 - pairwise comparisons of the entries. Then I apply all those to the cases, for example in table 1 entry2 and smth1 appear more than twice. I know relational distance of entry2 vs other entries, as well as a relational distance of smth1 vs other smths, now I want to combine them into all the cases, given in table 1 and to move the point by axis3 - non-relational axis, corresponding to parameter coefficient.

Related

Clustering matrix distance between 3 time series

I have a question about the application of clustering techniques more concretely the K-means.
I have a data frame with 3 sensors (A,B,C):
time A | B | C |
8:00:00 6 10 11
8:30:00 11 17 20
9:00:00 22 22 15
9:30:00 20 22 21
10:00:00 17 26 26
10:30:00 16 45 29
11:00:00 19 43 22
11:30:00 20 32 22
... ... ... ...
And I want to group sensors that have the same behavior.
My question is: Looking at the dataframe above, I must calculate the correlation of each object of the data frame and then apply the Euclidean distance on this correlation matrix, thus obtaining a 3 * 3 matrix with the value of distances?
Or do I transpose my data frame and then compute the dist () matrix with Euclidean metric only and then I will have a 3 * 3 matrix with the distances value.
You have just three sensors. That means, you'll need three values, d(A B), d(B,C) and d(A B). Any "clustering" here does not seem to make sense to me? Certainly not k-means. K-means is for points (!) In R^d for small d.
Choose any form of time series similarity that you like. Could be simply correlation, but also DTW and the like.
Q1: No. Why: The correlation is not needed here.
Q2: No. Why: I'd calculate the distances differently
For the first row, R' built-in s dist() function (which uses Euclidean distance by default)
dist(c(6, 10, 11))
gives you the intervals between each value
1 2
------
2| 4
3| 5 1
item 2 and 3 are closest to each other. That's simple.
But there is no single way to calculate the distance between a point and a group of points. There you need a linkage function (min/max/average/...)
What I would do using R's built-in kmeans() function:
Ignore the date column,
(assuming there are no NA values in any A,B,C columns)
scale the data if necessary (here they all seem to have same order of magnitude)
perform KMeans analysis on the A,B,C columns, with k = 1...n ; evaluate results
perform a final KMeans with your suitable choice of k
get the cluster assignments for each row
put them in a new column to the right of C

How to compute & plot Equal Error Rate (EER) from FAR/FRR values using matlab

I have the following values against FAR/FRR. i want to compute EER rates and then plot in matlab.
FAR FRR
19.64 20
21.29 18.61
24.92 17.08
19.14 20.28
17.99 21.39
16.83 23.47
15.35 26.39
13.20 29.17
7.92 42.92
3.96 60.56
1.82 84.31
1.65 98.33
26.07 16.39
29.04 13.13
34.49 9.31
40.76 6.81
50.33 5.42
66.83 1.67
82.51 0.28
Is there any matlab function available to do this. can somebody explain this to me. Thanks.
Let me try to answer your question
1) For your data EER can be the mean/max/min of [19.64,20]
1.1) The idea of EER is try to measure the system performance against another system (the lower the better) by finding the equal(if not equal then at least nearly equal or have the min distance) between False Alarm Rate (FAR) and False Reject Rate (FRR, or missing rate) .
Refer to your data, [19.64,20] gives min distance, thus it could used as EER, you can take mean/max/min value of these two value, however since it means to compare between systems, thus make sure other system use the same method(mean/max/min) to pick EER value.
The difference among mean/max/min can be ignored if the there are large amount of data. In some speaker verification task, there will be 100k data sample.
2) To understand EER ,better compute it by yourself, here is how:
two things you need to know:
A) The system score for each test case (trial)
B) The true/false for each trial
After you have A and B, then you can create [trial, score,true/false] pairs then sort it by the score value, after that loop through the score, eg from min-> max. At each loop assume threshold is that score and compute the FAR,FRR. After loop through the score find the FAR,FRR with "equal" value.
For the code you can refer to my pyeer.py , in function processDataTable2
https://github.com/StevenLOL/Research_speech_speaker_verification_nist_sre2010/blob/master/SRE2010/sid/pyeer.py
This function is written for the NIST SRE 2010 evaluation.
4) There are other measures similar to EER, such as minDCF which only play with the weights of FAR and FRR. You can refer to "Performance Measure" of http://www.nist.gov/itl/iad/mig/sre10results.cfm
5) You can also refer to this package https://sites.google.com/site/bosaristoolkit/ and DETware_v2.1.tar.gz at http://www.itl.nist.gov/iad/mig/tools/ for computing and plotting EER in Matlab
Plotting in DETWare_v2.1
Pmiss=1:50;Pfa=50:-1:1;
Plot_DET(Pmiss/100.0,Pfa/100.0,'r')
FAR(t) and FRR(t) are parameterized by threshold, t. They are cumulative distributions, so they should be monotonic in t. Your data is not shown to be monotonic, so if it is indeed FAR and FRR, then the measurements were not made in order. But for the sake of clarity, we can order:
FAR FRR
1 1.65 98.33
2 1.82 84.31
3 3.96 60.56
4 7.92 42.92
5 13.2 29.17
6 15.35 26.39
7 16.83 23.47
8 17.99 21.39
9 19.14 20.28
10 19.64 20
11 21.29 18.61
12 24.92 17.08
13 26.07 16.39
14 29.04 13.13
15 34.49 9.31
16 40.76 6.81
17 50.33 5.42
18 66.83 1.67
19 82.51 0.28
This is for increasing FAR, which assumes a distance score; if you have a similarity score, then FAR would be sorted in decreasing order.
Loop over FAR until it is larger than FRR, which occurs at row 11. Then interpolate the cross over value between rows 10 and 11. This is your equal error rate.

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

Need to generate a cluster of points in k-dimensional space in MATLAB

The points generated should be something like this-
21 32 34 54 76 34
23 55 67 45 75 23.322
54 23 45 76 85.1 32
the above example is when k=6.
How can I generate such a cluster of say around 1000 points and vary the value of k and the radius of the cluster.
Is there any built-in function that can do this for me? I can use any other tool if needed.
Any help would be appreciated.
Have a look at ELKI. It comes with a quite flexible data generator for clustering datasets, and there is a 640d subspace clustering example somewhere on the wiki.
Consider using d for the dimensionality, as when you are talking about clusters k usually refers to the number of clusters (think of k-means ...)
I think you would need to write your own code for this. Supposing your center is at the origin, you have to pick k numbers, in sequence, with the constraint at every step that the sum of the squares of all the numbers upto (and including) it must not exceed the radius of the hypersphere squared. That is, the k th number squared must be less than or equal to the radius squared minus the sum of the squares of all previously picked numbers.
If you have the stats toolbox this is easy
http://www.mathworks.co.uk/help/toolbox/stats/kmeans.html
Otherwise, you can quite easily write the code yourself using Lloyds algorithm.

Matlab: Sum elements in array into another array

Suppose I have an array age=[16 17 25 18 32 89 43 55] which holds the ages of a certain list of people. I also have a second array called groups=[1 1 2 1 3 2 1 4] denotes to which group each person belongs, i.e the person whose age is 55 is the only person in group 4, there are three people in group 1 etc.
I want to calculate the combined sum of ages in each group. That is, the result I want to get in this case is an array of 4 elements, it's first entry containing the sum of ages of people belonging to group #1 (16+17+18+43), second entry containing the sum of ages of people belonging to group #2 (23+89) etc.
I know of course how to do this with a for loop, but is it possible to do this using some variation of sum or something similar, so as to tap into matlab's vector optimization?
The code in #Ismail's answer is fine, but you could also try this:
>> accumarray(groups', age')
ans =
94
114
32
55
I find it hard to get an appreciation from the documentation exactly what accumarray can do in its full generality, but this is a great example of a simple usage. It's worth learning how to use it effectively, as once you've worked it out it's very powerful - and it will be a lot faster (when used on a larger example) than arrayfun.
You can use arrayfun and unique as follows:
arrayfun(#(x) sum(age(groups==x)), unique(groups))