Clustering data after linkage algorithm - matlab

I am not an expert in statistics and data analysis, hence I can't understand if the behavior which I obtain is correct or not. I am here looking for your help.
Assume I have these samples which I would like to cluster (10 points in the plane - reduced version of the problem):
[X Y] =
266 450
266 400
258 168
290 442
295 438
273 432
294 158
318 161
250 423
253 413
To cluster them I can use a cluster tree
Z = linkage([ X Y ],'complete');
which is (by dendrogram(Z,10))
Now I would like to extract clusters on the basis of the distance attached to the nodes of the tree.
Say that my distance is 150, I would expect that the call
T = cluster(Z,'Cutoff',150);
returns me 2 clusters. But it gives me just one (I suppose), i.e.
T =
1
1
1
1
1
1
1
1
1
1
What am I missing?

Use inconsistent(Z,150) and look at the values in column 4. Increasing the cutoff from a small positive number steps you along the tree.
E.g.
cluster(Z,'cutoff',0.7)
does not give you what you want (I think)
but
cluster(Z,'cutoff',0.8)
does.

The criterion for cluster is inconsistency ('inconsistent') by default.
Since the height in dendrogram is distance, you can change the criterion to 'distance',
i.e:
T = cluster(Z, 'Cutoff', 150, 'criterion', 'distance');

Related

KDB+/Q: Custom min max scaler

Im trying to implement a custom min max scaler in kdb+/q. I have taken note of the implementation located in the ml package however I'm looking to be able to scale data between a custom range i.e. 0 and 255. What would be an efficient implementation of min max scaling in kdb+/q?
Thanks
Looking at the link to github on the page you referenced it looks like you may be able to define a function like so:
minmax255:{[sf;x]sf*(x-mnx)%max[x]-mnx:min x}[255]
Where sf is your scaling factor (here given by 255).
q)minmax255 til 10
0 28.33333 56.66667 85 113.3333 141.6667 170 198.3333 226.6667 255
If you don't like decimals you could round to the nearest whole number like:
q)minmax255round:{[sf;x]floor 0.5+sf*(x-mnx)%max[x]-mnx:min x}[255]
q)minmax255round til 10
0 28 57 85 113 142 170 198 227 255
(logic here is if I have a number like 1.7, add .5, and floor I'll wind up with 2, whereas if I had a number like 1.2, add .5, and floor I'll end up with 1)
If you don't want to start at 0 you could use | which takes the max of it's left and right arguments
q)minmax255roundlb:{[sf;lb;x]lb|floor sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmax255roundlb til 10
10 28 56 85 113 141 170 198 226 255
Where I'm using lb to mean 'lower bound'
If you want to apply this to a table you could use
q)show testtab:([]a:til 10;b:til 10)
a b
---
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
q)update minmax255 a from testtab
a b
----------
0 0
28.33333 1
56.66667 2
85 3
113.3333 4
141.6667 5
170 6
198.3333 7
226.6667 8
255 9
The following will work nicely
minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
As petty as it sounds, it is my strong recommendation that you do not follow through with Shehir94 solution for a custom minimum value. Applying a maximum to get a starting range, it will mess with the original distribution. A custom minmax scaling should be a simple linear transformation on a standard 0-1 minmax transformation.
X' = a + bX
For example, to get a custom scaling of 10-255, that would be a b=245 and a=10, we would expect the new mean to follow this formula and the standard deviation to only be a Multiplicative, but applying lower bound messes with this, for example.
q)dummyData:10000?100.0
q)stats:{`transform`minVal`maxVal`avgVal`stdDev!(x;min y;max y; avg y; dev y)}
q)minmax255roundlb:{[sf;lb;x]lb|sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
q)res:stats'[`orig`lb`linear;(dummyData;minmax255roundlb dummyData;minmaxCustom[10;255;dummyData])]
q)res
transform minVal maxVal avgVal stdDev
-----------------------------------------------
orig 0.02741043 99.98293 50.21896 28.92852
lb 10 255 128.2518 73.45999
linear 10 255 133.024 70.9064
// The transformed average should roughly be
q)10 + ((255-10)%100)*49.97936
132.4494
// The transformed std devaition should roughly be
q)2.45*28.92852
70.87487
To answer the comment, this could be applied over a large number of coluwould be applied to a table in the following manner
q)n:10000
q)tab:([]sym:n?`3;col1:n?100.0)
q)multiColApply:{[tab;scaler;colList]flip ft,((),colList)!((),scaler each (ft:flip tab)[colList])}
q)multiColApply[tab;minmaxCustom[10;20];`col1`col2]
sym col1 col2 col3
------------------------------
cag 13.78461 10.60606 392.7524
goo 15.26201 16.76768 517.0911
eoh 14.05111 19.59596 515.9796
kbc 13.37695 19.49495 406.6642
mdc 10.65973 12.52525 178.0839
odn 16.24697 17.37374 301.7723
ioj 15.08372 15.05051 785.033
mbc 16.7268 20 534.7096
bhj 12.95134 18.38384 711.1716
gnf 19.36005 15.35354 411.597
gnd 13.21948 18.08081 493.1835
khi 12.11997 17.27273 578.5203

How to insert a value

I want to insert a number in the following matrix: n x 1 matrix
6
103
104
660
579
750
300
299
300
750
579
661
580
760
302
301
302
760
580
662
581
How to I insert it in the middle and shift the remaining numbers? I tried the following code:
Idx=[723];
c=false(1,length(Element_set2)+length(Idx));
c(Idx)=true;
result=nan(size(c));
result(~c)=Element_set2;
result(c)=8
You are complicating things. Simply find the middle index by finding the length of the array, dividing by 2 and truncating any decimal points, then using simply indexing to update the new matrix. Supposing that result is the column vector that was created by you and number is the value you want to insert in the middle, do the following:
number = 8; %// Change to suit whatever number you desire
middle = floor(numel(result) / 2);
result = [result(1:middle); number; result(middle+1:end)];
In the future, please read this great MATLAB tutorial on indexing directly from MathWorks: http://www.mathworks.com/company/newsletters/articles/matrix-indexing-in-matlab.html. It's a good resource on the kinds of indexing operations one expects from starting out in MATLAB.

Calculate Variance of a Group data

I have a table contain height and frequency.I want to calculate the variance of it.
Height 140 150 160 170 180 190
Frequency 3 5 57 63 30 2
I have tried the below code:
height=[140 150 160 170 180 190;3 5 57 63 30 2]
height=height(:)
V = var(height) %Calculate Variance
**This give an answer of 5.7316e+03**
while with formula it give an answer of 81.8594. Now please tell me how can i do this?
Use weighted variance:
h=height;
var(h(1,:),h(2,:))

Matlab - Sum of surrounding elements

I want to calculate the sum of the elements surrounding a given element in a matrix. So far, I have written these lines of code:
for i=1:m,
rij(1:n)=0
for j=1:n,
alive = tijdelijk(i-1,j)+tijdelijk(i+1,j)+tijdelijk(i-1,j-1)+tijdelijk(i+1,j-1)+tijdelijk(i,j+1)+tijdelijk(i,j-1)+tijdelijk(i-1,j+1)+tijdelijk(i+1,j+1)
This results in an error because, for example, i-1 becomes zero for i=1. Anyone got an idea how to do this without getting this error?
You can sum the elements via filtering. conv2 can be used for this manner.
Let me give an example. I create a sample matrix
>> A = reshape(1:20, 4, 5)
A =
1 5 9 13 17
2 6 10 14 18
3 7 11 15 19
4 8 12 16 20
Then, I create a filter. The filter is like a mask where you put the center on the current cell and the locations corresponding to the 1's on the filter are summed. For eight-connected neighbor case, the filter should be as follows:
>> B = [1 1 1; 1 0 1; 1 1 1]
B =
1 1 1
1 0 1
1 1 1
Then, you simply convolve the matrix with this small matrix.
>> conv2(A, B, 'same')
ans =
13 28 48 68 45
22 48 80 112 78
27 56 88 120 83
18 37 57 77 50
If you want four-connected neighbors, you can make the corners of your filter 0. Similarly, you can design any filter for your purpose, such as for averaging all neighbors instead of summing them.
For details, please see the convolution article in Wikipedia.
Two possibilities : change the limits of the loops to i=k:(m-k) and j=k:(n-k) or use blkproc
ex :
compute the 2-D DCT of each 8-by-8 block
I = imread('cameraman.tif');
fun = #dct2;
J = blkproc(I,[8 8],fun);
imagesc(J), colormap(hot)
There are lots of things you can do at the edges. Which you do depends very specifically on your problem and is different from usage case to usage case. Typical things to do:
If (i-1) or (i+1) is out of range, then just ignore that element. This is equivalent to zero padding the matrix with zeros around the outside and adjusting the loop limits accordingly
Wrap around the edges. In other words, for an MxN matrix, if (i-1) takes you to 0 then instead of taking element (i-1, j) = (0, j) you take element (M, j).
Since your code mentions "your teacher" I'd guess that you can ask what should happen at the edges (or working it out in a sensible manner may well be part of the task!!).

How to compare the pairs of coordinates most efficiently without using nested loops in Matlab?

If I have 20 pairs of coordinates, whose x and y values are say :
x y
27 182
180 81
154 52
183 24
124 168
146 11
16 90
184 153
138 133
122 79
192 183
39 25
194 63
129 107
115 161
33 14
47 65
65 2
1 124
93 79
Now if I randomly generate 15 pairs of coordinates (x,y) and want to compare with these 20 pairs of coordinates given above, how can I do that most efficiently without nested loops?
If you're trying to see if any of your 15 randomly generated coordinate pairs are equal to any of your 20 original coordinate pairs, an easy solution is to use the function ISMEMBER like so:
oldPts = [...]; %# A 20-by-2 matrix with x values in column 1
%# and y values in column 2
newPts = randi(200,[15 2]); %# Create a 15-by-2 matrix of random
%# values from 1 to 200
isRepeated = ismember(newPts,oldPts,'rows');
And isRepeated will be a 15-by-1 logical array with ones where a row of newPts exists in oldPts and zeroes otherwise.
If your coordinates are 1) actually integers and 2) their span is reasonable (otherwise use sparse matrix), I'll utilize a simple truth table. Like
x_0= [27 180 ...
y_0= [182 81 ...
s= [200 200]; %# span of coordinates
T= false(s);
T(sub2ind(s, x_0, y_0))= true;
%# now obtain some other coordinates
x_1= [...
y_1= [...
%# and common coordinates of (x_0, y_0) and (x_1, y_1) are just
T(sub2ind(s, x_1, y_1))
If your original twenty points aren't going to change, you'd get better efficiency if you sorted them O(n log n); then you could see if each random point was in the list with a O(log n) search.
If your "original" points list changes (insertions / deletions), you could get equivalent performance with a binary tree.
BUT: If the number of points you're working with is really as low as in your question, your double loop might just be the fastest method! Algorithms with low Big-O curves will be faster as the amount of data gets really big, but it's often at the cost of a one-time slowdown (in your case, the sort) - and with only 15x20 data points... There won't be a human-perceptible difference; you might see one if you're timing it on your system clock. Or you might not.
Hope this helps!