Does h2o.kmeans() make predictions based on euclidean distance? - cluster-analysis

I created a clustering model using h2o.kmeans(). The modeling dataset was standardized by scale() in R first.
The model has five clusters and the coordinates of the centroids are:
CENTROID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22
1 -0.646544 -0.6322714 -0.5101907 -0.2980412 -1.6182105 -1.7939725 -1.8194372 -1.82349 -1.8174061 -1.8069266 -2.2213561 -2.2618561 -2.2170297 -2.2004509 -2.196722 -2.2267695 -2.2536694 -2.2653944 -2.1599764 -2.2074994 -1.9114193 -2.78E-16
2 -0.2505012 -0.2582746 -0.2542313 -0.3205136 0.2912933 0.3239872 0.3236214 0.3231876 0.3234663 0.309818 0.362641 0.3800735 0.3615138 0.3542787 0.350817 0.3583391 0.375764 0.3715018 0.3533203 0.3533025 0.2651153 3.72E-15
3 0.4237044 0.4421857 0.408422 0.6620773 0.2371281 0.2592748 0.2597783 0.2782299 0.258803 0.3129833 0.4157714 0.3704712 0.3948566 0.4137049 0.4289137 0.4229101 0.3904031 0.4323851 0.3984215 0.442518 0.5278553 1.00E+00
4 2.2426614 2.2450805 2.0475964 1.5666675 0.2249847 0.2887632 0.3391117 0.3224008 0.3375972 0.3617759 0.5063836 0.4805747 0.5226613 0.5097081 0.5196333 0.5136624 0.4780912 0.4686772 0.4743151 0.5357567 0.5734882 8.24E-01
5 4.4718381 4.5243432 4.8917335 5.223828 0.2374653 0.3096633 0.3215417 0.3326531 0.3189998 0.414707 0.5065842 0.5113028 0.558864 0.5482378 0.543278 0.5436269 0.5204451 0.5341745 0.5096259 0.6486469 0.6595461 9.89E-01
When using the model to make predictions for new data, mostly the result makes sense, which returns the cluster whose centroid has the shortest euclidean distance to the data point; however, sometimes (about 5%) the prediction is off. For example, for a data point as below:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22
-0.2001578 -0.2485784 -0.3008685 -0.005366991 0.2624246 0.3142725 0.3074037 0.3221539 0.3033765 0.3403944 0.3557642 0.3810387 0.4848038 0.2788213 0.544491 0.2838926 0.2899755 0.3963652 0.2594092 0.3083141 0.463528 1
The prediction is cluster 3; however, the euclidean distance between the data point and centroids are:
cluster 1: 10
cluster 2: 1.11
cluster 3: 1.39
cluster 4: 4.53
cluster 5: 9.97.
Based on the calculation above, the data point should be assigned to cluster 2, not 3.
Is it a bug or h2o.kmeans() uses other methods instead of euclidean distance for prediction?
Thank you.

Yes, as stated in the K-Means documentation, it uses Euclidean distance.
If you can provide a reproducible example showing that this is a bug, please file a bug report. Thanks!

Related

Potential Bug in MATLAB regress R2014a

MATLAB R2014a used to work fine w regress but now I get an error when the variables are fine and rank is satisfactory.
X = rand([10 3])
X =
0.8407 0.3517 0.0759
0.2543 0.8308 0.0540
0.8143 0.5853 0.5308
0.2435 0.5497 0.7792
0.9293 0.9172 0.9340
0.3500 0.2858 0.1299
0.1966 0.7572 0.5688
0.2511 0.7537 0.4694
0.6160 0.3804 0.0119
0.4733 0.5678 0.3371
K>> Y = rand([10 1])
Y =
0.1622
0.7943
0.3112
0.5285
0.1656
0.6020
0.2630
0.6541
0.6892
0.7482
[B,BINT] = regress(Y,X)
Subscript indices must either be real positive integers or logicals.
Error in regress (line 93)
b(perm) = R \ (Q'*y);
Obviously, X and Y are fine. There's something wrong w/ the matrix math in regress and it's that perm is, for some reason, being output as a vector (giving an ind error). A few lines above, qr is called like this with no further modification to perm:
[Q,R,perm] = qr(X,0);
Help file says qr is supposed to output a third argument that's a matrix - but how can that be if the math always expects a vector?
% [Q,R,E] = QR(A) produces unitary Q, upper triangular R and a
% permutation matrix E so that A*E = Q*R. The column permutation E is
% chosen so that ABS(DIAG(R)) is decreasing.
Very confusing considering both of these are built-in functions. I literally re-installed MATLAB R2014a and a few toolboxes and STILL get this error. It feels like qr got updated to output a different argument, but I don't understand why a fresh reinstall wouldn't take care of this, or why qr would have updated at all anyways. Everything else in my MATLAB is great.
Any ideas???

3d Graphing Data in matlab and changing the axis values

I am trying to do a 3d graph of a excel file that is 343 by 81 cells, The first column needs to be the X and the first row needs to be the Y and the remaining matrix needs to be the Z. I have the data successfully imported from excel and I create a matrix of the first column called energy (343,1)(x-axis), while creating a row matrix (1, 81) called Time Delay(y-axis) and a (343,81) matrix where the first column and row is zero called Absorbance Change(Z-axis). I've got the proper 3d graph that I need but I need the axes shown in the graph to be that of the Energy and Time Delay instead of the indices of the Absorbance Change matrix. I am putting the relevant portion of the code below as well as a picture of the graph:
EnergyString = dataArray{:, 1};
EnergyString(1,1) = {'0'};
Energy = str2double(EnergyString);
%Energy = [ Energy, zeros(343, 80) ];
TimeDelay = [ z1(1,1), z2(1,1), z3(1,1), z4(1,1), z5(1,1), z6(1,1), z7(1,1), z8(1,1), z9(1,1), z10(1,1), z11(1,1), z12(1,1), z13(1,1), z14(1,1), z15(1,1), z16(1,1), z17(1,1), z18(1,1), z19(1,1), z20(1,1), z21(1,1), z22(1,1), z23(1,1), z24(1,1), z25(1,1), z26(1,1), z27(1,1), z28(1,1), z29(1,1), z30(1,1), z31(1,1), z32(1,1), z33(1,1), z34(1,1), z35(1,1), z36(1,1), z37(1,1), z38(1,1), z39(1,1), z40(1,1), z41(1,1), z42(1,1), z42(1,1), z43(1,1), z44(1,1), z45(1,1), z46(1,1), z47(1,1), z48(1,1), z49(1,1), z50(1,1), z51(1,1), z52(1,1), z53(1,1), z54(1,1), z55(1,1), z56(1,1), z57(1,1), z58(1,1), z59(1,1), z60(1,1), z61(1,1), z62(1,1), z63(1,1), z64(1,1), z65(1,1), z66(1,1), z67(1,1), z68(1,1), z69(1,1), z70(1,1), z71(1,1), z72(1,1), z73(1,1), z74(1,1), z75(1,1), z76(1,1), z77(1,1), z78(1,1), z79(1,1), z80(1,1) ];
%TimeDelay = [ TimeDelay; zeros(342, 81)];
startRow formatSpec filename fileID delimiter ans EnergyString Alpha Beta Gamma Delta Epsilon Zeta Eta Theta Iota Kappa Lambda Mu Nu Xi Omicron Pi Rho Sigma Tau Upsilon Phi Chi Psi Omega AlphaAlpha AlphaBeta AlphaGamma AlphaDelta AlphaEpsilon AlphaZeta AlphaEta AlphaTheta AlphaIota AlphaKappa AlphaLambda AlphaMu AlphaNu AlphaXi AlphaOmicron AlphaPi AlphaRho AlphaSigma AlphaTau AlphaUpsilon AlphaPhi AlphaChi AlphaPsi AlphaOmega BetaAlpha BetaBeta BetaGamma BetaDelta BetaEpsilon BetaZeta BetaEta BetaTheta BetaIota BetaKappa BetaLambda BetaMu BetaNu BetaXi BetaOmicron BetaPi BetaRho BetaSigma BetaTau BetaUpsilon BetaPhi BetaChi BetaPsi BetaOmega GammaAlpha GammaBeta GammaGamma GammaDelta GammaEpsilon GammaZeta GammaEta GammaTheta; %Delete Excess Varaible
AbsorbanceChange = [ zeros(343, 1), z1, z2, z3, z4, z5, z6, z7, z8, z9, z10, z11, z12, z13, z14, z15, z16, z17, z18, z19, z20, z21, z22, z23, z24, z25, z26, z27, z28, z29, z30, z31, z32, z33, z34, z35, z36, z37, z38, z39, z40, z41, z42, z43, z44, z45, z46, z47, z48, z49, z50, z51, z52, z53, z54, z55, z56, z57, z58, z59, z60, z61, z62, z63, z64, z65, z66, z67, z68, z69, z70, z71, z72, z73, z74, z75, z76, z77, z78, z79, z80];
AbsorbanceChange(1,:) = 0;
clear z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 z16 z17 z18 z19 z20 z21 z22 z23 z24 z25 z26 z27 z28 z29 z30 z31 z32 z33 z34 z35 z36 z37 z38 z39 z40 z41 z42 z43 z44 z45 z46 z47 z48 z49 z50 z51 z52 z53 z54 z55 z56 z57 z58 z59 z60 z61 z62 z63 z64 z65 z66 z67 z68 z69 z70 z71 z72 z73 z74 z75 z76 z77 z78 z79 z80;
mesh(AbsorbanceChange)
colorbar
title('WS2-Perovskite-image')
xlabel('Energy') % x-axis label
ylabel('Time-delay') % y-axis label
zlabel('Absorbance Change')
When I type help mesh in MATLAB I see, among other thins, this:
mesh(x,y,Z) and mesh(x,y,Z,C), with two vector arguments replacing
the first two matrix arguments, must have length(x) = n and
length(y) = m where [m,n] = size(Z). In this case, the vertices
of the mesh lines are the triples (x(j), y(i), Z(i,j)).
Note that x corresponds to the columns of Z and y corresponds to
the rows.
Thus, you can do
mesh(Energy, TimeDelay, AbsorbanceChange);
I don't know how you read the data from file, but there is a better way than specifying each cell individually in your code.

difference between feval and predict in matlab

I am trying to learn a linear regression model in Matlab. So my variables are : train_fv, train_fv_labels, test_fv and test_fv_labels. The sizes of the variables are as follows : 333x9, 333x1, 167x9 and 167x1. I want to train the model and then predict the labels on test_fv compare them with the actual labels given in test_fv_labels.
My matlab code is as follows : I am using a stepwise linear regression to model to get the best fit possible :
mdl = stepwiselm(train_fv,train_fv_labels,'PEnter',0.001,'verbose',1)
mdl1 = step(mdl,'upper','quadratic','verbose',1)
The outputs which I am getting are as follows
1. Adding x5, FStat = 83.3108, pValue = 7.06324e-18
2. Adding x1, FStat = 35.6014, pValue = 6.24096e-09
3. Adding x7, FStat = 41.0932, pValue = 5.0338e-10
4. Adding x5:x7, FStat = 33.3157, pValue = 1.81571e-08
5. Adding x1:x5, FStat = 14.1821, pValue = 0.000196729
mdl =
Linear regression model:
y ~ 1 + x1*x5 + x5*x7
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
(Intercept) 0.0014532 5.5229e-05 26.312 9.9458e-83
x1 0.00071972 0.00011402 6.3121 8.9595e-10
x5 -0.0021179 0.00018102 -11.7 1.1938e-26
x7 0.0011401 0.00022498 5.0678 6.7473e-07
x1:x5 -0.0015096 0.00040087 -3.7659 0.00019673
x5:x7 -0.0049673 0.00077872 -6.3788 6.0915e-10
Number of observations: 333, Error degrees of freedom: 327
Root Mean Squared Error: 0.001
R-squared: 0.442, Adjusted R-Squared 0.434
F-statistic vs. constant model: 51.9, p-value = 1.65e-39
6. Adding x5^2, FStat = 63.1344, pValue = 3.17359e-14
mdl1 =
Linear regression model:
y ~ 1 + x1*x5 + x5*x7 + x5^2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
(Intercept) 0.0011415 6.4043e-05 17.825 4.3107e-50
x1 0.00071722 0.00010452 6.8618 3.4339e-11
x5 -0.0018651 0.00016896 -11.039 2.7782e-24
x7 0.0011951 0.00020635 5.7915 1.6426e-08
x1:x5 -0.0019348 0.00037135 -5.2101 3.354e-07
x5:x7 -0.0045341 0.00071592 -6.3332 7.9578e-10
x5^2 0.0033789 0.00042525 7.9457 3.1736e-14
Number of observations: 333, Error degrees of freedom: 326
Root Mean Squared Error: 0.000921
R-squared: 0.533, Adjusted R-Squared 0.524
F-statistic vs. constant model: 61.9, p-value = 5.33e-51
So it basically means that for regression using mdl model I have this function : y ~ 1 + x1*x5 + x5*x7 and for mdl1 I have this : y ~ 1 + x1*x5 + x5*x7 + x5^2
But when I am trying to predict the values using the test set, I am getting an error. Why is this so ?
test_fv_labels = feval(mdl1,test_fv);
Predictor data matrix must have 5 columns.
But if I use the predict function instead of feval I am not getting an error. Why is this so ?
test_fv_labels = predict(mdl1,test_fv);
Please kindly tell me where I am going wrong and what is the difference between predict and feval command in Matlab.

Generating a vector of symbolic variables in Matlab

I want to generate a symbolic vector p, with each element a symbolic variable:
p = [p1; p2; ...; pn];
I don't want to type syms p1 p2 ... because I have ~100 such variables. Is there a way to generate them automatically?
Yup. Use sym like so:
p = sym('p', [100 1]);
This syntax will create a vector of symbolic variables where p is the first character followed by an integer. We wish to create 100 of them, and this will give you a symbolic vector from p1 up to p100, or however many you want. Simply change the 100 to whatever number you want.
This is what p looks like:
>> p
p =
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
p11
p12
p13
p14
p15
p16
p17
p18
p19
p20
p21
p22
p23
p24
p25
p26
p27
p28
p29
p30
p31
p32
p33
p34
p35
p36
p37
p38
p39
p40
p41
p42
p43
p44
p45
p46
p47
p48
p49
p50
p51
p52
p53
p54
p55
p56
p57
p58
p59
p60
p61
p62
p63
p64
p65
p66
p67
p68
p69
p70
p71
p72
p73
p74
p75
p76
p77
p78
p79
p80
p81
p82
p83
p84
p85
p86
p87
p88
p89
p90
p91
p92
p93
p94
p95
p96
p97
p98
p99
p100

Associated Labels in a dendrogram plot - MATLAB

I have the following set of data stored in file stations.dat :
Station A 305.2 321.1 420.9 383.5 311.7 197.1 160.2 113.9 60.5 60.5 64.8 154.3
Station B 281.1 304.0 353.1 231.9 84.6 20.9 11.7 11.9 31.1 75.8 133.0 235.3
Station C 312.3 342.2 366.2 335.2 200.1 74.4 45.9 27.5 24.0 53.6 87.7 177.0
Station D 402.2 524.5 554.9 529.5 347.5 176.8 120.2 35.0 12.6 13.3 14.0 61.6
Station E 261.3 262.7 282.3 232.6 103.8 33.2 16.7 33.2 111.0 149.0 184.8 227.0
By using the following commands,
Z = linkage (stations.data,'ward','euc');
figure (1), dendrogram(Z,0,'orientation', 'right')
I get the figure below:
So cluster 1 components are 4,3,1 (Stations D,C and A, respectively) and cluster 2 are 5,2(Stations E and B).
I want to put the name of Stations on plot, but if I use the command:
set (gca,'YTickLabel', stations.textdata);
The figure I get is the following:
How can I associate data to respective names and plot in dendrogram.
I have 144 stations data. I used only 5 for illustration.
Try the following:
ind = str2num(get(gca,'YTickLabel'));
set(gca, 'YTickLabel',stations.textdata(ind))
An easier way would be to specify the labels of the data points in the dendrogram call directly:
dendrogram(Z,0, 'Orientation','right', 'Labels',stations.textdata)