Does h2o.kmeans() make predictions based on euclidean distance? - cluster-analysis
I created a clustering model using h2o.kmeans(). The modeling dataset was standardized by scale() in R first.
The model has five clusters and the coordinates of the centroids are:
CENTROID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22
1 -0.646544 -0.6322714 -0.5101907 -0.2980412 -1.6182105 -1.7939725 -1.8194372 -1.82349 -1.8174061 -1.8069266 -2.2213561 -2.2618561 -2.2170297 -2.2004509 -2.196722 -2.2267695 -2.2536694 -2.2653944 -2.1599764 -2.2074994 -1.9114193 -2.78E-16
2 -0.2505012 -0.2582746 -0.2542313 -0.3205136 0.2912933 0.3239872 0.3236214 0.3231876 0.3234663 0.309818 0.362641 0.3800735 0.3615138 0.3542787 0.350817 0.3583391 0.375764 0.3715018 0.3533203 0.3533025 0.2651153 3.72E-15
3 0.4237044 0.4421857 0.408422 0.6620773 0.2371281 0.2592748 0.2597783 0.2782299 0.258803 0.3129833 0.4157714 0.3704712 0.3948566 0.4137049 0.4289137 0.4229101 0.3904031 0.4323851 0.3984215 0.442518 0.5278553 1.00E+00
4 2.2426614 2.2450805 2.0475964 1.5666675 0.2249847 0.2887632 0.3391117 0.3224008 0.3375972 0.3617759 0.5063836 0.4805747 0.5226613 0.5097081 0.5196333 0.5136624 0.4780912 0.4686772 0.4743151 0.5357567 0.5734882 8.24E-01
5 4.4718381 4.5243432 4.8917335 5.223828 0.2374653 0.3096633 0.3215417 0.3326531 0.3189998 0.414707 0.5065842 0.5113028 0.558864 0.5482378 0.543278 0.5436269 0.5204451 0.5341745 0.5096259 0.6486469 0.6595461 9.89E-01
When using the model to make predictions for new data, mostly the result makes sense, which returns the cluster whose centroid has the shortest euclidean distance to the data point; however, sometimes (about 5%) the prediction is off. For example, for a data point as below:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22
-0.2001578 -0.2485784 -0.3008685 -0.005366991 0.2624246 0.3142725 0.3074037 0.3221539 0.3033765 0.3403944 0.3557642 0.3810387 0.4848038 0.2788213 0.544491 0.2838926 0.2899755 0.3963652 0.2594092 0.3083141 0.463528 1
The prediction is cluster 3; however, the euclidean distance between the data point and centroids are:
cluster 1: 10
cluster 2: 1.11
cluster 3: 1.39
cluster 4: 4.53
cluster 5: 9.97.
Based on the calculation above, the data point should be assigned to cluster 2, not 3.
Is it a bug or h2o.kmeans() uses other methods instead of euclidean distance for prediction?
Thank you.
Yes, as stated in the K-Means documentation, it uses Euclidean distance.
If you can provide a reproducible example showing that this is a bug, please file a bug report. Thanks!
Related
Potential Bug in MATLAB regress R2014a
MATLAB R2014a used to work fine w regress but now I get an error when the variables are fine and rank is satisfactory. X = rand([10 3]) X = 0.8407 0.3517 0.0759 0.2543 0.8308 0.0540 0.8143 0.5853 0.5308 0.2435 0.5497 0.7792 0.9293 0.9172 0.9340 0.3500 0.2858 0.1299 0.1966 0.7572 0.5688 0.2511 0.7537 0.4694 0.6160 0.3804 0.0119 0.4733 0.5678 0.3371 K>> Y = rand([10 1]) Y = 0.1622 0.7943 0.3112 0.5285 0.1656 0.6020 0.2630 0.6541 0.6892 0.7482 [B,BINT] = regress(Y,X) Subscript indices must either be real positive integers or logicals. Error in regress (line 93) b(perm) = R \ (Q'*y); Obviously, X and Y are fine. There's something wrong w/ the matrix math in regress and it's that perm is, for some reason, being output as a vector (giving an ind error). A few lines above, qr is called like this with no further modification to perm: [Q,R,perm] = qr(X,0); Help file says qr is supposed to output a third argument that's a matrix - but how can that be if the math always expects a vector? % [Q,R,E] = QR(A) produces unitary Q, upper triangular R and a % permutation matrix E so that A*E = Q*R. The column permutation E is % chosen so that ABS(DIAG(R)) is decreasing. Very confusing considering both of these are built-in functions. I literally re-installed MATLAB R2014a and a few toolboxes and STILL get this error. It feels like qr got updated to output a different argument, but I don't understand why a fresh reinstall wouldn't take care of this, or why qr would have updated at all anyways. Everything else in my MATLAB is great. Any ideas???
3d Graphing Data in matlab and changing the axis values
I am trying to do a 3d graph of a excel file that is 343 by 81 cells, The first column needs to be the X and the first row needs to be the Y and the remaining matrix needs to be the Z. I have the data successfully imported from excel and I create a matrix of the first column called energy (343,1)(x-axis), while creating a row matrix (1, 81) called Time Delay(y-axis) and a (343,81) matrix where the first column and row is zero called Absorbance Change(Z-axis). I've got the proper 3d graph that I need but I need the axes shown in the graph to be that of the Energy and Time Delay instead of the indices of the Absorbance Change matrix. I am putting the relevant portion of the code below as well as a picture of the graph: EnergyString = dataArray{:, 1}; EnergyString(1,1) = {'0'}; Energy = str2double(EnergyString); %Energy = [ Energy, zeros(343, 80) ]; TimeDelay = [ z1(1,1), z2(1,1), z3(1,1), z4(1,1), z5(1,1), z6(1,1), z7(1,1), z8(1,1), z9(1,1), z10(1,1), z11(1,1), z12(1,1), z13(1,1), z14(1,1), z15(1,1), z16(1,1), z17(1,1), z18(1,1), z19(1,1), z20(1,1), z21(1,1), z22(1,1), z23(1,1), z24(1,1), z25(1,1), z26(1,1), z27(1,1), z28(1,1), z29(1,1), z30(1,1), z31(1,1), z32(1,1), z33(1,1), z34(1,1), z35(1,1), z36(1,1), z37(1,1), z38(1,1), z39(1,1), z40(1,1), z41(1,1), z42(1,1), z42(1,1), z43(1,1), z44(1,1), z45(1,1), z46(1,1), z47(1,1), z48(1,1), z49(1,1), z50(1,1), z51(1,1), z52(1,1), z53(1,1), z54(1,1), z55(1,1), z56(1,1), z57(1,1), z58(1,1), z59(1,1), z60(1,1), z61(1,1), z62(1,1), z63(1,1), z64(1,1), z65(1,1), z66(1,1), z67(1,1), z68(1,1), z69(1,1), z70(1,1), z71(1,1), z72(1,1), z73(1,1), z74(1,1), z75(1,1), z76(1,1), z77(1,1), z78(1,1), z79(1,1), z80(1,1) ]; %TimeDelay = [ TimeDelay; zeros(342, 81)]; startRow formatSpec filename fileID delimiter ans EnergyString Alpha Beta Gamma Delta Epsilon Zeta Eta Theta Iota Kappa Lambda Mu Nu Xi Omicron Pi Rho Sigma Tau Upsilon Phi Chi Psi Omega AlphaAlpha AlphaBeta AlphaGamma AlphaDelta AlphaEpsilon AlphaZeta AlphaEta AlphaTheta AlphaIota AlphaKappa AlphaLambda AlphaMu AlphaNu AlphaXi AlphaOmicron AlphaPi AlphaRho AlphaSigma AlphaTau AlphaUpsilon AlphaPhi AlphaChi AlphaPsi AlphaOmega BetaAlpha BetaBeta BetaGamma BetaDelta BetaEpsilon BetaZeta BetaEta BetaTheta BetaIota BetaKappa BetaLambda BetaMu BetaNu BetaXi BetaOmicron BetaPi BetaRho BetaSigma BetaTau BetaUpsilon BetaPhi BetaChi BetaPsi BetaOmega GammaAlpha GammaBeta GammaGamma GammaDelta GammaEpsilon GammaZeta GammaEta GammaTheta; %Delete Excess Varaible AbsorbanceChange = [ zeros(343, 1), z1, z2, z3, z4, z5, z6, z7, z8, z9, z10, z11, z12, z13, z14, z15, z16, z17, z18, z19, z20, z21, z22, z23, z24, z25, z26, z27, z28, z29, z30, z31, z32, z33, z34, z35, z36, z37, z38, z39, z40, z41, z42, z43, z44, z45, z46, z47, z48, z49, z50, z51, z52, z53, z54, z55, z56, z57, z58, z59, z60, z61, z62, z63, z64, z65, z66, z67, z68, z69, z70, z71, z72, z73, z74, z75, z76, z77, z78, z79, z80]; AbsorbanceChange(1,:) = 0; clear z1 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 z16 z17 z18 z19 z20 z21 z22 z23 z24 z25 z26 z27 z28 z29 z30 z31 z32 z33 z34 z35 z36 z37 z38 z39 z40 z41 z42 z43 z44 z45 z46 z47 z48 z49 z50 z51 z52 z53 z54 z55 z56 z57 z58 z59 z60 z61 z62 z63 z64 z65 z66 z67 z68 z69 z70 z71 z72 z73 z74 z75 z76 z77 z78 z79 z80; mesh(AbsorbanceChange) colorbar title('WS2-Perovskite-image') xlabel('Energy') % x-axis label ylabel('Time-delay') % y-axis label zlabel('Absorbance Change')
When I type help mesh in MATLAB I see, among other thins, this: mesh(x,y,Z) and mesh(x,y,Z,C), with two vector arguments replacing the first two matrix arguments, must have length(x) = n and length(y) = m where [m,n] = size(Z). In this case, the vertices of the mesh lines are the triples (x(j), y(i), Z(i,j)). Note that x corresponds to the columns of Z and y corresponds to the rows. Thus, you can do mesh(Energy, TimeDelay, AbsorbanceChange); I don't know how you read the data from file, but there is a better way than specifying each cell individually in your code.
difference between feval and predict in matlab
I am trying to learn a linear regression model in Matlab. So my variables are : train_fv, train_fv_labels, test_fv and test_fv_labels. The sizes of the variables are as follows : 333x9, 333x1, 167x9 and 167x1. I want to train the model and then predict the labels on test_fv compare them with the actual labels given in test_fv_labels. My matlab code is as follows : I am using a stepwise linear regression to model to get the best fit possible : mdl = stepwiselm(train_fv,train_fv_labels,'PEnter',0.001,'verbose',1) mdl1 = step(mdl,'upper','quadratic','verbose',1) The outputs which I am getting are as follows 1. Adding x5, FStat = 83.3108, pValue = 7.06324e-18 2. Adding x1, FStat = 35.6014, pValue = 6.24096e-09 3. Adding x7, FStat = 41.0932, pValue = 5.0338e-10 4. Adding x5:x7, FStat = 33.3157, pValue = 1.81571e-08 5. Adding x1:x5, FStat = 14.1821, pValue = 0.000196729 mdl = Linear regression model: y ~ 1 + x1*x5 + x5*x7 Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 0.0014532 5.5229e-05 26.312 9.9458e-83 x1 0.00071972 0.00011402 6.3121 8.9595e-10 x5 -0.0021179 0.00018102 -11.7 1.1938e-26 x7 0.0011401 0.00022498 5.0678 6.7473e-07 x1:x5 -0.0015096 0.00040087 -3.7659 0.00019673 x5:x7 -0.0049673 0.00077872 -6.3788 6.0915e-10 Number of observations: 333, Error degrees of freedom: 327 Root Mean Squared Error: 0.001 R-squared: 0.442, Adjusted R-Squared 0.434 F-statistic vs. constant model: 51.9, p-value = 1.65e-39 6. Adding x5^2, FStat = 63.1344, pValue = 3.17359e-14 mdl1 = Linear regression model: y ~ 1 + x1*x5 + x5*x7 + x5^2 Estimated Coefficients: Estimate SE tStat pValue __________ __________ _______ __________ (Intercept) 0.0011415 6.4043e-05 17.825 4.3107e-50 x1 0.00071722 0.00010452 6.8618 3.4339e-11 x5 -0.0018651 0.00016896 -11.039 2.7782e-24 x7 0.0011951 0.00020635 5.7915 1.6426e-08 x1:x5 -0.0019348 0.00037135 -5.2101 3.354e-07 x5:x7 -0.0045341 0.00071592 -6.3332 7.9578e-10 x5^2 0.0033789 0.00042525 7.9457 3.1736e-14 Number of observations: 333, Error degrees of freedom: 326 Root Mean Squared Error: 0.000921 R-squared: 0.533, Adjusted R-Squared 0.524 F-statistic vs. constant model: 61.9, p-value = 5.33e-51 So it basically means that for regression using mdl model I have this function : y ~ 1 + x1*x5 + x5*x7 and for mdl1 I have this : y ~ 1 + x1*x5 + x5*x7 + x5^2 But when I am trying to predict the values using the test set, I am getting an error. Why is this so ? test_fv_labels = feval(mdl1,test_fv); Predictor data matrix must have 5 columns. But if I use the predict function instead of feval I am not getting an error. Why is this so ? test_fv_labels = predict(mdl1,test_fv); Please kindly tell me where I am going wrong and what is the difference between predict and feval command in Matlab.
Generating a vector of symbolic variables in Matlab
I want to generate a symbolic vector p, with each element a symbolic variable: p = [p1; p2; ...; pn]; I don't want to type syms p1 p2 ... because I have ~100 such variables. Is there a way to generate them automatically?
Yup. Use sym like so: p = sym('p', [100 1]); This syntax will create a vector of symbolic variables where p is the first character followed by an integer. We wish to create 100 of them, and this will give you a symbolic vector from p1 up to p100, or however many you want. Simply change the 100 to whatever number you want. This is what p looks like: >> p p = p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 p17 p18 p19 p20 p21 p22 p23 p24 p25 p26 p27 p28 p29 p30 p31 p32 p33 p34 p35 p36 p37 p38 p39 p40 p41 p42 p43 p44 p45 p46 p47 p48 p49 p50 p51 p52 p53 p54 p55 p56 p57 p58 p59 p60 p61 p62 p63 p64 p65 p66 p67 p68 p69 p70 p71 p72 p73 p74 p75 p76 p77 p78 p79 p80 p81 p82 p83 p84 p85 p86 p87 p88 p89 p90 p91 p92 p93 p94 p95 p96 p97 p98 p99 p100
Associated Labels in a dendrogram plot - MATLAB
I have the following set of data stored in file stations.dat : Station A 305.2 321.1 420.9 383.5 311.7 197.1 160.2 113.9 60.5 60.5 64.8 154.3 Station B 281.1 304.0 353.1 231.9 84.6 20.9 11.7 11.9 31.1 75.8 133.0 235.3 Station C 312.3 342.2 366.2 335.2 200.1 74.4 45.9 27.5 24.0 53.6 87.7 177.0 Station D 402.2 524.5 554.9 529.5 347.5 176.8 120.2 35.0 12.6 13.3 14.0 61.6 Station E 261.3 262.7 282.3 232.6 103.8 33.2 16.7 33.2 111.0 149.0 184.8 227.0 By using the following commands, Z = linkage (stations.data,'ward','euc'); figure (1), dendrogram(Z,0,'orientation', 'right') I get the figure below: So cluster 1 components are 4,3,1 (Stations D,C and A, respectively) and cluster 2 are 5,2(Stations E and B). I want to put the name of Stations on plot, but if I use the command: set (gca,'YTickLabel', stations.textdata); The figure I get is the following: How can I associate data to respective names and plot in dendrogram. I have 144 stations data. I used only 5 for illustration.
Try the following: ind = str2num(get(gca,'YTickLabel')); set(gca, 'YTickLabel',stations.textdata(ind)) An easier way would be to specify the labels of the data points in the dendrogram call directly: dendrogram(Z,0, 'Orientation','right', 'Labels',stations.textdata)