Decide best 'k' in k-means algorithm in weka - cluster-analysis

I am using k-means algorithm for clustering but I am not sure how to decide best optimal value of k based on the results.
For ex, i have applied k-means on a dataset for k=10:
kMeans
======
Number of iterations: 16
Within cluster sum of squared errors: 38.47923197081721
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5 6 7 8 9
(214) (16) (9) (13) (23) (46) (12) (11) (40) (15) (29)
==============================================================================================================================================================================================================================================================
RI 1.5184 1.5181 1.5175 1.5189 1.5178 1.5172 1.519 1.5255 1.5175 1.5222 1.5171
Na 13.4079 12.9988 14.6467 12.8277 13.2148 13.1896 13.63 12.6318 13.0518 13.9107 14.4421
Mg 2.6845 3.4894 1.3056 0.7738 3.4261 3.4987 3.4917 0.2145 3.4958 3.8273 0.5383
Al 1.4449 1.1844 1.3667 2.0338 1.3552 1.4898 1.3308 1.1891 1.2617 0.716 2.1228
Si 72.6509 72.785 73.2067 72.3662 72.6526 72.6989 72.07 72.0709 72.9532 71.7467 72.9659
K 0.4971 0.4794 0 1.47 0.527 0.59 0.4108 0.2345 0.547 0.1007 0.3252
Ca 8.957 8.8069 9.3567 10.1238 8.5648 8.3041 8.87 13.1291 8.5035 9.5887 8.4914
Ba 0.175 0.015 0 0.1877 0.023 0.003 0.0667 0.2864 0 0 1.04
Fe 0.057 0.2238 0 0.0608 0.2013 0.0104 0.0167 0.1109 0.011 0.0313 0.0134
Type build wind non-float build wind float tableware containers build wind non-float build wind non-float build wind float build wind non-float build wind float build wind float headlamps

There are various methods for deciding the optimal value for "k" in k-means algorithm Thumb-Rule, elbow method, silhouette method etc. In my work I used to follow the result obtained form the elbow method and got succeed with my results, I had done all the analysis in the R-Language.
Here is the link of the description for those methods link
Try to find the sub links of the given link, build a code for any one of the method & apply on your data.
I hope this will help you, if not I am sorry.
All the Best with your work.

Related

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Translating chemical equations from article, results differ (Matlab)

I've been trying to translate a set of chemical equations to MATLAB code, to be able to solve for different chemical species. I have the approximate solution (as it's from a graph) but after entering all the data and checking multiple times I still haven't been able to find what is wrong. I'm wondering what is going wrong and if anyone could please help me out. The source for the graph/equation is the article at this link: The chemistry of co-injected BOE. The graph I want to reproduce later on is figure 2 in the paper, see the image below:
Now the results I get for 10cc, 40cc and 90cc are respectively:
HF 43%, H2F2 48%, F- 3%, HF2- 6% in comparison ~28%, 63%, 2%, 7% (10cc).
HF 35%, H2F2 33%, F- 14%, HF2- 18% in comparison ~24%, 44%, 6%, 26% (40cc).
HF 21%, H2F2 12%, F- 37%, HF2- 30% in comparison ~18%, 23%, 20%, 45% (90cc).
The script is the following:
clc;
clear all;
%Units to be used
%Volume is in CC also cm^3, 1 litre is 1000 CC, 1 cc = 1 ml
%density is in g/cm^3
%weigth percentages are in fractions of 0 to 1
%Molecular weight is in g/mol
% pts=10; %number of points for linear spacing
%weight percentages of NH4OH and HF
xhf=0.49;
xnh3=0.28;
%H2O
Vh2o=1800;
dh2o=1.00; %0.997 at 25C when rounded 1
mh2o=18.02;
%HF values
Vhf=100;
dhf49=1.15;
dhf=dh2o+(dhf49-dh2o)*xhf/0.49; %# 25C
Mhf=20.01;
nhf=mols(Vhf,dhf,xhf,Mhf);
%NH4OH (NH3) values
% Vnh3=linspace(0.1*Vhf,1.9*Vhf,pts);
Vnh3=10;
dnh3=0.9; %for ~20-31% #~20-25C
Mnh3=17.03; %The wt% of NH4OH actually refers to the wt% of NH3 dissolved in H2O
nnh3=mols(Vnh3,dnh3,xnh3,Mnh3);
if max(nnh3)>=nhf
error(['There are more mols NH4OH,',num2str(max(nnh3)),', than mols HF,',num2str(nhf),'.'])
end
%% Calculations for species
Vt=(Vhf+Vh2o+Vnh3)/1000; %litre
A=nhf/Vt; %mol/l
B=nnh3/Vt; %mol/l
syms HF F H2F2 HF2 NH3 NH4 H OH
eq2= H*F/HF==6.85*10^(-4);
eq3= NH3*H/NH4==6.31*10^(-10);
eq4= H*OH==10^(-14);
eq5= HF2/(HF*F)==3.963;
eq6= H2F2/(HF^2)==2.7;
eq7= H+NH4==OH+F+HF2;
eq8= HF+F+2*H2F2+2*HF2==A;
eq9= NH3+NH4==B;
eqns=[eq2,eq5,eq6,eq8,eq4,eq3,eq9,eq7];
varias=[HF, F, H2F2, HF2, NH3, NH4, H, OH];
assume(HF> 0 & F>= 0 & H2F2>= 0 & HF2>= 0& NH3>= 0 & NH4>= 0 & H>= 0 & OH>= 0)
[HF, F, H2F2, HF2, NH3, NH4, H, OH]=vpasolve(eqns,varias);% [0 max([A,B])])
totalHF=double(HF)+double(F)+double(H2F2)+double(HF2);
HFf=double(HF)/totalHF %fraction of species for HF
H2F2f=double(H2F2)/totalHF %fraction of species for H2F2
Ff=double(F)/totalHF %fraction of species for F-
HF2f=double(HF2)/totalHF %fraction of species for HF2-
an extra function needed is called mols.m
%%%% amount of mol, Vol=volume, d=density, pwt=%weight, M=molecularweight
function mol=mols(Vol, d, pwt, M)
mol=(Vol*d*pwt)/M;
end
The equations being used from the article are in the image below:
(HF)2 is H2F2 in my script
So appears the issue wasn't so much with Matlab, had some help in that area as well.
Final solution and updated Matlab code can be found here:
https://chemistry.stackexchange.com/questions/98306/why-do-my-equilibrium-calculations-on-this-hf-nh4oh-buffer-system-not-match-thos

linear combination of curves to match a single curve with integer constraints

I have a set of vectors (curves) which I would like to match to a single curve. The issue isnt only finding a linear combination of the set of curves which will most closely match the single curve (this can be done with least squares Ax = B). I need to be able to add constraints, for example limiting the number of curves used in the fitting to a particular number, or that the curves lie next to each other. These constraints would be found in mixed integer linear programming optimization.
I have started by using lsqlin which allows constraints and have been able to limit the variable to be > 0.0, but in terms of adding further constraints I am at a loss. Is there a way to add integer constraints to least squares, or alternatively is there a way to solve this with a MILP?
any help in the right direction much appreciated!
Edit: Based on the suggestion by ErwinKalvelagen I am attempting to use CPLEX and its quadtratic solvers, however until now I have not managed to get it working. I have created a minimal 'notworking' example and have uploaded the data here and code here below. The issue is that matlabs LS solver lsqlin is able to solve, however CPLEX cplexlsqnonneglin returns CPLEX Error 5002: %s is not convex for the same problem.
function [ ] = minWorkingLSexample( )
%MINWORKINGLSEXAMPLE for LS with matlab and CPLEX
%matlab is able to solve the least squares, CPLEX returns error:
% Error using cplexlsqnonneglin
% CPLEX Error 5002: %s is not convex.
%
%
% Error in Backscatter_Transform_excel2_readMut_LINPROG_CPLEX (line 203)
% cplexlsqnonneglin (C,d);
%
load('C_n_d_2.mat')
lb = zeros(size(C,2),1);
options = optimoptions('lsqlin','Algorithm','trust-region-reflective');
[fact2,resnorm,residual,exitflag,output] = ...
lsqlin(C,d,[],[],[],[],lb,[],[],options);
%% CPLEX
ctype = cellstr(repmat('C',1,size(C,2)));
options = cplexoptimset;
options.Display = 'on';
[fact3, resnorm, residual, exitflag, output] = ...
cplexlsqnonneglin (C,d);
end
I could reproduce the Cplex problem. Here is a workaround. Instead of solving the first model, use a model that is less nonlinear:
The second model solves fine with Cplex. The problem is somewhat of a tolerance/numeric issue. For the second model we have a much more well-behaved Q matrix (a diagonal). Essentially we moved some of the complexity from the objective into linear constraints.
You should now see something like:
Tried aggregator 1 time.
QP Presolve eliminated 1 rows and 1 columns.
Reduced QP has 401 rows, 443 columns, and 17201 nonzeros.
Reduced QP objective Q matrix has 401 nonzeros.
Presolve time = 0.02 sec. (1.21 ticks)
Parallel mode: using up to 8 threads for barrier.
Number of nonzeros in lower triangle of A*A' = 80200
Using Approximate Minimum Degree ordering
Total time for automatic ordering = 0.00 sec. (3.57 ticks)
Summary statistics for Cholesky factor:
Threads = 8
Rows in Factor = 401
Integer space required = 401
Total non-zeros in factor = 80601
Total FP ops to factor = 21574201
Itn Primal Obj Dual Obj Prim Inf Upper Inf Dual Inf
0 3.3391791e-01 -3.3391791e-01 9.70e+03 0.00e+00 4.20e+04
1 9.6533667e+02 -3.0509942e+03 1.21e-12 0.00e+00 1.71e-11
2 6.4361775e+01 -3.6729243e+02 3.08e-13 0.00e+00 1.71e-11
3 2.2399862e+01 -6.8231454e+01 1.14e-13 0.00e+00 3.75e-12
4 6.8012056e+00 -2.0011575e+01 2.45e-13 0.00e+00 1.04e-12
5 3.3548410e+00 -1.9547176e+00 1.18e-13 0.00e+00 3.55e-13
6 1.9866256e+00 6.0981384e-01 5.55e-13 0.00e+00 1.86e-13
7 1.4271894e+00 1.0119284e+00 2.82e-12 0.00e+00 1.15e-13
8 1.1434804e+00 1.1081026e+00 6.93e-12 0.00e+00 1.09e-13
9 1.1163905e+00 1.1149752e+00 5.89e-12 0.00e+00 1.14e-13
10 1.1153877e+00 1.1153509e+00 2.52e-11 0.00e+00 9.71e-14
11 1.1153611e+00 1.1153602e+00 2.10e-11 0.00e+00 8.69e-14
12 1.1153604e+00 1.1153604e+00 1.10e-11 0.00e+00 8.96e-14
Barrier time = 0.17 sec. (38.31 ticks)
Total time on 8 threads = 0.17 sec. (38.31 ticks)
QP status(1): optimal
Cplex Time: 0.17sec (det. 38.31 ticks)
Optimal solution found.
Objective : 1.115360
See here for some details.
Update: In Matlab this becomes:

How to find subset selection for linear regression model?

I am working with mtcars dataset and using linear regression
data(mtcars)
fit<- lm(mpg ~.,mtcars);summary(fit)
When I fit the model with lm it shows the result like this
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5087 -1.3584 -0.0948 0.7745 4.6251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.87913 20.06582 1.190 0.2525
cyl6 -2.64870 3.04089 -0.871 0.3975
cyl8 -0.33616 7.15954 -0.047 0.9632
disp 0.03555 0.03190 1.114 0.2827
hp -0.07051 0.03943 -1.788 0.0939 .
drat 1.18283 2.48348 0.476 0.6407
wt -4.52978 2.53875 -1.784 0.0946 .
qsec 0.36784 0.93540 0.393 0.6997
vs1 1.93085 2.87126 0.672 0.5115
amManual 1.21212 3.21355 0.377 0.7113
gear4 1.11435 3.79952 0.293 0.7733
gear5 2.52840 3.73636 0.677 0.5089
carb2 -0.97935 2.31797 -0.423 0.6787
carb3 2.99964 4.29355 0.699 0.4955
carb4 1.09142 4.44962 0.245 0.8096
carb6 4.47757 6.38406 0.701 0.4938
carb8 7.25041 8.36057 0.867 0.3995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.833 on 15 degrees of freedom
Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
I found that none of variables are marked as significant at 0.05 significant level.
To find out significant variables I want to to do subset selection to find out best pair of vairables as predictors with response variable mpg.
The function regsubsets in the package leaps does best subset regression (see ?leaps). Adapting your code:
library(leaps)
regfit <- regsubsets(mpg ~., data = mtcars)
summary(regfit)
# or for a more visual display
plot(regfit,scale="Cp")

Adaboost weka True positive vs False positive recognition issue

I am using Adaboost M1 algorithm in Weka Experiment Environment with default setup:
Runs (1-10) -> 10 runs to provide more statistically significant results
Random Split Result Producer
I use train percent to divide training from evaluation data
Now, the problem is with the Weighted average TP and FP results.
I get this:
TP:0.8
FP:0.47
But as far as I am aware, if TP rate is 0.8, the FP rate should be as high as 0.2?
I assume that this has to do something with 10 runs, but anyway if average values is taken from this run, again this FP rate should be much lower?
Sorry if this is too simple question, but from my logic this seems like error in Weka toolkit, or am I wrong? Thanks
EDIT:
In order to avoid asking a new question and because this is related to the same problem, can anyone answer what are Weighted average values displayed in Weka?
I have included the Atilla's example below: it can be seen that Weighted average are not Average values,e.g. AVG(0.933,0.422) != 0.77, etc.
Can someone answer what these values actually are?
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.933 0.578 0.776 0.933 0.847 0.429 0.844 0.917 tested_negative
0.422 0.067 0.745 0.422 0.538 0.429 0.844 0.696 tested_positive
Weighted Avg. 0.77 0.416 0.766 0.77 0.749 0.429 0.844 0.847
I run adoboostM1 with default parameters on diabetes data set of weka. I got following results.
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.933 0.578 0.776 0.933 0.847 0.429 0.844 0.917 tested_negative
0.422 0.067 0.745 0.422 0.538 0.429 0.844 0.696 tested_positive
Weighted Avg. 0.77 0.416 0.766 0.77 0.749 0.429 0.844 0.847
Notice that this TP Rate and FP rate is for each of your class values. Since I have two (2) values for class feature in this data set, I have two (2) lines.
Also notice that:
0.933 + 0.067 = 1
0.578 + 0.422 = 1
As you correctly pointed that TP rate + FP rate should be equal to one (1). So in your example: I assume that you have following class variable:
target {A,B}
TP Rate FP Rate
0.8 0.47 ..... for A
0.53 0.2 ..... for B