Why CLARA clustering method does not give the same classes as when I do clustering manually? - cluster-analysis

I am using CLARA (in 'cluster' package). This method is supposed to assign each observation to the closest 'medoid'. But when I calculate the distance of medoids and observations manually and assign them manually, the results are slightly different (1-2 percent of occurrence probability). Does anyone know how clara calculates dissimilarities and why I get different clustering results?
This is the function I use to do clustering manually:
Manual.Clustering <- function(Data,Clusters,Weights=NULL) {
if (is.null(Weights)) Weights <- rep(1,length(Data));
if (length(Weights)==1) Weights <- rep(Weights,length(Data));
Data2 <- Data[,rownames(Clusters)];
Data2 <- Weights*Data2;
dist <- matrix(NA,nrow=nrow(Data),ncol=ncol(Clusters));
for (i in 1:ncol(Clusters)) {
dist[,i] <- Dist2Center(Data2,Clusters[,i],Weights=NULL);
}
classes <- apply(dist,1,which.min);
Out <- cbind(Data,classes);
colnames(Out) <- c(colnames(Data),"Class");
Freq <- FreqTable(Out[,"Class"]);
Freq <- as.data.frame(Freq);
return(list(Data=Out,Freq=Freq));
}
=====================================
Dist2Center <- function(Data,Center,Weights=NULL) {
if (is.null(Weights)) Weights <- matrix(rep(1,nrow(Data)),ncol=1);
if (length(Weights)==1) Weights <- rep(Weights,nrow(Data));
if (ncol(Data)!=length(Center)) stop();
Dist <- Weights*apply(Data,1,function(x){sqrt(sum((x-Center)^2,na.rm=T))} );
return(Dist);
}
Data: Original Data.
Clusters: t(Medoids).
Medoids: 'medoids' picked by clara.
Dist2Center: A function which calculates Euclidean distance of each observation from each Medoids.
Behnam.

I found that this happens only when input data has NA values. For inputs without NAs, the results of my algorithm and clara are identical. I think this is related to how clara handles NA values while calculating the distances of observations to medoids. Any comment? Any suggestion for replacing clara with a better algorithm compatible with large datasets and NA values?

Having a look at the Clara C code, I found that Clara manipulates the distances if there is any missing values. The line " dsum *= (nobs / pp) " in the code shows that it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. That is why it does not give the same results as my algorithm.

Related

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

The result of FCM clustering for my Testset in matlab [duplicate]

After finishing cluster analysis,when I input some new data,how Do I know which cluster do the data belongs to?
data(freeny)
library(RSNNS)
options(digits=2)
year<-as.integer(rownames(freeny))
freeny<-cbind(freeny,year)
freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)]
freenyValues= freeny[,1:5]
freenyTargets=decodeClassLabels(freeny[,6])
freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15)
km<-kmeans(freeny$inputsTrain,10,iter.max = 100)
kclust=km$cluster
kmeans returns an object containing the coordinates of the cluster centers in $centers. You want to find the cluster to which the new object is closest (in terms of the sum of squares of distances):
v <- freeny$inputsTrain[1,] # just an example
which.min( sapply( 1:10, function( x ) sum( ( v - km$centers[x,])^2 ) ) )
The above returns 8 - same as the cluster to which the first row of freeny$inputsTrain was assigned.
In an alternative approach, you can first create a clustering, and then use a supervised machine learning to train a model which you will then use as a prediction. However, the quality of the model will depend on how good the clustering really represents the data structure and how much data you have. I have inspected your data with PCA (my favorite tool):
pca <- prcomp( freeny$inputsTrain, scale.= TRUE )
library( pca3d )
pca3d( pca )
My impression is that you have at most 6-7 clear classes to work with:
However, one should run more kmeans diagnostic (elbow plots etc) to determine the optimal number of clusters:
wss <- sapply( 1:10, function( x ) { km <- kmeans(freeny$inputsTrain,x,iter.max = 100 ) ; km$tot.withinss } )
plot( 1:10, wss )
This plot suggests 3-4 classes as the optimum. For a more complex and informative approach, consult the clusterograms: http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

How can we find percentile or quantile of gamma distribution in MATLAB?

Suppose that we have this gamma distribution in MATLAB:
I want this part of distribution with higher density (x-axis range). How can I extract this in MATLAB? I fit this distribution using histfit function.
PS. My codes:
figure;
histfit(Data,20,'gamma');
[phat, pci] = gamfit(Data);
phat =
11.3360 4.2276
pci =
8.4434 3.1281
15.2196 5.7136
When you fit a gamma distribution to your data with [phat, pci] = gamfit(Data);, phat contains the MLE parameters.
You can plug this into gaminv:
x = gaminv(p, phat(1), phat(2));
where p is a vector of percentages, e.g. p = [.2, .8].
I always base my code on the following. This is a code once made which I now alter where necessary. Maybe you will find it usefull as well.
table2.10 <- cbind(lower=c(0,2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300),upper=c(2.5,7.5,12.5,17.5,22.5,32.5,
47.5,67.5,87.5,125,225,300,Inf),freq=c(41,48,24,18,15,14,16,12,6,11,5,4,3))
loglik <-function(p,d){
upper <- d[,2]
lower <- d[,1]
n <- d[,3]
ll<-n*log(ifelse(upper<Inf,pgamma(upper,p[1],p[2]),1)-
pgamma(lower,p[1],p[2]))
sum( (ll) )
}
p0 <- c(alpha=0.47,beta=0.014)
m <- optim(p0,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10)
theta <- qgamma(0.95,m$par[1],m$par[2])
theta
One can also create a 95% confidence interval by using the Delta method. To do so, we need to differentiate the
distribution function of the claims F^(−1)_{X} (0.95; α, β) with respect to α and β.
p <- m$par
eps <- c(1e-5,1e-6,1e-7)
d.alpha <- 0*eps
d.beta <- 0*eps
for (i in 1:3){
d.alpha[i] <- (qgamma(0.95,p[1]+eps[i],p[2])-qgamma(0.95,p[1]-eps[i],p[2]))/(2*eps[i])
d.beta[i] <- (qgamma(0.95,p[1],p[2]+eps[i])-qgamma(0.95,p[1],p[2]-eps[i]))/(2*eps[i])
}
d.alpha
d.beta
var.p <- solve(-m$hessian)
var.q95 <- t(c(d.alpha[2],d.beta[2])) %*% var.p %*% c(d.alpha[2],d.beta[2])
qgamma(0.95,p[1],p[2]) + qnorm(c(0.025,0.975))*sqrt(c(var.q95))
It is even possible using the parametric bootstrap on the estimates for α and β to get B = 1000 different estimates for the
95th percentile of the loss distribution. And use these estimates to construct a 95% confidence interval
library(mvtnorm)
B <- 10000
q.b <- rep(NA,B)
for (b in 1:B){
p.b <- rmvnorm(1,p,var.p)
if (!any(p.b<0)) q.b[b] <- qgamma(0.95,p.b[1],p.b[2])
}
quantile(q.b,c(0.025,0.975))
To do the nonparametrtic bootstrap, we first ‘expand’ the data to reflect each individual observation. Then
we sample with replacement from the line numbers, calculate the frequency table, estimate the model and its
95% percentile.
line.numbers <- rep(1:13,table2.10[,"freq"])
q.b <- rep(NA,B)
table2.10b <- table2.10
for (b in 1:B){
line.numbers.b <- sample(line.numbers,size=217,replace=TRUE)
table2.10b[,"freq"] <- table(factor(line.numbers.b,levels=1:13))
m.b <- optim(m$par,loglik,hessian=T,control=list(fnscale=-1),
d=table2.10b)
q.b[b] <- qgamma(0.95,m.b$par[1],m.b$par[2])
}
q.npb <- q.b
quantile(q.b,c(0.025,0.975))

basic help using hmm to clasify a sequence

I am very new to matlab, hidden markov model and machine learning, and am trying to classify a given sequence of signals. Please let me know if the approach I have followed is correct:
create a N by N transition matrix and fill with random values which sum to 1for each row. (N will be the number of states)
create a N by M emission/observation matrix and fill with random values which sum to 1 for each row
convert different instances of the sequence (i.e each instance will be saying the word 'hello' ) into one long stream and feed each stream to the hmm train function such that:
new_transition_matrix old_transition_matrix = hmmtrain(sequence,old_transition_matrix,old_emission_matrix)
give the final transition and emission matrix to hmm decode with an unknown sequence to give the probability
i.e [posterior_states logrithmic_probability] = hmmdecode( sequence, final_transition_matrix,final_emission_matris)
1. and 2. are correct. You have to be careful that your initial transition and emission matrices are not completely uniform, they should be slightly randomized for the training to work.
3. I would just feed in the 'Hello' sequences separately rather than concatenating them to form a single long sequence.
Let's say this is the sequence for Hello: [1,0,1,1,0,0]. If you form one long sequence from 3 'Hello' sequences, you would get:
data = [1,0,1,1,0,0,1,0,1,1,0,0,1,0,1,1,0,0]
This is not ideal, instead you should feed the sequences in separately like:
data = [1,0,1,1,0,0; 1,0,1,1,0,0; 1,0,1,1,0,0].
Since you are using MatLab, I would recommend using the HMM toolbox by Murphy. It has a demo on how you can train an HMM with multiple observation sequences:
M = 3;
N = 2;
% "true" parameters
prior0 = normalise(rand(N ,1));
transmat0 = mk_stochastic(rand(N ,N ));
obsmat0 = mk_stochastic(rand(N ,M));
% training data: a 5*6 matrix, e.g. 5 different 'Hello' sequences of length 6
number_of_seq = 5;
seq_len= 6;
data = dhmm_sample(prior0, transmat0, obsmat0, number_of_seq, seq_len);
% initial guess of parameters
prior1 = normalise(rand(N ,1));
transmat1 = mk_stochastic(rand(N ,N ));
obsmat1 = mk_stochastic(rand(N ,M));
% improve guess of parameters using EM
[LL, prior2, transmat2, obsmat2] = dhmm_em(data, prior1, transmat1, obsmat1, 'max_iter', 5);
LL
4. What you say is correct, below is how you calculate the log probaility in the HMM toolbox:
% use model to compute log[P(Obs|model)]
loglik = dhmm_logprob(data, prior2, transmat2, obsmat2)
Finally: Have a look at this paper by Rabiner on how the mathematics work if anything is unclear.
Hope this helps.

Generating a set of emissions given a transition matrix and starting state in a hidden markov model

I have the transition matrix, emission matrix and starting state for a hidden Markov model. I want to generate a sequence of observations (emissions). However, I'm stuck on one thing.
I understand how to choose among two states (or emissions). If Event A probability x, then Event B (or, really not-A) occurs with probability 1-x. To generate a sequence of A's and B's, with a random number, rand, you do the following.
for iteration in iterations:
observation[iteration] <- A if rand < x else B
I don't understand how to extend this to more than two variables. For example, if three events occur such that Event A occurs with probability x1, Event B with x2 and Event C with 1-(x1+x2), then how do I extend the above pseudocode?
I didn't find the answer Googling. In fact I get the impression that I'm missing a basic fact that many of the notes online assume. :-/
One way would be
x<-rand()
if x < x1 observation is A
else if x < x1 + x2 observation is B
else observation is C
Of course if you have a large number of alternatives it might be better to build a cumulative probability table (holding x1, x1+x2, x1+x2+x3 ...) and then do a binary search in that table given the random number. If you are willing to do more preprocessing, there is an even more efficient way, see here for example
The two value case is a binomial distribution, and you generate random draws from a binomial distribution (a series of coin flips, essentially).
For more than 2 variables, you need to draw samples from a multinomial distribution, which is simply a generalisation of the binomial distribution for n>2.
Regardless of what language you use, there should most likely be built-in functions to accomplish this task. Below is some code in Python, which simulates a set of observations and states given your hmm model object:
import numpy as np
def random_MN_draw(n, probs):
""" get X random draws from the multinomial distribution whose probability is given by 'probs' """
mn_draw = np.random.multinomial(n,probs) # do 1 multinomial experiment with the given probs with probs= [0.5,0.5], this is a coin-flip
return np.where(mn_draw == 1)[0][0] # get the index of the state drawn e.g. 0, 1, etc.
def simulate(self, nSteps):
""" given an HMM = (A, B1, B2 pi), simulate state and observation sequences """
lenB= len(self.emission)
observations = np.zeros((lenB, nSteps), dtype=np.int) # array of zeros
states = np.zeros(nSteps)
states[0] = self.random_MN_draw(1, self.priors) # appoint the first state from the prior dist
for i in range(0,lenB): # initialise observations[i,0] for all observerd variables
observations[i,0] = self.random_MN_draw(1, self.emission[i][states[0],:]) #ith variable array, states[0]th row
for t in range(1,nSteps): # loop through t
states[t] = self.random_MN_draw(1, self.transition[states[t-1],:]) # given prev state (t-1) pick what row of the A matrix to use
for i in range(0,lenB): # loop through the observed variable for each t
observations[i,t] = self.random_MN_draw(1, self.emission[i][states[t],:]) # given current state t, pick what row of the B matrix to use
return observations,states
In pretty much every language, you can find equivalents of
np.random.multinomial()
for multinomial and other discrete or continuous distributions as built-in functions.