how to derive meaning/values from cluster analysis results - cluster-analysis

I am currently doing my MasterThesis and for my MasterThesis I would like to create a simulation model which simulates the walking behavior of the older adults. However to make my simulation model easier, I want to form groups based on the cluster analysis so I can easily assign certain walking behavior to an older person if it belongs to a certain group(so if you belong to group 1, your walking time in min will be approximately 20 minutes for example).
However, I am not that familiar with cluster analysis. I have a big dataset containing many data of the characteristics of the older adults (** variables of discrete and continuous nature**), however the following characteristics are used currently based on literature:
age,gender, scorehealth, education categoy, income category, occupation, socialnetwork, yes/no living in a pleasant neighbourhood, yes/no feeling safe in the neighbourhood, the distance to green, having a dog, the walking time, walking in minutes.
After using the daisy function and using the silhouette method to define the ideal amount of clusters/thus groups, I got my clusters. However, now I was wondering how I should derive meaning from my clusters. I found it difficult to use statistical functions such as means, since I am dealing with categories. So what can I do to derive useful meaning/statistical conclusions from each clustergroup, such as if you belong to cluster group1, your incomelevel should be on average around incomegroup 10, age should be around 70 and the walkingtime in minutes is around 20 min for example. Ideally I also would like to have standard deviations of each varaibles in each cluster group.
So I can easily use these values in my simulation model to assign certain walking behavior to older adults.

#Joy, you should first determine the relevant variables. This will also help in dimensionality reduction. Since you've not given a sample dataset to work with, I'm creating my own. Also you must note, before cluster analysis, its important to obtain clusters that are pure. With purity, I mean the cluster must contain only those variables that account for maximum variance in the data. The variables that show little to negligible variance can best be removed for they are non-contributors to a cluster model. Once you've these (statistically) significant variables, cluster analysis will be meaningful.
Theoretical concepts
Clustering is a preprocessing algorithm. Its imperative to derive statistically significant variables to extract pure clusters. The derivation of these significant variables in a classification task is called feature selection whereas in a clustering task is called Principal Components (PCs). Historically the PCs are known to work only for continuous variables. To derive the PCs from categorical variable there is a method called Correspondence Analysis (CA) and for nominal categorical variables the method Multiple Correspondence Analysis (MCA) can be used.
Practical implementation
Let's create a data frame containing mixed variables (i.e. both categorical and continuous) like,
R> digits = 0:9
# set seed for reproducibility
R> set.seed(17)
# function to create random string
R> createRandString <- function(n = 5000) {
a <-, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
R> df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
R> str(df)
'data.frame': 10 obs. of 6 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10
$ name : Factor w/ 10 levels "a","b","c","d",..: 2 9 4 6 3 7 1 8 10 5
$ studLoc : Factor w/ 10 levels "APBQD6181U","GOSWE3283C",..: 5 3 7 9 2 1 8 10 4 6
$ finalmark: int 53 73 95 39 97 58 67 64 15 81
$ subj1mark: int 63 18 98 83 68 80 46 32 99 19
$ subj2mark: int 90 40 8 14 35 82 79 69 91 2
I will inject random missing values in the data so that its more similar to real-world datasets.
# add random NA values
R> df<, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
R> colSums(
ID name studLoc finalmark subj1mark subj2mark
0 0 0 2 2 0
As you can see the missing values are in continuous variables finalmark and subj1mark. I choose to do median imputation over mean because median is more robust than mean.
# Create a function to impute the missing values
R> ImputeMissing<- function(data=df){
# check if data frame
# Loop through the columns of the dataframe
for(i in seq_along(df))
if(class(df[,i]) %in% c("numeric","integer")){
# missing continuous data to be replaced by median
df[[,i]),i] <- median(df[,i],na.rm = TRUE)
} # end inner-if
} # end for
} # end function
# Remove the missing values
R> df.complete<- ImputeMissing(df)
# check missing values
R> colSums(
ID name studLoc finalmark subj1mark subj2mark
0 0 0 0 0 0
Now we can apply the method FAMD() from the FactoMineR package to the cleaned dataset. You can type, ??FactoMineR::FAMD in R console to look at the vignette of this method. From the vignette, FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability.
R> df.princomp <- FactoMineR::FAMD(df.complete, graph = FALSE)
Thereafter we can visualize the PCs using a screeplot shown in fig1 like,
R> factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
A Scree Plot (as shown in fig1) is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 44.5% of total variance. The question now naturally arises, "What are these variables?".
To extract the contribution of the variables, I've used fviz_contrib() shown in fig2 like,
R> factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
The fig2 above visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, studLoc, name, subj2markand finalMark are the most important variables that can be used for further analysis.
Now, you can proceed with cluster analysis.
# extract the important variables and store in a new dataframe
R> df.princomp.impvars<- df.complete[,c(2:3,6,4)]
# make the distance matrix
R> gower_dist <- cluster::daisy(df.princomp.impvars,
metric = "gower",
type = list(logratio = 3))
R> gower_mat <- as.matrix(gower_dist)
#make a hierarchical cluster model
R> model<-hclust(gower_dist)
#plotting the hierarchy
R> plot(model)
#cutting the tree at your decided level
R> clustmember<-cutree(model,3)
#adding the cluster member as a column to your data
R> df.clusters<-data.frame(df.princomp.impvars,cluster=clustmember)
R> df.clusters
name studLoc subj2mark finalmark cluster
1 b POTYQ0002N 90 53 1
2 i LWMTW1195I 40 73 1
3 d VTUGO1685F 8 95 2
4 f YCGGS5755N 14 70 1
5 c GOSWE3283C 35 97 2
6 g APBQD6181U 82 58 1
7 a VUJOG1460V 79 67 1
8 h YXOGP1897F 69 64 1
9 j NFUOB6042V 91 70 1
10 e QYTHG0783G 2 81 3


