how to derive meaning/values from cluster analysis results - cluster-analysis

I am currently doing my MasterThesis and for my MasterThesis I would like to create a simulation model which simulates the walking behavior of the older adults. However to make my simulation model easier, I want to form groups based on the cluster analysis so I can easily assign certain walking behavior to an older person if it belongs to a certain group(so if you belong to group 1, your walking time in min will be approximately 20 minutes for example).
However, I am not that familiar with cluster analysis. I have a big dataset containing many data of the characteristics of the older adults (** variables of discrete and continuous nature**), however the following characteristics are used currently based on literature:
age,gender, scorehealth, education categoy, income category, occupation, socialnetwork, yes/no living in a pleasant neighbourhood, yes/no feeling safe in the neighbourhood, the distance to green, having a dog, the walking time, walking in minutes.
After using the daisy function and using the silhouette method to define the ideal amount of clusters/thus groups, I got my clusters. However, now I was wondering how I should derive meaning from my clusters. I found it difficult to use statistical functions such as means, since I am dealing with categories. So what can I do to derive useful meaning/statistical conclusions from each clustergroup, such as if you belong to cluster group1, your incomelevel should be on average around incomegroup 10, age should be around 70 and the walkingtime in minutes is around 20 min for example. Ideally I also would like to have standard deviations of each varaibles in each cluster group.
So I can easily use these values in my simulation model to assign certain walking behavior to older adults.

#Joy, you should first determine the relevant variables. This will also help in dimensionality reduction. Since you've not given a sample dataset to work with, I'm creating my own. Also you must note, before cluster analysis, its important to obtain clusters that are pure. With purity, I mean the cluster must contain only those variables that account for maximum variance in the data. The variables that show little to negligible variance can best be removed for they are non-contributors to a cluster model. Once you've these (statistically) significant variables, cluster analysis will be meaningful.
Theoretical concepts
Clustering is a preprocessing algorithm. Its imperative to derive statistically significant variables to extract pure clusters. The derivation of these significant variables in a classification task is called feature selection whereas in a clustering task is called Principal Components (PCs). Historically the PCs are known to work only for continuous variables. To derive the PCs from categorical variable there is a method called Correspondence Analysis (CA) and for nominal categorical variables the method Multiple Correspondence Analysis (MCA) can be used.
Practical implementation
Let's create a data frame containing mixed variables (i.e. both categorical and continuous) like,
R> digits = 0:9
# set seed for reproducibility
R> set.seed(17)
# function to create random string
R> createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
R> df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
R> str(df)
'data.frame': 10 obs. of 6 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10
$ name : Factor w/ 10 levels "a","b","c","d",..: 2 9 4 6 3 7 1 8 10 5
$ studLoc : Factor w/ 10 levels "APBQD6181U","GOSWE3283C",..: 5 3 7 9 2 1 8 10 4 6
$ finalmark: int 53 73 95 39 97 58 67 64 15 81
$ subj1mark: int 63 18 98 83 68 80 46 32 99 19
$ subj2mark: int 90 40 8 14 35 82 79 69 91 2
I will inject random missing values in the data so that its more similar to real-world datasets.
# add random NA values
R> df<-as.data.frame(lapply(df, function(cc) cc[ sample(c(TRUE, NA), prob = c(0.85, 0.15), size = length(cc), replace = TRUE) ]))
R> colSums(is.na(df))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 2 2 0
As you can see the missing values are in continuous variables finalmark and subj1mark. I choose to do median imputation over mean because median is more robust than mean.
# Create a function to impute the missing values
R> ImputeMissing<- function(data=df){
# check if data frame
if(!(is.data.frame(df))){
df<- as.data.frame(df)
}
# Loop through the columns of the dataframe
for(i in seq_along(df))
{
if(class(df[,i]) %in% c("numeric","integer")){
# missing continuous data to be replaced by median
df[is.na(df[,i]),i] <- median(df[,i],na.rm = TRUE)
} # end inner-if
} # end for
return(df)
} # end function
# Remove the missing values
R> df.complete<- ImputeMissing(df)
# check missing values
R> colSums(is.na(df.complete))
ID name studLoc finalmark subj1mark subj2mark
0 0 0 0 0 0
Now we can apply the method FAMD() from the FactoMineR package to the cleaned dataset. You can type, ??FactoMineR::FAMD in R console to look at the vignette of this method. From the vignette, FAMD is a principal component method dedicated to explore data with both continuous and categorical variables. It can be seen roughly as a mixed between PCA and MCA. More precisely, the continuous variables are scaled to unit variance and the categorical variables are transformed into a disjunctive data table (crisp coding) and then scaled using the specific scaling of MCA. This ensures to balance the influence of both continous and categorical variables in the analysis. It means that both variables are on a equal foot to determine the dimensions of variability.
R> df.princomp <- FactoMineR::FAMD(df.complete, graph = FALSE)
Thereafter we can visualize the PCs using a screeplot shown in fig1 like,
R> factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
A Scree Plot (as shown in fig1) is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 44.5% of total variance. The question now naturally arises, "What are these variables?".
To extract the contribution of the variables, I've used fviz_contrib() shown in fig2 like,
R> factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
The fig2 above visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, studLoc, name, subj2markand finalMark are the most important variables that can be used for further analysis.
Now, you can proceed with cluster analysis.
# extract the important variables and store in a new dataframe
R> df.princomp.impvars<- df.complete[,c(2:3,6,4)]
# make the distance matrix
R> gower_dist <- cluster::daisy(df.princomp.impvars,
metric = "gower",
type = list(logratio = 3))
R> gower_mat <- as.matrix(gower_dist)
#make a hierarchical cluster model
R> model<-hclust(gower_dist)
#plotting the hierarchy
R> plot(model)
#cutting the tree at your decided level
R> clustmember<-cutree(model,3)
#adding the cluster member as a column to your data
R> df.clusters<-data.frame(df.princomp.impvars,cluster=clustmember)
R> df.clusters
name studLoc subj2mark finalmark cluster
1 b POTYQ0002N 90 53 1
2 i LWMTW1195I 40 73 1
3 d VTUGO1685F 8 95 2
4 f YCGGS5755N 14 70 1
5 c GOSWE3283C 35 97 2
6 g APBQD6181U 82 58 1
7 a VUJOG1460V 79 67 1
8 h YXOGP1897F 69 64 1
9 j NFUOB6042V 91 70 1
10 e QYTHG0783G 2 81 3

Related

Fast intersection of several interval ranges?

I have several variables, all of which are numeric ranges: (intervals in rows)
a = [ 1 4; 5 9; 11 15; 20 30];
b = [ 2 6; 12 14; 19 22];
c = [ 15 22; 24 29; 33 35];
d = [ 0 3; 15 17; 23 26];
(The values in my real dataset are not integers, but are represented here as such for clarity).
I would like to find intervals in which at least 3 of the variables intersect. In the above example [20 22] and [24 26] would be two such cases.
One way to approach this is to bin my values and add the bins together, but as my values are continuous, that'd create an 'edge effect', and I'd lose time binning the values in the first place. (binning my dataset at my desired resolution would create hundreds of GB of data).
Another approach, which doesn't involve binning, would use pairwise intersections (let's call it X) between all possible combinations of variables, and then the intersection of X with all other variables O(n^3).
What are your thoughts on this? Are there algorithms/libraries which have tools to solve this?
I was thinking of using sort of a geometric approach to solve this: Basically, if I considered that my intervals were segments in 1D space, then my desired output would be points where three segments (from three variables) intersect. I'm not sure if this is efficient algorithmically though. Advice?
O(N lg N) method:
Convert each interval (t_A, t_B) to a pair of tagged endpoints ('begin', t_A), ('end', t_B)
Sort all the endpoints by time, this is the most expensive step
Do one pass through, tracking nesting depth (increment if tag is 'start', decrement if tag is 'end'). This takes linear time.
When the depth changes from 2 to 3, it's the start of an output interval.
When it changes from 3 to 2, it's the end of an interval.

Using bin counts as weights for random number selection

I have a set of data that I wish to approximate via random sampling in a non-parametric manner, e.g.:
eventl=
4
5
6
8
10
11
12
24
32
In order to accomplish this, I initially bin the data up to a certain value:
binsize = 5;
nbins = 20;
[bincounts,ind] = histc(eventl,1:binsize:binsize*nbins);
Then populate a matrix with all possible numbers covered by the bins which the approximation can choose:
sizes = transpose(1:binsize*nbins);
To use the bin counts as weights for selection i.e. bincount (1-5) = 2, thus the weight for choosing 1,2,3,4 or 5 = 2 whilst (16-20) = 0 so 16,17,18, 19 or 20 can never be chosen, I simply take the bincounts and replicate them across the bin size:
w = repelem(bincounts,binsize);
To then perform weighted number selection, I use:
[~,R] = histc(rand(1,1),cumsum([0;w(:)./sum(w)]));
R = sizes(R);
For some reason this approach is unable to approximate the data. It was my understanding that was sufficient sampling depth, the binned version of R would be identical to the binned version of eventl however there is significant variation and often data found in bins whose weights were 0.
Could anybody suggest a better method to do this or point out the error?
For a better method, I suggest randsample:
values = [1 2 3 4 5 6 7 8]; %# values from which you want to pick
numberOfElements = 1000; %# how many values you want to pick
weights = [2 2 2 2 2 1 1 1]; %# weights given to the values (1-5 are twice as likely as 6-8)
sample = randsample(values, numberOfElements, true, weights);
Note that even with 1000 samples, the distribution does not exactly correspond to the weights, so if you only pick 20 samples, the histogram may look rather different.

Bootstrap method resampling in matlab

I am producing a script for creating bootstrap samples (random) from precipitations data set (sskt and kendall tau package in Matlab).
I have one double array with 3 colums from my data.
first is year, second a vector (for season or period) and third the precipitation of this station(vector is the number of the station, i run this method for regional trend).
1970 1 234
1971 1 244
1972 1 344
... ... ...
1970 2 342
1971 2 356
... ... ...
etc....i have a 36 years for each of my stations(12 stations=12x36=432data at 3 columns)
i want one m script file that i can call function sskt for N=5000repetitions of my data. My data is a csv file, actually a double matric in matlab. I want a bootstrap method of each column that generates 5000repetitions or 1000. 1000repetitions it means 1000x36=36000repetitions. When first loop of 1000 gives me results...in this loop i called function sskt and as results i have 1000 S slopes, 1000 kendall tau, 1000 sign.
Does anyone has an idea?
Matlab has a bootstrap function for called bootstrp. It draws N bootstrap data samples, computes statistics on each sample, and returns the results.

Kmeans matlab "Empty cluster created at iteration 1" error

I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1".
The script I'm using:
[G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample');
XX can be found in this link XX value and the K is set to 3
So if anyone could please advise me why this is happening.
It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than you specified.
Try changing the initialization method using the start option. Kmeans provides four possible techniques to initialize clusters:
sample: sample K points randomly from the data as initial clusters (default)
uniform: select K points uniformly across the range of the data
cluster: perform preliminary clustering on a small subset
manual: manually specify initial clusters
Also you can try the different values of emptyaction option, which tells MATLAB what to do when a cluster becomes empty.
Ultimately, I think you need to reduce the number of clusters, i.e try K=2 clusters.
I tried to visualize your data to get a feel for it:
load matlab_X.mat
figure('renderer','zbuffer')
line(XX(:,1), XX(:,2), XX(:,3), ...
'LineStyle','none', 'Marker','.', 'MarkerSize',1)
axis vis3d; view(3); grid on
After some manual zooming/panning, it looks like a silhouette of a person:
You can see that the data of 307200 points is really dense and compact, which confirms what I suspected; the data doesnt have that many clusters.
Here is the code I tried:
>> [IDX,C] = kmeans(XX, 3, 'start','uniform', 'emptyaction','singleton');
>> tabulate(IDX)
Value Count Percent
1 18023 5.87%
2 264690 86.16%
3 24487 7.97%
Whats more, the entire points in cluster 2 are all duplicate points ([0 0 0]):
>> unique(XX(IDX==2,:),'rows')
ans =
0 0 0
The other two clusters look like:
clr = lines(max(IDX));
for i=1:max(IDX)
line(XX(IDX==i,1), XX(IDX==i,2), XX(IDX==i,3), ...
'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1)
end
So you might get better clusters if you first remove duplicate points first...
In addition, you have a few outliers that might affect the result of clustering. Visually, I narrowed down the range of the data to the following intervals which encompasses most of the data:
>> xlim([-500 100])
>> ylim([-500 100])
>> zlim([900 1500])
Here is the result after removing dupe points (over 250K points) and outliers (around 250 data points), and clustering with K=3 (best of out of 5 runs with the replicates option):
XX = unique(XX,'rows');
XX(XX(:,1) < -500 | XX(:,1) > 100, :) = [];
XX(XX(:,2) < -500 | XX(:,2) > 100, :) = [];
XX(XX(:,3) < 900 | XX(:,3) > 1500, :) = [];
[IDX,C] = kmeans(XX, 3, 'replicates',5);
with almost an equal split across the three clusters:
>> tabulate(IDX)
Value Count Percent
1 15605 36.92%
2 15048 35.60%
3 11613 27.48%
Recall that the default distance function is euclidean distance, which explains the shape of the formed clusters.
If you are confident with your choice of "k=3", here is the code I wrote for not getting an empty cluster:
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
while length(unique(IDX))<3 || histc(histc(IDX,[1 2 3]),1)~=0
% i.e. while one of the clusters is empty -- or -- we have one or more clusters with only one member
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
end
Amro described the reason clearly:
It is simply telling you that during the assign-recompute iterations,
a cluster became empty (lost all assigned points). This is usually
caused by an inadequate cluster initialization, or that the data has
less inherent clusters than you specified.
But the other option that could help to solve this problem is emptyaction:
Action to take if a cluster loses all its member observations.
error: Treat an empty cluster as an error (default).
drop: Remove any clusters that become empty. kmeans sets the corresponding return values in C and D to NaN. (for information
about C and D see kmeans documentioan page)
singleton: Create a new cluster consisting of the one point furthest from its centroid.
An example:
Let’s run a simple code to see how this option changes the behavior and results of kmeans. This sample tries to partition 3 observations in 3 clusters while 2 of them are located at same point:
clc;
X = [1 2; 1 2; 2 3];
[I, C] = kmeans(X, 3, 'emptyaction', 'singleton');
[I, C] = kmeans(X, 3, 'emptyaction', 'drop');
[I, C] = kmeans(X, 3, 'emptyaction', 'error')
The first call with singleton option displays a warning and returns:
I = C =
3 2 3
2 1 2
1 1 2
As you can see two cluster centroids are created at same location ([1 2]), and two first rows of X are assigned to these clusters.
The Second call with drop option also displays same warning message, but returns different results:
I = C =
1 1 2
1 NaN NaN
3 2 3
It just returns two cluster centers and assigns two first rows of X to same cluster. I think most of the times this option would be most useful. In cases that observations are too close and we need as more cluster centers as possible, we can let MATLAB decide about the number. You can remove NaN rows form C like this:
C(any(isnan(C), 2), :) = [];
And finally the third call generates an exception and halts the program as expected.
Empty cluster created at iteration 1.

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).