Clustering groups based on 3 variables in SPSS and R - cluster-analysis

I am currently trying to understand cluster analysis (using SPSS and R). Reading up so much about it further confused me as to what clustering method to use to answer a research question.
My research question investigates whether a) certain participants can be clustered according to their change in variable A (a group that remains stable, a group that worsens, and a third group that improves over 2 assessments), and b) how these groups/clusters differ with regard to two other variables at assessment 1 (B and C). That is, do people with different patterns in B and C have a different change in A?
Question: I have standardised the data and, so far, have tried two step hierarchical and k-means clustering. However, I am unsure if this is the right method for answering my question. In the case of a fixed number of clusters, I chose 3 because I am interested in seeing clusters of people that improve/worsen/stay stable over time, and the clusters' individual pattern of B and C. Is this feasible? Am I missing something?
For k-means clustering I used the following syntax:
QUICK CLUSTER z_A_change z_B_mean z_C_mean
/MISSING=LISTWISE
/CRITERIA=CLUSTER(3) MXITER(10) CONVERGE(0)
/METHOD=KMEANS(NOUPDATE)
/SAVE CLUSTER DISTANCE
Finally, is there any way to visualise these clusters on a 3-D plot in SPSS? I am not quite as proficient in R's ggplot2 or scatterplot3d as I would like to be.
Thank you in advance.

If you use TWOSTEP CLUSTER or QUICK CLUSTER to fit a three cluster solution and save the cluster memberships as a new variable, you can create a grouped 3D scatter plot via the Chat Builder. In the menus, go to Graphs>Chart Builder. In the Gallery view, under Choose from: select Scatter/Dot. In the icons shown underneath the main canvas, the second from the right in the top row should be the grouped 3D scatter. Move that icon into the canvas. Select each of the three variables used in the clustering for the X, Y, and Z axes. Specify the cluster membership variable as the Set Color variable, then click OK.

Related

How to merge clustering results for different clustering approaches?

Problem: It appears to me that a fundamental property of a clustering method c() is whether we can combine the results c(A) and c(B) by some function f() of two clusterings in a way that we do not have to apply the full clustering c(A+B) again but instead do f(c(A),c(B)) and still end up with the same result:
c(A+B) == f(c(A),c(B))
I suppose that a necessary condition for some c() to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.
It would be really nice to have some reference where to look up which cluster methods support this and what a good f() looks like in the respective case.
Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):
One point is reachable from another point if it is in its eps-neighborhood
A core point is a point with at least minPts reachable
An edge goes from every core point to all points reachable from it
Every point with incoming edge from a core point is in the same cluster as the latter
If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.
Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set.
For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.

Determine number of clusters for different datasets

I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!
Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.

I cannot reproduce the results with kmeans in Orange

I've tried to repeat the same results with the same flow, and I don't understand the results are different in each situation.
I describe the situation I have a file with 192 instances and 37 features, y select in all cases the same columns and preprocess by Median and StdDev. It computes the PCA with 7 principal components. The following step is to run the k-means algorithm (k is between 2 and 8) from this 'new' dataset. The scatter plot shows the results for k=5.
I attached different images with my flows.
Image1: original flow
The first one is the original flow (it is painted of yellow color), which I would like to repeat without the rest of the options (the second image).
Image2: flows repeated
However, when I tried to do it, I saw that the results are different (the third image) Of course the colors don't determine the differences, however the clusters are different. In addition the Slhouette Scores are different too for the different flows.
Image3: results of the different flows
K-means initializes with the kmean++ and I have the question if I can "control" this, or if the way to initialize k-means is always randomly. I saw in other programmes that there is an option called seed which is used to control that an experiment can be repeated but I didn't see this option here or something similar.
I wonder if it is possible to obtain always the same results with the same flow (using k-means).
It seems that the issue happens because no random seed is set in the k-means widget. So initialization is different each time you repeat an experiment and because of nature of your data, the method converges differently. Can you please report your issue to Orange3 issue tracker.

Using node similarities in a graph or clustering viaualization

Use case:
nodes are documents
Links are links between documents that have an associated correlation (e.g., 0 to 1)
Being new, it is not clear how to apply those correlations or "weights' so that the document cluster in a logical manner.
Can anyone point me to an existing example?
Thanks in advance.
Positioning nodes is done by the layout. Use any force-directed (physics) layout, like CoSE or Cola. Those layouts allow your to specify how strongly nodes should be pulled towards one another on a per-edge basis.
Try some of the force-directed layouts to see which one gives results that you like. Each one has different trade-offs (speed, aesthetics, etc.).
Just make sure to set the edge force for whatever layout, e.g. edgeElasticity for CoSE, to be proportional to edge.data('weight').
Example: http://js.cytoscape.org/demos/7b511e1f48ffd044ad66/

Mixed variables (categorical and numerical) distance function

I want to fuzzy cluster a set of jobs.
Jobs Attributes are:
Categorical: position,diploma, skills
Numerical : salary , years of experience
My question is: how to calculate the distance between different jobs?
e.g job1(programmer,bs computer science,(java ,.net,responsibility),1500, 3)
and job2(tester,bs computer science,(black and white box testing),1200,1)
PS: I'm beginner in data mining clustering, I highly appreciate your help.
You may take this as your starting point:
http://www.econ.upf.edu/~michael/stanford/maeb4.pdf. Distance between categorical data is nicely explained at the end.
Here is a good walk-through of several different clustering methods and how to use them in R: http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/clustering.pdf
In general, clustering for discrete data is related to either the use of counts (e.g. overlaps in vectors) or related to some statistic derived from counts. As much as I'd like to address the statistical side, I suppose you're interested in the algorithm, so I'll leave it at that.