Cluster analysis and binomial testing - cluster-analysis

I performed a cluster analysis(using k means) in some marketing data and I found out that I should use 5 clusters. After that, I would like to see if the subjects in each cluster have a significant relationship with a disease. For each cluster I know if the subject had the disease or not. Is there any way to test if there is a significant relationship between cluster/disease? And if I can do that with a binomial test what should be the parameters? Thanks!

Related

Cluster Analysis Assumptions and diagnostics

I have a question. I am conducting a study using cluster analysis. My number of observations is less than the number of variables which means my matrix in n<p. Is it true that I am violating cluster analysis assumptions? Do I have to reduce my variables?

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

clustering evaluation, taking into account the number of cluster

I know how to calculate the Recall, Precision and F_measure for clusters as explained in this course https://www.coursera.org/learn/cluster-analysis/lecture/BcYhV/6-4-external-measures-1-matching-based-measures
However, what if the number of clusters generated by my system is more than the number of clusters in the ground-truth, how can we calculate these measures?
It seems that there is no penalty for systems generating more clusters since we just matching each cluster in the ground-truth to the best cluster generated from my system. Am i missing something here?
Don't compute them as in classification!!!
Either you need to work with pairs of points - that is the most common approach, used by the very popular ARI measure.
Or you need to find the cluster with the maximum overlap, this then sometimes called "matching". I am not convinced of this approach.
Last but not least, you could use the Hungarian algorithm to find the best partial 1:1 correspondence, and consider unmatched clusters to be all false.

Partitioning densed data points using clustering

I have to cluster data which are power profiles of the solar panel output. I tried various algorithm including classical K-means to shape based clustering as well. I have to decide number of cluster possible in the pool of data. And I am always getting 2 cluster, so I think they are very dense.
Is there any way I can partition dense cluster?

Decision on number of clusters in Data Mining

When ever we want to cluster some data then It is required to give the number of cluster by user. Like K-Means algorithm we need to specify that how cluster are required.
My question is it possible that the algorithm decides itself that how cluster are feasible for particular data set.
There are several clustering algorithms that do not require a desired number of clusters as an input to the algorithm. An example of such an algorithm is the mean-shift clustering algorithm. However, you will need to specify a kernel as an input to the algorithm. This kernel selection (e.g., the size and shape of the kernel) will impact the number of clusters that you get as an output.
Some more information:
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html
I'm not expert with that, but to answer to your question, yes there are methods to determine automatically the number of cluster for a kmeans for example.
It's quite complicated but given a dataset and a cluster method you can compute what is called gap statistic in order to estime the number of clusters.
If you are a R user, try to check clusGap and maxSE functions.