How to generate a 'clusterable' dataset in MATLAB - matlab

I need to test my Gap Statistics algorithm (which should tell me the optimum k for the dataset) and in order to do so I need to generate a big dataset easily clustarable, so that I know a priori the optimum number of clusters. Do you know any fast way to do it?

It very much depends on what kind of dataset you expect - 1D, 2D, 3D, normal distribution, sparse, etc? And how big is "big"? Thousands, millions, billions of observations?
Anyway, my general approach to creating easy-to-identify clusters is concatenating sequential vectors of random numbers with different offsets and spreads:
DataSet = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
Groups = [1*ones(1000,1);2*ones(1000,1);3*ones(1000,1)];
This can be extended to N features by using e.g.
randn(1000,5)
or concatenating horizontally
DataSet1 = [5*randn(1000,1);20+3*randn(1,1000);120+25*randn(1,1000)];
DataSet2 = [-100+7*randn(1000,1);1+0.1*randn(1,1000);20+3*randn(1,1000)];
DataSet = [DataSet1 DataSet2];
and so on.
randn also takes multidimensional inputs like
randn(1000,10,3);
For looking at higher-dimensional clusters.
If you don't have details on what kind of datasets this is going to be applied to, you should look for these.

Related

Multi-label clustering

I have a question regarding a task that I am trying to solve. The data that I have are characterisation data,
meaning that I have a label (PASS/FAIL) for every single datapoint.
So my data matrix, is of n rows and m columns and the target variables are again a matrix of
n rows and m columns composed of binary values (0s and 1s).
My task is to apply clustering and partition all these datapoints into two clusters, one being for PASS
datapoints and the other for FAIL datapoints. I wasn't able to find an algorithm that can solve
this type of 'multi-label' problem with clustering.
I tried to implement algorithms like k-means but while tuning the number of clusters to initialise
I get k=6 which doesn't really make sense. In the data, outliers are already dropped and they
are normalised as well.
I have a large amount of features on my data matrix (eg. >3000) and I tried to apply
dimensionality reduction methods like PCA to at least drop the features that are more
irrelevant than the rest. But I am not sure if this would be applicable in my case when
I have a binary matrix as target variables.
Is there a specific algorithm that can solve this type of problem and if so, what is the
necessary pre-processing I should be doing before applying it?

Clustering cancer gene expression data at the same time?

I'm working at gene expression data clustering techniques and I have downloaded 35 datasets from web,
We have 35 datasets that each of them represents a type of cancer. Each dataset has its own features. Some of these datasets are shared in several features, and some of them do not share anything from the viewpoint of features.
My question is, how do we ultimately cluster these data, while many of them do not have the same characteristics?
I think that we do the clustering operation on all 35 datasets at the same time.
Is my idea correct?
any help is appreciated.
I assume that when you say heterogenous it'll be things like different gene expression platforms where different genes are present.
You can use any clustering technique, but you'll need to write your own distance metric that takes into account heterogeneity within your dataset. For instance, you could use the correlation of all the genes that are in common between pairwise samples, create a distance matrix from this, then use something like hierarchical clustering on this distance matrix
I think there is no need to write your own distance metric. There already exists plenty of distance metrics that can work for mixed data types. For instance, the gower distance works well for mixed data type. See this post on the same. But if your data contains only continuous values then you can use k-means. You'll also be better off, if the data is preprocessed first.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Select data based on a distribution in matlab

I have a set of data in a vector. If I were to plot a histogram of the data I could see (by clever inspection) that the data is distributed as the sum of three distributions;
One normal distribution centered around x_1 with variance s_1;
One normal distribution centered around x_2 with variance s_2;
Once lognormal distribution.
My data is obviously a subset of the 'real' data.
What I would like to do is to take a random subset of my data away from my data ensuring that the resulting subset is a reasonable representative sample of the original data.
I would like to do this as easily as possible in matlab but am new to both statistics and matlab and am unsure where to start.
Thank you for any help :)
If you can identify each of the 3 distributions (in the sense that you can estimate their parameters), one approach could be to select a random subset of your data and then try to estimate the parameters for each distribution and see whether they are close enough (according to your own definition of "close") to the parameters of the original distributions. You should repeat this process several time and look at the average difference given a random subset size.

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.