Can ELKI cluster non-normalized negative points? [duplicate] - cluster-analysis

This question already has answers here:
ELKI Kmeans clustering Task failed error for high dimensional data
(2 answers)
Closed 3 years ago.
I have gone through this question but the solution doesn't help.
ELKI Kmeans clustering Task failed error for high dimensional data
This is my first time with ELKI so, please bear with me. I have 45000 2D data points (after performing doc2vec ) that contain negative values and are not normalized. The dataset looks something like this :
-4.688612 32.793335
-42.990147 -20.499323
-24.948868 -10.822767
-45.502155 -40.917801
27.979715 -40.012688
1.867812 -9.838544
56.284512 6.756072
I am using the K-means algorithm to get 2 clusters. However, I get the following error:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=0,maxdim=1 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
So my question is, does ELKI require the data to be in the range of [0,1] because all the examples that I came across had their data within that range.
Or is it that ELKI does not accept negative values?
If something else, can someone please guide me through this?
Thank you!

ELKI can handle negative values just fine.
Your input data is not correctly formatted. Same problem as in ELKI Kmeans clustering Task failed error for high dimensional data
Apparently your lines have either 0 or 1 values. ELKI itself is fine with that, but
k-means requires the data to be in a R^d vector space, hence ELKI cannot run k-means on your data set. But the reason is that the input file is bad. You may want to double check your file - there probably is at least one line that is not properly formatted.

Related

How to revert One-Hot Enoding in Spark (Scala)

After running k-means (mllib spark scala) I want to make sense of the cluster centers I obtained from data which I pre-processed using (among other transformers) mllib's OneHotEncoder.
A center looks like this:
Cluster Center 0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
Which is obviously not very human friendly... Any ideas on how to revert the one-hot encoding and retrieve the original categorical features?
What if I look for the data point which is closest (using the same distance metric that is used by k-means, which I assume is Euclidean distance) to the centroid and then revert the encoding of that particular data point?
For the cluster centroids it is not possible (strongly disrecommended) to reverse the encoding. Imagine you have the original feature "3" out of 6 and it is encoded as [0.0,0.0,1.0,0.0,0.0,0.0]. In this case it's easy to extract 3 as the correct feature from the encoding.
But after kmeans application you may get a cluster centroid that looks for this feature like this [0.0,0.13,0.0,0.77,0.1,0.0]. If you want to decode this back to the representation that you had before, like "4" out of 6, because the feature 4 has the largest value, then you will lose information and the model may get corrupted.
Edit: Add a possible way to revert encoding on datapoints from the comments to the answer
If you have IDs on the datapoints you can perform a select / join operation on the ID after you assigned a datapoints to a cluster to get the old state, before the encoding.

Clustering in Matlab

Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.
Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?

K-Means with equal numbers of a binary attribute value in each cluster

Given a certain binary attribute, I want to ensure that the clusters produced by K-means have equal numbers of data points where the said binary attribute's value is 1.
I know the above sentence is wordy so I will explain using an example.
Suppose I have an attribute "Asian" with 40 out of my 100 data points having the value of "Asian" = 1. For k = 10, I want each cluster to have exactly 4 points with "Asian" = 1.
Is there a simple way of achieving this? I have racked my brains but have not been able to come up with one. Please note that I am a beginner when it comes to clustering problems.
Here is a tutorial on how to perform such a k-means modification:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
It's not exactly what you need, but a closer k-means variant that can be easily adapted to your needs. Plus, it is a walkthrough tutorial.

Gaussian Mixture Modelling Matlab

Im using the Gaussian Mixture Model to estimate loglikelihood function(the parameters are estimated by the EM algorithm)Im using Matlab...my data is of the size:17991402*1...17991402 data points of one dimension:
When I run gmdistribution.fit(X,2) I get the desired output
But when I run gmdistribution.fit(X,k) for k>2....the code crashes and I get the error"OUT OF MEMORY"..I have also tried an open source code which again gives me the same problem.Can someone help me out here?..Im basically looking for a code which will allow me to use different number of components on such a large dataset.
Thanks!!!
Is it possible for you to decrease the iteration time? The default is 100.
OPTIONS = statset('MaxIter',50,'Display','final','TolFun',1e-6)
gmdistribution.fit(X,3,OPTIONS)
Or you may consider under-sampling the original data.
A general solution to out of memory problem is described in this document.

Mixed variables (categorical and numerical) distance function

I want to fuzzy cluster a set of jobs.
Jobs Attributes are:
Categorical: position,diploma, skills
Numerical : salary , years of experience
My question is: how to calculate the distance between different jobs?
e.g job1(programmer,bs computer science,(java ,.net,responsibility),1500, 3)
and job2(tester,bs computer science,(black and white box testing),1200,1)
PS: I'm beginner in data mining clustering, I highly appreciate your help.
You may take this as your starting point:
http://www.econ.upf.edu/~michael/stanford/maeb4.pdf. Distance between categorical data is nicely explained at the end.
Here is a good walk-through of several different clustering methods and how to use them in R: http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/clustering.pdf
In general, clustering for discrete data is related to either the use of counts (e.g. overlaps in vectors) or related to some statistic derived from counts. As much as I'd like to address the statistical side, I suppose you're interested in the algorithm, so I'll leave it at that.