How to revert One-Hot Enoding in Spark (Scala) - scala
After running k-means (mllib spark scala) I want to make sense of the cluster centers I obtained from data which I pre-processed using (among other transformers) mllib's OneHotEncoder.
A center looks like this:
Cluster Center 0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
Which is obviously not very human friendly... Any ideas on how to revert the one-hot encoding and retrieve the original categorical features?
What if I look for the data point which is closest (using the same distance metric that is used by k-means, which I assume is Euclidean distance) to the centroid and then revert the encoding of that particular data point?
For the cluster centroids it is not possible (strongly disrecommended) to reverse the encoding. Imagine you have the original feature "3" out of 6 and it is encoded as [0.0,0.0,1.0,0.0,0.0,0.0]. In this case it's easy to extract 3 as the correct feature from the encoding.
But after kmeans application you may get a cluster centroid that looks for this feature like this [0.0,0.13,0.0,0.77,0.1,0.0]. If you want to decode this back to the representation that you had before, like "4" out of 6, because the feature 4 has the largest value, then you will lose information and the model may get corrupted.
Edit: Add a possible way to revert encoding on datapoints from the comments to the answer
If you have IDs on the datapoints you can perform a select / join operation on the ID after you assigned a datapoints to a cluster to get the old state, before the encoding.
Related
How to merge clustering results for different clustering approaches?
Problem: It appears to me that a fundamental property of a clustering method c() is whether we can combine the results c(A) and c(B) by some function f() of two clusterings in a way that we do not have to apply the full clustering c(A+B) again but instead do f(c(A),c(B)) and still end up with the same result: c(A+B) == f(c(A),c(B)) I suppose that a necessary condition for some c() to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient. It would be really nice to have some reference where to look up which cluster methods support this and what a good f() looks like in the respective case. Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them): One point is reachable from another point if it is in its eps-neighborhood A core point is a point with at least minPts reachable An edge goes from every core point to all points reachable from it Every point with incoming edge from a core point is in the same cluster as the latter If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.
Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set. For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.
Running k-medoids algorithm in ELKI
I am trying to run ELKI to implement k-medoids (for k=3) on a dataset in the form of an arff file (using the ARFFParser in ELKI): The dataset is of 7 dimensions, however the clustering results that I obtain show clustering only on the level of one dimension, and does this only for 3 attributes, ignoring the rest. Like this: Could anyone help with how can I obtain a clustering visualization for all dimensions?
ELKI is mostly used with numerical data. Currently, ELKI does not have a "mixed" data type, unfortunately. The ARFF parser will split your data set into multiple relations: a 1-dimensional numerical relation containing age a LabelList relation storing sex and region a 1-dimensional numerical relation containing salary a LabelList relation storing married a 1-dimensional numerical relation storing children a LabelList relation storing car Apparently it has messed up the relation labels, though. But other than that, this approach works perfectly well with arff data sets that consist of numerical data + a class label, for example - the use case this parser was written for. It is a well-defined and consistent behaviour, though not what you expected it to do. The algorithm then ran on the first relation it could work with, i.e. age only. So here is what you need to do: Implement an efficient data type for storing mixed type data. Modify the ARFF parser to produce a single relation of mixed type data. Implement a distance function for this type, because the lack of a mixed type data representation means we do not have a distance to go with it either. Choose this new distance function in k-Medoids. Share the code, so others do not have to do this again. ;-) Alternatively, you could write a script to encode your data in a numerical data set, then it will work fine. But in my opinion, the results of one-hot-encoding etc. are not very convincing usually.
In preprocessing data with high cardinality, do you hash first or one-hot-encode first?
Hashing reduces dimensionality while one-hot-encoding essentially blows up the feature space by transforming multi-categorical variables into many binary variables. So it seems like they have opposite effects. My questions are: What is the benefit of doing both on the same dataset? I read something about capturing interactions but not in detail - can somebody elaborate on this? Which one comes first and why?
Binary one-hot-encoding is needed for feeding categorical data to linear models and SVMs with the standard kernels. For example, you might have a feature which is a day of a week. Then you create a one-hot-encoding for each of them. 1000000 Sunday 0100000 Monday 0010000 Tuesday ... 0000001 Saturday Feature-hashing is mostly used to allow for significant storage compression for parameter vectors: one hashes the high dimensional input vectors into a lower dimensional feature space. Now the parameter vector of a resulting classifier can therefore live in the lower-dimensional space instead of in the original input space. This can be used as a method of dimension reduction thus usually you expect to trade a bit of decreasing of performance with significant storage benefit. The example in wikipedia is a good one. Suppose your have three documents: John likes to watch movies. Mary likes movies too. John also likes football. Using a bag-of-words model, you first create below document to words model. (each row is a document, each entry in the matrix indicates whether a word appears in the document). The problem with this process is that such dictionaries take up a large amount of storage space, and grow in size as the training set grows. Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words) in the items under consideration, then using the hash values directly as feature indices and updating the resulting vector at those indices. Suppose you generate below hashed features with 3 buckets. (you apply k different hash functions to the original features and count how many times the hashed value hit a bucket). bucket1 bucket2 bucket3 doc1: 3 2 0 doc2: 2 2 0 doc3: 1 0 2 Now you successfully transformed the features in 9-dimensions to 3-dimensions. A more interesting application of feature hashing is to do personalization. The original paper of feature hashing contains a nice example. Imagine you want to design a spam filter but customized to each user. The naive way of doing this is to train a separate classifier for each user, which are unfeasible regarding either training (to train and update the personalized model) or serving (to hold all classifiers in memory). A smart way is illustrated below: Each token is duplicated and one copy is individualized by concatenating each word with a unique user id. (See USER123_NEU and USER123_Votre). The bag of words model now holds the common keywords and also use-specific keywords. All words are then hashed into a low dimensioanl feature space where the document is trained and classified. Now to answer your questions: Yes. one-hot-encoding should come first since it is transforming a categorical feature to binary feature to make it consumable by linear models. You can apply both on the same dataset for sure as long as there is benefit to use the compressed feature-space. Note if you can tolerate the original feature dimension, feature-hashing is not required. For example, in a common digit recognition problem, e.g., MINST, the image is represented by 28x28 binary pixels. The input dimension is only 784. For sure feature hashing won't have any benefit in this case.
Running DBSCAN in ELKI
I am trying to cluster some geospatial data, and I previously tried the WEKA library. I found this benchmarking, and decided to try ELKI. Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is definetly a plus. However, when I compare the results of Weka's DBSCAN, with the ones from ELKI's DBSCAN, I get a little bit puzzled. I would accept different implementations can give origin to slightly different results, but these magnitude of difference makes me think there is something wrong with the algorithm (probably with my code). The number of clusters and their geometry is very different in the two algorithms. For the record, I am using the latest version of ELKI (0.6.0), and the parameters I used for my simulations were: minpts=50 epsilon=0.008 I coded two DBSCAN functions (for Weka and ELKI), where the "entry point" is a csv with points, and the "output" for both of them is also identical: a function that calculates the concave hull of a set of points (one for each cluster). Since the function that reads the csv file into an ELKI "database" is relatively simple, I think my problem could be: a) in the parametrization of the algorithm; b) reading the results (most likely). Parametrizing DBSCAN does not pose any challenges, and I use the two compulsory parameters, which I previously tested through the UI: ListParameterization params2 = new ListParameterization(); params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPoints); params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, epsilon); Reading the result is a bit more challenging, as I don't completely understand the organization of the structure that stores the clusters; My idea is to iterate over each cluster, get the list of points, and pass it to the function that calculates the concave hull, in order to generate a polygon. ArrayList<Clustering<?>> cs = ResultUtil.filterResults(result, Clustering.class); for (Clustering<?> c : cs) { System.out.println("clusters: " + c.getAllClusters().size()); for (de.lmu.ifi.dbs.elki.data.Cluster<?> cluster : c.getAllClusters()) { if (!cluster.isNoise()){ Coordinate[] ptList=new Coordinate[cluster.size()]; int ct=0; for (DBIDIter iter = cluster.getIDs().iter(); iter.valid(); iter.advance()) { ptList[ct]=dataMap.get(DBIDUtil.toString(iter)); ++ct; } //there are no "empty" clusters assertTrue(ptList.length>0); GeoPolygon poly=getBoundaryFromCoordinates(ptList); if (poly.getCoordinates().getGeometryType()== "Polygon"){ try { out.write(poly.coordinates.toText()+"\n"); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } }else System.out.println( poly.getCoordinates().getGeometryType()); }//!noise } } I notice that the "noise" was coming up as a cluster, so I ignored this cluster (I don't want to draw it). I am not sure if this is the right way of reading the clusters, as I don't find many examples. I also have some questions, for which I did not found answers yet: What is the difference between getAllClusters() and getTopLevelClusters()? Are the DBSCAN clusters "nested", i.e.: can we have points that belong to many clusters at the same time? Why? I read somewhere that we should not use the database IDs to identify the points, as they are for ELKI's internal use, but what other way there is to get the list of points in each cluster? I read that you can use a relation for the labels, but I am not sure how to actually implement this... Any comments that could point me in the right direction, or any code suggestions to iterate over the result set of ELKI's DBSCAN would be really welcome! I also used ELKI's OPTICSxi in my code, and I have even more questions regarding those results, but I guess I'll save that for another post.
This is mostly a follow-up to #Anony-Mousse, who gave a pretty complete answer. getTopLevelClusters() and getAllClusters() do the same for DBSCAN, as DBSCAN does not produce hierarchical clusters. DBSCAN clusters are disjoint. Treating clusters with isNoise()==true as singleton objects is likely the best way to handling noise. Clusters returned by our OPTICSXi implementation are also disjoint, but you should consider the members of all child clusters to be part of the outer cluster. For convex hulls, an efficient approach is to first compute the convex hull of the child clusters; then for the parent compute the convex hull on the additional objects + the convex hull points of all childs. The RangeDBIDs approach mentioned by #Anony-Mousse is pretty clean for static databases. A clean approach that also works with dynamic databases is to have an additional relation that identifies the objects. When using a CSV file as input, instead of relying on the line numbering to be consistent, you would just add a non-numeric column, containing labels e.g. object123. This is the best approach from a logical point of view - if you want to be able to identify objects, give them a unique identifier. ;-) We use ELKI for teaching, and we have verified its DBSCAN algorithm very very carefully (you can find a DBSCAN step by step demonstration here, and ELKI results exactly match this). The DBSCAN and OPTICS code in Weka was contributed by a student a long time ago, and has never been verified to the same extend. From a quick check, Weka does not produce the correct results on our class exercise data set. Because the exercise data set has the same extend of 10 in each dimension, we can adjust the epsilon parameter by 1/10, and then the Weka result seems to match the solution; so #Anony-Mousses finding appears to be correct: Weka's implementation enforces a [0;1] scaling on the data.
Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned. For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be assigned deterministically anyway (only when using the MiniGUI, they will differ if you rerun a job!) This will also be more efficient than DBIDUtil.toString. DBSCAN results are not hierarchical, so every cluster should be a top level cluster. As for Weka, it sometimes does automatic normalization. Then the epsilon value will be distorted. For geographic data, I would prefer geodetic distance anyway, Euclidean distance on latitude and longitude does not make sense. Check this part of Wekas code: "norm" function, used by EuclideanDataObject. This does look to me as if Wekas DBSCAN enforces a normalization on the data set! Try scaling your data to [0:1] (I'm pretty sure there is a filter for this in ELKI), if the results are identical afterwards? Judging from this code snippet, I would blame Weka. The code above also looks a bit inefficient to me. The filter approach makes IMHO more sense than this enforced filtering in the data objects.
K-Means with equal numbers of a binary attribute value in each cluster
Given a certain binary attribute, I want to ensure that the clusters produced by K-means have equal numbers of data points where the said binary attribute's value is 1. I know the above sentence is wordy so I will explain using an example. Suppose I have an attribute "Asian" with 40 out of my 100 data points having the value of "Asian" = 1. For k = 10, I want each cluster to have exactly 4 points with "Asian" = 1. Is there a simple way of achieving this? I have racked my brains but have not been able to come up with one. Please note that I am a beginner when it comes to clustering problems.
Here is a tutorial on how to perform such a k-means modification: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans It's not exactly what you need, but a closer k-means variant that can be easily adapted to your needs. Plus, it is a walkthrough tutorial.