I'm new to ELKI and I need to do some basic clustering of a dataset that I already tested and clustered in Weka. I'm using the "GUI version" and I read the tutorial Analyzing the "mouse" data set on ELKI site: http://elki.dbs.ifi.lmu.de/wiki/Tutorial#Analyzingthemousedataset
I clustered my dataset with EM and successfully visualized and output the results (from the tutorial I just changed the parameter resultHandler: ResultWriter). The results I got in the folder are are: cluster.txt, cluster-evaluation.txt and settings.txt.
I have problems with the output results for hierarchical algorithms (SLINK,CLINK, etc.). The output that I got is just the settings.txt, but I need the cluster.txt.
I need to change some other parameters, because on the log view there are no errors?
To get partitions from a hierarchical clustering result, you also need to specify a cluster extraction method:
-algorithm clustering.hierarchical.extraction.HDBSCANHierarchyExtraction
-algorithm CLINK
-hdbscan.minclsize 50
Note that we have two -algorithm parameters now, and order is important. The extraction algorithm has a "nested" algorithm call to do the actual hierarchical clustering.
On the long run, we want to move to an operator-based approach (in particular for GUIs). For the command line, the nested-invocation is more safe, as you cannot attempt to extract without running a hierarchical clustering.
As for CLINK, the cluster quality is usually not too good (it also is order dependent, so shuffling the data and running multiple times will give different results). I'd also give AGNES or Anderberg with complete linkage a try; AGNES is always O(n^3), Anderberg is usually in O(n^2) (only worst case is O(n^3)) and both produce much better results (they are expected to produce the same results except for tied distances, CLINK is different):
Related
Looking for advice on how to determine wether my model output data distribution is similar (and if so, then how similar) to the observed datasets distribution.
Basically I have a GBM model with mean reversion that provides seemingly good results, when I compare its distribution to observed data. You can see their PDFs side-by-side in the attached picture.
PDF of observed and model data
Both datasets are huge (~6 million datapoint), and I start to suspect that this is part of the problem...
I am looking for a way to verify that the datasets distributions are similar. I tried the two-sample Kolmogorov-Smirnov test, two-sample t-test, but for some reason both of them rejected the Null hypothesis (always, even with different Alphas). In some threads I've read that these tests are unreliable, when applied to huge datasets, but there wasn't a consensus about this.
I am using Matlab currently, but I am open to others if necessary.
Any help would be appreciated! I primarily looking for a hypothesis test for verification, but if you have a different idea don't hold it back!
I have data sets for two groups, with one being much smaller than the other. For that reason, I am using the MatLab bootstrapping function to estimate the performance of the smaller group. I have code that draws on my original data, and it generates 1000 'new' means. However, it is not clear as to how many of the original data points are used each time. Obviously, if all the original data was used, the same mean would continue to be generated.
Can anyone help me out with this?
Bootstrapping comes from sampling with replacement. You'll use the same number of points as the original data, but some of them will be repeated. There are some variants of bootstrapping which work slightly differently, however. See https://en.wikipedia.org/wiki/Bootstrapping_(statistics).
I am trying to cluster some geospatial data, and I previously tried the WEKA library.
I found this benchmarking, and decided to try ELKI.
Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is definetly a plus.
However, when I compare the results of Weka's DBSCAN, with the ones from ELKI's DBSCAN, I get a little bit puzzled. I would accept different implementations can give origin to slightly different results, but these magnitude of difference makes me think there is something wrong with the algorithm (probably with my code). The number of clusters and their geometry is very different in the two algorithms.
For the record, I am using the latest version of ELKI (0.6.0), and the parameters I used for my simulations were:
minpts=50
epsilon=0.008
I coded two DBSCAN functions (for Weka and ELKI), where the "entry point" is a csv with points, and the "output" for both of them is also identical: a function that calculates the concave hull of a set of points (one for each cluster). Since the function that reads the csv file into an ELKI "database" is relatively simple, I think my problem could be:
a) in the parametrization of the algorithm;
b) reading the results (most likely).
Parametrizing DBSCAN does not pose any challenges, and I use the two compulsory parameters, which I previously tested through the UI:
ListParameterization params2 = new ListParameterization();
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPoints);
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, epsilon);
Reading the result is a bit more challenging, as I don't completely understand the organization of the structure that stores the clusters; My idea is to iterate over each cluster, get the list of points, and pass it to the function that calculates the concave hull, in order to generate a polygon.
ArrayList<Clustering<?>> cs = ResultUtil.filterResults(result, Clustering.class);
for (Clustering<?> c : cs) {
System.out.println("clusters: " + c.getAllClusters().size());
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cluster : c.getAllClusters()) {
if (!cluster.isNoise()){
Coordinate[] ptList=new Coordinate[cluster.size()];
int ct=0;
for (DBIDIter iter = cluster.getIDs().iter(); iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
//there are no "empty" clusters
assertTrue(ptList.length>0);
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
try {
out.write(poly.coordinates.toText()+"\n");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}else
System.out.println(
poly.getCoordinates().getGeometryType());
}//!noise
}
}
I notice that the "noise" was coming up as a cluster, so I ignored this cluster (I don't want to draw it).
I am not sure if this is the right way of reading the clusters, as I don't find many examples. I also have some questions, for which I did not found answers yet:
What is the difference between getAllClusters() and
getTopLevelClusters()?
Are the DBSCAN clusters "nested", i.e.: can we have points that
belong to many clusters at the same time? Why?
I read somewhere that we should not use the database IDs to identify
the points, as they are for ELKI's internal use, but what other way
there is to get the list of points in each cluster? I read that you
can use a relation for the labels, but I am not sure how to actually
implement this...
Any comments that could point me in the right direction, or any code suggestions to iterate over the result set of ELKI's DBSCAN would be really welcome! I also used ELKI's OPTICSxi in my code, and I have even more questions regarding those results, but I guess I'll save that for another post.
This is mostly a follow-up to #Anony-Mousse, who gave a pretty complete answer.
getTopLevelClusters() and getAllClusters() do the same for DBSCAN, as DBSCAN does not produce hierarchical clusters.
DBSCAN clusters are disjoint. Treating clusters with isNoise()==true as singleton objects is likely the best way to handling noise. Clusters returned by our OPTICSXi implementation are also disjoint, but you should consider the members of all child clusters to be part of the outer cluster. For convex hulls, an efficient approach is to first compute the convex hull of the child clusters; then for the parent compute the convex hull on the additional objects + the convex hull points of all childs.
The RangeDBIDs approach mentioned by #Anony-Mousse is pretty clean for static databases. A clean approach that also works with dynamic databases is to have an additional relation that identifies the objects. When using a CSV file as input, instead of relying on the line numbering to be consistent, you would just add a non-numeric column, containing labels e.g. object123. This is the best approach from a logical point of view - if you want to be able to identify objects, give them a unique identifier. ;-)
We use ELKI for teaching, and we have verified its DBSCAN algorithm very very carefully (you can find a DBSCAN step by step demonstration here, and ELKI results exactly match this). The DBSCAN and OPTICS code in Weka was contributed by a student a long time ago, and has never been verified to the same extend. From a quick check, Weka does not produce the correct results on our class exercise data set.
Because the exercise data set has the same extend of 10 in each dimension, we can adjust the epsilon parameter by 1/10, and then the Weka result seems to match the solution; so #Anony-Mousses finding appears to be correct: Weka's implementation enforces a [0;1] scaling on the data.
Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned.
For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be assigned deterministically anyway (only when using the MiniGUI, they will differ if you rerun a job!)
This will also be more efficient than DBIDUtil.toString.
DBSCAN results are not hierarchical, so every cluster should be a top level cluster.
As for Weka, it sometimes does automatic normalization. Then the epsilon value will be distorted. For geographic data, I would prefer geodetic distance anyway, Euclidean distance on latitude and longitude does not make sense.
Check this part of Wekas code: "norm" function, used by EuclideanDataObject. This does look to me as if Wekas DBSCAN enforces a normalization on the data set! Try scaling your data to [0:1] (I'm pretty sure there is a filter for this in ELKI), if the results are identical afterwards?
Judging from this code snippet, I would blame Weka. The code above also looks a bit inefficient to me. The filter approach makes IMHO more sense than this enforced filtering in the data objects.
I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?
I experience a strange situation when running Mahout K-means:
Using the a pre-selected set of initial centroids, I run K-means on a SequenceFile generated by lucene.vector. The run is for testing purposes, so the file is small (around 10MB~10000 vectors).
When K-means is executed with a single mapper (the default considering the Hadoop split size which in my cluster is 128MB), it reaches a given clustering result in 2 iterations (Case A).
However, I wanted to test if there would be any improvement/deterioration in the algorithm's execution speed by firing more mapping tasks (the Hadoop cluster has in total 6 nodes).
I therefore set the -Dmapred.max.split.size parameter to 5242880 bytes, in order to make mahout fire 2 mapping tasks (Case B).
I indeed succeeded in starting two mappers, but the strange thing was that the job finished after 5 iterations instead of 2, and that even at the first assignment of points to clusters, the mappers made different choices compared to the single-map execution . What I mean is that after close inspection of the clusterDump for the first iteration for both two cases, I found that in case B some points were not assigned to their closest cluster.
Could this behavior be justified by the existing K-means Mahout implementation?
From a quick look at the sources, I see two problems with the Mahout k-means implementation.
First of all, the way the S0, S1, S2 statistics are kept is probably not numerically stable for large data sets. Oh, and since k-means actually does not even use S2, it is also unnecessary slow. I bet a good implementation can beat this version of k-means by a factor of 2-5 at least.
For small data sets split onto multiple machines, there seems to be an error in the way they compute their means. Ouch. This will amplify if the reducer is applied to more than one input, in particular when the partitions are small. To be more verbose, the cluster mean apparently is initialized with the previous mean instead of the 0 vector. Now if you if you reduce 't' copies of it, the resulting vector will be off by 't' times the previous mean.
Initialization of AbstractCluster:
setS1(center.like());
Update of the mean:
getS1().assign(x, Functions.PLUS);
Merge of multiple copies of a cluster:
setS1(getS1().plus(cl.getS1()));
Finalization to new center:
setCenter(getS1().divide(getS0()));
So with this approach, the center will be offset from the proper value by the previous center times t / n where t is the number of splits, and n the number of objects.
To fix the numerical instability (which arises whenever the data set is not centered on the 0 vector), I recommend replacing the S1 statistic by the true mean, not S0*mean. Both S1 and S2 can be incrementally updated at little cost using the incremental mean formula which AFAICT was used in the original "k-means" publication by MacQueen (which actually is an online kmeans, while this is Lloyd style batch iterations). Well, for an incremental k-means you obviously need the updatable mean vector anyway... I believe the formula was also discussed by Knuth in his essential books. I'm surprised that Mahout does not seem to use it. It's fairly cheap (just a few CPU instructions more, no additional data, so it all happens in the CPU cache line) and gives you extra precision when you are dealing with large data sets.