How to cluster sets (users/documents) with distributed MinHash using the banding technique? - scala

I have a big doubt about the way I should cluster sets using MinHash together with the banding technique.
I assume everyone reading has a good knowledge of MinHash so I won't define most of the terms I'm using.
My goal is to use MinHash to cluster users according to the similarity of their signatures. In a local, non-banded settings this would be trivial: if their signature hash is the same, they go in the same cluster.
If we split signatures in bands and process them indipendently, I can treat a band as I said before and generate a group of clusters for every band. My question is: how should I aggregate these clusters? Just merge them if they have at least an element in common? Or should I do something different?
Thanks

MinHash is not really meant as standalone clustering algorithm. It is meant as a candidate filter for near-duplicate detection.
When looking for similar documents, you compute the minhashes to retrieve candidates. You then still need to check these candidates - they could be false positives!
The more signatures agree, the more likely they really match.
So if you consider the near-duplicate scenario again: if a is a near duplicate of b and b is a near duplicate of c, then a should also be a near duplicate of c. If this holds, you can throw all these matches (after verification) together. If it doesn't consider a hierarchical clustering like strategy to merge (or not merge) candidates.

Related

Questions about LSH (Locality-sensitive hashing) and minihashing implementation

I'm trying to implement this paper
Browser Fingerprint Coding Methods Increasing the Effectiveness of User Identification in the Web Traffic
I got a couple of questions about the LHS algorithm in general and the proposed implementation:
The LSH algorithm it's used only when you have a lot of documents to compare with each other (because it is supposed to put the similar ones in the same bucket from what I got). If for example I have a new document and I want to calculate the similarity with the others, I have to relaunch the LHS algorithm from scratch, including the new document, correct?
In 'Mining of Massive Datasets, Ch3', it is said that for the LHS we should use one hash function per band. Each hash function creates n buckets.
So, for the first band, we are going to have n buckets. For the second band onward, Am I supposed to keep using the same hash function (so this way I keep using the same buckets as before) or another one (ending so with m>>n buckets)?
This question is related t the previous one. If I use the same hash function for all the bands, then I'll have n buckets. No problem here. But If I have to use more hash functions (one different function per row), I'm going to end up with a lot of different buckets. Am I supposed to measure the similarity for each pair in each bucket? (If I have to use only one hash function then here it's not a problem).
In the paper, I understood most of the algorithm except for its end.
Basically, two Signatures matrices are created (one for stable features and one for unstable features) via minhashing. Then, they use LSH on the first matrix to obtain a list of candidates pairs. So far so good.
What happens at the end? do they perform the LHS on the second matrix? How the result of the first LHS is used? I cannot see the relationship between the first and the second LHS.
The output of the final step is supposed to be a list of pairing candidates, right? and all that I have to do is performing Jaccard similarity on them and setting a threshold, right?
Thanks for your answers!
I got a partial answer to my question (still missing question 4)
No. You would keep the bucket structure and hash the new doc into it. Then compare with only those docs in one of the buckets it fell into.
No. You HAVE to use different hash functions and a different set of buckets for each hash function.
This is irrelevant because of the answer to (2).

Running DBSCAN in ELKI

I am trying to cluster some geospatial data, and I previously tried the WEKA library.
I found this benchmarking, and decided to try ELKI.
Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is definetly a plus.
However, when I compare the results of Weka's DBSCAN, with the ones from ELKI's DBSCAN, I get a little bit puzzled. I would accept different implementations can give origin to slightly different results, but these magnitude of difference makes me think there is something wrong with the algorithm (probably with my code). The number of clusters and their geometry is very different in the two algorithms.
For the record, I am using the latest version of ELKI (0.6.0), and the parameters I used for my simulations were:
minpts=50
epsilon=0.008
I coded two DBSCAN functions (for Weka and ELKI), where the "entry point" is a csv with points, and the "output" for both of them is also identical: a function that calculates the concave hull of a set of points (one for each cluster). Since the function that reads the csv file into an ELKI "database" is relatively simple, I think my problem could be:
a) in the parametrization of the algorithm;
b) reading the results (most likely).
Parametrizing DBSCAN does not pose any challenges, and I use the two compulsory parameters, which I previously tested through the UI:
ListParameterization params2 = new ListParameterization();
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPoints);
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, epsilon);
Reading the result is a bit more challenging, as I don't completely understand the organization of the structure that stores the clusters; My idea is to iterate over each cluster, get the list of points, and pass it to the function that calculates the concave hull, in order to generate a polygon.
ArrayList<Clustering<?>> cs = ResultUtil.filterResults(result, Clustering.class);
for (Clustering<?> c : cs) {
System.out.println("clusters: " + c.getAllClusters().size());
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cluster : c.getAllClusters()) {
if (!cluster.isNoise()){
Coordinate[] ptList=new Coordinate[cluster.size()];
int ct=0;
for (DBIDIter iter = cluster.getIDs().iter(); iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
//there are no "empty" clusters
assertTrue(ptList.length>0);
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
try {
out.write(poly.coordinates.toText()+"\n");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}else
System.out.println(
poly.getCoordinates().getGeometryType());
}//!noise
}
}
I notice that the "noise" was coming up as a cluster, so I ignored this cluster (I don't want to draw it).
I am not sure if this is the right way of reading the clusters, as I don't find many examples. I also have some questions, for which I did not found answers yet:
What is the difference between getAllClusters() and
getTopLevelClusters()?
Are the DBSCAN clusters "nested", i.e.: can we have points that
belong to many clusters at the same time? Why?
I read somewhere that we should not use the database IDs to identify
the points, as they are for ELKI's internal use, but what other way
there is to get the list of points in each cluster? I read that you
can use a relation for the labels, but I am not sure how to actually
implement this...
Any comments that could point me in the right direction, or any code suggestions to iterate over the result set of ELKI's DBSCAN would be really welcome! I also used ELKI's OPTICSxi in my code, and I have even more questions regarding those results, but I guess I'll save that for another post.
This is mostly a follow-up to #Anony-Mousse, who gave a pretty complete answer.
getTopLevelClusters() and getAllClusters() do the same for DBSCAN, as DBSCAN does not produce hierarchical clusters.
DBSCAN clusters are disjoint. Treating clusters with isNoise()==true as singleton objects is likely the best way to handling noise. Clusters returned by our OPTICSXi implementation are also disjoint, but you should consider the members of all child clusters to be part of the outer cluster. For convex hulls, an efficient approach is to first compute the convex hull of the child clusters; then for the parent compute the convex hull on the additional objects + the convex hull points of all childs.
The RangeDBIDs approach mentioned by #Anony-Mousse is pretty clean for static databases. A clean approach that also works with dynamic databases is to have an additional relation that identifies the objects. When using a CSV file as input, instead of relying on the line numbering to be consistent, you would just add a non-numeric column, containing labels e.g. object123. This is the best approach from a logical point of view - if you want to be able to identify objects, give them a unique identifier. ;-)
We use ELKI for teaching, and we have verified its DBSCAN algorithm very very carefully (you can find a DBSCAN step by step demonstration here, and ELKI results exactly match this). The DBSCAN and OPTICS code in Weka was contributed by a student a long time ago, and has never been verified to the same extend. From a quick check, Weka does not produce the correct results on our class exercise data set.
Because the exercise data set has the same extend of 10 in each dimension, we can adjust the epsilon parameter by 1/10, and then the Weka result seems to match the solution; so #Anony-Mousses finding appears to be correct: Weka's implementation enforces a [0;1] scaling on the data.
Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned.
For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be assigned deterministically anyway (only when using the MiniGUI, they will differ if you rerun a job!)
This will also be more efficient than DBIDUtil.toString.
DBSCAN results are not hierarchical, so every cluster should be a top level cluster.
As for Weka, it sometimes does automatic normalization. Then the epsilon value will be distorted. For geographic data, I would prefer geodetic distance anyway, Euclidean distance on latitude and longitude does not make sense.
Check this part of Wekas code: "norm" function, used by EuclideanDataObject. This does look to me as if Wekas DBSCAN enforces a normalization on the data set! Try scaling your data to [0:1] (I'm pretty sure there is a filter for this in ELKI), if the results are identical afterwards?
Judging from this code snippet, I would blame Weka. The code above also looks a bit inefficient to me. The filter approach makes IMHO more sense than this enforced filtering in the data objects.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

rapidminer: cluster performance operators..what does different value mean?

I have to check performance of various clustering algos using different performance operators in rapidminer. For that I want to know the following things:
what does cluster number index value shows which is output of cluster count performance operator?
what does small and large value of avg within cluster distance and avg. within centroid distance mean in terms of good and bad clustering?
I also want to check other indexes value like Dunn index,Jaccard index, Fowlkes–Mallows for various clustering algos. but rapidminer don't have any operator for this, what to do for that. I don't have experience with R.
I have copied part of the answer I gave on the Rapid-I forum
The cluster number index is the count of clusters - pointless you might say but when used with DBSCAN, it can be quite interesting http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html
The avg within cluster and centroid distances are hard to interpret - one thing to search for is "elbow criterion" in this context. As the number of clusters varies, note how the validity measure changes and look for an "elbow" that marks the point where the natural progression of the measure dominates the structure.
R has many validity measures and it's worth investing some time because you can always call the R process from RapidMiner which makes it easier to work out what is going on.

Hash operator in Matlab for linear indices of vectors

I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.