I am really confused how to compute precision and recall in clustering applications.
I have the following situation:
Given two sets A and B. By using a unique key for each element I can determine which of the elements of A and B match. I want to cluster those elements based on features (not using the unique key of course).
I am doing the clustering but I am not sure how to compute precision and recall. The formulas,according to the paper "Extended Performance Graphs for Cluster Retrieval" (http://staff.science.uva.nl/~nicu/publications/CVPR01_nies.pdf) are:
p = precision = relevant retrieved items/retrieved items and
r = recall = relevant retrieved items/relevant items
I really do not get what elements fall under which category.
What I did so far is, I checked within the clusters how many matching pairs I have (using the unique key). Is that already one of precision or recall? And if so, which one is it and how can I compute the other one?
Update: I just found another paper with the title "An F-Measure for Evaluation of Unsupervised Clustering with Non-Determined Number of Clusters" at http://mtg.upf.edu/files/publications/unsuperf.pdf.
I think you'll find wikipedia has a helpful article on precision and recall. In short:
Precision = true positives / (true positives + false positives)
Recall = true positives /( true positivies + false negatives)
There are several other measures of cluster validity that I've been using in some research I've been doing in accessing clustering methods. In cases where you have a dataset labeled with classes (supervised clustering) you can use precision and recall as mentioned above, or purity and entropy.
Purity of a cluster = the number of occurrences of the most frequent class / the size of the cluster (this should be high)
Entropy of a cluster = a measure of how dispersed classes are with a cluster (this should be low)
In cases where you don't have the class labels (unsupervised clustering), intra and inter similarity are good measures.
Intra-cluster similarity for a single cluster = average cosine similarity of all pairs within a cluster (this should be high)
Inter-cluster similarity for a single cluster = average cosine sim of all items in one cluster compared to all items in every other cluster (this should be low)
This paper has some good descriptions of all four of these measures.
http://glaros.dtc.umn.edu/gkhome/fetch/papers/edcICAIL05.pdf
Nice link with the unsupervised F-measure, I'm looking into that right now.
What I make of this problem is:
One of the sets A and B is the "positive" one. Lets suppose A is positive
Given that for an element of A in a cluster
matching element of B is in the same cluster. it is a true positive
matching element of B is not in the same cluster. it is a false negative
non matching element of B is in the same cluster. is is a false positive
non matching element of B is not in the same cluster. is is a true negative.
Then just use
Precision = true positives / (true positives + false positives)
Recall = true positives /( true positivies + false negatives)
as mentioned by someone
I think there's a problem with your definitions.
Precision and recall are suited for classification problem, which are basically two-clusters problems. Had you clustered into something like "good items" (=retrieved items) and "bad items" (=non retrieved items), then your definition would make sense.
In your case you calculated the percentage of correct clustering out of all the items, which is sort of like precision, but not really because as I said the definitions don't apply.
See "Introduction to Information Retrieval", chapter 18 (fat clustering), for ways to evaluate clustering algorithms.
http://nlp.stanford.edu/IR-book/html/htmledition/flat-clustering-1.html
This section of the book may also prove useful as it discusses metrics such as precision and recall:
http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-unranked-retrieval-sets-1.html
The problem with precision and recall is that they generally require you to have some idea of what the 'true' labels are, whereas in many cases (and in your description) you don't know the labels, but you know the partition to compare against. I'd suggest the adjusted Rand index perhaps:
http://en.wikipedia.org/wiki/Rand_index
If you consider one of the sets, say A, as gold clustering and the other set (B) as an output of your clustering process, (exact) precision and recall values can be estimated as:
Precision = (Number of elements common to A and B)/(Number of Elements in B)
Recall = (Number of elements common to A and B)/(Number of Elements in A)
From these standard F measure can be estimated as well.
Related
I am using a GLMM model to determine differences in soil compaction across 3 locations and 2 seasons in undisturbed and disturbed sites. I used location and seas as random effects. My teacher says to use the compaction reading divided by its upper bound as the Y value against the different sites (fixed effect). (I was previously using disturbed and undisturbed sites as 1,0 as Y against the compaction reading - so the opposite way around.) The random variables are minimal. I was using both glmer (glmer to determine AIC and therefore best model fit (but this cannot be done in glmmPQL)) while glmmPQL provides all amounts of variation which glmer does not. So while these outcomes are very similar when using disturbed and undisturbed as Y (as well as matching the graphs) only glmmPQL is similar to the graphs when using proportion of compaction reading. glmer using proportions is totally different. Additionally my teacher says I need to validate my model choice with a chi-squared value and if over-dispersed use a quasi binomial. But I cannot find any way to do this in glmmPQL and with glmer showing strange results using proportions as Y I am unsure if this is correct. I also cannot use quasi binomial in either glmer or glmmPQL.
My response was the compaction reading which is measured from 0 to 6 (kg per cm squared) inclusive. The explanatory variable was Type (diff soil either disturbed and not disturbed = 4 categories because some were artificially disturbed to pull out differences). All compaction readings were divided by 6 to make them a proportion and so a continuous variable bounded by 0 and 1 but containing values of both 0 and 1. (I also tried the reverse and coded disturbed as 1 and undisturbed as 0 and compared these groups separately across all Types (due to 4 Types) and left compaction readings as original). Using glmer with code:
model1 <- glmer(comp/6 ~ Type +(1|Loc/Seas), data=mydata,
family = "binomial")
model2 <- glmer(comp/6~Type +(1|Loc) , data=mydata, family="binomial")
and using glmmPQL:
mod1 <-glmmPQL(comp/6~Type, random=~1|Loc, family = binomial, data=mydata)
mod2 <- glmmPQL(comp/6~Type, random=~1|Loc/Seas, family = binomial, data=mydata)
I could compare models in glmer but not in glmmPQL but the latter gave me the variance for all random effects plus residual variance whereas glmer did not provide the residual variance (so was left wondering what was the total amount and what proportion were these random effects responsible for)
When I used glmer this way, the results were completely different to glmmPQL as in no there was no sig difference at all in glmer but very much a sig diff in glmmPQL. (However if I do the reverse and code by disturbed and undisturbed these do provide similar results between glmer and glmmPQL and what is suggested by the graphs - but my supervisor said this is not strictly correct (eg: mod3 <- glmmPQL(Status~compaction, random=~1|Loc/Seas, family = binomial, data=mydata) where Status is 1 or 0 according to disturbed or undisturbed) plus my supervisor would like me to provide a chi squared goodness of fit for the model chosen - so can only use glmer here ?). Additionally, the random effects variance is minimal, and glmer model choice removes these as non significant (although keeping one in provides a smaller AIC). Removing them (as suggested by the chi-squared test (but not AIC) and running as only a glm is consistent to both results from glmmPQL and what is observed on the graph. Sorry if this seems very pedantic, but I am trying to do what is correct for my supervisor and for the species I am researching. I know there are differences.. they are seen, observed, eyeballing the data suggests so and so do the graphs.. Maybe I should just run the glm ? Thank you for answering me. I will find some output to post.
I have a huge list of bit vectors (BV) that I want to group in clusters.
The idea behind this clusters is to be able to choose later BVs from each cluster and combine them for generate a BV with (almost) all-ones (which must be maximized).
For example, imagine the 1 means an app is Up and 0 is down in node X in a specific moment in time. We want to find the min list of nodes for having the app Up:
App BV for node X in cluster 1: 1 0 0 1 0 0
App BV for node Y in cluster 2: 0 1 1 0 1 0
Combined BV for App (X+Y): 1 1 1 1 1 0
I have been checking the different cluster algorithms but I did found one that takes into account this "complemental" behavior because in this case each column of the BV is not referred to a feature (only means up or down in an specific timeframe).
Regarding other algorithms like k-means or hierarchical clustering, I do not have clear if I can include in the clustering algorithm this consideration for the later grouping.
Finally, I am using the hamming distance to determine the intra-cluster and the inter-cluster distances given that it seems to be the most appropiated metric for binary data but results show me that clusters are not closely grouped and separated among them so I wonder if I am applying the most suitable group/approximation method or even if I should filter the input data previously grouping.
Any clue or idea regarding grouping/clustering method or filtering data is welcomed.
This does not at all sound like a clustering problem.
None of these algorithms will help you.
Instead, I would rather call this a match making algorithm. But I'd assume it is at least NP-hard (it resembles set cover) to find the true optimum, so you'll need to come up with a fast approximation. Best something specific to your use case.
Also you haven't specified (you wrote + but that likely isn't what you want) how to combine two 1s. Is it xor or or? Nor if it is possible to combine more than two, and what the cost is when doing so. A strategy would be to find the nearest neighbor of the inverse bitvector for each and always combine the best pair.
In the Kademlia paper by Petar Maymounkov and David Mazières, it is said that the XOR distance is a valid non-Euclidian metric with limited explanations as to why each of the properties of a valid metric are necessary or interesting, namely:
d(x,x) = 0
d(x,y) > 0, if x != y
forall x,y : d(x,y) = d(y,x) -- symmetry
d(x,z) <= d(x,y) + d(y,z) -- triangle inequality
Why is it important for a metric to have these properties in general? Why is each of these properties necessary in the context of routing queries in the Kademlia Distributed Hash Table implementation?
In addition, the paper mentions that unidirectionality (for a given x, and a distance l, there exist only a single y for which d(x,y) = l) guarantees that all queries will converge along the same path. Why is that so?
I can only speak for Kademlia, maybe someone else can provide a more general answer. In the meantime...
d(x,x) = 0
d(x,y) > 0, if x != y
These two points together effectively mean that the closest point to x is x itself; every other point is further away. (This may seem intuitive, but other aspects of the XOR metric aren't.)
In the context of Kademlia, this is important since a lookup for node with ID x will yield that node as the closest. It would be awkward if that were not the case, since a search converging towards x might not find node x.
forall x,y : d(x,y) = d(y,x)
The structure of the Kademlia routing table is such that nodes maintain detailed knowledge of the address space closest to them, and exponentially decreasing knowledge of more distant address space. In short, a node tries to keep all the k closest contacts it hears about.
The symmetry is useful since it means that each of these closest contacts will be maintaining detailed knowledge of a similar part of the address space, rather than a remote part.
If we didn't have this property, it might be helpful to think of the search as more like the hands of a clock moving in one direction round a clockface. The node at 1 o'clock (Node1) is close to Node2 at 2 o'clock (30°), but Node2 is far from Node1 (330°). So imagine we're looking for the two closest to 3 o'clock (i.e. Node1 and Node2). If the search reaches Node2, it won't know about Node1 since it's far away. The whole lookup and topology would have to change.
d(x,z) <= d(x,y) + d(y,z)
If this weren't the case, it would be impossible for a node to know which contacts from its routing table to return during a lookup. It would know the k closest to the target, but there would be no guarantee that one of the other more distant contacts wouldn't yield a shorter overall path.
Because of this property and unidirectionality, different searches starting from vastly separated points will tend to converge down the same path.
The unidirectionality means that no two nodes can have the same distance from a given point. If that weren't the case, then the target point could be encircled by a bunch of nodes all the same distance from it. Then various different searches would be free pick any of those to pass through. However, unidirectionality guarantees that exactly one of this bunch will be the closest, and any search which chooses between this group will always select the same one.
I've been bashing my head on this for quite some time: how can the XOR - as in the number of differing bits, a proper Hamming distance - be the basis of a total order?
Well it can't, such a metric on its own is not enough for a comparable relationship, all it can do is dump nodes in circles around a point.
Then I read the paper more closely and noticed that it says "the XOR as an integer value" and it dawned on me: the crux is not the "XOR metric", but the length of the common prefix of the ID (of which XOR is a derivation mechanism.)
Take two nodes with the same Hamming distance from "self" and the length of their prefix common to "self": the one with shortest common prefix is the furthest node.
The paper uses "XOR distance metric" but it really should read "ID prefix length total ordering"
I think this may explain it a wee bit, let me know http://metaquestions.me/2014/08/01/shortest-distance-between-two-points-is-not-always-a-straight-line/
Basically each hop if it were only one bit at a time in a fully populated network (extreme) then would have twice the knowledge of the previous hop. As you converge the knowledge is greater until you get to the closest nodes whose knowledge is ultimate in the network.
I have to check performance of various clustering algos using different performance operators in rapidminer. For that I want to know the following things:
what does cluster number index value shows which is output of cluster count performance operator?
what does small and large value of avg within cluster distance and avg. within centroid distance mean in terms of good and bad clustering?
I also want to check other indexes value like Dunn index,Jaccard index, Fowlkes–Mallows for various clustering algos. but rapidminer don't have any operator for this, what to do for that. I don't have experience with R.
I have copied part of the answer I gave on the Rapid-I forum
The cluster number index is the count of clusters - pointless you might say but when used with DBSCAN, it can be quite interesting http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html
The avg within cluster and centroid distances are hard to interpret - one thing to search for is "elbow criterion" in this context. As the number of clusters varies, note how the validity measure changes and look for an "elbow" that marks the point where the natural progression of the measure dominates the structure.
R has many validity measures and it's worth investing some time because you can always call the R process from RapidMiner which makes it easier to work out what is going on.
I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.