Computing F-measure for clustering - cluster-analysis

Can anyone help me to calculate F-measure collectively ? I know how to calculate recall and precision, but don't know for a given algorithm how to calculate one F-measure value.
As an exemple, suppose my algorithm creates m clusters, but I know there are n clusters for the same data (as created by another benchmark algorithm).
I found one pdf but it is not useful since the collective value I got is greater than 1. Reference of pdf is F Measure explained. Specifically I have read some research paper, in which the author compares two algorithms on the basis of F-measure, they got collectively values between 0 and 1.
if you read the pdf mentioned above carefully, the formula is F(C,K) = ∑ | ci | / N * max {F(ci,kj)}
where ci is reference cluster & kj is cluster created by other algorithm, here i is running from 1 to n & j is running from 1 to m.Let say |c1|=218 here as per pdf N=m*n let say m=12 and n=10, and we got max F(c1,kj) for j=2. Definitely F(c1,k2) is between 0 and 1. but the resultant value calculated by above formula we will get value above 1.

The term f-measure itself is underspecified. It's the harmonic mean, usually of precision and recall. Actually you should even say F1-score if you mean the unweighted version, because you can put different weight on the two input values. But without saying which two values are averaged (not in the sense of the arithmetic mean!) this doesn't say much.
Note that the values must be in the 0-1 value range. Otherwise, you have an error earlier on.
In cluster analysis, the common approach is to apply the F1-Measure to the precision and recall of pairs, often referred to as "pair counting f-measure". But you could compute the same mean on other values, too.
Pair-counting has the nice property that it doesn't directly compare clusters, so the result is well defined when one result has m cluster, the other has n clusters. However, pair counting needs strict partitions. When elements are not clustered or assigned to more than one cluster, the pair-counting measures can easily go out of the range 0-1.
E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Clusterings Metrics and Visual Support
Int. Conf. Data Engineering (ICDE 2012)
Discusses some of these metrics (including Rand index and such) and gives a simple explanation of the "pair counting F-measure".

The paper Characterization and evaluation of similarity measures for pairs of clusterings by Darius Pfitzner, Richard Leibbrandt and David Powers contains a lot of useful information regarding this subject, including the following example:
Given the set,
D = {1, 2, 3, 4, 5, 6}
and the partitions,
P = {1, 2, 3}, {4, 5}, {6}, and
Q = {1, 2, 4}, {3, 5, 6}
where P is set created by our algorithm and Q is set created by standard algorithm we known
PairsP = {(1, 2), (1, 3), (2, 3), (4, 5)},
PairsQ = {(1, 2), (1, 4), (2, 4), (3, 5), (3, 6), (5, 6)}, and
PairsD = {(1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 3), (2, 4),
(2, 5), (2, 6), (3, 4), (3, 5), (3, 6), (4, 5), (4, 6), (5, 6)}
a = | PairsP intersection PairsQ | = |(1, 2)| = 1
b = | PairsP- PairsQ | = |(1, 3)(2, 3)(4, 5)| = 3
c = | PairsQ- PairsP | = |(1, 4)(2, 4)(3, 5)(3, 6)(5, 6)| = 5
F-measure= 2a/(2a+b+c)
Note: There is an error in the publication on page 364 where a, b, c, and d are computed and the result of b and c are actually switched incorrectly. This switch would throw off the results of some other measures. Obviously, the F-measure is unaffected.

The N in your formula, F(C,K) = ∑ | ci | / N * max {F(ci,kj)}, is the sum of the |ci| over all i i.e. it is the total number of elements. You are perhaps mistaking it to be the number of clusters and therefore are getting an answer greater than one. If you make the change, your answer will be between 1 and 0.


Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a feature vector for one of the movie's fields such as Title, Plot, Genres, Actors, etc.). For Actors and Genres, for example, the vector shows whether a given actor is present (1) or absent (0) in the movie.
The task is to find top 10 similar movies for each movie.
I managed to write a script in Scala that performs all those computations and does the job. It works for smaller sets of movies such as 1000 movies but not for the whole dataset (out of memory, etc.).
The way I do this computation is by using a cross join on the movies dataset. Then reduce the problem by only taking rows where movie1_id < movie2_id.
Still the dataset at this point will contain 46000^2/2 rows which is 1058000000.
And each row has significant amount of data.
Then I calculate similarity score for each row. After similarity is calculated I group the results where movie1_id is same and sort them in descending order by similarity score using a Window function taking top N items (similar to how it's described here: Spark get top N highest score results for each (item1, item2, score)).
The question is - can it be done more efficiently in Spark? E.g. without having to perform a crossJoin?
And another question - how does Spark deal with such huge Dataframes (1058000000 rows consisting of multiple SparseVectors)? Does it have to keep all this in memory at a time? Or does it process such dataframes piece by piece somehow?
I'm using the following function to calculate similarity between movie vectors:
def intersectionCosine(movie1Vec: SparseVector, movie2Vec: SparseVector): Double = {
val a: BSV[Double] = toBreeze(movie1Vec)
val b: BSV[Double] = toBreeze(movie2Vec)
var dot: Double = 0
var offset: Int = 0
while( offset < a.activeSize) {
val index: Int = a.indexAt(offset)
val value: Double = a.valueAt(offset)
dot += value * b(index)
offset += 1
val bReduced: BSV[Double] = new BSV(a.index, => b(i)), a.index.length)
val maga: Double = magnitude(a)
val magb: Double = magnitude(bReduced)
if (maga == 0 || magb == 0)
return 0
return dot / (maga * magb)
Each row in the Dataframe consists of two joined classes:
final case class MovieVecData(imdbID: Int,
Title: SparseVector,
Decade: SparseVector,
Plot: SparseVector,
Genres: SparseVector,
Actors: SparseVector,
Countries: SparseVector,
Writers: SparseVector,
Directors: SparseVector,
Productions: SparseVector,
Rating: Double
It can be done more efficiently, as long as you are fine with approximations, and don't require exact results (or exact number or results).
Similarly to my answer to Efficient string matching in Apache Spark you can use LSH, with:
BucketedRandomProjectionLSH to approximate Euclidean distance.
MinHashLSH to approximate Jaccard Distance.
If feature space is small (or can be reasonably reduced) and each category is relatively small you can also optimize your code by hand:
explode feature array to generate #features records from a single record.
Self join result by feature, compute distance and filter out candidates (each pair of records will be compared if and only if they share specific categorical feature).
Take top records using your current code.
A minimal example would be (consider it to be a pseudocode):
// This is oversimplified. In practice don't assume only sparse scenario
val indices = udf((v: SparseVector) => v.indices)
val df = Seq(
(1L, Vectors.sparse(1024, Array(1, 3, 5), Array(1.0, 1.0, 1.0))),
(2L, Vectors.sparse(1024, Array(3, 8, 12), Array(1.0, 1.0, 1.0))),
(3L, Vectors.sparse(1024, Array(3, 5), Array(1.0, 1.0))),
(4L, Vectors.sparse(1024, Array(11, 21), Array(1.0, 1.0))),
(5L, Vectors.sparse(1024, Array(21, 32), Array(1.0, 1.0)))
).toDF("id", "features")
val possibleMatches = df
.withColumn("key", explode(indices($"features")))
.transform(df => df.alias("left").join(df.alias("right"), Seq("key")))
val closeEnough(threshold: Double) = udf((v1: SparseVector, v2: SparseVector) => intersectionCosine(v1, v2) > threshold)
possilbeMatches.filter(closeEnough($"left.features", $"right.features")).select($"", $"").distinct
Note that both solutions are worth the overhead only if hashing / features are selective enough (and optimally sparse). In the example shown above you'd compare only rows inside set {1, 2, 3} and {4, 5}, never between sets.
However in the worst case scenario (M records, N features) we can make N M2 comparisons, instead of M2
Another thought.. Given that your matrix is relatively small and sparse, it can fit in memory using breeze CSCMatrix[Int].
Then, you can compute co-occurrences using A'B (A.transposed * B) followed by a TopN selection of the LLR (logLikelyhood ratio) of each pairs. Here, since you keep only 10 top items per row, the output matrix will be very sparse as well.
You can lookup the details here:
You can borrow from the idea of locality sensitive hashing. Here is one approach:
Define a set of hash keys based on your matching requirements. You would use these keys to find potential matches. For example, a possible hash key could be based on the movie actor vector.
Perform reduce for each key. This will give sets of potential matches. For each of the potential matched set, perform your "exact match". The exact match will produce sets of exact matches.
Run Connected Component algorithm to perform set merge to get the sets of all exact matches.
I have implemented something similar using the above approach.
Hope this helps.
Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:
Keep in mind that you will always have N^2 values in resulting similarity matrix
You will have to concatenate your sparse vectors
One very important suggestion , that i have used in similar scenarios is if some movie
relation similarity score
A-> B 8/10
B->C 7/10
C->D 9/10
E-> A 4 //less that some threshold or hyperparameter
Don't calculate similarity for
E-> B
E-> C

How can I group RDD by key then count per unique string?

I have an RDD like:
[(1, "Western"),
(1, "Western")
(1, "Drama")
(2, "Western")
(2, "Romance")
(2, "Romance")]
I wish to count per userID the occurances of each movie genres resulting in
1, { "Western", 2), ("Drama", 1) } ...
After that it's my intention to pick the one with the largest number and thus gaining the most popular genre per user.
I have tried userGenre.sortByKey().countByValue()
but to no avail I have no clue on how I can perform this task. I'm using pyspark jupyter notebook.
I have tried the following and it seems to have worked, could someone confirm? x: (x, 1)).aggregateByKey(\
0, # initial value for an accumulator \
lambda r, v: r + v, # function that adds a value to an accumulator \
lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
Here is one way of doing
rdd = sc.parallelize([('u1', "Western"),('u2', "Western"),('u1', "Drama"),('u1', "Western"),('u2', "Romance"),('u2', "Romance")])
The occurrence of each movie genre could be
>>> rdd = sc.parallelize(rdd.countByValue().items())
>>> ((x,y),z): (x,(y,z))).groupByKey().map(lambda (x,y): (x, [y for y in y])).collect()
[('u1', [('Western', 2), ('Drama', 1)]), ('u2', [('Western', 1), ('Romance', 2)])]
Most popular genre
>>> (x,y): ((x,y),1)).reduceByKey(lambda x,y: x+y).map(lambda ((x,y),z):(x,(y,z))).groupByKey().mapValues(lambda (x,y): (y)).collect()
[('u1', ('Western', 2)), ('u2', ('Romance', 2))]
Now one could ask what should be most popular genre if more than one genre have the same popularity count?

Create Spark dataset with parts of other dataset

I'm trying to create a new dataset by taking intervals from another dataset, for example, consider dataset1 as input and dataset2 as output:
dataset1 = [1, 2, 3, 4, 5, 6]
dataset2 = [1, 2, 2, 3, 3, 4, 4, 5, 5, 6]
I managed to do that using arrays, but for mlib a dataset is needed.
My code with array:
def generateSeries(values: Array[Double], n: Int): Seq[Array[Float]] = {
var res: Array[Array[Float]] = new Array[Array[Float]](m)
for(i <- 0 to m-n){
res :+ values(i to i + n)
return res
FlatMap seems like the way to go, but how a function can search for the next value in the dataset?
The problem here is that an array is in no way similar to a DataSet. A DataSet is unordered and has no indices, so thinking in terms of arrays won't help you. Go for a Seq and treat it without using indices and positions at all.
So, to represent an array-like behaviour on a DataSet you need to create your own indices. This is simply done by pairing the value with the position in the "abstract array" we are representing.
So the type of your DataSet will be something like [(Int,Int)], where the first is the index and the second is the value. They will arrive unordered, so you will need to rework your logic in a more functional way. It's not really clear what you're trying to achieve but I hope I gave you an hint. Otherwise explain better the expected result in the comment to my answer and I will edit.

PySpark, intersection by Key

for example I have two RDDs in PySpark:
((0,0), 1)
((0,1), 2)
((1,0), 3)
((1,1), 4)
and second is just
((0,1), 3)
((1,1), 0)
I want to have intersection from the first RDD with the second one. Actually, second RDDs has to play the role of the mask for the first. The output should be:
((0,1), 2)
((1,1), 4)
it means the values from the first RDD, but only for the keys from the second. The lengths of both RDDs are different.
I have some solution (have to prove), but something like this:
rdd3 = rdd1.cartesian(rdd2)
rdd4 = rdd3.filter(lambda((key1, val1), (key2, val2)): key1 == key2)
rdd5 =, val1), (key2, val2)): (key1, val1))
I don't know, how efficient is this solution. would like to hear the opinion of experienced Spark programmers....
Perhaps we shouldn't think of this process as join. You're not really looking to join two datasets, you're looking to subtract one dataset from the other?
I'm going to state what I am assuming from your question
You don't care about the values in the second dataset, at all.
You only want to keep the values in the first dataset where the key value pair appears in the second dataset.
Idea 1: Cogroup (I think probably the fastest way). It's basically calculating the intersection of both datasets.
rdd1 = sc.parallelize([((0,0), 1), ((0,1), 2), ((1,0), 3), ((1,1), 4)])
rdd2 = sc.parallelize([((0,1), 3), ((1,1), 0)])
intersection = rdd1.cogroup(rdd2).filter(lambda x: x[1][0] and x[1][1])
final_rdd = x: (x[0], list(x[1][0]))).map(lambda (x,y): (x, y[0]))
Idea 2: Subtract By Key
rdd1 = sc.parallelize([((0,0), 1), ((0,1), 2), ((1,0), 3), ((1,1), 4)])
rdd2 = sc.parallelize([((0,1), 3), ((1,1), 0)])
unwanted_rows = rdd1.subtractByKey(rdd2)
wanted_rows = rdd1.subtractByKey(unwanted_rows)
I'm not 100% sure if this is faster than your method. It does require two subtractByKey operations, which can be slow. Also, this method does not preserve order (e.g. ((0, 1), 2), despite being first in your first dataset, is second in the final dataset). But I can't imagine this matters.
As to which is faster, I think it depends on how long your cartersian join takes. Mapping and filtering tend to be faster than the shuffle operations needed for subtractByKey, but of course cartesian is a time consuming process.
Anyway, I figure you can try out this method and see if it works for you!
A sidenote for performance improvements, depending on how large your RDDs are.
If rdd1 is small enough to be held in main memory, the subtraction process can be sped up immensely if you broadcast it and then stream rdd2 against it. However, I acknowledge that this is rarely the case.

How to transpose an RDD in Spark

I have an RDD like this:
1 2 3
4 5 6
7 8 9
It is a matrix. Now I want to transpose the RDD like this:
1 4 7
2 5 8
3 6 9
How can I do this?
Say you have an N×M matrix.
If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)
If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.
N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
case (row, rowIndex) => {
case (number, columnIndex) => columnIndex -> (rowIndex, number)
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = {
indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
A first draft without using collect(), so everything runs worker side and nothing is done on driver:
val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
rdd.flatMap(row => ( => (col, row.indexOf(col))))) // flatMap by keeping the column position
.map(v => (v._2, v._1)) // key by column position
.groupByKey.sortByKey // regroup on column position, thus all elements from the first column will be in the first row
.map(_._2) // discard the key, keep only value
The problem with this solution is that the columns in the transposed matrix will end up shuffled if the operation is performed in a distributed system. Will think of an improved version
My idea is that in addition to attach the 'column number' to each element of the matrix, we attach also the 'row number'. So we could key by column position and regroup by key like in the example, but then we could reorder each row on the row number and then strip row/column numbers from the result.
I just don't have a way to know the row number when importing a file into an RDD.
You might think it's heavy to attach a column and a row number to each matrix element, but i guess that's the price to pay to have the possibility to process your input as chunks in a distributed fashion and thus handle huge matrices.
Will update the answer when i find a solution to the ordering problem.
As of Spark 1.6 you can use the pivot operation on DataFrames, depending on the actual shape of your data, if you put it into a DF you could pivot columns to rows, the following databricks blog is very useful as it describes in detail a number of pivoting use cases with code examples