Getting a negative value for global clustering coefficient in Spark GraphX - scala

I'm working on a large graph in GraphX, and I want to calculate the global clustering coefficient. I'm using a function from the book Spark GraphX in Action, which is :
def clusteringCoefficient[VD:ClassTag,ED:ClassTag](g:Graph[VD,ED]) = {
val numTriplets = g.aggregateMessages[Set[VertexId]](
et => { et.sendToSrc(Set(et.dstId));
et.sendToDst(Set(et.srcId)) },
(a,b) => a ++ b) // #A
.map(x => {val s = (x._2 - x._1).size; s*(s-1) / 2})
.reduce((a,b) => a + b)
println(numTriplets)
if (numTriplets == 0) 0.0 else
g.triangleCount.vertices.map(_._2).reduce(_ + _) /
numTriplets.toFloat
}
I canonicalise the graph and partition it before running the algorithm, but for some graphs I get a negative clustering coefficient which is impossible. I put the print statement in the function just for debugging and for those graphs I get a negative number for numTriplets.
I'm not very experienced with scala so I can't see if there is a bug in the implementation.
Any help would be appreciated!

Related

How does Spark Word2Vec merge each partition's results?

Increasing numPartitions for Spark's Word2Vec makes it faster but less accurate since it fits each partition separately, reducing the context available for each word, before merging the results.
How exactly does it merge the results from multiple partitions? Is it just an average of the vectors? Looking to better understand how this affects the accuracy.
Looking at the source code, I think the merging is happening here:
val synAgg = partial.reduceByKey { case (v1, v2) =>
blas.saxpy(vectorSize, 1.0f, v2, 1, v1, 1)
v1
}.collect()
Which looks like just a vector sum (effectively an average). partial comes from:
val sentences: RDD[Array[Int]] = dataset.mapPartitions { sentenceIter =>
// Each sentence will map to 0 or more Array[Int]
sentenceIter.flatMap { sentence =>
// Sentence of words, some of which map to a word index
val wordIndexes = sentence.flatMap(bcVocabHash.value.get)
// break wordIndexes into trunks of maxSentenceLength when has more
wordIndexes.grouped(maxSentenceLength).map(_.toArray)
}
}
val newSentences = sentences.repartition(numPartitions).cache()
val partial = newSentences.mapPartitionsWithIndex { case (idx, iter) =>
// ... long calculation (skip-gram training, etc.)
}
But I'm not a Word2Vec/Spark ML/Scala expert, so hoping someone more knowledgeable can verify.

Spark Mllib - Frequent Pattern Mining - Association Rules - Not getting the expected results

I've the following dataset:
[A,D]
[C,A,B]
[A]
[A,E,D]
[B,D]
And I am trying to extract some association rules using Frequent Pattern Mining using Spark Mllib. For that I've the following code:
val transactions = sc.textFile("/user/cloudera/teste")
import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
val freqItemsets = transactions.repartition(10).map(_.split(",")).flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
val ar = new AssociationRules().setMinConfidence(0.8)
val results = ar.run(freqItemsets)
results.collect().foreach { rule =>
println("[" + rule.antecedent.mkString(",")
+ "=>"
+ rule.consequent.mkString(",") + "]," + rule.confidence)}
But all the rules extracted have confidence equal to 1:
[[C=>A],1.0
[[C=>B]],1.0
[A,B]=>[C],1.0
[E=>D]],1.0
[E=>[A],1.0
[A=>B]],1.0
[A=>[C],1.0
[[C,A=>B]],1.0
[[A=>D]],1.0
[E,D]=>[A],1.0
[[A,E=>D]],1.0
[[C,B]=>A],1.0
[[B=>D]],1.0
[B]=>A],1.0
[B]=>[C],1.0
I really not understanding the issue that I've in my code... Anyone knows what is the error that I have to calculate the confidence?
Many thanks!
Your data set is too tiny. The maximum frequency of any item in your data is 3. So you can have confidences 0, 1/3, 1/2, 2/3, 1. Only 1 is larger than 0.8.
Try setting minimum confidence to 0.6, then you can actually get
[A]=>[D] confidence 0.666

Aggregation of multiple values using scala/spark

I am new with spark and scala. I want to sum up all the values present in the RDD. below is the example.
RDD is key value pair and suppose after doing some join and transformation the output of RDD have 3 record as below, where A is key:
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
(A, List(1,1,1,1,1,1,1))
Now i want to sum up all values of each record with corresponding value in other records, so the output should come like
(A, List(3,3,3,3,3,3,3))
Can anyone please help me out on this. Is there any possible way to achieve this using scala?
Big Thanks in Advance
A naive approach is to reduceByKey:
rdd.reduceByKey(
(xs, ys) => xs.zip(ys).map { case (x, y) => x + y }
)
but it is rather inefficient because it creates a new List on each merge.
You can improve on that by using for example aggregateByKey with mutable buffer:
rdd.aggregateByKey(Array.fill(7)(0)) // Mutable buffer
// For seqOp we'll mutate accumulator
(acc, xs) => {
for {
(x, i) <- xs.zipWithIndex
} acc(i) += x
acc
},
// For performance you could modify acc1 as above
(acc1, acc2) => acc1.zip(acc2).map { case(x, y) => x + y }
).mapValues(_.toList)
It should be also possible to use DataFrames but by default recent versions schedule aggregations separately so without adjusting configuration it is probably not worth the effort.

Spark task optimisation

I am trying to find an optimised way to generate a list of unique co-location pairings. I have looked to do this using a series of flatmaps and distinct queries but I have found the flatmap to be not overly performant when running over millions of records. Any help in optimising this would be gratefully received.
The dataset is (geohash, id) and I am running this on 30 Node Cluster.
val rdd = sc.parallelize(Seq(("gh5", "id1"), ("gh4", "id1"), ("gh5", "id2"),("gh5", "id3"))
val uniquePairings = rdd.groupByKey().map(value =>
value._2.toList.sorted.combinations(2).map{
case Seq(x, y) => (x, y)}.filter(id =>
id._1 != id._2)).flatMap(x => x).distinct()
voutput = Array(("id1","id2"),("id1","id3"),("id2","id3"))
A simple join should be more than enough here. For example with DataFrames:
val df = rdd.toDF
df.as("df1").join(df.as("df2"),
($"df1._1" === $"df2._1") &&
($"df1._2" < $"df2._2")
).select($"df1._2", $"df2._2")
or datasets
val ds = rdd.toDS
ds.as("ds1").joinWith(ds.as("ds2"),
($"ds1._1" === $"ds2._1") &&
($"ds1._2" < $"ds2._2")
).map{ case ((_, x), (_, y)) => (x, y)}
Look into the cartesian function. It produces an RDD that is all possible combinations of the input RDDs. Do note that this is an expensive operation (N^2 in the size of the RDD)
Cartesian example

Iterative algorithms with Spark streaming

So I understand that Spark can perform iterative algorithms on single RDDs for example Logistic regression.
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
The above example is iterative because it maintains a global state w that is updated after each iteration and its updated value is used in the next iteration. Is this functionality possible in Spark streaming? Consider the same example, except now points is a DStream. In this case, you could create a new DStream that calculates the gradient with
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
But how would you handle the global state w. It seems like w would have to be a DStream too (using updateStateByKey maybe), but then its latest value would somehow need to be passed into the points map function which I don't think is possible. I don't think DStreams can communicate in this way. Am I correct, or is it possible to have iterative computations like this in Spark Streaming?
I just found out that this is quite straightforward with the foreachRDD function. MLlib actually provides models that you can train with DStreams and I found the answer in the streamingLinearAlgorithm code. It looks like you can just keep your global update variable locally in the driver and update it within the .foreachRDD so there is actually no need to transform it into a DStream itself. So you can apply this to the example I provided with something like
points.foreachRDD{(rdd,time) =>
val gradient=rdd.map(p=> (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
)).reduce(_ + _)
w -= gradient
}
Hmm... you can achieve something by parallelizing your iterator and then folding on it to update your gradient.
Also... I think you should keep Spark Streaming out of it as this problem does not look like having any feature which links it to any kind Streaming requirements.
// So, assuming... points is somehow a RDD[ Point ]
val points = sc.textFile(...).map(parsePoint).cache()
var w = Vector.random(D)
// since fold is ( T )( ( T, T) => T ) => T
val temps = sc.parallelize( 1 to ITERATIONS ).map( w )
// now fold over temps.
val gradient = temps.fold( w )( ( acc, v ) => {
val gradient = points.map( p =>
(1 / (1 + exp(-p.y*(acc dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
acc - gradient
}