Spark GraphX Aggregation Summation - scala

I'm trying to compute the sum of node values in a spark graphx graph. In short the graph is a tree and the top node (root) should sum all children and their children. My graph is actually a tree that looks like this and the expected summed value should be 1850:
+----+
+---------------> | VertexID 14
| | | Value: 1000
+---+--+ +----+
+------------> | VertexId 11
| | | Value: +----+
| +------+ Sum of 14 & 24 | VertexId 24
+---++ +--------------> | Value: 550
| | VertexId 20 +----+
| | Value:
+----++Sum of 11 & 911
|
| +-----+
+-----------> | VertexId 911
| | Value: 300
+-----+
The first stab at this looks like this:
val vertices: RDD[(VertexId, Int)] =
sc.parallelize(Array((20L, 0)
, (11L, 0)
, (14L, 1000)
, (24L, 550)
, (911L, 300)
))
//note that the last value in the edge is for factor (positive or negative)
val edges: RDD[Edge[Int]] =
sc.parallelize(Array(
Edge(14L, 11L, 1),
Edge(24L, 11L, 1),
Edge(11L, 20L, 1),
Edge(911L, 20L, 1)
))
val dataItemGraph = Graph(vertices, edges)
val sum: VertexRDD[(Int, BigDecimal, Int)] = dataItemGraph.aggregateMessages[(Int, BigDecimal, Int)](
sendMsg = { triplet => triplet.sendToDst(1, triplet.srcAttr, 1) },
mergeMsg = { (a, b) => (a._1, a._2 * a._3 + b._2 * b._3, 1) }
)
sum.collect.foreach(println)
This returns the following:
(20,(1,300,1))
(11,(1,1550,1))
It's doing the sum for vertex 11 but it's not rolling up to the root node (vertex 20). What am I missing or is there a better way of doing this? Of course the tree can be of arbitrary size and each vertex can have an arbitrary number of children edges.

Given the graph is directed (as in you example it seems to be) it should be possible to write a Pregel program that does what you're asking for:
val result =
dataItemGraph.pregel(0, activeDirection = EdgeDirection.Out)(
(_, vd, msg) => msg + vd,
t => Iterator((t.dstId, t.srcAttr)),
(x, y) => x + y
)
result.vertices.collect().foreach(println)
// Output is:
// (24,550)
// (20,1850)
// (14,1000)
// (11,1550)
// (911,300)
I'm using EdgeDirection.Out so that the messages are being send only from bottom to up (otherwise we would get into an endless loop).

Related

Scala: Run a function that is written for arrays on a dataframe that contains column of array

So, I have the following functions that work perfectly when I use them on arrays:
def magnitude(x: Array[Int]): Double = {
math.sqrt(x map(i => i*i) sum)
}
def dotProduct(x: Array[Int], y: Array[Int]): Int = {
(for((a, b) <- x zip y) yield a * b) sum
}
def cosineSimilarity(x: Array[Int], y: Array[Int]): Double = {
require(x.size == y.size)
dotProduct(x, y)/(magnitude(x) * magnitude(y))
}
But, I don't know how to run it on an array that I have in a spark dataframe column.
I know the problem is that the function expects an array, but I am giving a column to it. But, I don't know how to solve the problem.
One way is to wrap your functions within UDFs. Yet UDFs are known to be suboptimal most of the time. You could therefore rewrite your functions with spark primitives. To ease the reuse of the expression you write, you can write functions that take Column objects as parameters.
import org.apache.spark.sql.Column
def magnitude(x : Column) = {
aggregate(transform(x, _ * _), lit(0), _ + _)
}
def dotProduct(x : Column, y : Column) = {
val products = transform(arrays_zip(x, y), s => s(x.toString) * s(y.toString))
aggregate(products, lit(0), _ + _)
}
def cosineSimilarity(x : Column, y : Column) = {
dotProduct(x, y) / (magnitude(x) * magnitude(y))
}
Let's test this:
val df = spark.range(1).select(
array(lit(1), lit(2), lit(3)) as "x",
array(lit(1), lit(3), lit(5)) as "y"
)
df.select(
'x, 'y,
magnitude('x) as "magnitude_x",
dotProduct('x, 'y) as "dot_prod_x_y",
cosineSimilarity('x, 'y) as "cosine_x_y"
).show()
which yields:
+---------+---------+-----------+------------+--------------------+
| x| y|magnitude_x|dot_prod_x_y| cosine_x_y|
+---------+---------+-----------+------------+--------------------+
|[1, 2, 3]|[1, 3, 5]| 14| 22|0.044897959183673466|
+---------+---------+-----------+------------+--------------------+
To use your own functions within sparkSQL, you need to wrap them inside of a UDF (user defined function).
val df = spark.range(1)
.withColumn("x", array(lit(1), lit(2), lit(3)))
// defining the user defined functions from the scala functions.
val magnitude_udf = udf(magnitude _)
val dot_product_udf = udf(dotProduct(_,_))
df
.withColumn("magnitude", magnitude_udf('x))
.withColumn("dot_product", dot_product_udf('x, 'x))
.show
+---+---------+------------------+-----------+
| id| x| magnitude|dot_product|
+---+---------+------------------+-----------+
| 0|[1, 2, 3]|3.7416573867739413| 14|
+---+---------+------------------+-----------+

using scala Find four elements from list that sum to a given value

I am new to scala programming language and want to implement the code having below scenerio.
given a list sampleone of n integer and an integer samplethree, there are elements a,b,c and d in sampleone such that a+b+c+d = samplethree. Find all unique quadruplet in the list which gives the sum of samplethree
Example:
sampleone =[1,0,-1,0,-2,2] and samplethree = 0
a solution set is
[-1,0,0,1]
[-2,-1,1,2]
[-2,0,0,2]
the code that I have used is
scala> def findFourElements(A: List[Int], n: Int, x: Int) = {
| {
| for(a <- 0 to A.length-3)
| {
| for(b <- a+1 to A.length-2)
| {
| for(c <- b+1 to A.length-1)
| {
| for(d <- c+1 to A.length)
| {
| if(A(a) + A(b) + A(c) + A(d) == x)
| {
| print(A(a)+A(b)+A(c)+A(d))
| }}}}}}
| }
findFourElements: (A: List[Int], n: Int, x: Int)Unit
scala> val sampleone = List(1,0,-1,0,-2,2)
sampleone: List[Int] = List(1, 0, -1, 0, -2, 2)
scala> val sampletwo = sampleone.length
sampletwo: Int = 6
scala> val samplethree = 0
samplethree: Int = 0
scala> findFourElements(sampleone,sampletwo,samplethree)
0java.lang.IndexOutOfBoundsException: 6
at scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
at scala.collection.immutable.List.apply(List.scala:84)
at $anonfun$findFourElements$1$$anonfun$apply$mcVI$sp$1$$anonfun$apply$mcVI$sp$2$$anonfun$apply$mcVI$sp$3.apply$mcVI$sp(<console>:33)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at $anonfun$findFourElements$1$$anonfun$apply$mcVI$sp$1$$anonfun$apply$mcVI$sp$2.apply$mcVI$sp(<console>:31)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at $anonfun$findFourElements$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(<console>:29)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at $anonfun$findFourElements$1.apply$mcVI$sp(<console>:27)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at findFourElements(<console>:25)
... 48 elided
But I am getting error of index out of bound exception.
Also is there a way to have a more optimized code in scala.
Thanks for help.
This may do what you want:
sampleone.combinations(4).filter(_.sum == samplethree)
The combinations method gives an iterator that delivers each possible combination of values in turn. If there is more than one way to construct the same sequence, only one will be returned.
The filter call removes any sequences that do not sum to the samplethree value.

how to find the number of vertex that are reachable from a given vertex in Spark GraphX

I want to find out the number of reachable vertexes from a given vertex in a directed graph (see image below), e.g. for id=0L, since it connects to 1L and 2L, 1L connects to 3L, 2L connects to 4L, hence, the output should be 4. Following is the graph relationship data:
edgeid from to distance
0 0 1 10.0
1 0 2 5.0
2 1 2 2.0
3 1 3 1.0
4 2 1 3.0
5 2 3 9.0
6 2 4 2.0
7 3 4 4.0
8 4 0 7.0
9 4 3 5.0
I was able to set up the graph, but I am not sure how to use graph.edges.filter to get the ouput
val vertexRDD: RDD[(Long, (Double))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(Double), Int] = Graph(vertexRDD, edgeRDD)
In your example all vertices are connected with a directed path so each vertex should result in a value of 4.
But if you were to remove the 4->0 (id=8) connection there would be a different number of course.
Since your problem relies on (recursively) traversing the graph in parallel the Graphx Pregel API is probably the best approach.
The pregel call takes 3 functions
vprog to initialize each vertex with a message (in your case empty List[VertexId])
sendMsg an update step that is applied on each iteration (in your case accumulation the neighboring VertexIds and returning an Iterator with messages to send out to the next iteration
mergeMsg to merge two messages (2 List[VertexId]s into 1)
In code it would look like:
def vprog(id: VertexId, orig: List[VertexId], newly: List[VertexId]) : List[VertexId] = newly
def mergeMsg(a: List[VertexId], b: List[VertexId]) : List[VertexId] = (a ++ b).distinct
def sendMsg(trip: EdgeTriplet[List[VertexId],Double]) : Iterator[(VertexId, List[VertexId])] = {
val recursivelyConnectedNeighbors = (trip.dstId :: trip.dstAttr).filterNot(_ == trip.srcId)
if (trip.srcAttr.intersect(recursivelyConnectedNeighbors).length != recursivelyConnectedNeighbors.length)
Iterator((trip.srcId, recursivelyConnectedNeighbors))
else
Iterator.empty
}
val initList = List.empty[VertexId]
val result = graph
.mapVertices((_,_) => initList)
.pregel(
initialMsg = initList,
activeDirection = EdgeDirection.Out
)(vprog, sendMsg, mergeMsg)
.mapVertices((_, neighbors) => neighbors.length)
result.vertices.toDF("vertex", "value").show()
Output:
+------+-----+
|vertex|value|
+------+-----+
| 0| 4|
| 1| 3|
| 2| 3|
| 3| 1|
| 4| 1|
+------+-----+
Make sure to experiment with spark.graphx.pregel.checkpointInterval if you are getting OoM's traversing large graphs (or configuring the maxIterations in pregel init)

Add a column to DataFrame with value of 1 where prediction greater than a custom threshold

I am trying to add a column to a DataFrame that should have the value 1 when the output class probability is high. Something like this:
val output = predictions
.withColumn(
"easy",
when( $"label" === $"prediction" &&
$"probability" > 0.95, 1).otherwise(0)
)
The problem is, probability is a Vector, and 0.95 is a Double, so the above doesn't work. What I really need is more like max($"probability") > 0.95 but of course that doesn't work either.
What is the right way of accomplishing this?
Here is a simple example as to implement your question.
Create a udf and pass probability column and return 0 or 1 for the new added column. In a Row WrappedArray is used instead of Array, Vector.
val spark = SparkSession.builder().master("local").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(Vector(0.78, 0.98, 0.97), 1), (Vector(0.78, 0.96), 2), (Vector(0.78, 0.50), 3)
)).toDF("probability", "id")
data.withColumn("label", label($"probability")).show()
def label = udf((prob: mutable.WrappedArray[Double]) => {
if (prob.max >= 0.95) 1 else 0
})
Output:
+------------------+---+-----+
| probability| id|label|
+------------------+---+-----+
|[0.78, 0.98, 0.97]| 1| 1|
| [0.78, 0.96]| 2| 1|
| [0.78, 0.5]| 3| 0|
+------------------+---+-----+
Define UDF
val findP = udf((label: <type>, prediction: <type>, probability: <type> ) => {
if (label == prediction && vector.toArray.max > 0.95) 1 else 0
})
Use UDF in withCoulmn()
val output = predictions.withColumn("easy",findP($"lable",$"prediction",$"probability"))
Use an udf, something like:
val func = (label: String, prediction: String, vector: Vector) => {
if(label == prediction && vector.toArray.max > 0.95) 1 else 0
}
val output = predictions
.select($"label", func($"label", $"prediction", $"probability").as("easy"))

UDF to randomly assign values based on different probabilities

I would like to create a UDF to randomly assign values based on different probabilities.
In the following example depending of the value returned by rand:
0 to 0.5 the value should be A (50% probability)
0.8 to 1 the value should be B (20% probability)
anything else the value should be c (30% probability)
val names = Array("A", "B", "C")
val allocate = udf((p: Double) => {
if(p < 0.5) names(0)
else if (p > 0.8) names(1)
else names(2)})
val test = sqlContext.range(0, 100).select(($"id"),(round(abs(rand),2)).alias("val"), allocate(abs(rand)).alias("name"))
`
However when I print the result the names are not assigned based on the rules defined in the UDF.
+---+----+----+
| id| val|name|
+---+----+----+
| 0|0.17| C| => should be A
| 1|0.12| A|
| 2|0.36| A|
| 3|0.56| B|
| 4|0.82| A|=> should be C
There is nothing unexpected going on here. You call rand function twice so you get two different random values.
Either provide the same seed for both calls:
sqlContext.range(0, 100)
.select(
$"id",
abs(rand(1)).alias("val"),
allocate(abs(rand(1))).alias("name")
)
or reuse the value:
sqlContext.range(0, 100)
.withColumn("val", abs(rand))
.withColumn("name", allocate($"val"))