applying a function to graph data using mapReduceTriplets in spark and graphx - scala

I'm having some problems applying the mapReduceTriplets to my graph network in spark using graphx.
I've been following the tutorials and read in my own data which is put together as [Array[String],Int], so for example my vertices are:
org.apache.spark.graphx.VertexRDD[Array[String]] e.g. (3999,Array(17, Low, 9))
And my edges are:
org.apache.spark.graphx.EdgeRDD[Int]
e.g. Edge(3999,4500,1)
I'm trying to apply an aggregate type function using mapReduceTriplets which counts how many of the last integer in the array of a vertices (in the above example 9) is the same or different to the first integer (in the above example 17) of all connected vertices.
So you would end up with a list of counts for the number of matches or non-matches.
The problem I am having is applying any function using mapReduceTriplets, I am quite new to scala so this may be really obvious, but in the graphx tutorials it has an example which is using a graph with the format Graph[Double, Int], however my graph is in the format of Graph[Array[String],Int], so i'm just trying as a first step to figure out how I can use my graph in the example and then work from there.
The example on the graphx website is as follows:
val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
Iterator((triplet.dstId, (1, triplet.srcAttr)))
} else {
// Don't send a message for this triplet
Iterator.empty
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
Any advice would be most appreciated, or if you think there is a better way than using mapreducetriplets I would be happy to hear it.
Edited new code
val nodes = (sc.textFile("C~nodeData.csv")
.map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
val edges = GraphLoader.edgeListFile(sc, "C:~edges.txt")
val graph = edges.outerJoinVertices(nodes) {
case (uid, deg, Some(attrList)) => attrList
case (uid, deg, None) => Array.empty[String]
}
val countsRdd = graph.collectNeighbors(EdgeDirection.Either).leftOuterJoin(graph.vertices).map {
case (id, t) => {
val neighbors: Array[(VertexId, Array[String])] = t._1
val nodeAttr = (t._2)
neighbors.map(_._2).count( x => x.apply(x.size - 1) == nodeAttr(0))
}
}

I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages.
collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like:
val countsRdd = graph.collectNeighbors(EdgeDirection.Either)
.join(graph.vertices)
.map{ case (vid,t) => {
val neighbors = t._1
val nodeAttr = t._2
neighbors.map(_._2).filter( <add logic here> ).size
}
If this doesn't get you going in the right direction, or you get stuck, let me know (the "" part, for example).

Related

What is the proper way to compute graph diameter in GraphX

I'm implementing an algorithm on GraphX for which I need to also compute the diameter of some relatively small graphs.
The problem is that GraphX doesn't have any notion of undirected graphs, so when using the built-in method from ShortestPaths, it obsviously gets the shortets directed paths. This doesn't help much in computing a graph diameter (Longest Shorted undirected path between any pairs of nodes).
I thought of duplicating the the edges of my graph (instead of |E| I would have 2|E| edges) but I didn't feel it's the right way to do it. So, are there a better way to do it on GraphX notably?
Here is my code for a directed graph:
// computing the query diameter
def getDiameter(graph: Graph[String, Int]):Long = {
// Get ids of vertices of the graph
val vIds = graph.vertices.collect.toList.map(_._1)
// Compute list of shortest paths for every vertex in the graph
val shortestPaths = lib.ShortestPaths.run(graph, vIds).vertices.collect
// extract only the distance values from a list of tuples <VertexId, Map> where map contains <key, value>: <dst vertex, shortest directed distance>
val values = shortestPaths.map(element => element._2).map(element => element.values)
// diamter is the longest shortest undirected distance between any pair of nodes in te graph
val diameter = values.map(m => m.max).max
diameter
}
GraphX actually has no notion of direction it you don't tell it to use it.
If you look at the inner workings of the ShortestPaths library you'll see that it uses Pregel and the direction is default (EdgeDirection.Either). This means that for all triplets it will add both source & dest to the activeset.
However if you specify in the sendMsg function of Pregel to only keep the srcId in the active set (as is happening in the ShortestPaths lib) certain vertices (with only outgoing edges) will not be reevaluated.
Anyway a solution is to write your own Diameter object/library, maybe looking like this (heavily based on ShortestPath, so maybe there are even better solutions?)
object Diameter extends Serializable {
type SPMap = Map[VertexId, Int]
def makeMap(x: (VertexId, Int)*) = Map(x: _*)
def incrementMap(spmap: SPMap): SPMap = spmap.map { case (v, d) => v -> (d + 1) }
def addMaps(spmap1: SPMap, spmap2: SPMap): SPMap = {
(spmap1.keySet ++ spmap2.keySet).map {
k => k -> math.min(spmap1.getOrElse(k, Int.MaxValue), spmap2.getOrElse(k, Int.MaxValue))
}(collection.breakOut) // more efficient alternative to [[collection.Traversable.toMap]]
}
// Removed landmarks, since all paths have to be taken in consideration
def run[VD, ED: ClassTag](graph: Graph[VD, ED]): Int = {
val spGraph = graph.mapVertices { (vid, _) => makeMap(vid -> 0) }
val initialMessage:SPMap = makeMap()
def vertexProgram(id: VertexId, attr: SPMap, msg: SPMap): SPMap = {
addMaps(attr, msg)
}
def sendMessage(edge: EdgeTriplet[SPMap, _]): Iterator[(VertexId, SPMap)] = {
// added the concept of updating the dstMap based on the srcMap + 1
val newSrcAttr = incrementMap(edge.dstAttr)
val newDstAttr = incrementMap(edge.srcAttr)
List(
if (edge.srcAttr != addMaps(newSrcAttr, edge.srcAttr)) Some((edge.srcId, newSrcAttr)) else None,
if (edge.dstAttr != addMaps(newDstAttr, edge.dstAttr)) Some((edge.dstId, newDstAttr)) else None
).flatten.toIterator
}
val pregel = Pregel(spGraph, initialMessage)(vertexProgram, sendMessage, addMaps)
// each vertex will contain map with all shortest paths, so just get first
pregel.vertices.first()._2.values.max
}
}
val diameter = Diameter.run(graph)

GraphX - How to get all connected vertices from vertexId (not just the firsts adjacents)?

Considering this graph:
Exemple graph
How can I get all connected vertices from a vertexID?
For example, from VertexId 5, it should return 5-3-7-8-10
CollectNeighbors only returns the first adjacent ones.
I'm trying to use pregel, but I don't know how to start from a specific vertex. I don't want to calculate all the nodes.
Thanks!
I just noticed that the graph is directed. then you can use the code of the shortest path example here. if the distance of a specific node is not infinity then you can reach this node.
or there is a better idea you can modify the shortest path algorithm to satisfy your needs.
import org.apache.spark.graphx.{Graph, VertexId}
import org.apache.spark.graphx.util.GraphGenerators
// A graph with edge attributes containing distances
val graph: Graph[Long, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have canReach = false.
val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId)
val sssp = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
} else {
Iterator.empty
}
},
(a, b) => a || b // Merge Message
)
println(sssp.vertices.collect.mkString("\n"))

Spark Scala - Split Map, Getkey and so on

I have a text file wich contains the following :
A>B,C,D
B>A,C,D,E
C>A,B,D,E
D>A,B,C,E
E>B,C,D
I would like to write a Spark-Scala script to obtain the following :
(For each left member, we give all right members.)
(A,B)
(A,C)
(A,D)
(B,A)
(B,C)
(B,D)
(B,E)
...
I tried to go through the map and get the keys to feed a new map with my results but it did not work.
Here is my code (more like pseudo code):
import scala.io.Source
// Loading file
val file = sc.textFile("friends.txt")
// MAP
// A;B
// A;C
// ...
var associations_persons_friends:Map[Char,Char] = Map()
var lines = file.map(line=>line.split(">"))
for (line <- lines)
{
val person = line.key
for (friend <- line.value.split(","))
{
associations_persons_friends += (person -> friend)
}
}
associations_persons_friends.collect()
val rdd = sc.parallelize(associations_persons_friends)
rdd.foreach(println)
// GROUP
// For each possible pair, all associated values
// AB;B-C-D-A-C-D-E
// REDUCE
// For each pair we keep occurences >= 2
// AB;C-D
I wonder if it is possible to write basic code like this in Spark-Scala because I can't find any answer to my needs on the web.
Thanks for help.
you can achieve your requirement with the combination of map and flatMap as
val rdd = sc.textFile("path to the text file")
rdd.map(line => line.split(">")).flatMap(array => array(1).split(",").map(arr => (array(0), arr))).foreach(println)
You should have output as
(A,B)
(A,C)
(A,D)
(B,A)
(B,C)
(B,D)
(B,E)
(C,A)
(C,B)
(C,D)
(C,E)
(D,A)
(D,B)
(D,C)
(D,E)
(E,B)
(E,C)
(E,D)
I hope the answer is helpful

Compute an average in a RDD and then filter this RDD based on the average in Spark Streaming

I want to do something i feel very strange in Spark Streaming and i want to get some feedback on it.
I have a DStream of a tuple (String,Int). Let's say the string is an id and the integer a value.
So for a microbatch, I want to compute The average of the field Int, and base on this average i want to filter the same microbatch, for example field2 > average. So i wrote this code :
lineStreams
.foreachRDD(
rdd => {
val totalElement = rdd.count()
if(totalElement > 0) {
val totalSum = rdd.map(elem => elem.apply(1).toInt).reduce(_ + _)
val average = totalSum / totalElement
rdd.foreach(
elem => {
if(elem.apply(1).toInt > average){
println("Element is higher than average")
}
}
)
}
})
But actually this code is not running, the first part of computation looks ok but not the test.
I know there is some dirty things in this code, but I just want to know if the logic is good.
Thanks for you advices !
Try:
lineStreams.transform { rdd => {
val mean = rdd.values.map(_.toDouble).mean
rdd.filter(_._2.toDouble > mean)
}}

too many map keys causing out of memory exception in spark

I have an RDD 'inRDD' of the form RDD[(Vector[(Int, Byte)], Vector[(Int, Byte)])] which is a PairRDD(key,value) where key is Vector[(Int, Byte)] and value is Vector[(Int, Byte)].
For each element (Int, Byte) in the vector of key field, and each element (Int, Byte) in the vector of value field I would like to get a new (key,value) pair in the output RDD as (Int, Int), (Byte, Byte).
That should give me an RDD of the form RDD[((Int, Int), (Byte, Byte))].
For example, inRDD contents could be like,
(Vector((3,2)),Vector((4,2))), (Vector((2,3), (3,3)),Vector((3,1))), (Vector((1,3)),Vector((2,1))), (Vector((1,2)),Vector((2,2), (1,2)))
which would become
((3,4),(2,2)), ((2,3),(3,1)), ((3,3),(3,1)), ((1,2),(3,1)), ((1,2),(2,2)), ((1,1),(2,2))
I have the following code for that.
val outRDD = inRDD.flatMap {
case (left, right) =>
for ((ll, li) <- left; (rl, ri) <- right) yield {
(ll,rl) -> (li,ri)
}
}
It works when the vectors are small in size in the inRDD. But when there are lot elements in the vectors, I get out of memory exception. Increasing the available memory
to spark could only solve for smaller inputs and the error appears again for even larger inputs.
Looks like I am trying to assemble a huge structure in memory. I am unable to rewrite this code in any other ways.
I have implemented a similar logic with java in hadoop as follows.
for (String fromValue : fromAssetVals) {
fromEntity = fromValue.split(":")[0];
fromAttr = fromValue.split(":")[1];
for (String toValue : toAssetVals) {
toEntity = toValue.split(":")[0];
toAttr = toValue.split(":")[1];
oKey = new Text(fromEntity.trim() + ":" + toEntity.trim());
oValue = new Text(fromAttr + ":" + toAttr);
outputCollector.collect(oKey, oValue);
}
}
But when I try something similar in spark, I get nested rdd exceptions.
How do I do this efficiently with spark using scala?
Well, if Cartesian product is the only option you can at least make it a little bit more lazy:
inRDD.flatMap { case (xs, ys) =>
xs.toIterator.flatMap(x => ys.toIterator.map(y => (x, y)))
}
You can also handle this at the Spark level
import org.apache.spark.RangePartitioner
val indexed = inRDD.zipWithUniqueId.map(_.swap)
val partitioner = new RangePartitioner(indexed.partitions.size, indexed)
val partitioned = indexed.partitionBy(partitioner)
val lefts = partitioned.flatMapValues(_._1)
val rights = partitioned.flatMapValues(_._2)
lefts.join(rights).values