GraphX - How to get all connected vertices from vertexId (not just the firsts adjacents)? - scala

Considering this graph:
Exemple graph
How can I get all connected vertices from a vertexID?
For example, from VertexId 5, it should return 5-3-7-8-10
CollectNeighbors only returns the first adjacent ones.
I'm trying to use pregel, but I don't know how to start from a specific vertex. I don't want to calculate all the nodes.
Thanks!

I just noticed that the graph is directed. then you can use the code of the shortest path example here. if the distance of a specific node is not infinity then you can reach this node.
or there is a better idea you can modify the shortest path algorithm to satisfy your needs.
import org.apache.spark.graphx.{Graph, VertexId}
import org.apache.spark.graphx.util.GraphGenerators
// A graph with edge attributes containing distances
val graph: Graph[Long, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have canReach = false.
val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId)
val sssp = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
} else {
Iterator.empty
}
},
(a, b) => a || b // Merge Message
)
println(sssp.vertices.collect.mkString("\n"))

Related

How to get node type in graph scala

I am generating a graph using graph-scala library and I need to set the coordinates after building the graph.
My object are Ball and Figure extending from GraphNode. And I generate my Graph using the GraphNode object:
val ball1=new Ball(1,"BALL-A")
val figure1=new Figure(1)
val figure2=new Figure(2)
val figure3=new Figure(3)
val edges = Seq(
(ball1, figure1),
(figure2, ball1),
(ball1, figure3)
)
val graph1: Graph[GraphNode, HyperEdge] = edges
.map({ case (node1, node2) =>
Graph[GraphNode, HyperEdge](node1 ~> node2)
})
.reduce(_ ++ _)
And now I want to set X, Y and Width properties for each node:
graph1.nodes
.map(node => {
node match {
case b: Ball =>
println("is a ball!")
if (b.nodeType.equals("BALL-A"))
b.copy(x = 0, y = 0, width = 100)
else
b.copy(x = 30, y = 30, width = 200)
case otherType =>
val name = otherType.getClass.getSimpleName
println(name)
}
node.toJson
})
.foreach(println)
But I get the type "NodeBase" instead of setting the node. Any suggestions to set properties once I built the graph? My base issue is to get the type for each node to set the property but I am not able to.

What is the proper way to compute graph diameter in GraphX

I'm implementing an algorithm on GraphX for which I need to also compute the diameter of some relatively small graphs.
The problem is that GraphX doesn't have any notion of undirected graphs, so when using the built-in method from ShortestPaths, it obsviously gets the shortets directed paths. This doesn't help much in computing a graph diameter (Longest Shorted undirected path between any pairs of nodes).
I thought of duplicating the the edges of my graph (instead of |E| I would have 2|E| edges) but I didn't feel it's the right way to do it. So, are there a better way to do it on GraphX notably?
Here is my code for a directed graph:
// computing the query diameter
def getDiameter(graph: Graph[String, Int]):Long = {
// Get ids of vertices of the graph
val vIds = graph.vertices.collect.toList.map(_._1)
// Compute list of shortest paths for every vertex in the graph
val shortestPaths = lib.ShortestPaths.run(graph, vIds).vertices.collect
// extract only the distance values from a list of tuples <VertexId, Map> where map contains <key, value>: <dst vertex, shortest directed distance>
val values = shortestPaths.map(element => element._2).map(element => element.values)
// diamter is the longest shortest undirected distance between any pair of nodes in te graph
val diameter = values.map(m => m.max).max
diameter
}
GraphX actually has no notion of direction it you don't tell it to use it.
If you look at the inner workings of the ShortestPaths library you'll see that it uses Pregel and the direction is default (EdgeDirection.Either). This means that for all triplets it will add both source & dest to the activeset.
However if you specify in the sendMsg function of Pregel to only keep the srcId in the active set (as is happening in the ShortestPaths lib) certain vertices (with only outgoing edges) will not be reevaluated.
Anyway a solution is to write your own Diameter object/library, maybe looking like this (heavily based on ShortestPath, so maybe there are even better solutions?)
object Diameter extends Serializable {
type SPMap = Map[VertexId, Int]
def makeMap(x: (VertexId, Int)*) = Map(x: _*)
def incrementMap(spmap: SPMap): SPMap = spmap.map { case (v, d) => v -> (d + 1) }
def addMaps(spmap1: SPMap, spmap2: SPMap): SPMap = {
(spmap1.keySet ++ spmap2.keySet).map {
k => k -> math.min(spmap1.getOrElse(k, Int.MaxValue), spmap2.getOrElse(k, Int.MaxValue))
}(collection.breakOut) // more efficient alternative to [[collection.Traversable.toMap]]
}
// Removed landmarks, since all paths have to be taken in consideration
def run[VD, ED: ClassTag](graph: Graph[VD, ED]): Int = {
val spGraph = graph.mapVertices { (vid, _) => makeMap(vid -> 0) }
val initialMessage:SPMap = makeMap()
def vertexProgram(id: VertexId, attr: SPMap, msg: SPMap): SPMap = {
addMaps(attr, msg)
}
def sendMessage(edge: EdgeTriplet[SPMap, _]): Iterator[(VertexId, SPMap)] = {
// added the concept of updating the dstMap based on the srcMap + 1
val newSrcAttr = incrementMap(edge.dstAttr)
val newDstAttr = incrementMap(edge.srcAttr)
List(
if (edge.srcAttr != addMaps(newSrcAttr, edge.srcAttr)) Some((edge.srcId, newSrcAttr)) else None,
if (edge.dstAttr != addMaps(newDstAttr, edge.dstAttr)) Some((edge.dstId, newDstAttr)) else None
).flatten.toIterator
}
val pregel = Pregel(spGraph, initialMessage)(vertexProgram, sendMessage, addMaps)
// each vertex will contain map with all shortest paths, so just get first
pregel.vertices.first()._2.values.max
}
}
val diameter = Diameter.run(graph)

How to take different types of elements from Map

Here I got two hash sets:
var vertexes = new HashSet[String]()
var edges = new HashSet[RDFTriple]() //RDFTriple is a class
I want to put them into a map like this:
var graph = Map[String, HashSet[_]]()
graph.put("e", edges)
graph.put("v", vertexes)
But now I want to take vertexes and edges respectively but failed. I have tried something like the following:
val a = graph.get("v")
a match {
case _ => val v = a
}
val b = graph.get("e")
b match {
case _ => val e = b
}
But v and e are recognized as Option[HashSet[_]] while I want are HashSet[String] and HashSet[RDFTriple].
How can I do this?
I will apprecicate it so much cuz it bothers me too long.
It is not recommended to use different types in the same Map, however you could some the problem by using Some and asInstanceOf like this:
val v = a match {
case Some(a) => a.asInstanceOf[HashSet[String]]
case None => // do something
}
Note that the assignment val v = ... is done outside the match to allow usage of the variable afterwards. The match for the edges is similar.
However, a better solution would be to use a case class for the graph. Then you would avoid a lot of hassle.
case class Graph(vertexes: HashSet[String], edges: HashSet[RDFTriple])
val graph = Graph(vertexes, edges)
val v = graph.vertexes // HashSet[String]
val e = graph.edges // HashSet[RDFTriple]

How to execute Pregel Shortest Path making all the vertices as source vertex once on Spark Cluster

We have assignment of finding the shortest path using Pregel API for 3lac vertices. We are supposed to make each vertex as source vertex once and identify the shortest path among all these executions. My code looks like below,
def shortestPath(sc: SparkContext, mainGraph: Graph[(String, String, Double), Double], singleSourceVertexFlag: Boolean) {
var noOfIterations = mainGraph.vertices.count();
// If single source vertext is true, pass only count as one iteration only
if (singleSourceVertexFlag) {
noOfIterations = 1
} else { // else loop through complete list of vertices
noOfIterations = mainGraph.vertices.count()
}
for (i <- 0 to (noOfIterations.toInt - 1)) {
val sourceId: VertexId = i
val modGraph = mainGraph.mapVertices((id, attr) =>
if (id == sourceId) (0.0)
else (Double.PositiveInfinity))
val loopItrCount = modGraph.vertices.count().toInt;
val sssp = modGraph.pregel(Double.PositiveInfinity, loopItrCount, EdgeDirection.Out)(
(id, dist, newDist) =>
if (dist < newDist) dist
else newDist, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a, b) =>
if (a < b) a // Merge Message
else b)
sssp.unpersist(true)
modGraph.unpersist(true)
println("****Shortest Path End**** SourceId" + sourceId)
}
}
From this code I have to read shortest path from each loop and from them identify the minimum value as the final output(which is the future part and i am yet to code for the same).
Now this current code works fine for 15node graph and also 1112 node graph. But when I try to execute the algorithm for 22k node graph the algorithm executes for 55source nodes and then stops with Out of memory error. We have a two node cluster(1node - 64GB RAM, 2node - 32GB RAM)
Question are,
1. How does the for loops are treated on Spark cluster? Does anything which I have to modify in the code so that the code is optimized?
2. I am trying to use unpersist so that the at each loop the RDD are cleared and new one is created for every loop. But still I get out of memory after it executes for 55 nodes. What should be done to execute the same for all the nodes?

applying a function to graph data using mapReduceTriplets in spark and graphx

I'm having some problems applying the mapReduceTriplets to my graph network in spark using graphx.
I've been following the tutorials and read in my own data which is put together as [Array[String],Int], so for example my vertices are:
org.apache.spark.graphx.VertexRDD[Array[String]] e.g. (3999,Array(17, Low, 9))
And my edges are:
org.apache.spark.graphx.EdgeRDD[Int]
e.g. Edge(3999,4500,1)
I'm trying to apply an aggregate type function using mapReduceTriplets which counts how many of the last integer in the array of a vertices (in the above example 9) is the same or different to the first integer (in the above example 17) of all connected vertices.
So you would end up with a list of counts for the number of matches or non-matches.
The problem I am having is applying any function using mapReduceTriplets, I am quite new to scala so this may be really obvious, but in the graphx tutorials it has an example which is using a graph with the format Graph[Double, Int], however my graph is in the format of Graph[Array[String],Int], so i'm just trying as a first step to figure out how I can use my graph in the example and then work from there.
The example on the graphx website is as follows:
val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
Iterator((triplet.dstId, (1, triplet.srcAttr)))
} else {
// Don't send a message for this triplet
Iterator.empty
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
Any advice would be most appreciated, or if you think there is a better way than using mapreducetriplets I would be happy to hear it.
Edited new code
val nodes = (sc.textFile("C~nodeData.csv")
.map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
val edges = GraphLoader.edgeListFile(sc, "C:~edges.txt")
val graph = edges.outerJoinVertices(nodes) {
case (uid, deg, Some(attrList)) => attrList
case (uid, deg, None) => Array.empty[String]
}
val countsRdd = graph.collectNeighbors(EdgeDirection.Either).leftOuterJoin(graph.vertices).map {
case (id, t) => {
val neighbors: Array[(VertexId, Array[String])] = t._1
val nodeAttr = (t._2)
neighbors.map(_._2).count( x => x.apply(x.size - 1) == nodeAttr(0))
}
}
I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages.
collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like:
val countsRdd = graph.collectNeighbors(EdgeDirection.Either)
.join(graph.vertices)
.map{ case (vid,t) => {
val neighbors = t._1
val nodeAttr = t._2
neighbors.map(_._2).filter( <add logic here> ).size
}
If this doesn't get you going in the right direction, or you get stuck, let me know (the "" part, for example).