What is the proper way to compute graph diameter in GraphX - scala

I'm implementing an algorithm on GraphX for which I need to also compute the diameter of some relatively small graphs.
The problem is that GraphX doesn't have any notion of undirected graphs, so when using the built-in method from ShortestPaths, it obsviously gets the shortets directed paths. This doesn't help much in computing a graph diameter (Longest Shorted undirected path between any pairs of nodes).
I thought of duplicating the the edges of my graph (instead of |E| I would have 2|E| edges) but I didn't feel it's the right way to do it. So, are there a better way to do it on GraphX notably?
Here is my code for a directed graph:
// computing the query diameter
def getDiameter(graph: Graph[String, Int]):Long = {
// Get ids of vertices of the graph
val vIds = graph.vertices.collect.toList.map(_._1)
// Compute list of shortest paths for every vertex in the graph
val shortestPaths = lib.ShortestPaths.run(graph, vIds).vertices.collect
// extract only the distance values from a list of tuples <VertexId, Map> where map contains <key, value>: <dst vertex, shortest directed distance>
val values = shortestPaths.map(element => element._2).map(element => element.values)
// diamter is the longest shortest undirected distance between any pair of nodes in te graph
val diameter = values.map(m => m.max).max
diameter
}

GraphX actually has no notion of direction it you don't tell it to use it.
If you look at the inner workings of the ShortestPaths library you'll see that it uses Pregel and the direction is default (EdgeDirection.Either). This means that for all triplets it will add both source & dest to the activeset.
However if you specify in the sendMsg function of Pregel to only keep the srcId in the active set (as is happening in the ShortestPaths lib) certain vertices (with only outgoing edges) will not be reevaluated.
Anyway a solution is to write your own Diameter object/library, maybe looking like this (heavily based on ShortestPath, so maybe there are even better solutions?)
object Diameter extends Serializable {
type SPMap = Map[VertexId, Int]
def makeMap(x: (VertexId, Int)*) = Map(x: _*)
def incrementMap(spmap: SPMap): SPMap = spmap.map { case (v, d) => v -> (d + 1) }
def addMaps(spmap1: SPMap, spmap2: SPMap): SPMap = {
(spmap1.keySet ++ spmap2.keySet).map {
k => k -> math.min(spmap1.getOrElse(k, Int.MaxValue), spmap2.getOrElse(k, Int.MaxValue))
}(collection.breakOut) // more efficient alternative to [[collection.Traversable.toMap]]
}
// Removed landmarks, since all paths have to be taken in consideration
def run[VD, ED: ClassTag](graph: Graph[VD, ED]): Int = {
val spGraph = graph.mapVertices { (vid, _) => makeMap(vid -> 0) }
val initialMessage:SPMap = makeMap()
def vertexProgram(id: VertexId, attr: SPMap, msg: SPMap): SPMap = {
addMaps(attr, msg)
}
def sendMessage(edge: EdgeTriplet[SPMap, _]): Iterator[(VertexId, SPMap)] = {
// added the concept of updating the dstMap based on the srcMap + 1
val newSrcAttr = incrementMap(edge.dstAttr)
val newDstAttr = incrementMap(edge.srcAttr)
List(
if (edge.srcAttr != addMaps(newSrcAttr, edge.srcAttr)) Some((edge.srcId, newSrcAttr)) else None,
if (edge.dstAttr != addMaps(newDstAttr, edge.dstAttr)) Some((edge.dstId, newDstAttr)) else None
).flatten.toIterator
}
val pregel = Pregel(spGraph, initialMessage)(vertexProgram, sendMessage, addMaps)
// each vertex will contain map with all shortest paths, so just get first
pregel.vertices.first()._2.values.max
}
}
val diameter = Diameter.run(graph)

Related

GraphX - How to get all connected vertices from vertexId (not just the firsts adjacents)?

Considering this graph:
Exemple graph
How can I get all connected vertices from a vertexID?
For example, from VertexId 5, it should return 5-3-7-8-10
CollectNeighbors only returns the first adjacent ones.
I'm trying to use pregel, but I don't know how to start from a specific vertex. I don't want to calculate all the nodes.
Thanks!
I just noticed that the graph is directed. then you can use the code of the shortest path example here. if the distance of a specific node is not infinity then you can reach this node.
or there is a better idea you can modify the shortest path algorithm to satisfy your needs.
import org.apache.spark.graphx.{Graph, VertexId}
import org.apache.spark.graphx.util.GraphGenerators
// A graph with edge attributes containing distances
val graph: Graph[Long, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
val sourceId: VertexId = 42 // The ultimate source
// Initialize the graph such that all vertices except the root have canReach = false.
val initialGraph: Graph[Boolean, Double] = graph.mapVertices((id, _) => id == sourceId)
val sssp = initialGraph.pregel(false)(
(id, canReach, newCanReach) => canReach || newCanReach, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr && !triplet.dstAttr) {
Iterator((triplet.dstId, true))
} else {
Iterator.empty
}
},
(a, b) => a || b // Merge Message
)
println(sssp.vertices.collect.mkString("\n"))

How to take different types of elements from Map

Here I got two hash sets:
var vertexes = new HashSet[String]()
var edges = new HashSet[RDFTriple]() //RDFTriple is a class
I want to put them into a map like this:
var graph = Map[String, HashSet[_]]()
graph.put("e", edges)
graph.put("v", vertexes)
But now I want to take vertexes and edges respectively but failed. I have tried something like the following:
val a = graph.get("v")
a match {
case _ => val v = a
}
val b = graph.get("e")
b match {
case _ => val e = b
}
But v and e are recognized as Option[HashSet[_]] while I want are HashSet[String] and HashSet[RDFTriple].
How can I do this?
I will apprecicate it so much cuz it bothers me too long.
It is not recommended to use different types in the same Map, however you could some the problem by using Some and asInstanceOf like this:
val v = a match {
case Some(a) => a.asInstanceOf[HashSet[String]]
case None => // do something
}
Note that the assignment val v = ... is done outside the match to allow usage of the variable afterwards. The match for the edges is similar.
However, a better solution would be to use a case class for the graph. Then you would avoid a lot of hassle.
case class Graph(vertexes: HashSet[String], edges: HashSet[RDFTriple])
val graph = Graph(vertexes, edges)
val v = graph.vertexes // HashSet[String]
val e = graph.edges // HashSet[RDFTriple]

applying a function to graph data using mapReduceTriplets in spark and graphx

I'm having some problems applying the mapReduceTriplets to my graph network in spark using graphx.
I've been following the tutorials and read in my own data which is put together as [Array[String],Int], so for example my vertices are:
org.apache.spark.graphx.VertexRDD[Array[String]] e.g. (3999,Array(17, Low, 9))
And my edges are:
org.apache.spark.graphx.EdgeRDD[Int]
e.g. Edge(3999,4500,1)
I'm trying to apply an aggregate type function using mapReduceTriplets which counts how many of the last integer in the array of a vertices (in the above example 9) is the same or different to the first integer (in the above example 17) of all connected vertices.
So you would end up with a list of counts for the number of matches or non-matches.
The problem I am having is applying any function using mapReduceTriplets, I am quite new to scala so this may be really obvious, but in the graphx tutorials it has an example which is using a graph with the format Graph[Double, Int], however my graph is in the format of Graph[Array[String],Int], so i'm just trying as a first step to figure out how I can use my graph in the example and then work from there.
The example on the graphx website is as follows:
val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
Iterator((triplet.dstId, (1, triplet.srcAttr)))
} else {
// Don't send a message for this triplet
Iterator.empty
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
Any advice would be most appreciated, or if you think there is a better way than using mapreducetriplets I would be happy to hear it.
Edited new code
val nodes = (sc.textFile("C~nodeData.csv")
.map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
val edges = GraphLoader.edgeListFile(sc, "C:~edges.txt")
val graph = edges.outerJoinVertices(nodes) {
case (uid, deg, Some(attrList)) => attrList
case (uid, deg, None) => Array.empty[String]
}
val countsRdd = graph.collectNeighbors(EdgeDirection.Either).leftOuterJoin(graph.vertices).map {
case (id, t) => {
val neighbors: Array[(VertexId, Array[String])] = t._1
val nodeAttr = (t._2)
neighbors.map(_._2).count( x => x.apply(x.size - 1) == nodeAttr(0))
}
}
I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages.
collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like:
val countsRdd = graph.collectNeighbors(EdgeDirection.Either)
.join(graph.vertices)
.map{ case (vid,t) => {
val neighbors = t._1
val nodeAttr = t._2
neighbors.map(_._2).filter( <add logic here> ).size
}
If this doesn't get you going in the right direction, or you get stuck, let me know (the "" part, for example).

Recursive Types in Scala with a List

Similarly to mutually recursive types in scala I am trying to create a mutually recursive type in Scala.
I am trying to make a graph defined with this type (which does compile) :
case class Node(val id : Int, val edges : Set[Node])
But I don't understand how I can actually create something with this type, because in order to initialize Node A with edges B and C, I need to at least have a lazy reference to B and C, but I can't simultaneously create their edgesets.
Is it possible to implement this recursive type?
EDIT:
Here is the solution I am currently using to convert an explicit adjacency list to a self-referential one.
def mapToGraph(edgeMap : Map[Int, mutable.Set[Int]]) : List[Node] = {
lazy val nodeMap = edgeMap map (kv => (kv._1, new Node(kv._1, futures.get(kv._1).get)))
lazy val futures : Map[Int, Set[Node]] = edgeMap map (kv => {
val edges = (kv._2 map (e => nodeMap.get(e).get)).toSet
(kv._1, edges)
})
val eval = nodeMap.values.toList
eval //to force from lazy to real - don't really like doing this
}
or alternatively, fromm an edgelist
//reads an edgeList into a graph
def readEdgelist(filename : String) : List[Node] = {
lazy val nodes = new mutable.HashMap[Int, Node]()
lazy val edges = new mutable.HashMap[Int, mutable.Buffer[Node]]()
Source.fromFile(filename).getLines() filter (x => x != None) foreach {edgeStr =>
val edge = edgeStr.split('\t')
if (edge.size != 2) goodbye("Not a well-formed edge : " + edgeStr + " size: " + edge.size.toString)
val src = edge(0).toInt
val des = edge(1).toInt
if (!(nodes.contains(src))) nodes.put(src, new Node(src, futures.get(src).get))
if (!(nodes.contains(des))) nodes.put(des, new Node(des, futures.get(des).get))
edges.put(src, edges.getOrElse(src, mutable.Buffer[Node]()) += nodes.get(des).get)
}
lazy val futures : Map[Int, Set[Node]] = nodes map {node => (node._1, edges.getOrElse(node._1, mutable.Buffer[Node]()).toSet)} toMap
val eval = nodes.values.toList
eval
}
Thanks everyone for the advice!
Sounds like you need to work from the bottom up
val b = Node(1, Set.empty)
val c = Node(2, Set.empty)
val a = Node(3, Set(b, c))
Hope that helps
Chicken & egg... you have three options:
Restrict your graphs to Directed Acyclic Graphs (DAGs) and use RKumsher's suggestion.
To retain immutability, you'll need to separate your node instances from your edge sets (two different classes: create nodes, then create edge sets/graph).
If you prefer the tight correlation, consider using a setter for the edge sets so that you can come back and set them later, after all the nodes are created.

Alternative to List.sliding?

I'm trying to get more familiar with functional programming and I was wondering if there is a more elegant way to group a list into pairs of 2 and apply a function to those pairs.
case class Line(start: Vector, end: Vector) {
def isLeftOf(v: Vector) = (end - start).cross(v - start) < 0
}
case class Polygon(vertices: List[Vector]) {
def edges = (vertices.sliding(2).toList :+ List(vertices.last,vertices.head)).map(l => Line(l(0), l(1)))
def contains(v: Vector) = {
edges.map(_.isLeftOf(v)).forall(_ == true)
}
}
I'm talking about this line
def edges = (vertices.sliding(2).toList :+ List(vertices.last,vertices.head)).map(l => Line(l(0), l(1)))
Is there a better way to write this?
val edges = (vertices, vertices.tail :+ vertices.head).zipped map Line
See also these questions:
How do you turn a Scala list into pairs?
Most concise way to combine sequence elements
Well you can simplify a bit by doing this:
case class Polygon(vertices: List[Vector]) {
def edges = Line(vertices.last, vertices.head) :: vertices.sliding(2).map(l => Line(l(0), l(1))).toList
def contains(v: Vector) = edges.forall(_.isLeftOf(v))
}
I've done three things:
Pulled the last/head Line out so that it isn't part of the map
Moved toList to after map so that you map over an Iterator, saving you from constructing two Lists.
Simplified contains to simply call forall with the predicate.