How to convert a Map containing (vertexId,edgeId) into GraphX RDDs - scala

After parsing the graph from a file, I get a Map where the key represents vertices (id) and the value represents the edge (id). In order to create the edges (Vx->Vy) we need to join the Map entries using the values (the edge id). The goal is to create a GraphX graph from this representation.
Here is what I have so far:
tempLHM.foreach(x=>println(x))
(A.L0,A)
(B.L0,B)
(C.L0,C)
(D.L0,D)
(E.L0,E)
(a.L0M1,A)
(b.L0M1,B)
(c.L0M1,n4)
(a.L0M2,n4)
(b.L0M2,D)
(c.L0M2,n5)
(a.L0M3,n5)
(b.L0M3,C)
(c.L0M3,E)
Is there a direct way to map this hashmap to vertex and edge RDD?
tempLHM is a mutable LinkedHashMap[String,String]. In the above hashmap, in elements (A.L0,A) and (a.L0M1,A), A.L0 and a.L0M1 are keys(vertices) that are joined by the common value A (edge)
Here is what I want to derive
val vertex:RDD(vertexId, VertexName) i.e ((A.L0).Long, A.L0), ((a.L0M1).Long, a.L0M1) etc
val edge:RDD((vertexId1, vertexId2), EdgeName) i.e ((A.L0).Long, (a.L0M1).Long), A)

Assume you have this structure for your data.
val d = Map("v1" -> "e1", "v2" -> "e1", "v3" -> "e2", "v4" -> "e2")
Two edges here ("v1","v2") and ("v3","v4")
Assuming you have a simple graph (not a hyper-graph that can have multiple nodes connected by an edge). Therefore, the assumption for this solution is that an edge only connects two nodes and that edges appear just once.
import collection.mutable.{ HashMap, MultiMap, Set }
import java.security.MessageDigest
import org.apache.spark.graphx.Edge
import org.apache.spark.graphx.Graph
// a hacky way to go from string to Long since GraphX need Longs to
// represent vertex IDs. You might want to do something different
// here to make sure that your IDs are unique.
def str2Long(s: String) = s.##.toLong
val d = Map("v1" -> "e1", "v2" -> "e1", "v3" -> "e2", "v4" -> "e2")
// We use a multi-map to create an inverse map (Edge->Set(Vertices))
val mm = new HashMap[String, Set[String]] with MultiMap[String, String]
d.foreach{ x => mm.addBinding(x._2,x._1) }
val edges = mm.map{ case(k,v) => Edge[String](str2Long(v.head),str2Long(v.last), k) }.toList
val vertices = d.keys.map(x => (str2Long(x), x)).toList
val edgeRdd = sc.parallelize(edges)
val vertexRdd = sc.parallelize(vertices)
val g = Graph(vertexRdd, edgeRdd)
If you print the edges and the vertices you get:
g.vertices.foreach(println)
g.edges.foreach(println)
(3709,v3)
(3707,v1)
(3708,v2)
(3710,v4)
Edge(3709,3710,e2)
Edge(3707,3708,e1)
Note: The solution here will only work for data that fit in the memory of a single node. From your question I see that you load the data in a local Map, so the following solution would work for you. If you want to run this on a huge dataset with multiple nodes, the above solution will not work.
Updated Solution
This solution is more scalable that the one above. It makes sure that you always stay in the RDD domain without the need to collect the graph at the driver (for example, above we loaded all the raw data in a scala Map, which we are going to avoid here). It also covers the case where we have a common edge ID between different nodes (in a hyper-graph like way).
Let's assume that the text file has this format:
v1,e1
v2,e1
v3,e2
v4,e2
In the code below, we first read the raw data and then we transform them to the proper vertex and edge RDDs.
import org.apache.spark.graphx.Edge
import org.apache.spark.graphx.Graph
def str2Long(s: String) = s.##.toLong
val rawData: RDD[String] = sc.textFile("...")
val toBeJoined: RDD[(String, String)]
= rawData.map(_.split(",")).map{ case Array(x,y) => (y,x) }
Note here that our resulting graph will be bidirectional: If we have edge (v1,v2) we also have edge (v2,v1).
val biDirectionalEdges: RDD[(String, (String, String))]
= toBeJoined.join(toBeJoined).filter{ case(e,(v1,v2)) => v1 != v2 }
val edgeRdd =
biDirectionalEdges.map{ case(e,v) => Edge[String](str2Long(v._1),str2Long(v._2), e) }
val vertexRdd =
toBeJoined.map(_._1).distinct.map(x => (str2Long(x), x))
val g = Graph(vertexRdd, edgeRdd)
// Verify that this is the right graph
g.vertices.take(10).foreach(println)
g.edges.take(10).foreach(println)

Related

Graph processing using GraphX in Apache Spark + big data

I have created a Graph using node and edges data in Spark application. Now, I want to come up with the adjacency list for the created graph. How can I achieve this ?
I have written following code to read csv files for node and edge data and creating the Graph.
val grapha = sc.textFile("graph.csv")
val getgdata = grapha.map(line=>line.split(","))
val node1 = getgdata.map(line=>(line(3).toLong,(line(0)))).distinct
val node2 = getgdata.map(line=>(line(4).toLong,(line(1)))).distinct
// This is node list of a graph.
val nodes = node1.union(node2).distinct
//This is edge list.
val routes = getgdata.map(line=>
(Edge(line(3).toLong,line(4).toLong,line(2)))).distinct
// now create graph using Graph library
val graphx = Graph(nodes,routes)
Now I need to see adjacency list for each node from this graph. How can I do it using scala?
Looking at your code, I am assuming that your graph.csv looks like following,
node_1, node_2, node_1_relation_node_2, 1, 2
node_1, node_3, node_1_relation_node_3, 1, 3
node_2, node_3, node_2_relation_node_3, 2, 3
Now, you can read this into an RDD as follows,
val graphData = sc.textFile("graph.csv").map(line => line.split(","))
Now, you to create your graph you need two things,
RDD of Vertices,
val verticesRdd = graphData.flatMap(line => List(
(line(3), line(0)),
(line(4), line(1))
)).distinct
RDD of Edges,
val edgesRdd = graphData.map(line => Edge(line(3), line(4), line(2))).distinct
Now, you can create your graph as,
val graph = Graph(verticesRdd, edgesRdd)
But if you just need adjacency list, you can obtain it just from the graphData as following,
val adjacencyRdd = graphData
.flatMap(line => {
val v1 = line(3).toLong
val v2 = line(4).toLong
List(
(v1, v2),
(v2, v1)
)
)
.aggregateByKey(Set.empty[Long])(
{ case (adjacencySet, vertexId) => adjacencySet + vertexId }
{ case (adjacencySet1, adjacencySet2) => adjacencySet1 ++ adjacencySet2 }
)
.map({ case (vertexId, adjacencySet) => (vertextId, adjacencySet.toList) })

what's the best practice to merge rdds in scala

I have got multi RDDs as result and want to merge them, they are of the same format:
RDD(id, HashMap[String, HashMap[String, Int]])
^ ^ ^
| | |
identity category distribution of the category
Here is a example of that rdd:
(1001, {age={10=3,15=5,16=8, ...}})
The first keyString of the HashMap[String, HashMap] is the category of the statistic and the HashMap[String, Int] in the HashMap[String, HashMap] is the distribution of the category. After calculate each distribution of vary categories, I want to merge them by the identity so that I can store the results to database. Here is what I got currently:
def mergeRDD(rdd1: RDD[(String, util.HashMap[String, Object])],
rdd2:RDD[(String, util.HashMap[String, Object])]): RDD[(String, util.HashMap[String, Object])] = {
val mergedRDD = rdd1.join(rdd2).map{
case (id, (m1, m2)) => {
m1.putAll(m2)
(id, m1)
}
}
mergedRDD
}
val mergedRDD = mergeRDD(provinceRDD, mergeRDD(mergeRDD(levelRDD, genderRDD), actionTypeRDD))
I write a function mergeRDD so that I can merge two rdds each time, But I found that function is not very elegant, as a newbie to scala, any inspiring is appreciated.
I don't see any easy way to achieve this, without hitting performance.
Reason being, you are not simply merging two rdd, rather, you want your hashmap to have consolidated values after union of rdd.
Now, your merge function is wrong. In current state join will actually do inner join, missing out rows present in either rdd not present in other one.
Correct way would be something like.
val mergedRDD = rdd1.union(rdd2).reduceByKey{
case (m1, m2) => {
m1.putAll(m2)
}
}
You may replace the java.util.HashMap with scala.collection.immutable.Map
From there:
val rdds = List(provinceRDD, levelRDD, genderRDD, actionTypeRDD)
val unionRDD = rdds.reduce(_ ++ _)
val mergedRDD = unionRDD.reduceByKey(_ ++ _)
This is assuming that categories don't overlap between rdds.

how to retrieve the value of a property using the value of another property in RDDs

I have a links:JdbcRDD[String] which contains links in the form:
{"bob,michael"}
respectively for the source and destination of each link.
I can split each string to retrieve the string that uniquely identifies the source node and the destination node.
I then have a users:RDD[(Long, Vertex)] that holds all the vertices in my graph.
Each vertex has a nameId:String property and a nodeId:Long property.
I'd like to retrieve the nodeId from the stringId, but don't know how to implement this logic, being rather new both at Scala and Spark. I am stuck with this code:
val reflinks = links.map { x =>
// split each line in an array
val row = x.split(',')
// retrieve the id using the row(0) and row(1) values
val source = users.filter(_._2.stringId == row(0)).collect()
val dest = users.filter(_._2.stringId == row(1)).collect()
// return last value
Edge(source(0)._1, dest(0)._1, "referral")
// return the link in Graphx format
Edge(ids(0), ids(1), "ref")
}
with this solution I get:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
Unfortunately, you cannot have nested RDDs in Spark. That is, you cannot access one RDD while you are inside the closure send to another RDD.
If you want to combine knowledge from more than one RDD you need to join them in some way. Here is one way to solve this problem:
import org.apache.spark.graphx._
import org.apache.spark.SparkContext._
// These are some toy examples of the original data for the edges and the vertices
val rawEdges = sc.parallelize(Array("m,a", "c,a", "g,c"))
val rawNodes = sc.parallelize(Array( ("m", 1L), ("a", 2L), ("c", 3L), ("g", 4L)))
val parsedEdges: RDD[(String, String)] = rawEdges.map(x => x.split(",")).map{ case Array(x,y) => (x,y) }
// The two joins here are required since we need to get the ID for both nodes of each edge
// If you want to stay in the RDD domain, you need to do this double join.
val resolvedFirstRdd = parsedEdges.join(rawNodes).map{case (firstTxt,(secondTxt,firstId)) => (secondTxt,firstId) }
val edgeRdd = resolvedFirstRdd.join(rawNodes).map{case (firstTxt,(firstId,secondId)) => Edge(firstId,secondId, "ref") }
// The prints() are here for testing (they can be expensive to keep in the actual code)
edgeRdd.foreach(println)
val g = Graph(rawNodes.map(x => (x._2, x._1)), edgeRdd)
println("In degrees")
g.inDegrees.foreach(println)
println("Out degrees")
g.outDegrees.foreach(println)
The print output for testing:
Edge(3,2,ref)
Edge(1,2,ref)
Edge(4,3,ref)
In degrees
(3,1)
(2,2)
Out degrees
(3,1)
(1,1)
(4,1)

Scala :- Gatling :- Concatenation of two Maps stores last value only and ignores all other values

I have a two Maps and I want to concatenate them.
I tried almost all example given here Best way to merge two maps and sum the values of same key? but it ignores all values for key metrics and only stores last value.
I have downloaded scalaz-full_2.9.1-6.0.3.jar and imported it import scalaz._ but it won't works for me.
How can I concate this two maps with multiple values to same keys ?
Edit :-
Now I tried
val map = new HashMap[String, Set[String]] with MultiMap[String, String]
map.addBinding("""report_type""" , """performance""")
map.addBinding("""start_date""" ,start_date)
map.addBinding("""end_date""" , end_date)
map.addBinding("metrics" , "plays")
map.addBinding("metrics", "displays")
map.addBinding("metrics" , "video_starts")
map.addBinding("metrics" , "playthrough_25")
map.addBinding("metrics", "playthrough_50")
map.addBinding("metrics", "playthrough_75")
map.addBinding("metrics", "playthrough_100")
val map1 = new HashMap[String, Set[String]] with MultiMap[String, String]
map1.addBinding("""dimensions""" , """asset""")
map1.addBinding("""limit""" , """50""")
And tried to conver this mutable maps to immutable type using this link as
val asset_query_string = map ++ map1
val asset_query_string_map =(asset_query_string map { x=> (x._1,x._2.toSet) }).toMap[String, Set[String]]
But still I get
i_ui\config\config.scala:51: Cannot prove that (String, scala.collection.immutable.Set[String]) <:< (St
ring, scala.collection.mutable.Set[String]).
11:10:13.080 [ERROR] i.g.a.ZincCompiler$ - val asset_query_string_map =(asset_query_string map { x=> (x
._1,x._2.toSet) }).toMap[String, Set[String]]
Your problem is not related with a concatenation but with a declaration of the metrics map. It's not possible to have multiple values for a single key in a Map. Perhaps you should look at this collection:
http://www.scala-lang.org/api/2.10.3/index.html#scala.collection.mutable.MultiMap
You can't have duplicate keys in a Map.
for simple map it is impossible to have duplicates keys,if you have the duplicates keys in the map it takes the last one
but you can use MultiMap
import collection.mutable.{ HashMap, MultiMap, Set }
val mm = new HashMap[String, Set[String]] with MultiMap[String, String]
mm.addBinding("metrics","plays")
mm.addBinding("metrics","displays")
mm.addBinding("metrics","players")
println(mm,"multimap")//(Map(metrics -> Set(players, plays, displays)),multimap)
I was able to create two MultiMaps but when I tried to concatenate val final_map = map1 ++ map2
and I tried answer given here Mutable MultiMap to immutable Map
But my problem was not solved, I got
config\config.scala:51: Cannot prove that (String, scala.collection.immutable.Set[String]) <:< (St
ring, scala.collection.mutable.Set[String]).
finally it solved by
val final_map = map1 ++ map2
val asset_query_string_map = final_map.map(kv => (kv._1,kv._2.toSet)).toMap

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.