Graph processing using GraphX in Apache Spark + big data

Graph processing using GraphX in Apache Spark + big data - scala

I have created a Graph using node and edges data in Spark application. Now, I want to come up with the adjacency list for the created graph. How can I achieve this ?
I have written following code to read csv files for node and edge data and creating the Graph.
val grapha = sc.textFile("graph.csv")
val getgdata = grapha.map(line=>line.split(","))
val node1 = getgdata.map(line=>(line(3).toLong,(line(0)))).distinct
val node2 = getgdata.map(line=>(line(4).toLong,(line(1)))).distinct
// This is node list of a graph.
val nodes = node1.union(node2).distinct
//This is edge list.
val routes = getgdata.map(line=>
(Edge(line(3).toLong,line(4).toLong,line(2)))).distinct
// now create graph using Graph library
val graphx = Graph(nodes,routes)
Now I need to see adjacency list for each node from this graph. How can I do it using scala?

Looking at your code, I am assuming that your graph.csv looks like following,
node_1, node_2, node_1_relation_node_2, 1, 2
node_1, node_3, node_1_relation_node_3, 1, 3
node_2, node_3, node_2_relation_node_3, 2, 3
Now, you can read this into an RDD as follows,
val graphData = sc.textFile("graph.csv").map(line => line.split(","))
Now, you to create your graph you need two things,
RDD of Vertices,
val verticesRdd = graphData.flatMap(line => List(
(line(3), line(0)),
(line(4), line(1))
)).distinct
RDD of Edges,
val edgesRdd = graphData.map(line => Edge(line(3), line(4), line(2))).distinct
Now, you can create your graph as,
val graph = Graph(verticesRdd, edgesRdd)
But if you just need adjacency list, you can obtain it just from the graphData as following,
val adjacencyRdd = graphData
.flatMap(line => {
val v1 = line(3).toLong
val v2 = line(4).toLong
List(
(v1, v2),
(v2, v1)
)
)
.aggregateByKey(Set.empty[Long])(
{ case (adjacencySet, vertexId) => adjacencySet + vertexId }
{ case (adjacencySet1, adjacencySet2) => adjacencySet1 ++ adjacencySet2 }
)
.map({ case (vertexId, adjacencySet) => (vertextId, adjacencySet.toList) })

Related

Access two dimensional Array's element in Scala (Processing in Apache-Spark)

I am now trying to get Cosine Similarity.
There is also an item list, which consists of Array[(Int, Int)].
val list = Array(Array((item_11, item_12), (item_13,item_14), (item_15,item_16), ...
Array((item_21, item_22), (item_23,item_24), (item_25, item_26), ...
...
Array((item_n1, item_n2), (item_n3,item_n4), (item_n5, item_n6), ... ))
And I want to get the Cosine-Similarity between the items (item_1, item_2) using Feature vectors that extracted from ALS Model by accessing each element of the array.
The output that i want,
Array( similarity from item_11, item_12 ), ...
Array( similarity form item_21, item_22), ...
And I processed some code but spark doesn't support nested RDDs, So I failed.
The code was,
val combinations = list.mapValues(_.toSeq.combinations(2).toArray.map{ case Seq(x,y) => (x,y)}).map(_._2)
val simOnly = combinations.map{_.map{ case (item_1, item_2) =>
val itemFactor_1 = modelMLlib.productFeatures.lookup(item_1).head
val itemFactor_2 = modelMLlib.productFeatures.lookup(item_2).head
val itemVector_1 = new DoubleMatrix(itemFactor_1)
val itemVector_2 = new DoubleMatrix(itemFactor_2)
val sim = cosineSimilarity(itemVector_1,itemVector_2)
sim}}
Could anybody Help me to fix this?

How to filter the data in spark-shell using scala?

I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step

To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)

I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()

Apache Flink - Gelly - Create a dataset from a list of edges

I have a list of vertices and edges created this way:
val v1 = new Vertex(1L, "foo")
val v2 = new Vertex(2L, "bar")
val e1 = new Edge(v1, v2, 0.5)`
and want to create a Flink graph using the Graph.fromDataSet method (or any other for this matter). How can I transform those edges and vertices in something that is readable for Flink?
Thank you!!

Given a list of vertices val vertices: Seq[Vertex[Long, String]] = ... and edges val edges: Seq[Edge[Long, String]] = ... you can create a Graph using the Graph.fromCollection method:
val env = ExecutionEnvironment.getExecutionEnvironment
val vertices = Seq(new Vertex[Long, String](1L, "foo"), new Vertex[Long, String](2L, "bar"))
val edges = Seq(new Edge[Long, String](1L, 2L, "foobar"))
val graph = Graph.fromCollection(vertices, edges, env)
It is noteworthy that you have to import the Scala version of org.apache.flink.graph.scala.Graph.
Alternatively, you can also first create an edgeDataset: DataSet[Edge[Long, String]] and a vertexDataSet: DataSet[Vertex[Long, String]] using the ExecutionEnvironment. A Graph can then be created calling the Graph.fromDataSet method:
val vertexDataset = env.fromCollection(vertices)
val edgeDataset = env.fromCollection(edges)
val graph = Graph.fromDataSet(vertexDataset, edgeDataset, env)

How to convert a Map containing (vertexId,edgeId) into GraphX RDDs

After parsing the graph from a file, I get a Map where the key represents vertices (id) and the value represents the edge (id). In order to create the edges (Vx->Vy) we need to join the Map entries using the values (the edge id). The goal is to create a GraphX graph from this representation.
Here is what I have so far:
tempLHM.foreach(x=>println(x))
(A.L0,A)
(B.L0,B)
(C.L0,C)
(D.L0,D)
(E.L0,E)
(a.L0M1,A)
(b.L0M1,B)
(c.L0M1,n4)
(a.L0M2,n4)
(b.L0M2,D)
(c.L0M2,n5)
(a.L0M3,n5)
(b.L0M3,C)
(c.L0M3,E)
Is there a direct way to map this hashmap to vertex and edge RDD?
tempLHM is a mutable LinkedHashMap[String,String]. In the above hashmap, in elements (A.L0,A) and (a.L0M1,A), A.L0 and a.L0M1 are keys(vertices) that are joined by the common value A (edge)
Here is what I want to derive
val vertex:RDD(vertexId, VertexName) i.e ((A.L0).Long, A.L0), ((a.L0M1).Long, a.L0M1) etc
val edge:RDD((vertexId1, vertexId2), EdgeName) i.e ((A.L0).Long, (a.L0M1).Long), A)

Assume you have this structure for your data.
val d = Map("v1" -> "e1", "v2" -> "e1", "v3" -> "e2", "v4" -> "e2")
Two edges here ("v1","v2") and ("v3","v4")
Assuming you have a simple graph (not a hyper-graph that can have multiple nodes connected by an edge). Therefore, the assumption for this solution is that an edge only connects two nodes and that edges appear just once.
import collection.mutable.{ HashMap, MultiMap, Set }
import java.security.MessageDigest
import org.apache.spark.graphx.Edge
import org.apache.spark.graphx.Graph
// a hacky way to go from string to Long since GraphX need Longs to
// represent vertex IDs. You might want to do something different
// here to make sure that your IDs are unique.
def str2Long(s: String) = s.##.toLong
val d = Map("v1" -> "e1", "v2" -> "e1", "v3" -> "e2", "v4" -> "e2")
// We use a multi-map to create an inverse map (Edge->Set(Vertices))
val mm = new HashMap[String, Set[String]] with MultiMap[String, String]
d.foreach{ x => mm.addBinding(x._2,x._1) }
val edges = mm.map{ case(k,v) => Edge[String](str2Long(v.head),str2Long(v.last), k) }.toList
val vertices = d.keys.map(x => (str2Long(x), x)).toList
val edgeRdd = sc.parallelize(edges)
val vertexRdd = sc.parallelize(vertices)
val g = Graph(vertexRdd, edgeRdd)
If you print the edges and the vertices you get:
g.vertices.foreach(println)
g.edges.foreach(println)
(3709,v3)
(3707,v1)
(3708,v2)
(3710,v4)
Edge(3709,3710,e2)
Edge(3707,3708,e1)
Note: The solution here will only work for data that fit in the memory of a single node. From your question I see that you load the data in a local Map, so the following solution would work for you. If you want to run this on a huge dataset with multiple nodes, the above solution will not work.
Updated Solution
This solution is more scalable that the one above. It makes sure that you always stay in the RDD domain without the need to collect the graph at the driver (for example, above we loaded all the raw data in a scala Map, which we are going to avoid here). It also covers the case where we have a common edge ID between different nodes (in a hyper-graph like way).
Let's assume that the text file has this format:
v1,e1
v2,e1
v3,e2
v4,e2
In the code below, we first read the raw data and then we transform them to the proper vertex and edge RDDs.
import org.apache.spark.graphx.Edge
import org.apache.spark.graphx.Graph
def str2Long(s: String) = s.##.toLong
val rawData: RDD[String] = sc.textFile("...")
val toBeJoined: RDD[(String, String)]
= rawData.map(_.split(",")).map{ case Array(x,y) => (y,x) }
Note here that our resulting graph will be bidirectional: If we have edge (v1,v2) we also have edge (v2,v1).
val biDirectionalEdges: RDD[(String, (String, String))]
= toBeJoined.join(toBeJoined).filter{ case(e,(v1,v2)) => v1 != v2 }
val edgeRdd =
biDirectionalEdges.map{ case(e,v) => Edge[String](str2Long(v._1),str2Long(v._2), e) }
val vertexRdd =
toBeJoined.map(_._1).distinct.map(x => (str2Long(x), x))
val g = Graph(vertexRdd, edgeRdd)
// Verify that this is the right graph
g.vertices.take(10).foreach(println)
g.edges.take(10).foreach(println)

how to retrieve the value of a property using the value of another property in RDDs

I have a links:JdbcRDD[String] which contains links in the form:
{"bob,michael"}
respectively for the source and destination of each link.
I can split each string to retrieve the string that uniquely identifies the source node and the destination node.
I then have a users:RDD[(Long, Vertex)] that holds all the vertices in my graph.
Each vertex has a nameId:String property and a nodeId:Long property.
I'd like to retrieve the nodeId from the stringId, but don't know how to implement this logic, being rather new both at Scala and Spark. I am stuck with this code:
val reflinks = links.map { x =>
// split each line in an array
val row = x.split(',')
// retrieve the id using the row(0) and row(1) values
val source = users.filter(_._2.stringId == row(0)).collect()
val dest = users.filter(_._2.stringId == row(1)).collect()
// return last value
Edge(source(0)._1, dest(0)._1, "referral")
// return the link in Graphx format
Edge(ids(0), ids(1), "ref")
}
with this solution I get:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

Unfortunately, you cannot have nested RDDs in Spark. That is, you cannot access one RDD while you are inside the closure send to another RDD.
If you want to combine knowledge from more than one RDD you need to join them in some way. Here is one way to solve this problem:
import org.apache.spark.graphx._
import org.apache.spark.SparkContext._
// These are some toy examples of the original data for the edges and the vertices
val rawEdges = sc.parallelize(Array("m,a", "c,a", "g,c"))
val rawNodes = sc.parallelize(Array( ("m", 1L), ("a", 2L), ("c", 3L), ("g", 4L)))
val parsedEdges: RDD[(String, String)] = rawEdges.map(x => x.split(",")).map{ case Array(x,y) => (x,y) }
// The two joins here are required since we need to get the ID for both nodes of each edge
// If you want to stay in the RDD domain, you need to do this double join.
val resolvedFirstRdd = parsedEdges.join(rawNodes).map{case (firstTxt,(secondTxt,firstId)) => (secondTxt,firstId) }
val edgeRdd = resolvedFirstRdd.join(rawNodes).map{case (firstTxt,(firstId,secondId)) => Edge(firstId,secondId, "ref") }
// The prints() are here for testing (they can be expensive to keep in the actual code)
edgeRdd.foreach(println)
val g = Graph(rawNodes.map(x => (x._2, x._1)), edgeRdd)
println("In degrees")
g.inDegrees.foreach(println)
println("Out degrees")
g.outDegrees.foreach(println)
The print output for testing:
Edge(3,2,ref)
Edge(1,2,ref)
Edge(4,3,ref)
In degrees
(3,1)
(2,2)
Out degrees
(3,1)
(1,1)
(4,1)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Graph processing using GraphX in Apache Spark + big data - scala

Related

Access two dimensional Array's element in Scala (Processing in Apache-Spark)

How to filter the data in spark-shell using scala?

Apache Flink - Gelly - Create a dataset from a list of edges

How to convert a Map containing (vertexId,edgeId) into GraphX RDDs

how to retrieve the value of a property using the value of another property in RDDs

Categories

Resources