How to obtain convert DataFrame to specific RDD? - scala

I have the following DataFrame in Spark 2.2:
df =
v_in v_out
123 456
123 789
456 789
This df defines edges of a graph. Each row is a pair of vertices. I want to extract the Array of edges in order to create an RDD of edges as follows:
val edgeArray = Array(
Edge(2L, 1L, 0.0),
Edge(2L, 4L, 0.2),
Edge(3L, 2L, 0.9),
Edge(3L, 6L, 0.1),
Edge(4L, 1L, 0.0),
Edge(5L, 2L, 0.8),
Edge(5L, 3L, 0.7),
Edge(5L, 6L, 0.5)
)
val spark = SparkSession.builder()
.appName("ES")
.master("local[*]")
.getOrCreate()
implicit val sparkContext = spark.sparkContext
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
How can I obtain edgeArray of the same structure using df? In each Edge, the third value can be any random Double value from 0 to 1.
UPDATE:
I did it in this way, but not sure if this is the most optimal solution:
val edgeArray = df.rdd.collect().map(row => Edge(row.get(0).toString.toLong, row.get(1).toString.toLong, 0.0))
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
I don't like to use Array, because I might have millions of edges. Can I pass DataFrame more directly to RDD?

Given
val df = Seq((123, 456), (123, 789), (456, 789)).toDF("v_in", "v_out")
Import
import org.apache.spark.sql.functions.rand
import org.apache.spark.graphx.Edge
and convert:
val edgeRDD = df.toDF("srcId", "dstId")
.withColumn("attr", rand)
.as[Edge[Double]].rdd
With graphframes:
spark.jars.packages graphframes:graphframes:X.X.X-sparkY.Y-s_Z.ZZ
where X.X.X is package version, Y.Y is Spark version and Z.ZZ is Scala version, you can create Graph like this:
GraphFrame.fromEdges(df.toDF("src", "dst")).toGraphX
but it'll use Row attributes.

Related

Partial/Full-match value in one RDD to values in another RDD

I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?
I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+
Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!

efficiently using union in spark

I am new to scala and spark and now I have two RDD like A is [(1,2),(2,3)] and B is [(4,5),(5,6)] and I want to get RDD like [(1,2),(2,3),(4,5),(5,6)]. But thing is my data is large, suppose both A and B is 10GB. I use sc.union(A,B) but it is slow. I saw in spark UI there are 28308 tasks in this stage.
Is there more efficient way to do this?
Why don't you convert the two RDDs to dataframes and use union function.
Converting to dataframe is easy you just need to import sqlContext.implicits._ and apply .toDF() function with header names.
for example:
val sparkSession = SparkSession.builder().appName("testings").master("local").config("", "").getOrCreate()
val sqlContext = sparkSession.sqlContext
var firstTableColumns = Seq("col1", "col2")
var secondTableColumns = Seq("col3", "col4")
import sqlContext.implicits._
var firstDF = Seq((1, 2), (2, 3), (3, 4), (2, 3), (3, 4)).toDF(firstTableColumns:_*)
var secondDF = Seq((4, 5), (5, 6), (6, 7), (4, 5)) .toDF(secondTableColumns: _*)
firstDF = firstDF.union(secondDF)
It should be very easy for you to work with dataframes than with RDDs. Changing dataframe to RDD is quite easy too, just call .rdd function
val rddData = firstDF.rdd

How to join in spark graphx given multiple vertex types

I am relatively new to spark graphx. Basically my graph has:
2 vertex types: person and car
edge describes which person owns which car
I want to given all the person vertices in the graph, traverse the edges to collect a list of cars for each person
e.g.
person1 -> [car1, car2]
person2 -> [car3]
You can achieve this with a bit of SQL.
Let's assume that you have the following graph:
import org.apache.spark.graphx
import org.apache.spark.rdd.RDD
// Create an RDD for the vertices
val v: RDD[(VertexId, (String))] =
sc.parallelize(Array((1L, ("car1")), (2L, ("car2")),
(3L, ("car3")), (4L, ("person1")),(5L, ("person2"))))
// Create an RDD for edges
val e: RDD[Edge[Int]] =
sc.parallelize(Array(Edge(4L, 1L,1), Edge(4L, 2L, 1),
Edge(5L, 1L,1)))
val graph = Graph(v,e)
Now extract the edges and vertices into Dataframes:
val vDf = graph.vertices.toDF("vId","vName")
val eDf =graph.edges.toDF("person","car","attr")
Transform the data into the desired output
eDf.drop("attr").join(vDf,'person === 'vId).drop("vId","person").withColumnRenamed("vName","person")
.join(vDf,'car === 'vId).drop("car","vId")
.groupBy("person")
.agg(collect_set('vName)).toDF("person","car")
.show()
+-------+------------+
| person| car|
+-------+------------+
|person2| [car1]|
|person1|[car2, car1]|
+-------+------------+

Get all the nodes connected to a node in Apache Spark GraphX

Suppose we have got the input in Apache GraphX as :
Vertex RDD:
val vertexArray = Array(
(1L, "Alice"),
(2L, "Bob"),
(3L, "Charlie"),
(4L, "David"),
(5L, "Ed"),
(6L, "Fran")
)
Edge RDD:
val edgeArray = Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 1),
Edge(3L, 4L, 1),
Edge(5L, 6L, 1)
)
I need all the components connected to a node in Apache Spark GraphX
1,[1,2,3,4]
5,[5,6]
You can use ConnectedComponents which returns
a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.
and reshape results
graph.connectedComponents.vertices.map(_.swap).groupByKey

How can I get the number of common edges in Spark Graphx?

For example, if I have two graphs with vertices and edges like this:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexRdd1: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 65))
))
val edgeRdd1: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 8)
))
val vertexRdd2: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 28)),
(4L, ("d", 27)),
(5L, ("e", 65))
))
val edgeRdd2: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 4),
Edge(3L, 5L, 1),
Edge(2L, 4L, 1)
))
How can I get the number of common edges between these two graphs, without considering the edge attribute? So, in the above example the number of common edges is 2 and the common edges are: Edge(1L, 2L, 1) common with Edge(1L, 2L, 1) and Edge(2L, 3L, 8) common with Edge(2L, 3L, 4).
I am programming in scala.
Assuming you have graph1 (Graph(vertexRdd1, edgeRdd1)) and graph2 (Graph(vertexRdd2, edgeRdd2))) you can map edges to (srcId, dstId) and then use intersection method:
val srcDst1 = graph1.edges.map(e => (e.srcId, e.dstId))
val srcDst2 = graph2.edges.map(e => (e.srcId, e.dstId))
srcDst1.intersection(srcDst2).count()