How can I get the number of common edges in Spark Graphx? - scala

For example, if I have two graphs with vertices and edges like this:
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexRdd1: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 65))
))
val edgeRdd1: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 8)
))
val vertexRdd2: RDD[(VertexId, (String, Int))] = sc.parallelize(Array(
(1L, ("a", 28)),
(2L, ("b", 27)),
(3L, ("c", 28)),
(4L, ("d", 27)),
(5L, ("e", 65))
))
val edgeRdd2: RDD[Edge[Int]] = sc.parallelize(Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 4),
Edge(3L, 5L, 1),
Edge(2L, 4L, 1)
))
How can I get the number of common edges between these two graphs, without considering the edge attribute? So, in the above example the number of common edges is 2 and the common edges are: Edge(1L, 2L, 1) common with Edge(1L, 2L, 1) and Edge(2L, 3L, 8) common with Edge(2L, 3L, 4).
I am programming in scala.

Assuming you have graph1 (Graph(vertexRdd1, edgeRdd1)) and graph2 (Graph(vertexRdd2, edgeRdd2))) you can map edges to (srcId, dstId) and then use intersection method:
val srcDst1 = graph1.edges.map(e => (e.srcId, e.dstId))
val srcDst2 = graph2.edges.map(e => (e.srcId, e.dstId))
srcDst1.intersection(srcDst2).count()

Related

How to obtain convert DataFrame to specific RDD?

I have the following DataFrame in Spark 2.2:
df =
v_in v_out
123 456
123 789
456 789
This df defines edges of a graph. Each row is a pair of vertices. I want to extract the Array of edges in order to create an RDD of edges as follows:
val edgeArray = Array(
Edge(2L, 1L, 0.0),
Edge(2L, 4L, 0.2),
Edge(3L, 2L, 0.9),
Edge(3L, 6L, 0.1),
Edge(4L, 1L, 0.0),
Edge(5L, 2L, 0.8),
Edge(5L, 3L, 0.7),
Edge(5L, 6L, 0.5)
)
val spark = SparkSession.builder()
.appName("ES")
.master("local[*]")
.getOrCreate()
implicit val sparkContext = spark.sparkContext
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
How can I obtain edgeArray of the same structure using df? In each Edge, the third value can be any random Double value from 0 to 1.
UPDATE:
I did it in this way, but not sure if this is the most optimal solution:
val edgeArray = df.rdd.collect().map(row => Edge(row.get(0).toString.toLong, row.get(1).toString.toLong, 0.0))
val edgeRDD: RDD[Edge[Double]] = sparkContext.parallelize(edgeArray)
I don't like to use Array, because I might have millions of edges. Can I pass DataFrame more directly to RDD?
Given
val df = Seq((123, 456), (123, 789), (456, 789)).toDF("v_in", "v_out")
Import
import org.apache.spark.sql.functions.rand
import org.apache.spark.graphx.Edge
and convert:
val edgeRDD = df.toDF("srcId", "dstId")
.withColumn("attr", rand)
.as[Edge[Double]].rdd
With graphframes:
spark.jars.packages graphframes:graphframes:X.X.X-sparkY.Y-s_Z.ZZ
where X.X.X is package version, Y.Y is Spark version and Z.ZZ is Scala version, you can create Graph like this:
GraphFrame.fromEdges(df.toDF("src", "dst")).toGraphX
but it'll use Row attributes.

Partial/Full-match value in one RDD to values in another RDD

I have two RDDs where the first RDD has records of the form
RDD1 = (1, 2017-2-13,"ABX-3354 gsfette"
2, 2017-3-18,"TYET-3423 asdsad"
3, 2017-2-09,"TYET-3423 rewriu"
4, 2017-2-13,"ABX-3354 42324"
5, 2017-4-01,"TYET-3423 aerr")
and the second RDD has records of the form
RDD2 = ('mfr1',"ABX-3354")
('mfr2',"TYET-3423")
I need to find all the records in RDD1 which have a full match/partial match for each value in RDD2 matching the 3rd Column of RDD1 to 2nd column of RDD2 and get the count
For this example, the end result would be:
ABX-3354 2
TYET-3423 3
What is the best way to do this?
I am posting couple of solutions with Spark SQL and more focused towards accurate pattern matching of search string in given text.
1: Using CrossJoin
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-335442324"), //changed from "ABX-3354 42324"
(5, "2017-4-01", "aerrTYET-3423") //changed from "TYET-3423 aerr"
).toDF("id", "dt", "txt")
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).toDF("col1", "key")
//match function for filter
def matcher(row: Row): Boolean = row.getAs[String]("txt")
.contains(row.getAs[String]("key"))
val join = df1.crossJoin(df2)
import org.apache.spark.sql.functions.count
val result = join.filter(matcher _)
.groupBy("key")
.agg(count("txt").as("count"))
2: Using Broadcast variable
import spark.implicits._
val df1 = Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "aerrTYET-3423"),
(6, "2017-4-01", "aerrYET-3423")
).toDF("id", "dt", "pattern")
//small dataset to broadcast
val df2 = Seq(
("mfr1", "ABX-3354"),
("mfr2", "TYET-3423")
).map(_._2) // considering only 2 values in pair
//Lookup to use in UDF
val lookup = spark.sparkContext.broadcast(df2)
//Udf
import org.apache.spark.sql.functions._
val matcher = udf((txt: String) => {
val matches: Seq[String] = lookup.value.filter(txt.contains(_))
if (matches.size > 0) matches.head else null
})
val result = df1.withColumn("match", matcher($"pattern"))
.filter($"match".isNotNull) // not interested in non matching records
.groupBy("match")
.agg(count("pattern").as("count"))
Both solutions result same output
result.show()
+---------+-----+
| key|count|
+---------+-----+
|TYET-3423| 3|
| ABX-3354| 2|
+---------+-----+
Here is how you can get the result
val RDD1 = spark.sparkContext.parallelize(Seq(
(1, "2017-2-13", "ABX-3354 gsfette"),
(2, "2017-3-18", "TYET-3423 asdsad"),
(3, "2017-2-09", "TYET-3423 rewriu"),
(4, "2017-2-13", "ABX-3354 42324"),
(5, "2017-4-01", "TYET-3423 aerr")
))
val RDD2 = spark.sparkContext.parallelize(Seq(
("mfr1","ABX-3354"),
("mfr2","TYET-3423")
))
RDD1.map(r =>{
(r._3.split(" ")(0), (r._1, r._2, r._3))
})
.join(RDD2.map(r => (r._2, r._1)))
.groupBy(_._1)
.map(r => (r._1, r._2.toSeq.size))
.foreach(println)
Output:
(TYET-3423,3)
(ABX-3354,2)
Hope this helps!

Spark : anti-join two DStreams

I can do JOINs on two Spark DStreams like :
val joinStream = stream1.join(stream2)
Now, what if I need to filter out all the records that weren't JOINed. Essentially, something like stream1.anti-join(stream2). Is this possible somehow?
Thanks and appreciate any help!
Assuming you had these:
val rdd1 = sc.parallelize(Array(
(1, "one"),
(2, "twow"),
(3, "three"),
(4, "four"),
(5, "five")
))
val rdd2 = sc.parallelize(Array(
(1, "otherOne"),
(4, "otherFour"),
(5,"otherFive"),
(6,"six"),
(7,"seven")
))
val antiJoined = rdd1.fullOuterJoin(rdd2).filter(r => r._2._1.isEmpty || r._2._2.isEmpty)
antiJoined.collect foreach println
(6,(None,Some(six)))
(2,(Some(twow),None))
(3,(Some(three),None))
(7,(None,Some(seven)))

Distribute elements of the fist list to another list/array

I have an Array[(List(String)), Array[(Int, Int)]] like this
((123, 456, 789), (1, 24))
((89, 284), (2, 6))
((125, 173, 88, 222), (3, 4))
I would like to distribute each element of the first list to the second list, like this
(123, (1, 24))
(456, (1, 24))
(789, (1, 24))
(89, (2, 6))
(284, (2, 6))
(125, (3, 4))
(173, (3, 4))
(88, (3, 4))
(22, (3, 4))
Can anyone help me with this? Thank you very much.
For input data defined as follows:
val data = Array((List("123", "456", "789"), (1, 24)), (List("89", "284"), (2, 6)), (List("125", "173", "88", "222"), (3, 4)))
you can use:
data.flatMap { case (l, ii) => l.map((_, ii)) }
which yields:
Array[(String, (Int, Int))] = Array(("123", (1, 24)), ("456", (1, 24)), ("789", (1, 24)), ("89", (2, 6)), ("284", (2, 6)), ("125", (3, 4)), ("173", (3, 4)), ("88", (3, 4)), ("222", (3, 4)))
which I believe matches what you are looking for.
Based on your example, it seemed to me that you were using a single type.
scala> val xs: List[(List[Int], (Int, Int))] =
| List( ( List(123, 456, 789), (1, 24) ),
| ( List(89, 284), (2,6)),
| ( List(125, 173, 88, 222), (3, 4)) )
xs: List[(List[Int], (Int, Int))] = List((List(123, 456, 789), (1,24)),
(List(89, 284),(2,6)),
(List(125, 173, 88, 222),(3,4)))
Then I wrote this function:
scala> def f[A](xs: List[(List[A], (A, A))]): List[(A, (A, A))] =
| for {
| x <- xs
| head <- x._1
| } yield (head, x._2)
f: [A](xs: List[(List[A], (A, A))])List[(A, (A, A))]
Apply f to xs.
scala> f(xs)
res9: List[(Int, (Int, Int))] = List((123,(1,24)), (456,(1,24)),
(789,(1,24)), (89,(2,6)), (284,(2,6)), (125,(3,4)),
(173,(3,4)), (88,(3,4)), (222,(3,4)))

Get all the nodes connected to a node in Apache Spark GraphX

Suppose we have got the input in Apache GraphX as :
Vertex RDD:
val vertexArray = Array(
(1L, "Alice"),
(2L, "Bob"),
(3L, "Charlie"),
(4L, "David"),
(5L, "Ed"),
(6L, "Fran")
)
Edge RDD:
val edgeArray = Array(
Edge(1L, 2L, 1),
Edge(2L, 3L, 1),
Edge(3L, 4L, 1),
Edge(5L, 6L, 1)
)
I need all the components connected to a node in Apache Spark GraphX
1,[1,2,3,4]
5,[5,6]
You can use ConnectedComponents which returns
a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.
and reshape results
graph.connectedComponents.vertices.map(_.swap).groupByKey