Depth First Search Algorithm in Dataframe(GraphFrame) in spark - scala

I have a two dataframe having one containing vertices
val v = sqlContext.createDataFrame(scala.List(
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 1),
("f", "Fanny", 36),
("g", "Gabby", 60),
("h", "harry", 45),
("i", "ishwar", 37),
("j", "James", 65),
("z", "James", 65),
("k", "Kamla", 43),
("l", "laila", 54)
)).toDF("id", "name", "age")
and another one containing edges
val e = sqlContext.createDataFrame(scala.List(
("a", "b", "follow", 193, 231),
("b", "c", "friend", 113, 211),
("c", "d", "follow", 124, 222),
("d", "e", "follow", 135, 233),
("f", "c", "follow", 146, 243),
("b", "f", "follow", 146, 243),
("h", "i", "friend", 123, 265),
("i", "h", "friend", 123, 265),
("i", "j", "friend", 126, 223),
("j", "h", "friend", 126, 223),
("f", "g", "friend", 157, 243),
("i", "a", "friend", 157, 243)
)).toDF("src", "dst", "relationship", "SNO", "Salary")
I need a dataframe that contain the paths having all possible paths from vertices lets say vert 'a' to 'e' just like algorithm DFS does.
It should give output like
|n1 |e1 |n2 |e2 |n3 |e3 |n4 |e4 |n5 |
+--------------+------------------------+------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
|[a, Alice, 34]|[a, b, follow, 193, 231]|[b, Bob, 36]|[b, c, friend, 113, 211]|[c, Charlie, 30]|[c, d, follow, 124, 222]|[d, David, 29]|[d, e, follow, 135, 233]|[e, Esther, 1]|
+--------------+------------------------+------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
+--------------+------------------------+------------+------------------------+--------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
|n1 |e1 |n2 |e2 |n3 |e3 |n4 |e4 |n5 |e5 |n6 |
+--------------+------------------------+------------+------------------------+--------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
|[a, Alice, 34]|[a, b, follow, 193, 231]|[b, Bob, 36]|[b, f, follow, 146, 243]|[f, Fanny, 36]|[f, c, follow, 146, 243]|[c, Charlie, 30]|[c, d, follow, 124, 222]|[d, David, 29]|[d, e, follow, 135, 233]|[e, Esther, 1]|
+--------------+------------------------+------------+------------------------+--------------+------------------------+----------------+------------------------+--------------+------------------------+--------------+
I want DFS algorithm in a dataframe so that I can do another another task on these derived datframe or Graphframe.
Any help or suggestion would be great.
Thanks

Related

How to solve "aggregateByKey is not a member of org.apache.spark.sql.Dataset" in Spark?

I am trying this example:
https://backtobazics.com/big-data/spark/apache-spark-aggregatebykey-example/
But instead of an RDD, I am using a dataframe.
I tried the following:
val aggrRDD = student_df.map(r => (r.getString(0), (r.getString(1), r.getInt(2))))
.aggregateByKey(zeroVal)(seqOp, combOp)
which is a part of a this code snippet :
val student_df = sc.parallelize(Array(
("Joseph", "Maths", 83), ("Joseph", "Physics", 74), ("Joseph", "Chemistry", 91), ("Joseph", "Biology", 82),
("Jimmy", "Maths", 69), ("Jimmy", "Physics", 62), ("Jimmy", "Chemistry", 97), ("Jimmy", "Biology", 80),
("Tina", "Maths", 78), ("Tina", "Physics", 73), ("Tina", "Chemistry", 68), ("Tina", "Biology", 87),
("Thomas", "Maths", 87), ("Thomas", "Physics", 93), ("Thomas", "Chemistry", 91), ("Thomas", "Biology", 74),
("Cory", "Maths", 56), ("Cory", "Physics", 65), ("Cory", "Chemistry", 71), ("Cory", "Biology", 68),
("Jackeline", "Maths", 86), ("Jackeline", "Physics", 62), ("Jackeline", "Chemistry", 75), ("Jackeline", "Biology", 83),
("Juan", "Maths", 63), ("Juan", "Physics", 69), ("Juan", "Chemistry", 64), ("Juan", "Biology", 60)), 3).toDF("student", "subject", "marks")
def seqOp = (accumulator: Int, element: (String, Int)) =>
if(accumulator > element._2) accumulator else element._2
def combOp = (accumulator1: Int, accumulator2: Int) =>
if(accumulator1 > accumulator2) accumulator1 else accumulator2
val zeroVal = 0
val aggrRDD = student_df.map(r => (r.getString(0), (r.getString(1), r.getInt(2))))
.aggregateByKey(zeroVal)(seqOp, combOp)
That gives this error :
error: value aggregateByKey is not a member of org.apache.spark.sql.Dataset[(String, (String, Int))]
A possible cause might be that a semicolon is missing before value aggregateByKey?
What am I doing wrong here? How do I work with dataframes or datasets on this?
Try to call rdd after student_df and before map:
val aggrRDD = student_df.rdd.map(r => (r.getString(0), (r.getString(1), r.getInt(2))))
.aggregateByKey(zeroVal)(seqOp, combOp)

Get all key combinations from nested maps

I have a nested map like so:
val m: Map[Int, Map[String, Seq[Int]]] =
Map(
1 -> Map(
"A" -> Seq(1, 2, 3),
"B" -> Seq(4, 5, 6)
),
2 -> Map(
"C" -> Seq(7, 8, 9),
"D" -> Seq(10, 11, 12),
"E" -> Seq(13, 14, 15)
),
3 -> Map(
"F" -> Seq(16, 17, 18)
)
)
I want the desired output to show every possible combination of the integers in the Seqs. For example:
List((1, "A", 1),
(1, "A", 2),
(1, "A", 3),
(1, "B", 4),
(1, "B", 5),
(1, "B", 6),
(2, "C", 7),
(2, "C", 8),
(2, "C", 9),
(2, "D", 10),
(2, "D", 11),
(2, "D", 12),
(2, "E", 13),
(2, "E", 14),
(2, "E", 15),
(3, "F", 16),
(3, "F", 17),
(3, "F", 18))
I have been trying different combinations of map and flatMap, but nothing has been working. Any ideas?
Here is a possibility using a for comprehension:
for {
(k1, v1) <- m
(k2, v2) <- v1
v3 <- v2
} yield (k1, k2, v3)
This goes through all top key/value pairs of m. For each of these top values, this goes through all nested key/values. And finally for all of these nested values (which are the lists), it goes through each elements and yields what's requested.
A for comprehension is an equivalent to nested flatMaps, such as:
m.flatMap{
case (k1, v1) => v1.flatMap {
case (k2, v2) => v2.map(v3 => (k1, k2, v3))
}
}

Split RDD into many RDDs and Cache

I have an rdd like so
(aid, session, sessionnew, date)
(55-BHA, 58, 15, 2017-05-09)
(07-YET, 18, 5, 2017-05-09)
(32-KXD, 27, 20, 2017-05-09)
(19-OJD, 10, 1, 2017-05-09)
(55-BHA, 1, 0, 2017-05-09)
(55-BHA, 19, 3, 2017-05-09)
(32-KXD, 787, 345, 2017-05-09)
(07-YET, 4578, 1947, 2017-05-09)
(07-YET, 23, 5, 2017-05-09)
(32-KXD, 85, 11, 2017-05-09)
I want to split everything with the same aid to a new rdd and then cache that for use later, so one rdd per unique aid. I saw some other answers but they are saving the rdds to files. Is there a problem with saving this many rdds in memory? It will likely be around 30k+
I save the cached rdd with spark jobserver.
I would suggest you to cache the grouped rdd as below
lets say you have rdd data as :
val rddData = sparkContext.parallelize(Seq(
("55-BHA", 58, 15, "2017-05-09"),
("07-YET", 18, 5, "2017-05-09"),
("32-KXD", 27, 20, "2017-05-09"),
("19-OJD", 10, 1, "2017-05-09"),
("55-BHA", 1, 0, "2017-05-09"),
("55-BHA", 19, 3, "2017-05-09"),
("32-KXD", 787, 345, "2017-05-09"),
("07-YET", 4578, 1947, "2017-05-09"),
("07-YET", 23, 5, "2017-05-09"),
("32-KXD", 85, 11, "2017-05-09")))
You can cache the data by grouping with "aid" and use filter to select the grouped data you need as :
val grouped = rddData.groupBy(_._1).cache
val filtered = grouped.filter(_._1 equals("32-KXD"))
But I would suggest you to use DataFrame as below which is efficient and improved than rdds
import sqlContext.implicits._
val dataFrame = Seq(
("55-BHA", 58, 15, "2017-05-09"),
("07-YET", 18, 5, "2017-05-09"),
("32-KXD", 27, 20, "2017-05-09"),
("19-OJD", 10, 1, "2017-05-09"),
("55-BHA", 1, 0, "2017-05-09"),
("55-BHA", 19, 3, "2017-05-09"),
("32-KXD", 787, 345, "2017-05-09"),
("07-YET", 4578, 1947, "2017-05-09"),
("07-YET", 23, 5, "2017-05-09"),
("32-KXD", 85, 11, "2017-05-09")).toDF("aid", "session", "sessionnew", "date").cache
val newDF = dataFrame.select("*").where(dataFrame("aid") === "32-KXD")
newDF.show
I hope it helps

Intersection of Two Map rdd's in Scala

I have two RDD's, for example:
firstmapRDD - (0-14,List(0, 4, 19, 19079, 42697, 444, 42748))
secondmapRdd-(0-14,List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94))
I want to find the intersection.
I tried, var interResult = firstmapRDD.intersection(secondmapRdd), which shows no result in output file.
I also tried , cogrouping based on keys, mapRDD.cogroup(secondMapRDD).filter(x=>), but I don't know how to find the intersection between both the values, is it x=>x._1.intersect(x._2), Can someone help me with the syntax?
Even this throws a compile time error, mapRDD.cogroup(secondMapRDD).filter(x=>x._1.intersect(x._2))
var mapRDD = sc.parallelize(map.toList)
var secondMapRDD = sc.parallelize(secondMap.toList)
var interResult = mapRDD.intersection(secondMapRDD)
It may be because of ArrayBuffer[List[]] values, because of which the intersection is not working. Is there any hack to remove it?
I tried doing this
var interResult = mapRDD.cogroup(secondMapRDD).filter{case (_, (l,r)) => l.nonEmpty && r.nonEmpty }. map{case (k,(l,r)) => (k, l.toList.intersect(r.toList))}
Still getting an empty list!
Since you are looking intersect on values, you need to join both RDDs, get all the matched values, then do the intersect on values.
sample code:
val firstMap = Map(1 -> List(1,2,3,4,5))
val secondMap = Map(1 -> List(1,2,5))
val firstKeyRDD = sparkContext.parallelize(firstMap.toList, 2)
val secondKeyRDD = sparkContext.parallelize(secondMap.toList, 2)
val joinedRDD = firstKeyRDD.join(secondKeyRDD)
val finalResult = joinedRDD.map(tuple => {
val matchedLists = tuple._2
val intersectValues = matchedLists._1.intersect(matchedLists._2)
(tuple._1, intersectValues)
})
finalResult.foreach(println)
The output will be
(1,List(1, 2, 5))

SparkSQL:Avg based on a column after GroupBy

I have an rdd of student grades and I need to first group them by the first column which is university and then show the average of student count in each course like this. What is the easiest way to do this query?
+----------+-------------------+
|university| avg of students |
+----------+--------------------+
| MIT| 3 |
| Cambridge| 2.66
Here is the dataset.
case class grade(university: String, courseId: Int, studentId: Int, grade: Double)
val grades = List(grade(
grade("Cambridge", 1, 1001, 4),
grade("Cambridge", 1, 1004, 4),
grade("Cambridge", 2, 1006, 3.5),
grade("Cambridge", 2, 1004, 3.5),
grade("Cambridge", 2, 1002, 3.5),
grade("Cambridge", 3, 1006, 3.5),
grade("Cambridge", 3, 1007, 5),
grade("Cambridge", 3, 1008, 4.5),
grade("MIT", 1, 1001, 4),
grade("MIT", 1, 1002, 4),
grade("MIT", 1, 1003, 4),
grade("MIT", 1, 1004, 4),
grade("MIT", 1, 1005, 3.5),
grade("MIT", 2, 1009, 2))
1) First groupBy university
2) then get course count per university
3) then groupBy courseId
4) then get student count per course
grades.groupBy(_.university).map { case (k, v) =>
val courseCount = v.map(_.courseId).distinct.length
val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
}
Scala REPL
scala> val grades = List(
grade("Cambridge", 1, 1001, 4),
grade("Cambridge", 1, 1004, 4),
grade("Cambridge", 2, 1006, 3.5),
grade("Cambridge", 2, 1004, 3.5),
grade("Cambridge", 2, 1002, 3.5),
grade("Cambridge", 3, 1006, 3.5),
grade("Cambridge", 3, 1007, 5),
grade("Cambridge", 3, 1008, 4.5),
grade("MIT", 1, 1001, 4),
grade("MIT", 1, 1002, 4),
grade("MIT", 1, 1003, 4),
grade("MIT", 1, 1004, 4),
grade("MIT", 1, 1005, 3.5),
grade("MIT", 2, 1009, 2))
// grades: List[grade] = List(...)
scala> grades.groupBy(_.university).map { case (k, v) =>
val courseCount = v.map(_.courseId).distinct.length
val studentCountPerCourse = v.groupBy(_.courseId).map { case (k, v) => v.length }.sum
k -> (studentCountPerCourse.toDouble / courseCount.toDouble)
}
// res2: Map[String, Double] = Map("MIT" -> 3.0, "Cambridge" -> 2.6666666666666665)
gradesRdd.map({ case Grade(university: String, courseId: Int, studentId: Int, gpa: Int) =>
((university),(courseId))}).mapValues(x => (x, 1))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(y => 1.0 * y._1 / y._2).collect
res73: Array[(String, Double)] = Array((Cambridge,2.125), (MIT,1.1666666666666667))