how to join 2 rdd's in spark scala - scala

I have 2 RDD's as below
val rdd1 = spark.sparkContext.parallelize(Seq((123, List(("000000011119",20),("000000011120",30),("000000011121",50))),(234, List(("000000011119",20),("000000011120",30),("000000011121",50)))))
val rdd2 = spark.sparkContext.parallelize(Seq((123, List("000000011119","000000011120")),(234, List("000000011121","000000011120"))))
I want to perform addition of values in rdd1 on the basis of key pairs in rdd2.
Output required:
RDD[(123,50),(234,80)]
Any help will be appreciated.

Really this is a join on the first element of the row, and the first element of each of the contents.
So I'd explode it into multiple rows and join that way
val flat1 = rdd1.flatMap(r => r._2.map(e => ((r._1, e._1), e._2))) // looks like ((234,000000011119),20)
val flat2 = rdd2.flatMap(r => r._2.map(e => ((r._1, e), true))) // looks like ((234,000000011121),true)
val res = flat1.join(flat2)
.map(r => (r._1._1, r._2._1)) // looks like (123, 30)
.reduceByKey(_ + _) // total each key group
Result with a .foreach(println)
scala> :pas
// Entering paste mode (ctrl-D to finish)
flat1.join(flat2)
.map(r => (r._1._1, r._2._1)) // looks like (123, 30)
.reduceByKey(_ + _) // total each key group
.foreach(println)
// Exiting paste mode, now interpreting.
(123,50)
(234,80)
As usual, this stuff is much simpler using Dataset, so that would be my recommendation for the future though.

Related

Finding Max sum of marks each year

I am new to Scala and Spark, can someone optimize below Scala code for finding maximum marks scored by students each year
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
Output :
2017,raghu,288
2017,rajesh,288
2016,raj,278
I am not sure what you mean exactly by "Optimised", but a more "scala-y" and "spark-y" way of doing this might be as follows:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
This will produce the following output in Spark-shell:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
If you need a CSV displayed as per your question, you can use:
topStudents.collect.map(_.mkString(",")).foreach(println)
which produces:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
I have broken the process up into individual steps. This will allow you to see what is going on at each step by simply running show on an intermediate result. For example to see what the spark.read.option... does, simply enter marksDF.show into spark-shell
Since OP wanted an RDD version, here is one example. Probably it is not optimal, but it does give the correct result:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
The result of the above is:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
As before, you can view the intermediate RDD's simply by dumping them using the dump function. NB: the dump function takes an RDD. If you want to show the content of a DataFrame or dataset use it's show method.
It is probably that there is a more optimal solution than the one above, but it does the job.
Hopefully the RDD version will encourage you to use DataFrames and/or DataSets if you can. Not only is the code simpler, but:
Spark will evaluate DataFrames and DataSets and can optimise the overall transformation process. RDD's are not (i.e. they are executed one after another without optimisation). Translations DataFrame and DataSet based processes will likely run faster (assuming you don't manually optimise the RDD equivalent)
DataSets and DataFrames allow schemas to varying degrees (e.g. named columns and data typing).
DataFrames and DataSets can be queried using SQL.
DataFrame and DataSet operations/methods are more aligned with SQL constructs
DataFrames and DataSets are easier to use than RDD's
DataSets (and RDD's) offer compile time error detection.
DataSets are the future direction.
Check out these couple of links for more information:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/
https://medium.com/#sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
or simply google "spark should i use rdd or dataframe"
All the best with your project.
Try It on SCALA spark-shell
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
It will generate the following table
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}

Merging RDD records to obtain a single Row with multiple conditional counters

As a little bit of context, what I'm trying to achieve here is given multiple rows grouped by a certain set of keys, after that first reduce I would like to group them in a general row by, for example, date, with each of the grouped counters previously calculated. This may not seem clear by just reading it so here is an example output (quite simple, nothing complex) of what should happen.
(("Volvo", "T4", "2019-05-01"), 5)
(("Volvo", "T5", "2019-05-01"), 7)
(("Audi", "RS6", "2019-05-01"), 4)
And once merged those Row objects...
date , volvo_counter , audi_counter
"2019-05-01" , 12 , 4
I reckon this is quite a corner case and that there may be different approaches but I was wondering if there was any solution within the same RDD so there's no need for multiple RDDs divided by counter.
What you want to do is a pivot. You talk about RDDs so I assume that your question is: "how to do a pivot with the RDD API?". As far as I know there is no built-in function in the RDD API that does it. You could do it yourself like this:
// let's create sample data
val rdd = sc.parallelize(Seq(
(("Volvo", "T4", "2019-05-01"), 5),
(("Volvo", "T5", "2019-05-01"), 7),
(("Audi", "RS6", "2019-05-01"), 4)
))
// If the keys are not known in advance, we compute their distinct values
val values = rdd.map(_._1._1).distinct.collect.toSeq
// values: Seq[String] = WrappedArray(Volvo, Audi)
// Finally we make the pivot and use reduceByKey on the sequence
val res = rdd
.map{ case ((make, model, date), counter) =>
date -> values.map(v => if(make == v) counter else 0)
}
.reduceByKey((a, b) => a.indices.map(i => a(i) + b(i)))
// which gives you this
res.collect.head
// (String, Seq[Int]) = (2019-05-01,Vector(12, 4))
Note that you can write much simpler code with the SparkSQL API:
// let's first transform the previously created RDD to a dataframe:
val df = rdd.map{ case ((a, b, c), d) => (a, b, c, d) }
.toDF("make", "model", "date", "counter")
// And then it's as simple as that:
df.groupBy("date")
.pivot("make")
.agg(sum("counter"))
.show
+----------+----+-----+
| date|Audi|Volvo|
+----------+----+-----+
|2019-05-01| 4| 12|
+----------+----+-----+
I think it's easier to do with DataFrame:
val data = Seq(
Record(Key("Volvo", "2019-05-01"), 5),
Record(Key("Volvo", "2019-05-01"), 7),
Record(Key("Audi", "2019-05-01"), 4)
)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF()
val modelsExpr = df
.select("key.model").as("model")
.distinct()
.collect()
.map(r => r.getAs[String]("model"))
.map(m => sum(when($"key.model" === m, $"value").otherwise(0)).as(s"${m}_counter"))
df
.groupBy("key.date")
.agg(modelsExpr.head, modelsExpr.tail: _*)
.show(false)

Filtering RDD1 on the basis of RDD2

i have 2 RDDS in below format
RDD1 178,1
156,1
23,2
RDD2
34
178
156
now i want to filter rdd1 on the basis of value in rdd2 ie if 178 is present in rdd1 and also in rdd2 then it should return me those tuples from rdd1.
i have tried
val out = reversedl1.filter({ case(x,y) => x.contains(lines)})
where lines is my 2nd rdd and reversedl1 is first, but its not working
i also tried
val abce = reversedl1.subtractByKey(lines)
val defg = reversedl1.subtractByKey(abce)
This is also not working.
Any suggestions?
You can convert rdd2 to key value pairs and then join with rdd1 on the keys:
val rdd1 = sc.parallelize(Seq((178, 1), (156, 1), (23, 2)))
val rdd2 = sc.parallelize(Seq(34, 178, 156))
(rdd1.join(rdd2.distinct().map(k => (k, null)))
// here create a dummy value to form a pair wise RDD so you can join
.map{ case (k, (v1, v2)) => (k, v1) } // drop the dummy value
).collect
// res11: Array[(Int, Int)] = Array((156,1), (178,1))

Joining 2 RDDs when one having a Option type as key

I have 2 RDDs I would like to join which looks like this
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
Is there any way I can do a left outer join on them?
I have tried this but it does not work because the type of the key is different i.e Int, Option[Int]
q.leftOuterJoin(a)
The natural solution is to convert the Int to Option[Int] so they have the same type.
Following you example:
val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a)
If you want to recover the Int type at the output, you can do this:
q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a).map{ case (k,v) => (k.get, v) }
Note that you can do ".get" without any problem since it is not possible to get None's there.
One way to do is to convert it into dataframe and join
Here is a simple example
import spark.implicits._
val a = spark.sparkContext.parallelize(Seq(
(Some(3), 33),
(Some(1), 11),
(Some(2), 22)
)).toDF("id", "value1")
val q = spark.sparkContext.parallelize(Seq(
(Some(3), 33)
)).toDF("id", "value2")
q.join(a, a("id") === q("id") , "leftouter").show

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.