Scala/Spark - Aggregating RDD - scala

Just wondering how I can do the following:
Suppose I have an RDD containing (username, age, movieBought) for many usernames and some lines can have the same username and age but a different movieBought.
How can I remove the duplicated lines and transform it into (username, age, movieBought1, movieBought2...)?
Kind Regards

val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(_._3)))
val results = grouped.collect.toList
UPDATE (if each tuple also has number of movies item):
val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(m => (m._3, m._4))))
val results = grouped.collect.toList

I was gonna suggest collect and to list, but ka4eli beat me to it.
I guess you could also use the groupBy / groupByKey and then reduce/reduceByKey operation. The downside of this ofc is that the result (movie1,movie2,movie3..) are concatenated into 1 string (instead of a List structure, which makes accessing it difficult).
val group = rdd.map(x=>((x.name,x.age),x.movie))).groupBy(_._1)
val result = group.map(x=>(x._1._1,x._1._2,x._2.map(y=>y._2).reduce(_+","+_)

Related

Non Deterministic Behaviour of UNION of RDD in Spark

I'm performing Union operation on 3 RDD's, I'm aware Union doesn't preserve ordering but my in my case it is quite weird. Can someone explain me what's wrong in my code??
I've a (myDF)dataframe of rows and converted to RDD :-
myRdd = myDF.rdd.map(row => row.toSeq.toList.mkString(":")).map(rec => (2, rec))
myRdd.collect
/*
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
*/
val rowCount = myRdd.count() // Count of Records in myRdd
val header = "name:country:date:nextdate:1" // random header
// Generating Header Rdd
headerRdd = sparkContext.parallelize(Array(header), 1).map(rec => (1, rec))
//Generating Trailer Rdd
val trailerRdd = sparkContext.parallelize(Array("T" + ":" + rowCount),1).map(rec => (3, rec))
//Performing Union
val unionRdd = headerRdd.union(myRdd).union(trailerdd).map(rec => rec._2)
unionRdd.saveAsTextFile("pathLocation")
As Union doesn't preserve ordering it should not give below result
Output
name:country:date:nextdate:1
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
Without using any sorting, How's that possible to get above output??
sortByKey("true", 1)
But When I Remove map from headerRdd, myRdd & TrailerRdd the oder is like
Deepak:7321c:Stack Overflow:AIR:INDIA:AIR999:N:2020-04-22T10:28:33.087
name:country:date:nextdate:1
Veeru:596621c:Medium:POWER:USA:LN49:Y:2020-14-22T10:38:43.287
Rajeev:1612801:Udemy:LEARN:ITALY:P4399:N:2020-04-22T13:08:43.887
T:3
What is the possible reason for above behaviour??
In Spark, the elements within a particular partition are unordered, however the partitions themselves are ordered check this

Finding Max sum of marks each year

I am new to Scala and Spark, can someone optimize below Scala code for finding maximum marks scored by students each year
val m=sc.textFile("marks.csv")
val SumOfMarks=m.map(_.split(",")).mapPartitionsWithIndex {(idx, iter) => if (idx == 0) iter.drop(1) else iter}.map(l=>((l(0),l(1)),l(3).toInt)).reduceByKey(_+_).sortBy(line => (line._1._1, line._2), ascending=false)
var s:Int=0
var y:String="0"
for(i<-SumOfMarks){ if((i._1._1!=y) || (i._2==s && i._1._1==y)){ println(i);s=i._2;y=i._1._1}}
Input : marks.csv
year,student,sub,marks
2016,ram,maths,90
2016,ram,physics,86
2016,ram,chemistry,88
2016,raj,maths,84
2016,raj,physics,96
2016,raj,chemistry,98
2017,raghu,maths,96
2017,raghu,physics,98
2017,raghu,chemistry,94
2017,rajesh,maths,92
2017,rajesh,physics,98
2017,rajesh,chemistry,98
Output :
2017,raghu,288
2017,rajesh,288
2016,raj,278
I am not sure what you mean exactly by "Optimised", but a more "scala-y" and "spark-y" way of doing this might be as follows:
import org.apache.spark.sql.expressions.Window
// Read your data file as a CSV file with row headers.
val marksDF = spark.read.option("header","true").csv("marks.csv")
// Calculate the total marks for each student in each year. The new total mark column will be called "totMark"
val marksByStudentYear = marksDF.groupBy(col("year"), col("student")).agg(sum(col("marks")).as("totMark"))
// Rank the marks within each year. Highest Mark will get rank 1, second highest rank 2 and so on.
// A benefit of rank is that if two scores have the same mark, they will both get the
// same rank.
val marksRankedByYear = marksByStudentYear.withColumn("rank", dense_rank().over(Window.partitionBy("year").orderBy($"totMark".desc)))
// Finally filter so that we only have the "top scores" (rank = 1) for each year,
// order by year and student name and display the result.
val topStudents = marksRankedByYear.filter($"rank" === 1).orderBy($"year", $"student").show
topStudents.show
This will produce the following output in Spark-shell:
+----+-------+-------+----+
|year|student|totMark|rank|
+----+-------+-------+----+
|2016| raj| 278.0| 1|
|2017| raghu| 288.0| 1|
|2017| rajesh| 288.0| 1|
+----+-------+-------+----+
If you need a CSV displayed as per your question, you can use:
topStudents.collect.map(_.mkString(",")).foreach(println)
which produces:
2016,raj,278.0,1
2017,raghu,288.0,1
2017,rajesh,288.0,1
I have broken the process up into individual steps. This will allow you to see what is going on at each step by simply running show on an intermediate result. For example to see what the spark.read.option... does, simply enter marksDF.show into spark-shell
Since OP wanted an RDD version, here is one example. Probably it is not optimal, but it does give the correct result:
import org.apache.spark.rdd.RDD
// A Helper function which makes it slightly easier to view RDD content.
def dump[R] (rdd : RDD[R]) = rdd.collect.foreach(println)
val marksRdd = sc.textFile("marks.csv")
// A case class to annotate the content in the RDD
case class Report(year:Int, student:String, sub:String, mark:Int)
// Create the RDD as a series of Report objects - ignore the header.
val marksReportRdd = marksRdd.map(_.split(",")).mapPartitionsWithIndex {
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}.map(r => Report(r(0).toInt,r(1),r(2),r(3).toInt))
// Group the data by year and student.
val marksGrouped = marksReportRdd.groupBy(report => (report.year, report.student))
// Calculate the total score for each student for each year by adding up the scores
// of each subject the student has taken in that year.
val totalMarkStudentYear = marksGrouped.map{ case (key, marks:Iterable[Report]) => (key, marks.foldLeft(0)((acc, rep) => acc + rep.mark))}
// Determine the highest score for each year.
val yearScoreHighest = totalMarkStudentYear.map{ case (key, score:Int) => (key._1, score) }.reduceByKey(math.max(_, _))
// Determine the list of students who have received the highest score in each year.
// This is achieved by joining the total marks each student received in each year
// to the highest score in each year.
// The join is performed on the key which must is a Tuple2(year, score).
// To achieve this, both RDD's must be mapped to produce this key with a data attribute.
// The data attribute for the highest scores is a dummy value "x".
// The data attribute for the student scores is the student's name.
val highestRankStudentByYear = totalMarkStudentYear.map{ case (key, score) => ((key._1, score), key._2)}.join (yearScoreHighest.map (k => (k, "x")))
// Finally extract the year, student name and score from the joined RDD
// Sort by year and name.
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
// Show the final result.
dump(result)
val result = highestRankStudentByYear.map{ case (key, score) => (key._1, score._1, key._2)}.sortBy( r => (r._1, r._2))
dump(result)
The result of the above is:
(2016,raj,278)
(2017,raghu,288)
(2017,rajesh,288)
As before, you can view the intermediate RDD's simply by dumping them using the dump function. NB: the dump function takes an RDD. If you want to show the content of a DataFrame or dataset use it's show method.
It is probably that there is a more optimal solution than the one above, but it does the job.
Hopefully the RDD version will encourage you to use DataFrames and/or DataSets if you can. Not only is the code simpler, but:
Spark will evaluate DataFrames and DataSets and can optimise the overall transformation process. RDD's are not (i.e. they are executed one after another without optimisation). Translations DataFrame and DataSet based processes will likely run faster (assuming you don't manually optimise the RDD equivalent)
DataSets and DataFrames allow schemas to varying degrees (e.g. named columns and data typing).
DataFrames and DataSets can be queried using SQL.
DataFrame and DataSet operations/methods are more aligned with SQL constructs
DataFrames and DataSets are easier to use than RDD's
DataSets (and RDD's) offer compile time error detection.
DataSets are the future direction.
Check out these couple of links for more information:
https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/
https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash/
https://medium.com/#sachee/apache-spark-dataframe-vs-rdd-24a04d2eb1b9
or simply google "spark should i use rdd or dataframe"
All the best with your project.
Try It on SCALA spark-shell
scala> val df = spark.read.format("csv").option("header", "true").load("/CSV file location/marks.csv")
scala> df.registerTempTable("record")
scala> sql(" select year, student, marks from (select year, student, marks, RANK() over (partition by year order by marks desc) rank From ( Select year, student, SUM(marks) as marks from record group by Year, student)) where rank =1 ").show
It will generate the following table
+----+-------+-----+
|year|student|marks|
+----+-------+-----+
|2016| raj|278.0|
|2017| raghu|288.0|
|2017| rajesh|288.0|
+----+-------+-----+
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions
//Finding Max sum of marks each year
object Marks2 {
def getSparkContext() = {
val conf = new SparkConf().setAppName("MaxMarksEachYear").setMaster("local")
val sc = new SparkContext(conf)
sc
}
def dump[R](rdd: RDD[R]) = rdd.collect.foreach(println)
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "D:\\Setup\\hadoop_home")
val sc = getSparkContext()
val inpRDD = sc.textFile("marks.csv")
val head = inpRDD.first()
val marksRdd = inpRDD.filter(record=> !record.equals(head)).map(rec => rec.split(","))
val marksByNameyear = marksRdd.map(rec =>((rec(0).toInt,rec(1)),rec(3).toInt))
//marksByNameyear.cache()
val aggMarksByYearName = marksByNameyear.reduceByKey(_+_)
val maxMarksByYear = aggMarksByYearName.map(s=> (s._1._1,s._2))reduceByKey(math.max(_, _))
val markYearName = aggMarksByYearName.map(s => (s._2.toInt,s._1._2))
val marksAndYear = maxMarksByYear.map(s => (s._2.toInt,s._1))
val tt = sc.broadcast(marksAndYear.collect().toMap)
marksAndYear.flatMap {case(key,value) => tt.value.get(key).map {other => (other,value, key) } }
val yearMarksName = marksAndYear.leftOuterJoin(markYearName)
val result = yearMarksName.map(s =>(s._2._1,s._2._2,s._1)).sortBy(f=>f._3, true)
//dump(markYearName);'
dump(result)
}
}

Join two Dataframe without a common field in Spark-scala

I have two dataframes in Spark Scala, but one of these is composed by a unique column. I have to join them but they have no column in common. The number of row is the same.
val userFriends=userJson.select($"friends",$"user_id")
val x = userFriends("friends")
.rdd
.map(x => x.getList(0).toArray.map(_.toString))
val y = x.map(z=>z.count(z=>true)).toDF("friendCount")
I have to join userFriends with y
It's not possible to join them without common fields, except if you can rely on a ordering, in this case you can use row-number (with window-function) on both dataframes and join on the row-number.
But in your case this does not seem necessary, just keep the user_id column in your dataframe, something like this should work:
val userFriends=userJson.select($"friends",$"user_id")
val result_df =
userFriends.select($"friends",$"user_id")
.rdd
.map(x => (x.getList(0).toArray.map(_.toString).count(z=>true)),x.getInt(1)))
.toDF("friendsCount","user_id")

How to do aggregation on multiple columns at once in Spark

I have a dataframe which has multiple columns. I want to group by one of the columns and aggregate other columns all the once. Let's say the table have 4 columns, cust_id, f1,f2,f3 and I want to group by cust_id and then get avg(f1), avg(f2) and avg(f3).The table will have many columns. Any hints?
The following code is good start but as I have many columns it may not be good idea to manually write them.
df.groupBy("cust_id").agg(sum("f1"), sum("f2"), sum("f3"))
Maybe you can try mapping a list with the colum names:
val groupCol = "cust_id"
val aggCols = (df.columns.toSet - groupCol).map(
colName => avg(colName).as(colName + "_avg")
).toList
df.groupBy(groupCol).agg(aggCols.head, aggCols.tail: _*)
Alternatively, if needed, you can also match the schema and build the aggregations based on the type:
val aggCols = df.schema.collect {
case StructField(colName, IntegerType, _, _) => avg(colName).as(colName + "_avg")
case StructField(colName, StringType, _, _) => first(colName).as(colName + "_first")
}

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.