How to apply customizable Aggregator on Spark Dataset?

How to apply customizable Aggregator on Spark Dataset? - scala

I have the following schema and student records of a spark dataset.
id | name | subject | score
1 | Tom | Math | 99
1 | Tom | Math | 88
1 | Tom | Physics | 77
2 | Amy | Math | 66
My goal is to transfer this dataset into another one which shows a list of record of the highest score of every subject for all students
id | name | subject_score_list
1 | Tom | [(Math, 99), (Physics, 77)]
2 | Amy | [(Math, 66)]
I've decided to use an Aggregator to do the transformation after transforming this dataset into ((id, name), (subject score)) key-value pair.
For the buffer I tried to use a mutable Map[String, Integer] so I can update the score if the subject exists and the new score is higher. Here's how the aggregator looks like
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
type StudentSubjectPair = ((String, String), (String, Integer))
type SubjectMap = collection.mutable.Map[String, Integer]
type SubjectList = List[(String, Integer)]
val StudentSubjectAggregator = new Aggregator[StudentSubjectPair, SubjectMap, SubjectList] {
def zero: SubjectMap = collection.mutable.Map[String, Integer]()
def reduce(buf: SubjectMap, input: StudentSubjectPair): SubjectMap = {
if (buf.contains(input._2._1))
buf.map{ case (input._2._1, score) => input._2._1 -> math.max(score, input._2._2) }
else
buf(input._2._1) = input._2._2
buf
}
def merge(b1: SubjectMap, b2: SubjectMap): SubjectMap = {
for ((subject, score) <- b2) {
if (b1.contains(subject))
b1(subject) = math.max(score, b2(subject))
else
b1(subject) = score
}
b1
}
def finish(buf: SubjectMap): SubjectList = buf.toList
override def bufferEncoder: Encoder[SubjectMap] = ExpressionEncoder[SubjectMap]
override def outputEncoder: Encoder[SubjectList] = ExpressionEncoder[SubjectList]
}.toColumn.name("subject_score_list")
I use Aggregator because I found it customizable and if I want to find the mean score of a subject I can simply change the reduce and merge functions.
I'm expecting two answers for this post.
Is it a good way to use Aggregator to get this job done? Are there any other simple way to get the same output?
What's the correct encoder for collection.mutable.Map[String, Integer] and List[(String, Integer)] since I always get the following error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 37.0 failed 1 times, most recent failure: Lost task 0.0 in stage 37.0 (TID 231, localhost, executor driver):
java.lang.ClassCastException: scala.collection.immutable.HashMap$HashTrieMap cannot be cast to scala.collection.mutable.Map
at $anon$1.merge(<console>:54)
Appreciate for any input and help, thanks!

I think you can achieve your desired result with the DataFrame API.
val df= Seq((1 ,"Tom" ,"Math",99),
(1 ,"Tom" ,"Math" ,88),
(1 ,"Tom" ,"Physics" ,77),
(2 ,"Amy" ,"Math" ,66)).toDF("id", "name", "subject","score")
GroupBy on id, name,and subject for max score, followed by a groupBy on
id,name with a collect_list on map of subject,score
df.groupBy("id","name", "subject").agg(max("score").as("score")).groupBy("id","name").
agg(collect_list(map($"subject",$"score")).as("subject_score_list"))
+---+----+--------------------+
| id|name| subject_score_list|
+---+----+--------------------+
| 1| Tom|[[Physics -> 77],...|
| 2| Amy| [[Math -> 66]]|
+---+----+--------------------+

Related

How to call FileSystem from udf

What I expected
The goal is to add a column with modification time to each DataFrame row.
Given
val data = spark.read.parquet("path").withColumn("input_file_name", input_file_name())
+----+------------------------+
| id | input_file_name |
+----+------------------------+
| 1 | hdfs://path/part-00001 |
| 2 | hdfs://path/part-00001 |
| 3 | hdfs://path/part-00002 |
+----+------------------------+
Expected
+----+------------------------+
| id | modification_time |
+----+------------------------+
| 1 | 2000-01-01Z00:00+00:00 |
| 2 | 2000-01-01Z00:00+00:00 |
| 3 | 2000-01-02Z00:00+00:00 |
+----+------------------------+
What I tried
I wrote a function to get the modification time
def getModificationTime(path: String): Long = {
FileSystem.get(spark.sparkContext.hadoopConfiguration)
.getFileStatus(new org.apache.hadoop.fs.Path(path))
.getModificationTime()
}
val modificationTime = getModificationTime("hdfs://srsdev/projects/khajiit/data/OfdCheques2/date=2020.02.01/part-00002-04b9e4c8-5916-4bb2-b9ff-757f843a0142.c000.snappy.parquet")
modificationTime: Long = 1580708401253
... but it does not work in query
def input_file_modification_time = udf((path: String) => getModificationTime(path))
data.select(input_file_modification_time($"input_file_name") as "modification_time").show(20, false)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 54.0 failed 4 times, most recent failure: Lost task 0.3 in stage 54.0 (TID 408, srs-hdp-s1.dev.kontur.ru, executor 3): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$input_file_modification_time$1: (string) => bigint)

The problem is that spark is null in the UDF, because it only exists on the driver. Another problem is that hadoops Configuration is not serializable, so you cannot easily enclose it the the udf. But there is a workound using org.apache.spark.SerializableWritable:
import org.apache.spark.SerializableWritable
import org.apache.hadoop.conf.Configuration
val conf = new SerializableWritable(spark.sparkContext.hadoopConfiguration)
def getModificationTime(path: String, conf:SerializableWritable[Configuration]): Long = {
org.apache.hadoop.fs.FileSystem.get(conf.value)
.getFileStatus(new org.apache.hadoop.fs.Path(path))
.getModificationTime()
}
def input_file_modification_time(conf:SerializableWritable[Configuration]) = udf((path: String) => getModificationTime(path,conf))
data.select(input_file_modification_time(conf)($"input_file_name") as "modification_time").show(20, false)

Note Invoking getModificationTime for each row of your DataFrame will have performance impact.
Modified your code to fetch file metadata one time & stored in files:Map[String,Long], created UDF input_file_modification_time to fetch data from Map[String,Long].
Please check the below code.
scala> val df = spark.read.format("parquet").load("/tmp/par")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> :paste
// Entering paste mode (ctrl-D to finish)
def getModificationTime(path: String): Long = {
FileSystem.get(spark.sparkContext.hadoopConfiguration)
.getFileStatus(new org.apache.hadoop.fs.Path(path))
.getModificationTime()
}
// Exiting paste mode, now interpreting.
getModificationTime: (path: String)Long
scala> implicit val files = df.inputFiles.flatMap(name => Map(name -> getModificationTime(name))).toMap
files: scala.collection.immutable.Map[String,Long] = Map(file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000_2.snappy.parquet -> 1588080295000, file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000_3.snappy.parquet -> 1588080299000, file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000_4.snappy.parquet -> 1588080302000, file:///tmp/par/part-00000-c6360540-c56d-48c4-8795-05a9c0ac4d18-c000.snappy.parquet -> 1588071322000)
scala> :paste
// Entering paste mode (ctrl-D to finish)
def getTime(fileName:String)(implicit files: Map[String,Long]): Long = {
files.getOrElse(fileName,0L)
}
// Exiting paste mode, now interpreting.
getTime: (fileName: String)(implicit files: Map[String,Long])Long
scala> val input_file_modification_time = udf(getTime _)
input_file_modification_time: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> df.withColumn("createdDate",input_file_modification_time(input_file_name)).show
+---+-------------+
| id| createdDate|
+---+-------------+
| 1|1588080295000|
| 2|1588080295000|
| 3|1588080295000|
| 4|1588080295000|
| 5|1588080295000|
| 6|1588080295000|
| 7|1588080295000|
| 8|1588080295000|
| 9|1588080295000|
| 10|1588080295000|
| 11|1588080295000|
| 12|1588080295000|
| 13|1588080295000|
| 14|1588080295000|
| 15|1588080295000|
| 16|1588080295000|
| 17|1588080295000|
| 18|1588080295000|
| 19|1588080295000|
| 20|1588080295000|
+---+-------------+
only showing top 20 rows
scala>

How to append a string column to array string column in Scala Spark without using UDF?

I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.

In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.

val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]

Merging rows into a single struct column in spark scala has efficiency problems, how do we do it better?

I am trying to speed up and limit the cost of taking several columns and their values and inserting them into a map in the same row. This is a requirement because we have a legacy system that is reading from this job and it isn't yet ready to be refactored. There is also another map with some data that needs to be combined with this.
Currently we have a few solutions all of which seem to result in about the same run time on the same cluster with around 1TB of data stored in Parquet:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
def jsonToMap(s: String, map: Map[String, String]): Map[String, String] = {
implicit val formats = org.json4s.DefaultFormats
val jsonMap = if(!s.isEmpty){
parse(s).extract[Map[String, String]]
} else {
Map[String, String]()
}
if(map != null){
map ++ jsonMap
} else {
jsonMap
}
}
val udfJsonToMap = udf(jsonToMap _)
def addMap(key:String, value:String, map: Map[String,String]): Map[String,String] = {
if(map == null) {
Map(key -> value)
} else {
map + (key -> value)
}
}
val addMapUdf = udf(addMap _)
val output = raw.columns.foldLeft(raw.withColumn("allMap", typedLit(Map.empty[String, String]))) { (memoDF, colName) =>
if(colName.startsWith("columnPrefix/")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, addMapUdf(substring_index(lit(colName), "/", -1), col(colName), col("allTagsMap")) ))
} else if(colName.equals("originalMap")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, udfJsonToMap(col(colName), col("allMap"))))
} else {
memoDF
}
}
takes about 1h on 9 m5.xlarge
val resourceTagColumnNames = raw.columns.filter(colName => colName.startsWith("columnPrefix/"))
def structToMap: Row => Map[String,String] = { row =>
row.getValuesMap[String](resourceTagColumnNames)
}
val structToMapUdf = udf(structToMap)
val experiment = raw
.withColumn("allStruct", struct(resourceTagColumnNames.head, resourceTagColumnNames.tail:_*))
.select("allStruct")
.withColumn("allMap", structToMapUdf(col("allStruct")))
.select("allMap")
Also runs in about 1h on the same cluster
This code all works but it isn't fast enough it is about 10 times longer than every other transform we have right now and it is a bottle neck for us.
Is there another way to get this result that is more efficient?
Edit: I have also tried limiting the data by a key however because the values in the columns I am merging can change despite the key remaining the same I cannot limit the data size without risking data loss.

Tl;DR: using only spark sql built-in functions can significantly speedup computation
As explained in this answer, spark sql native functions are more
performant than user-defined functions. So we can try to implement the solution to your problem using only
spark sql native functions.
I show two main versions of implementation. One using all the sql functions existing in last version
of Spark available at the time I wrote this answer, which is Spark 3.0. And another using only sql functions
existing in spark version when the question was asked, so functions existing in Spark 2.3. All the used functions
in this version are also available in Spark 2.2
Spark 3.0 implementation with sql functions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
val mapFromPrefixedColumns = map_filter(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*),
(_, v) => v.isNotNull
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Spark 2.2 implementation with sql functions
In spark 2.2, We don't have the functions map_concat (available in spark 2.4) and map_filter (available in spark 3.0).
I replace them with user-defined functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
def filterNull(map: Map[String, String]): Map[String, String] = map.toSeq.filter(_._2 != null).toMap
val filter_null_udf = udf(filterNull _)
def mapConcat(map1: Map[String, String], map2: Map[String, String]): Map[String, String] = map1 ++ map2
val map_concat_udf = udf(mapConcat _)
val mapFromPrefixedColumns = filter_null_udf(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*)
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat_udf(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Implementation with sql functions without json mapping
The last part of the question contains a simplified code without mapping of the json column and without filtering of
null values in result map. I created the following implementation for this specific case. As I don't use functions
that were added between spark 2.2 and spark 3.0, I don't need two versions of this implementation:
import org.apache.spark.sql.functions._
val mapFromPrefixedColumns = map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c), col(c))): _*)
raw.withColumn("allMap", mapFromPrefixedColumns)
Run
For the following dataframe as input:
+--------------------+--------------------+--------------------+----------------+
|columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|originalMap |
+--------------------+--------------------+--------------------+----------------+
|a |1 |x |{"column4": "k"}|
|b |null |null |null |
|c |null |null |{} |
|null |null |null |null |
|d |2 |null | |
+--------------------+--------------------+--------------------+----------------+
I obtain the following allMap column:
+--------------------------------------------------------+
|allMap |
+--------------------------------------------------------+
|[column1 -> a, column2 -> 1, column3 -> x, column4 -> k]|
|[column1 -> b] |
|[column1 -> c] |
|[] |
|[column1 -> d, column2 -> 2] |
+--------------------------------------------------------+
And for the mapping without json column:
+---------------------------------------------------------------------------------+
|allMap |
+---------------------------------------------------------------------------------+
|[columnPrefix/column1 -> a, columnPrefix/column2 -> 1, columnPrefix/column3 -> x]|
|[columnPrefix/column1 -> b, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> c, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 ->, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> d, columnPrefix/column2 -> 2, columnPrefix/column3 ->] |
+---------------------------------------------------------------------------------+
Benchmark
I generated a csv file of 10 millions lines, uncompressed (about 800 Mo), containing one column without column prefix,
nine columns with column prefix, and one colonne containing json as a string:
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|id |columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|columnPrefix/column4|columnPrefix/column5|columnPrefix/column6|columnPrefix/column7|columnPrefix/column8|columnPrefix/column9|originalMap |
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|1 |iwajedhor |ijoefzi |der |ob |galsu |ril |le |zaahuz |fuzi |{"column10":"true"}|
|2 |ofo |davfiwir |lebfim |roapej |lus |roum |te |javes |karutare |{"column10":"true"}|
|3 |jais |epciel |uv |piubnak |saajo |doke |ber |pi |igzici |{"column10":"true"}|
|4 |agami |zuhepuk |er |pizfe |lafudbo |zan |hoho |terbauv |ma |{"column10":"true"}|
...
The benchmark is to read this csv file, create the column allMap, and write this column to parquet. I ran this on my local machine and I obtained the following results
+--------------------------+--------------------+-------------------------+-------------------------+
| implementations | current (with udf) | sql functions spark 3.0 | sql functions spark 2.2 |
+--------------------------+--------------------+-------------------------+-------------------------+
| execution time | 138 seconds | 48 seconds | 82 seconds |
| improvement from current | 0 % faster | 64 % faster | 40 % faster |
+--------------------------+--------------------+-------------------------+-------------------------+
I also ran against the second implementation in the question, that drop the mapping of the json column and the filtering of null value in map.
+--------------------------+-----------------------+------------------------------------+
| implementations | current (with struct) | sql functions without json mapping |
+--------------------------+-----------------------+------------------------------------+
| execution time | 46 seconds | 35 seconds |
| improvement from current | 0 % | 23 % faster |
+--------------------------+-----------------------+------------------------------------+
Of course, the benchmark is rather basic, but we can see an improvement compared to the implementations that use user-defined functions
Conclusion
When you have a performance issue and you use user-defined functions, it can be a good idea to try to replace those user-defined functions by
spark sql functions

How to evaluate binary classifier evaluation metrics per group (in scala)?

I have a dataframe, which stores the scores and labels for various binary classification class problem that I have. For example:
| problem | score | label |
|:--------|:------|-------|
| a | 0.8 | true |
| a | 0.7 | true |
| a | 0.2 | false |
| b | 0.9 | false |
| b | 0.3 | true |
| b | 0.1 | false |
| ... | ... | ... |
Now my goal is to get binary evaluation metrics (take AreaUnderROC for example, see https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification) for each problem, with end result being something like:
| problem | areaUnderROC |
| a | 0.83 |
| b | 0.68 |
| ... | ... |
I thought about doing something like:
df.groupBy("problem").agg(getMetrics)
but then I am not sure how to write getMetrics in terms of Aggregators (see https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html). Any suggestions?

There's a module built just for binary metrics - see it in the python docs
This code should work,
from pyspark.mllib.evaluation import BinaryClassificationMetrics
score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
print(metrics_a.areaUnderROC)
print(metrics_a.areaUnderPR)
score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
print(metrics_b.areaUnderROC)
print(metrics_b.areaUnderPR)
... and so on for the other problems
This seems to me to be the easiest way :)

Spark has very useful classes to get metrics from binary or multiclass classification. But they are available for the RDD based api version. So, doing a little bit of code and playing around with dataframes and rdd it can be possible. A ful example could be like the following:
object TestMetrics {
def main(args: Array[String]) : Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
// Test data with your schema
val someData = Seq(
Row("a",0.8, true),
Row("a",0.7, true),
Row("a",0.2, true),
Row("b",0.9, true),
Row("b",0.3, true),
Row("b",0.1, true)
)
// Set your threshold to get a positive or negative
val threshold : Double = 0.5
import org.apache.spark.sql.functions._
// First udf to convert probability in positives or negatives
def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0
// Cast boolean to double
val thresholdUdf = udf { _thresholdUdf(threshold)}
val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }
// Schema to build the dataframe
val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
// Apply first trans to get the double representation of all fields
val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))
// First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
val pbl = df0.select("problem").distinct().as[String].collect()
// Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
.filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))
// And the metrics for each 'problem' are available
val results = dfList.toMap.mapValues(metrics =>
Seq(metrics.areaUnderROC(),
metrics.areaUnderROC()))
val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))
// Get Metrics by key, in your case the 'problem'
results.foreach(element => println(element))
moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
// Score and labels
}
}

Spark "CodeGenerator: failed to compile" with Dataset.groupByKey

I'm new to both Scala and Spark, so hopefully someone can let me know where I'm going wrong here.
I have a three-column dataset (id, name, year) and I want to find the most recent year for each name. In other words:
BEFORE AFTER
| id_1 | name_1 | 2015 | | id_2 | name_1 | 2016 |
| id_2 | name_1 | 2016 | | id_4 | name_2 | 2015 |
| id_3 | name_1 | 2014 |
| id_4 | name_2 | 2015 |
| id_5 | name_2 | 2014 |
I thought groupByKey and reduceGroups would get the job done:
val latestYears = ds
.groupByKey(_.name)
.reduceGroups((left, right) => if (left.year > right.year) left else right)
.map(group => group._2)
But it gives this error, and spits out a lot of generated Java code:
ERROR CodeGenerator: failed to compile:
org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 21, Column 101: Unknown variable or type "value4"
Interestingly, if I create a dataset with just the name and year columns, it works as expected.
Here's the full code I'm running:
object App {
case class Record(id: String, name: String, year: Int)
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val ds = spark.createDataset[String](Seq(
"id_1,name_1,2015",
"id_2,name_1,2016",
"id_3,name_1,2014",
"id_4,name_2,2015",
"id_5,name_2,2014"
))
.map(line => {
val fields = line.split(",")
new Record(fields(0), fields(1), fields(2).toInt)
})
val latestYears = ds
.groupByKey(_.name)
.reduceGroups((left, right) => if (left.year > right.year) left else right)
.map(group => group._2)
latestYears.show()
}
}
EDIT: I believe this may be a bug with Spark v2.0.1. After downgrading to v2.0.0, this no longer occurs.

Your groupBy and reduceGroups functions are experimental. Why not use reduceByKey (api)?
Pros:
It should be easy to translate from the code you have.
It is more stable (not experimental).
It should be more efficient because it does not require a complete shuffle of all the items in each groups (which can also create network I/O slow downs and overflow the memory in a node).