Spark collect_list and limit resulting list - scala

I have a dataframe of the following format:
name merged
key1 (internalKey1, value1)
key1 (internalKey2, value2)
...
key2 (internalKey3, value3)
...
What I want to do is group the dataframe by the name, collect the list and limit the size of the list.
This is how i group by the name and collect the list:
val res = df.groupBy("name")
.agg(collect_list(col("merged")).as("final"))
The resuling dataframe is something like:
key1 [(internalKey1, value1), (internalKey2, value2),...] // Limit the size of this list
key2 [(internalKey3, value3),...]
What I want to do is limit the size of the produced lists for each key. I' ve tried multiple ways to do that but had no success. I've already seen some posts that suggest 3rd party solutions but I want to avoid that. Is there a way?

So while a UDF does what you need, if you're looking for a more performant way that is also memory sensitive, the way of doing this would be to write a UDAF. Unfortunately the UDAF API is actually not as extensible as the aggregate functions that ship with spark. However you can use their internal APIs to build on the internal functions to do what you need.
Here is an implementation for collect_list_limit that is mostly a copy past of Spark's internal CollectList AggregateFunction. I would just extend it but its a case class. Really all that's needed is to override update and merge methods to respect a passed in limit:
case class CollectListLimit(
child: Expression,
limitExp: Expression,
mutableAggBufferOffset: Int = 0,
inputAggBufferOffset: Int = 0) extends Collect[mutable.ArrayBuffer[Any]] {
val limit = limitExp.eval( null ).asInstanceOf[Int]
def this(child: Expression, limit: Expression) = this(child, limit, 0, 0)
override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): ImperativeAggregate =
copy(mutableAggBufferOffset = newMutableAggBufferOffset)
override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): ImperativeAggregate =
copy(inputAggBufferOffset = newInputAggBufferOffset)
override def createAggregationBuffer(): mutable.ArrayBuffer[Any] = mutable.ArrayBuffer.empty
override def update(buffer: mutable.ArrayBuffer[Any], input: InternalRow): mutable.ArrayBuffer[Any] = {
if( buffer.size < limit ) super.update(buffer, input)
else buffer
}
override def merge(buffer: mutable.ArrayBuffer[Any], other: mutable.ArrayBuffer[Any]): mutable.ArrayBuffer[Any] = {
if( buffer.size >= limit ) buffer
else if( other.size >= limit ) other
else ( buffer ++= other ).take( limit )
}
override def prettyName: String = "collect_list_limit"
}
And to actually register it, we can do it through Spark's internal FunctionRegistry which takes in the name and the builder which is effectively a function that creates a CollectListLimit using the provided expressions:
val collectListBuilder = (args: Seq[Expression]) => CollectListLimit( args( 0 ), args( 1 ) )
FunctionRegistry.builtin.registerFunction( "collect_list_limit", collectListBuilder )
Edit:
Turns out adding it to the builtin only works if you haven't created the SparkContext yet as it makes an immutable clone on startup. If you have an existing context then this should work to add it with reflection:
val field = classOf[SessionCatalog].getFields.find( _.getName.endsWith( "functionRegistry" ) ).get
field.setAccessible( true )
val inUseRegistry = field.get( SparkSession.builder.getOrCreate.sessionState.catalog ).asInstanceOf[FunctionRegistry]
inUseRegistry.registerFunction( "collect_list_limit", collectListBuilder )

You can create a function that limits the size of the aggregated ArrayType column as shown below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
case class KV(k: String, v: String)
val df = Seq(
("key1", KV("internalKey1", "value1")),
("key1", KV("internalKey2", "value2")),
("key2", KV("internalKey3", "value3")),
("key2", KV("internalKey4", "value4")),
("key2", KV("internalKey5", "value5"))
).toDF("name", "merged")
def limitSize(n: Int, arrCol: Column): Column =
array( (0 until n).map( arrCol.getItem ): _* )
df.
groupBy("name").agg( collect_list(col("merged")).as("final") ).
select( $"name", limitSize(2, $"final").as("final2") ).
show(false)
// +----+----------------------------------------------+
// |name|final2 |
// +----+----------------------------------------------+
// |key1|[[internalKey1,value1], [internalKey2,value2]]|
// |key2|[[internalKey3,value3], [internalKey4,value4]]|
// +----+----------------------------------------------+

You can use a UDF.
Here is a probable example without the necessity of schema and with a meaningful reduction:
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.functions._
import scala.collection.mutable
object TestJob1 {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val rawDf = Seq(
("key", 1L, "gargamel"),
("key", 4L, "pe_gadol"),
("key", 2L, "zaam"),
("key1", 5L, "naval")
).toDF("group", "quality", "other")
rawDf.show(false)
rawDf.printSchema
val rawSchema = rawDf.schema
val fUdf = udf(reduceByQuality, rawSchema)
val aggDf = rawDf
.groupBy("group")
.agg(
count(struct("*")).as("num_reads"),
max(col("quality")).as("quality"),
collect_list(struct("*")).as("horizontal")
)
.withColumn("short", fUdf($"horizontal"))
.drop("horizontal")
aggDf.printSchema
aggDf.show(false)
}
def reduceByQuality= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val red = d.reduce((r1, r2) => {
val quality1 = r1.getAs[Long]("quality")
val quality2 = r2.getAs[Long]("quality")
val r3 = quality1 match {
case a if a >= quality2 =>
r1
case _ =>
r2
}
r3
})
red
}
}
here is an example with data like yours
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import scala.collection.mutable
object TestJob {
def main (args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName(this.getClass.getName.replace("$", ""))
.master("local")
.getOrCreate()
val sc = sparkSession.sparkContext
import sparkSession.sqlContext.implicits._
val df1 = Seq(
("key1", ("internalKey1", "value1")),
("key1", ("internalKey2", "value2")),
("key2", ("internalKey3", "value3")),
("key2", ("internalKey4", "value4")),
("key2", ("internalKey5", "value5"))
)
.toDF("name", "merged")
// df1.printSchema
//
// df1.show(false)
val res = df1
.groupBy("name")
.agg( collect_list(col("merged")).as("final") )
res.printSchema
res.show(false)
def f= (x: Any) => {
val d = x.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]]
val d1 = d.asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].head
d1.toString
}
val fUdf = udf(f, StringType)
val d2 = res
.withColumn("d", fUdf(col("final")))
.drop("final")
d2.printSchema()
d2
.show(false)
}
}

Related

Update to the delta table in spark not working

package jobs
import io.delta.tables.DeltaTable
import model.RandomUtils
import org.apache.spark.sql.streaming.{ OutputMode, Trigger }
import org.apache.spark.sql.{ DataFrame, Dataset, Encoder, Encoders, SparkSession }
import jobs.SystemJob.Rate
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
case class Student(firstName: String, lastName: String, age: Long, percentage: Long)
case class Rate(timestamp: Timestamp, value: Long)
case class College(name: String, address: String, principal: String)
object RCConfigDSCCDeltaLake {
def getSpark(): SparkSession = {
SparkSession.builder
.appName("Delta table demo")
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
}
def main(args: Array[String]): Unit = {
val spark = getSpark()
val rate = 1
val studentProfile = "student_profile"
if (!DeltaTable.isDeltaTable(s"spark-warehouse/$studentProfile")) {
val deltaTable: DataFrame = spark.sql(s"CREATE TABLE `$studentProfile` (firstName String, lastName String, age Long, percentage Long) USING delta")
deltaTable.show()
deltaTable.printSchema()
}
val studentProfileDT = DeltaTable.forPath(spark, s"spark-warehouse/$studentProfile")
def processStream(student: Dataset[Student], college: Dataset[College]) = {
val studentQuery = student.writeStream.outputMode(OutputMode.Update()).foreachBatch {
(st: Dataset[Student], y: Long) =>
val listOfStudents = st.collect().toList
println("list of students :::" + listOfStudents)
val (o, n) = ("oldData", "newData")
val colMap = Map(
"firstName" -> col(s"$n.firstName"),
"lastName" -> col(s"$n.lastName"),
"age" -> col(s"$n.age"),
"percentage" -> col(s"$n.percentage"))
studentProfileDT.as(s"$o").merge(st.toDF.as(s"$n"), s"$o.firstName = $n.firstName AND $o.lastName = $n.lastName")
.whenMatched.update(colMap)
.whenNotMatched.insert(colMap)
.execute()
}.start()
val os = spark.readStream.format("delta").load(s"spark-warehouse/$studentProfile").writeStream.format("console")
.outputMode(OutputMode.Append())
.option("truncate", value = false)
.option("checkpointLocation", "retrieved").start()
studentQuery.awaitTermination()
os.awaitTermination()
}
import spark.implicits._
implicit val encStudent: Encoder[Student] = Encoders.product[Student]
implicit val encCollege: Encoder[College] = Encoders.product[College]
def rateStream = spark
.readStream
.format("rate") // <-- use RateStreamSource
.option("rowsPerSecond", rate)
.load()
.as[Rate]
val studentStream: Dataset[Student] = rateStream.filter(_.value % 25 == 0).map {
stu =>
Student(...., ....., ....., .....) //fill with values
}
val collegeStream: Dataset[College] = rateStream.filter(_.value % 40 == 0).map {
stu =>
College(...., ....., ......) //fill with values
}
processStream(studentStream, collegeStream)
}
}
What I am trying to do is a simple UPSERT operation with streaming datasets. But it fails with error
22/04/13 19:50:33 ERROR MicroBatchExecution: Query [id = 8cf759fd-9bee-460f-
b0d9-91889c59c524, runId = 55723708-fd3c-4425-a2bc-83d737c37589] terminated with
error
java.lang.UnsupportedOperationException: Detected a data update (for example part-
00000-d026d92e-1798-4d21-a505-67ec72d334e2-c000.snappy.parquet) in the source table
at version 4. This is currently not supported. If you'd like to ignore updates, set
the option 'ignoreChanges' to 'true'. If you would like the data update to be
reflected, please restart this query with a fresh checkpoint directory.
Dependencies :
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
--packages io.delta:delta-core_2.12:0.7.0
The update query works when the datasets are not streamed and only hardcoded.
Am I doing something wrong here ?

Assert RDD is not sorted

I have a method called split that accepts an RDD[T] and a splitSize and returns an Array[RDD[T]].
Now, one of the test cases I write for it should verify that this function also randomly shuffles the RDD.
So I create a sorted RDD, and then see the results:
it should "randomize shuffle" in {
val inputRDD = sc.parallelize((0 until 16))
val result = RDDUtils.split(inputRDD, 2)
result.foreach(rdd => {
rdd.collect.foreach(println)
})
// Asset result is not sorted
}
If the results are:
0
1
2
3
..
15
Then it's not working as expected.
A good result can be something like:
11
3
9
14
...
1
6
How can I assert the output Array[RDD[T]]] is not sorted?
You could try something like this
val resultOrder = result.sortBy(....)
assert(!resultOrder.sameElements(result))
or
val resultOrder = result.sortBy(....)
assert(!resultOrder.toList == result.toList)
It's important to note that the key is to know how to sort the Array. For an Integer data type it would be easy, but for a complex data type you could need an implicit Ordering for your data type. e.g:
implicit val ordering: Ordering[T] =
Ordering.fromLessThan[T]((sa: T, sb: T) => sa < sb)
// OR
implicit val ordering: Ordering[MyClass] =
Ordering.fromLessThan[MyClass]((sa: MyClass, sb: MyClass) => sa.field1 < sb.field1)
The exact code would depend of your data type.
As a full example of this
package tests
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object SortArrayRDD {
val spark = SparkSession
.builder()
.appName("SortArrayRDD")
.master("local[*]")
.config("spark.sql.shuffle.partitions","4") //Change to a more reasonable default number of partitions for our data
.config("spark.app.id","SortArrayRDD") // To silence Metrics warning
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]): Unit = {
try {
Logger.getRootLogger.setLevel(Level.ERROR)
val arrRDD: Array[RDD[Int]] = Array(sc.parallelize(List(2,3)),sc.parallelize(List(10,11)),sc.parallelize(List(6,7)),sc.parallelize(List(8,9)),
sc.parallelize(List(4,5)),sc.parallelize(List(0,1)),sc.parallelize(List(12,13)),sc.parallelize(List(14,15)))
val aux = arrRDD
implicit val ordering: Ordering[RDD[Int]] = Ordering.fromLessThan[RDD[Int]]((sa: RDD[Int], sb: RDD[Int]) => sa.sum() < sb.sum())
aux.sorted.foreach(rdd => println(rdd.collect().mkString(",")))
val resultOrder = aux.sorted
assert(!resultOrder.sameElements(arrRDD))
println("It's unordered")
} finally {
sc.stop()
}
}
}

Rewrite LogicalPlan to push down udf from aggregate

I have defined an UDF which increases the input value by one, named "inc", this is the code of my udf
spark.udf.register("inc", (x: Long) => x + 1)
this is my test sql
val df = spark.sql("select sum(inc(vals)) from data")
df.explain(true)
df.show()
this is the optimized plan of that sql
== Optimized Logical Plan ==
Aggregate [sum(inc(vals#4L)) AS sum(inc(vals))#7L]
+- LocalRelation [vals#4L]
I want to rewrite the plan, and extract the "inc" from the "sum", just like python udf does.
So, this is the optimized plan which I wanted.
Aggregate [sum(inc_val#6L) AS sum(inc(vals))#7L]
+- Project [inc(vals#4L) AS inc_val#6L]
+- LocalRelation [vals#4L]
I have found that source code file "ExtractPythonUDFs.scala" provides similar function which works on PythonUDF, but it inserts a new node named "ArrowEvalPython", this is the logical plan of pythonudf.
== Optimized Logical Plan ==
Aggregate [sum(pythonUDF0#7L) AS sum(inc(vals))#4L]
+- Project [pythonUDF0#7L]
+- ArrowEvalPython [inc(vals#0L)], [pythonUDF0#7L], 200
+- Repartition 10, true
+- RelationV2[vals#0L] parquet file:/tmp/vals.parquet
What I want to inset is just a "project node", I don't want to define a new node.
this is the test code of my project
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.expressions.{Expression, NamedExpression, ScalaUDF}
import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan}
import org.apache.spark.sql.catalyst.rules.Rule
object RewritePlanTest {
case class UdfRule(spark: SparkSession) extends Rule[LogicalPlan] {
def collectUDFs(e: Expression): Seq[Expression] = e match {
case udf: ScalaUDF => Seq(udf)
case _ => e.children.flatMap(collectUDFs)
}
override def apply(plan: LogicalPlan): LogicalPlan = plan match {
case agg#Aggregate(g, a, _) if (g.isEmpty && a.length == 1) =>
val udfs = agg.expressions.flatMap(collectUDFs)
println("================")
udfs.foreach(println)
val test = udfs(0).isInstanceOf[NamedExpression]
println(s"cast ScalaUDF to NamedExpression = ${test}")
println("================")
agg
case _ => plan
}
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.WARN)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Rewrite plan test")
.withExtensions(e => e.injectOptimizerRule(UdfRule))
.getOrCreate()
val input = Seq(100L, 200L, 300L)
import spark.implicits._
input.toDF("vals").createOrReplaceTempView("data")
spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(vals)) from data")
df.explain(true)
df.show()
spark.stop()
}
}
I have extract ScalaUDF from the Aggregate node,
since the arguments needed for Project Node is Seq[NamedExpression]
case class Project(projectList: Seq[NamedExpression], child: LogicalPlan)
but it's failed to cast ScalaUDF to NamedExpression,
so I have no idea about how to construct the Project node.
Can someone give me some advices?
Thanks.
OK, finally I find way to so answer this question.
Though ScalaUDF can't cast to NamedExpression, but Alias could.
So, I create Alias from ScalaUDF, then construct Project.
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.codegen.CodegenFallback
import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, ExpectsInputTypes, ExprId, Expression, NamedExpression, ScalaUDF}
import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LocalRelation, LogicalPlan, Project, Subquery}
import org.apache.spark.sql.catalyst.rules.Rule
import org.apache.spark.sql.types.{AbstractDataType, DataType}
import scala.collection.mutable
object RewritePlanTest {
case class UdfRule(spark: SparkSession) extends Rule[LogicalPlan] {
def collectUDFs(e: Expression): Seq[Expression] = e match {
case udf: ScalaUDF => Seq(udf)
case _ => e.children.flatMap(collectUDFs)
}
override def apply(plan: LogicalPlan): LogicalPlan = plan match {
case agg#Aggregate(g, a, c) if g.isEmpty && a.length == 1 => {
val udfs = agg.expressions.flatMap(collectUDFs)
if (udfs.isEmpty) {
agg
} else {
val alias_udf = for (i <- 0 until udfs.size) yield Alias(udfs(i), s"udf${i}")()
val alias_set = mutable.HashMap[Expression, Attribute]()
val proj = Project(alias_udf, c)
alias_set ++= udfs.zip(proj.output)
val new_agg = agg.withNewChildren(Seq(proj)).transformExpressionsUp {
case udf: ScalaUDF if alias_set.contains(udf) => alias_set(udf)
}
println("====== new agg ======")
println(new_agg)
new_agg
}
}
case _ => plan
}
}
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.WARN)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Rewrite plan test")
.withExtensions(e => e.injectOptimizerRule(UdfRule))
.getOrCreate()
val input = Seq(100L, 200L, 300L)
import spark.implicits._
input.toDF("vals").createOrReplaceTempView("data")
spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(vals)) from data where vals > 100")
// val plan = df.queryExecution.analyzed
// println(plan)
df.explain(true)
df.show()
spark.stop()
}
}
This code output the LogicalPlan that I wanted.
====== new agg ======
Aggregate [sum(udf0#9L) AS sum(inc(vals))#7L]
+- Project [inc(vals#4L) AS udf0#9L]
+- LocalRelation [vals#4L]

Scala/Spark reduce RDD of class objects

I need your help for my last step of a school project.
val conf: SparkConf = new SparkConf() .setMaster("local[*]") .setAppName("AppName") .set("spark.driver.host", "localhost")
val sc: SparkContext = new SparkContext(conf)
var list_creature = new ListBuffer[creature]()
list_creature += new creature("ska")
list_creature(0).addspell("Heal")
list_creature(0).addspell("Attaque")
list_creature += new creature("moise")
list_creature(1).addspell("Tank")
list_creature(1).addspell("Defense")
list_creature(1).addspell("Attaque")
val rdd = sc.parallelize(list_creature)
val y = rdd.map(e=>(e.name,e.Spells)).collect()
val z = y.flatMap(x =>ListBuffer(x._2->x._1))
val ze = z.flatMap(e =>e._1.flatMap(x => ListBuffer(x->e._2)))
i get this as a result,
(Heal,ska)
(Attaque,ska)
(Tank,moise)
(Defense,moise)
(Attaque,moise)
So, i want to reduce this List[List[String]]
to get List[String,List[string]]
and the result will be :
(Heal,(ska))
(Attaque,(ska,moise))
(Tank,(moise))
(Defense,(moise))
Thanks you're the best ...
Not sure why you create a RDD then collect before all the major transformations. Since you didn't provide definition of class Creature, I'm creating a placeholder class based on your question content as follows:
class Creature(val name: String) extends Serializable {
var spells: List[String] = List.empty[String]
def addspell(spell: String): Unit = {
spells ::= spell
}
}
import scala.collection.mutable.ListBuffer
val list_creature = ListBuffer[Creature]()
list_creature += new Creature("ska")
list_creature(0).addspell("Heal")
list_creature(0).addspell("Attaque")
list_creature += new Creature("moise")
list_creature(1).addspell("Tank")
list_creature(1).addspell("Defense")
list_creature(1).addspell("Attaque")
val rdd = sc.parallelize(list_creature)
val reducedRDD = rdd.flatMap( c => c.spells.map(s => (s, List(c.name))) ).
reduceByKey( _ ++ _ )
reducedRDD.collect
// res1: Array[(String, List[String])] = Array(
// (Heal,List(ska)), (Defense,List(moise)), (Attaque,List(ska, moise)), (Tank,List(moise)
// ))

spark map partitions to fill nan values

I want to fill nan values in spark using the last good known observation - see: Spark / Scala: fill nan with last good observation
My current solution used window functions in order to accomplish the task. But this is not great, as all values are mapped into a single partition.
val imputed: RDD[FooBar] = recordsDF.rdd.mapPartitionsWithIndex { case (i, iter) => fill(i, iter) } should work a lot better. But strangely my fill function is not executed. What is wrong with my code?
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
Here is the full example code:
import java.sql.Date
import org.apache.log4j.{ Level, Logger }
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class FooBar(foo: Date, bar: String)
object WindowFunctionExample extends App {
Logger.getLogger("org").setLevel(Level.WARN)
val conf: SparkConf = new SparkConf()
.setAppName("foo")
.setMaster("local[*]")
val spark: SparkSession = SparkSession
.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val myDff = Seq(("2016-01-01", "first"), ("2016-01-02", "second"),
("2016-wrongFormat", "noValidFormat"),
("2016-01-04", "lastAssumingSameDate"))
val recordsDF = myDff
.toDF("foo", "bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
recordsDF.show
def notMissing(row: FooBar): Boolean = {
row.foo != null
}
val toCarry = recordsDF.rdd.mapPartitionsWithIndex { case (i, iter) => Iterator((i, iter.filter(notMissing(_)).toSeq.lastOption)) }.collectAsMap
println("###################### carry ")
println(toCarry)
println(toCarry.foreach(println))
println("###################### carry ")
val toCarryBd = spark.sparkContext.broadcast(toCarry)
def fill(i: Int, iter: Iterator[FooBar]): Iterator[FooBar] = {
var lastNotNullRow: FooBar = toCarryBd.value(i).get
iter.map(row => {
if (!notMissing(row))1
FooBar(lastNotNullRow.foo, row.bar)
else {
lastNotNullRow = row
row
}
})
}
// The algorithm does not step into the for loop for filling the null values. Strange
val imputed: RDD[FooBar] = recordsDF.rdd.mapPartitionsWithIndex { case (i, iter) => fill(i, iter) }
val imputedDF = imputed.toDS()
println(imputedDF.orderBy($"foo").collect.toList)
imputedDF.show
spark.stop
}
edit
I fixed the code as outlined by the comment. But the toCarryBd contains None values. How can this happen as I did filter explicitly for
def notMissing(row: FooBar): Boolean = {row.foo != null}
iter.filter(notMissing(_)).toSeq.lastOption
non None values.
(2,None)
(5,None)
(4,None)
(7,Some(FooBar(2016-01-04,lastAssumingSameDate)))
(1,Some(FooBar(2016-01-01,first)))
(3,Some(FooBar(2016-01-02,second)))
(6,None)
(0,None)
This leads to NoSuchElementException: None.getwhen trying to access toCarryBd.
Firstly, if your foo field can be null, I would recommend creating the case class as:
case class FooBar(foo: Option[Date], bar: String)
Then, you can rewrite your notMissing function to something like:
def notMissing(row: Option[FooBar]): Boolean = row.isDefined && row.get.foo.isDefined