Spark Catalyst flatMapGroupsWithState: Group State with sorted collection - scala

I am trying to have a sorted collection in the state of my groups and I get an error from catalyst which I think regards default instance creation for the collection.
Below is simplified pipeline that demonstrates the error:
package com.example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode, Trigger}
import scala.collection.immutable.TreeMap
case class Event
(
key: String
)
case class KeyState
(
prop: TreeMap[Long, String]
)
object CatalystIssue {
def updateState(k: String, vs: Iterator[Event],
state: GroupState[KeyState]) : Iterator[Event] = vs
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local[*]")
.appName("CatalystIssue")
.getOrCreate()
import spark.implicits._
val df = spark.readStream.format("rate")
.load()
.select(lit("a").as("key"))
.as[Event]
.groupByKey(_.key)
.flatMapGroupsWithState(OutputMode.Append(),
GroupStateTimeout.NoTimeout())(updateState)
val query = df.writeStream.format("console")
.trigger(Trigger.ProcessingTime("30 seconds")).start()
query.awaitTermination()
}
}
Which produces the error:
ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 53, Column 106: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.generic.SortedMapFactory.newBuilder(scala.math.Ordering)"
This might be because Sorted Maps are not supported as a dataframe attribute type although that's not my intention here and I would have thought the KeyState would have been opaque to spark since you don't actually access it like a dataframe attribute.
While not very attractive one option might be to serialize the sorted set into a byte array which is an attribute of the KeyState. i.e.
case class KeyState
(
prop: Array[Byte]
)
If Java Serialization were used would that preserve the internal tree structure of the TreeMap, so that at least that would not have to be be rebuilt? Are there any alternative serialization technologies that would preserve the structure?
It seems useful to be able to keep some sorted collections in the group state, especially as the computation is supposed to be primarily in memory. Is there something about the way spark works that makes this fundamentally unworkable?

Related

Does using Scala implicit classes feature on Spark Dataframe is a monkey patching?

I'm trying to add side-effect functionality to Spark DataFrame by expanding DataFrame class using Scala implicit classes feature for the reason that "Dataset Transform Method" only allows returning DataFrame.
From Wikipedia - "The term monkey patch ... referred to changing code sneakily – and possibly incompatibly with other such patches – at runtime"
In this post the writer warns from "Monkey Patching with Implicit Classes", but I'm not sure his claims are correct because we are not
changing any classes.
Does the following example is potentially "monkey patching" and could somehow be incompatible with Spark future version, or because I'm not overwriting the current DataFrame class and just expanding, it can be no harm?
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, get_json_object}
object dataFrameSql {
implicit class DataFrameExSql(dataFrame: DataFrame) {
def writeDFbyPartition(repartition: Int, output: String): Unit = {
dataFrame
.repartition(repartition)
.write
.option("partitionOverwriteMode", "dynamic")
.mode(SaveMode.Overwrite)
.parquet(output)
}
}
}

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

Fetching a DataFrame into a Case Class instead results instead in reading a Tuple1

Given a case class :
case class ScoringSummary(MatchMethod: String="",
TP: Double=0,
FP: Double=0,
Precision: Double=0,
Recall: Double=0,
F1: Double=0)
We are writing summary records out as:
summaryDf.write.parquet(path)
Later we (attempt to) read the parquet file into a new dataframe:
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary]
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
But this fails - for some reason spark believes the contents of the data were Tuple1 instead of ScoringSummary:
Try to map struct<MatchMethod:string,TP:double,FP:double,Precision:double,
Recall:double,F1:double> to Tuple1,
but failed as the number of fields does not line up.;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$
.org$apache$spark$sql$catalyst$analysis$Analyzer$
ResolveDeserializer$$fail(Analyzer.scala:2168)
What step / setting is missing/incorrect for the correct translation?
Use import spark.implicits._ instead of registering an Encoder
I had forgotten that it is required to import spark.implicits. The incorrect approach was to add the Encoder: i.e. do not include the following line
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary] // Do NOT add this Encoder
Here is the error when removing the Encoder line
Error:(59, 113) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing
other types will be added in future releases.
val summaryDf = ParquetLoader.loadParquet(sparkEnv,res.state.dfs(ScoringSummaryTag).copy(df=None)).df.get.as[ScoringSummary]
Instead the following code should be added
import spark.implicits._
And then the same code works:
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
As an aside: encoders are not required for case class'es or primitive types: and the above is a case class. kryo becomes handy for complex object types.

Scala not able to save as sequence file in RDD, as per doc it is allowed

I am using Spark 1.6, as per the official doc it is allowed to save a RDD to sequence file format, however I notice for my RDD textFile:
scala> textFile.saveAsSequenceFile("products_sequence")
<console>:30: error: value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[String]
I googled and found similar discussions seem to suggest this works in pyspark. Is my understanding to the official doc wrong? Can saveAsSequenceFile() be used in Scala?
The saveAsSequenceFile is only available when you have key value pairs in the RDD. The reason for this is that it is defined in PairRDDFunctions
https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions
You can see that the API definition takes a K and a V.
if you change your code above to
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd._
object SequeneFile extends App {
val conf = new SparkConf().setAppName("sequenceFile").setMaster("local[1]")
val sc = new SparkContext(conf)
val rdd : RDD[(String, String)] = sc.parallelize(List(("foo", "foo1"), ("bar", "bar1"), ("baz", "baz1")))
rdd.saveAsSequenceFile("foo.seq")
sc.stop()
}
This works perfectly and you will get foo.seq file. The reason why the above works is because we have an RDD which is a key value pair and not just a RDD[String].

How do i pass Spark context to a function from foreach

I need to pass SparkContext to my function and please suggest me how to do that for below scenario.
I have a Sequence, each element refers to specific data source from which we gets RDD and process them. I have defined a function which takes spark context and the data source and does the necessary things. I am curretly using while loop. But, i would like to do it with foreach or map, so that i can imply parallel processing. I need to spark context for the function, but how can i pass it from the foreach.?
Just a SAMPLE code, as i cannot present the actual code:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object RoughWork {
def main(args: Array[String]) {
val str = "Hello,hw:How,sr:are,ws:You,re";
val conf = new SparkConf
conf.setMaster("local");
conf.setAppName("app1");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val rdd = sc.parallelize(str.split(":"))
rdd.map(x => {println("==>"+x);passTest(sc, x)}).collect();
}
def passTest(context: SparkContext, input: String) {
val rdd1 = context.parallelize(input.split(","));
rdd1.foreach(println)
}
}
You cannot pass the SparkContext around like that. passTest will be run on an/the executor(s), while the SparkContext runs on the driver.
If I would have to do a double split like that, one approach would be to use flatMap:
rdd
.zipWithIndex
.flatMap(l => {
val parts = l._1.split(",");
List.fill(parts.length)(l._2) zip parts})
.countByKey
There may be prettier ways, but basically the idea is that you can use zipWithIndex to keep track which line an item came from and then use key-value pair RDD methods to work on your data.
If you have more than one key, or just more structured data in general, you can look into using Spark SQL with DataFrames (or DataSets in latest version) and explode instead of flatMap.