Scala - template method pattern in a trait - scala

I'm implementing a template method pattern in Scala. The idea is that the method returns a Dataset[Metric].
But when I'm converting enrichedMetrics to a DataSet enrichedMetrics.as[Metric] I have to use implicits in order to map the records to the specified type. This means passing a SparkSession to the MetricsProcessor which seems not the best solution to me.
The solution I see now is to pass spark: SparkSession as a parameter to the template method. And then import spark.implicits._ within the template method.
Is there a more proper way to implement the template method pattern in this case?
trait MetricsProcessor {
// Template method
def parseMetrics(startDate: Date, endDate: Date, metricId: Long): Dataset[Metric] = {
val metricsFromSource: DataFrame = queryMetrics(startDate, endDate)
val enrichedMetrics = enrichMetrics(metricsFromSource, metricId)
enrichedMetrics.as[Metric] <--- //requires spark.implicits
}
// abstract method
def queryMetrics(startDate: Date, endDate: Date): DataFrame
def enrichMetrics(metricsDf: DataFrame, metricId: Long): DataFrame = {
/*Default implementation*/
}
}

You're missing the Encoder for your type Metric here which spark cannot find implicitly, for common types like String, Int etc, spark has implicit encoders.
Also, you cannot do a simple .as on a data frame if the columns in the source type and destination type aren't the same. I'll make some assumptions here.
For a case class Metric
case class Metric( ??? )
the line in parseMetrics will change to,
Option 1 - Explicitly passing the Encoder
enrichedMetrics.map(row => Metric( ??? ))(Encoders.product[Metric])
Option 2 - Implicitly passing the Encoder
implicit val enc : Encoder[Metric] = Encoders.product[Metric]
enrichedMetrics.map(row => Metric( ??? ))
Note, as pointed in one of the comments, if your parseMetric method is always returning the Dataset[Metric], you can add the implicit encoder to the body of the trait.
Hope this helped.

Related

Scala: How to take any generic sequence as input to this method

Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)] as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef], but then the code fails to recognize the toDF call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!
Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._
As #AssafMendelson already explained the real reason of why you cannot create a Dataset of Any is because Spark needs an Encoder to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder for Any type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products (tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._ to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.

Pass case class to Spark UDF

I have a scala-2.11 function which creates a case class from Map based on the provided class type.
def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {
val rMirror = runtimeMirror(getClass.getClassLoader)
val myClass = typeOf[T].typeSymbol.asClass
val cMirror = rMirror.reflectClass(myClass)
// The primary constructor is the first one
val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))
cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
}
I'm trying to use this in the context of a spark data frame as a UDF. However, I'm not sure what's the best way to pass the case class. The approach below doesn't seem to work.
def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
createCaseClass[T](inMap)
}
I'm looking for something like this-
case class MyType(c1: String, c2: Long)
val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}
Thoughts and suggestions to resolve this is appreciated.
However, I'm not sure what's the best way to pass the case class
It is not possible to use case classes as arguments for user defined functions. SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects.
If you want to operate on statically typed objects please use statically typed Dataset.
From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types
You can see with:
df.schema.toString
Basic types like Int,Double, are stored like:
StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)
Complex types like case class are transformed to a combination of nested types:
StructType(StructField(..),StructField(..),StructType(..))
Sample code:
case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema
//Output:
org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))
The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes
A basic debug technic is print to string:
val myUdf = udf( (r:Row) => r.schema.toString )
then, to see was happen:
df.take(1).foreach(println) //

Polymorphism with Spark / Scala, Datasets and case classes

We are using Spark 2.x with Scala for a system that has 13 different ETL operations. 7 of them are relatively simple and each driven by a single domain class, and differ primarily by this class and some nuances in how the load is handled.
A simplified version of the load class is as follows, for the purposes of this example say that there are 7 pizza toppings being loaded, here's Pepperoni:
object LoadPepperoni {
def apply(inputFile: Dataset[Row],
historicalData: Dataset[Pepperoni],
mergeFun: (Pepperoni, PepperoniRaw) => Pepperoni): Dataset[Pepperoni] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[PepperoniRaw] = inputFile.rdd.map{ case row : Row =>
PepperoniRaw(
weight = row.getAs[String]("weight"),
cost = row.getAs[String]("cost")
)
}.toDS()
val validatedData: Dataset[PepperoniRaw] = ??? // validate the data
val dedupedRawData: Dataset[PepperoniRaw] = ??? // deduplicate the data
val dedupedData: Dataset[Pepperoni] = dedupedRawData.rdd.map{ case datum : PepperoniRaw =>
Pepperoni( value = ???, key1 = ???, key2 = ??? )
}.toDS()
val joinedData = dedupedData.joinWith(historicalData,
historicalData.col("key1") === dedupedData.col("key1") &&
historicalData.col("key2") === dedupedData.col("key2"),
"right_outer"
)
joinedData.map { case (hist, delta) =>
if( /* some condition */) {
hist.copy(value = /* some transformation */)
}
}.flatMap(list => list).toDS()
}
}
In other words the class performs a series of operations on the data, the operations are mostly the same and always in the same order, but can vary slightly per topping, as would the mapping from "raw" to "domain" and the merge function.
To do this for 7 toppings (i.e. Mushroom, Cheese, etc), I would rather not simply copy/paste the class and change all of the names, because the structure and logic is common to all loads. Instead I'd rather define a generic "Load" class with generic types, like this:
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D): Dataset[D] = {
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val rawData: Dataset[R] = inputFile.rdd.map{ case row : Row =>
...
And for each class-specific operation such as mapping from "raw" to "domain", or merging, have a trait or abstract class that implements the specifics. This would be a typical dependency injection / polymorphism pattern.
But I'm running into a few problems. As of Spark 2.x, encoders are only provided for native types and case classes, and there is no way to generically identify a class as a case class. So the inferred toDS() and other implicit functionality is not available when using generic types.
Also as mentioned in this related question of mine, the case class copy method is not available when using generics either.
I have looked into other design patterns common with Scala and Haskell such as type classes or ad-hoc polymorphism, but the obstacle is the Spark Dataset basically only working on case classes, which can't be abstractly defined.
It seems that this would be a common problem in Spark systems but I'm unable to find a solution. Any help appreciated.
The implicit conversion that enables .toDS is:
implicit def rddToDatasetHolder[T](rdd: RDD[T])(implicit arg0: Encoder[T]): DatasetHolder[T]
(from https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLImplicits)
You are exactly correct in that there's no implicit value in scope for Encoder[T] now that you've made your apply method generic, so this conversion can't happen. But you can simply accept one as an implicit parameter!
object Load {
def apply[R,D](inputFile: Dataset[Row],
historicalData: Dataset[D],
mergeFun: (D, R) => D)(implicit enc: Encoder[D]): Dataset[D] = {
...
Then at the time you call the load, with a specific type, it should be able to find an Encoder for that type. Note that you will have to import sparkSession.implicits._ in the calling context as well.
Edit: a similar approach would be to enable the implicit newProductEncoder[T <: Product](implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Encoder[T] to work by bounding your type (apply[R, D <: Product]) and accepting an implicit JavaUniverse.TypeTag[D] as a parameter.

Why the case class returns a dataframe in spark sql

Below just take Union as an example.
I am reading the spark sql source code, and got stucken on this code, which is in the DataFrame.scala
def unionAll(other: DataFrame): DataFrame = Union(logicalPlan, other.logicalPlan)
and the Union is a case class which defined like this
case class Union(left: LogicalPlan, right: LogicalPlan) extends BinaryNode {...}
I'm confused, how can the result be treat as a instance of DataFrame type?
Well, if something is not clear in Scala it has to be an implicit. First lets take a look at the BinaryNode node definition:
abstract class BinaryNode extends LogicalPlan
Since LogicalPlan combined with SQLContext is the only thing required to create a DataFrame it looks like a good place for a conversion. And here it is:
#inline private implicit def logicalPlanToDataFrame(logicalPlan: LogicalPlan):
DataFrame = {
new DataFrame(sqlContext, logicalPlan)
}
Actually this conversion has been removed in 1.6.0 by SPARK-11513 with a following description:
DataFrame has an internal implicit conversion that turns a LogicalPlan into a DataFrame. This has been fairly confusing to a few new contributors. Since it doesn't buy us much, we should just remove that implicit conversion.

Scala wrapping method of a parametrized class (spark-cassandra-connector)

I am writing a set of methods that extend Spark RDD's API.
I have to implement a general method for storing the RDDs, and for a start I tried to wrap spark-cassandra-connector's saveAsCassandraTable, without success.
Here's the "extending RDD's API" part:
object NewRDDFunctions {
implicit def addStorageFunctions[T](rdd: RDD[T]):
RDDStorageFunctions[T] = new RDDStorageFunctions(rdd)
}
class RDDStorageFunctions[T](rdd: RDD[T]) {
def saveResultsToCassandra() {
rdd.saveAsCassandraTable("ks_name", "table_name") // this line produces errors!
}
}
...and importing the object as: import ...NewRDDFunctions._.
The marked line produces following errors:
Error:(99, 29) could not find implicit value for parameter rwf: com.datastax.spark.connector.writer.RowWriterFactory[T]
rdd.saveAsCassandraTable("ks_name", "table_name")
^
Error:(99, 29) not enough arguments for method saveAsCassandraTable: (implicit connector: com.datastax.spark.connector.cql.CassandraConnector, implicit rwf: com.datastax.spark.connector.writer.RowWriterFactory[T], implicit columnMapper: com.datastax.spark.connector.mapper.ColumnMapper[T])Unit.
Unspecified value parameters rwf, columnMapper.
rdd.saveAsCassandraTable("ks_name", "table_name")
^
I don't get why this doesn't work since saveAsCassandraTable is designed to work on any RDD. Any suggestions?
I had similar problem with the example in spark-cassandra-connector docs:
case class WordCount(word: String, count: Long)
val collection = sc.parallelize(Seq(WordCount("dog", 50), WordCount("cow", 60)))
collection.saveAsCassandraTable("test", "words_new", SomeColumns("word", "count"))
...and the solution was to move case class definition out of "main" function (but I don't really know if this applies to the mentioned problem...).
saveAsCassandraTable needs 3 implicit parameters. The first one (connector) has a default value, the last two (rwf and columnMapper) are not in implicit scope in your saveResultsToCassandra method, as a consequence your method doesn't compile.
Look at this answer on another question, if you need some more information about implicits.
Turning your saveResultsToCassandra into the function below should work, if you have defined your tables (TableDef) before.
def saveResultsToCassandra()(
// implicit parameters as a separate list!
implicit rwf: RowWriterFactory[T],
columnMapper: ColumnMapper[T]
) {
rdd.saveAsCassandraTable("ks_name", "table_name")
}