Scala: How to take any generic sequence as input to this method - scala

Scala noob here. Still trying to learn the syntax.
I am trying to reduce the code I have to write to convert my test data into DataFrames. Here is what I have right now:
def makeDf[T](seq: Seq[(Int, Int)], colNames: String*): Dataset[Row] = {
val context = session.sqlContext
import context.implicits._
seq.toDF(colNames: _*)
}
The problem is that the above method only takes a sequence of the shape Seq[(Int, Int)] as input. How do I make it take any sequence as input? I can change the inputs shape to Seq[AnyRef], but then the code fails to recognize the toDF call as valid symbol.
I am not able to figure out how to make this work. Any ideas? Thanks!

Short answer:
import scala.reflect.runtime.universe.TypeTag
def makeDf[T <: Product: TypeTag](seq: Seq[T], colNames: String*): DataFrame = ...
Explanation:
When you are calling seq.toDF you are actually using an implicit defined in SQLImplicits:
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = {
DatasetHolder(_sqlContext.createDataset(s))
}
which in turn requires the generation of an encoder. The problem is that encoders are defined only on certain types. Specifically Product (i.e. tuple, case class etc.) You also need to add the TypeTag implicit so that Scala can get over the type erasure (in the runtime all Sequences have the type sequence regardless of the generics type. TypeTag provides information on this).
As a side node, you do not need to extract sqlcontext from the session, you can simply use:
import sparkSession.implicits._

As #AssafMendelson already explained the real reason of why you cannot create a Dataset of Any is because Spark needs an Encoder to transform objects from they JVM representation to its internal representation - and Spark cannot guarantee the generation of such Encoder for Any type.
Assaf answers is correct, and will work.
However, IMHO, it is too much restrictive as it will only work for Products (tuples, and case classes) - and even if that includes most use cases, there still a few ones excluded.
Since, what you really need is an Encoder, you may leave that responsibility to the client. Which in most situation will only need to call import spark.implicits._ to get them in scope.
Thus, this is what I believe will be the most general solution.
import org.apache.spark.sql.{DataFrame, Dataset, Encoder, SparkSession}
// Implicit SparkSession to make the call to further methods more transparent.
implicit val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
def makeDf[T: Encoder](seq: Seq[T], colNames: String*)
(implicit spark: SparkSession): DataFrame =
spark.createDataset(seq).toDF(colNames: _*)
def makeDS[T: Encoder](seq: Seq[T])
(implicit spark: SparkSession): Dataset[T] =
spark.createDataset(seq)
Note: This is basically re-inventing the already defined functions from Spark.

Related

How to define methods to deal with Datasets with parametrized types?

I'm trying to define some functions that take Datasets (typed DataFrames) as input and produces another as output, and I want them to be flexible enough to deal with parametrized types. In this example, I need a column to represent the ID of users, but it doesn't matter to my functions if that ID is an Int, a Long, a String, etc. That's why my case classes have this type parameter A.
I tried at first simply wrting my function and using Dataset instead of DataFrame:
import org.apache.spark.sql.Dataset
case class InputT[A](id: A, data: Long)
case class OutputT[A](id: A, dataA: Long, dataB: Long)
def someFunction[A](ds: Dataset[InputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]] // suppose there are some transformations here
}
... but I got this error:
Unable to find encoder for type OutputT[A]. An implicit Encoder[OutputT[A]] is needed
to store OutputT[A] instances in a Dataset. Primitive types (Int, String, etc) and
Product types (case classes) are supported by importing spark.implicits._ Support
for serializing other types will be added in future releases.
So I tried providing an Encoder for my type:
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
case class InputT[A](id: A, data: Long)
case class OutputT[A](id: A, dataA: Long, dataB: Long)
implicit def enc[A]: Encoder[InputT[A]] = implicitly(ExpressionEncoder[OutputT[A]])
def someFunction[A](ds: Dataset[InputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]] // suppose there are some transformations here
}
And now I get this error:
No TypeTag available for OutputT[A]
If the code is the same as above, but without type parameters (e.g., using String instead of A), then there are no errors.
Avoiding the use of import spark.implicits._ magic if possible, what should I do to fix it? Is it even possible to achieve this level of flexibility with Dataset?
If you check the Scaladoc You would see that as requires an Encoder so you only need to add it to the scope.
def someFunction[A](ds: Dataset[InputT[A]])(implicit ev: Encoder[[OutputT[A]]): Dataset[OutputT[A]] = {
ds.select().as[OutputT[A]]
}
Also, you may want to take a look at Where does Scala look for implicits.

How to derive a decoder semiautomatically for a list of some type with Circe?

I have an implicit class that decodes server's response into a JSON and latter in the right case class to avoid repeating calls to .as and .getOrElse all around the tests:
implicit class RouteTestResultBody(testResult: RouteTestResult) {
def body: String = bodyOf(testResult)
def decodedBody[T](implicit d: Decoder[T]): T =
decode[Json](body)
.fold(err => throw new Exception(s"Body is not a valid JSON: $body"), identity)
.as[T]
.getOrElse(throw new Exception(s"JSON doesn't have the right shape: $body"))
}
Of course, it relies on us passing a decoder:
import io.circe.generic.semiauto.deriveDecoder
val result: RouteTestResult = ...
result.decodedBody(deriveDecoder[SomeType[AnotherType])
It works most of the time, but fails when the response is a list:
result.dedoceBody(deriveDecoder[List[SomeType]])
// throws "JSON doesn't have the right shape"
How can I semiautomatically derive a decoder for a list with specific types inside?
The terminology here is unfortunately overloaded, in that we use "deriving" in two senses:
Providing an instance for e.g. List[A] given an instance for A.
Providing an instance for a case class or sealed trait hierarchy given instances for all member types.
This problem isn't specific to Circe, or even Scala. In writing about Circe I generally try to avoid referring to the first kind of instance generation as "derivation" at all, and to refer to the second kind as "generic derivation" to emphasize that we're generating instances via a generic representation of the algebraic data type.
The fact that we sometimes use the same word to refer to both kinds of type class instance generation is a problem because they're typically very distinct mechanisms in Scala. In Circe the thing that provides an encoder or decoder instance for List[A] given one for A is a method in the type class companion object. For example, in the object Decoder in circe-core we have a method like this:
implicit def decodeList[A](implicit decodeA: Decoder[A]): Decoder[List[A]] = ...
Because this method definition is in the Decoder companion object, if you ask for an implicit Decoder[List[A]] in a context where you have an implicit Decoder[A], the compiler will find and use decodeList. You don't need any imports or extra definitions. For example:
scala> case class Foo(i: Int)
class Foo
scala> import io.circe.Decoder, io.circe.parser
import io.circe.Decoder
import io.circe.parser
scala> implicit val decodeFoo: Decoder[Foo] = Decoder[Int].map(Foo(_))
val decodeFoo: io.circe.Decoder[Foo] = io.circe.Decoder$$anon$1#6e992c05
scala> parser.decode[List[Foo]]("[1, 2, 3]")
val res0: Either[io.circe.Error,List[Foo]] = Right(List(Foo(1), Foo(2), Foo(3)))
If we desugared the implicit machinery here, it would look like this:
scala> parser.decode[List[Foo]]("[1, 2, 3]")(Decoder.decodeList(decodeFoo))
val res1: Either[io.circe.Error,List[Foo]] = Right(List(Foo(1), Foo(2), Foo(3)))
Note that we could replace first kind of derivation with the second, and it would still compile:
scala> import io.circe.generic.semiauto.deriveDecoder
import io.circe.generic.semiauto.deriveDecoder
scala> parser.decode[List[Foo]]("[1, 2, 3]")(deriveDecoder[List[Foo]])
val res2: Either[io.circe.Error,List[Foo]] = Left(DecodingFailure(CNil, List()))
This compiles because Scala's List is an algebraic data type that has a generic representation that circe-generic can create an instance for. The decoding fails for this input, though, since this representation doesn't result in the encoding we expect. We can derive the corresponding encoder to see what this encoding looks like:
scala> import io.circe.Encoder, io.circe.generic.semiauto.deriveEncoder
import io.circe.Encoder
import io.circe.generic.semiauto.deriveEncoder
scala> implicit val encodeFoo: Encoder[Foo] = Encoder[Int].contramap(_.i)
val encodeFoo: io.circe.Encoder[Foo] = io.circe.Encoder$$anon$1#2717857a
scala> deriveEncoder[List[Foo]].apply(List(Foo(1), Foo(2)))
val res3: io.circe.Json =
{
"::" : [
1,
2
]
}
So we're actually seeing the :: case class for List, which is basically never what we want.
If you need to provide a Decoder[List[Foo]] explicitly, the solution is to use either the Decoder.apply "summoner" method, or to call Decoder.decodeList explicitly:
scala> Decoder[List[Foo]]
val res4: io.circe.Decoder[List[Foo]] = io.circe.Decoder$$anon$44#5d40f590
scala> Decoder.decodeList[Foo]
val res5: io.circe.Decoder[List[Foo]] = io.circe.Decoder$$anon$44#2f936a01
scala> Decoder.decodeList(decodeFoo)
val res6: io.circe.Decoder[List[Foo]] = io.circe.Decoder$$anon$44#7f525e05
These all provide exactly the same instance, and which you should choose is a matter of taste.
As a footnote, I've thought about special-casing List in circe-generic so that deriveDecoder[List[X]] doesn't compile, since it's approximately never what you want (but seems like it might be, especially because of the confusing way we talk about instance derivation). I typically don't like the idea of having special cases like that, but I think in this case it might be the right thing to do, since this question comes up a lot.

Scala - template method pattern in a trait

I'm implementing a template method pattern in Scala. The idea is that the method returns a Dataset[Metric].
But when I'm converting enrichedMetrics to a DataSet enrichedMetrics.as[Metric] I have to use implicits in order to map the records to the specified type. This means passing a SparkSession to the MetricsProcessor which seems not the best solution to me.
The solution I see now is to pass spark: SparkSession as a parameter to the template method. And then import spark.implicits._ within the template method.
Is there a more proper way to implement the template method pattern in this case?
trait MetricsProcessor {
// Template method
def parseMetrics(startDate: Date, endDate: Date, metricId: Long): Dataset[Metric] = {
val metricsFromSource: DataFrame = queryMetrics(startDate, endDate)
val enrichedMetrics = enrichMetrics(metricsFromSource, metricId)
enrichedMetrics.as[Metric] <--- //requires spark.implicits
}
// abstract method
def queryMetrics(startDate: Date, endDate: Date): DataFrame
def enrichMetrics(metricsDf: DataFrame, metricId: Long): DataFrame = {
/*Default implementation*/
}
}
You're missing the Encoder for your type Metric here which spark cannot find implicitly, for common types like String, Int etc, spark has implicit encoders.
Also, you cannot do a simple .as on a data frame if the columns in the source type and destination type aren't the same. I'll make some assumptions here.
For a case class Metric
case class Metric( ??? )
the line in parseMetrics will change to,
Option 1 - Explicitly passing the Encoder
enrichedMetrics.map(row => Metric( ??? ))(Encoders.product[Metric])
Option 2 - Implicitly passing the Encoder
implicit val enc : Encoder[Metric] = Encoders.product[Metric]
enrichedMetrics.map(row => Metric( ??? ))
Note, as pointed in one of the comments, if your parseMetric method is always returning the Dataset[Metric], you can add the implicit encoder to the body of the trait.
Hope this helped.

How to make it a monad?

I am trying to validate a list of strings sequentially and define the validation result type like that:
import cats._, cats.data._, cats.implicits._
case class ValidationError(msg: String)
type ValidationResult[A] = Either[NonEmptyList[ValidationError], A]
type ListValidationResult[A] = ValidationResult[List[A]] // not a monad :(
I would like to make ListValidationResult a monad. Should I implement flatMap and pure manually or there is an easier way ?
I suggest you to take a totally different approach leveraging cats Validated:
import cats.data.Validated.{ invalidNel, valid }
val stringList: List[String] = ???
def evaluateString(s: String): ValidatedNel[ValidationError, String] =
if (???) valid(s) else invalidNel(ValidationError(s"invalid $s"))
val validationResult: ListValidationResult[String] =
stringList.map(evaluateString).sequenceU.toEither
It can be adapted for a generic type T, as per your example.
Notes:
val stringList: List[String] = ??? is the list of strings you want to validate;
ValidatedNel[A,B] is just a type alias for Validated[NonEmptyList[A],B];
evaluateString should be your evaluation function, it is currently just an unimplemented stub if;
sequenceU you may want to read cats documentation about it: sequenceU;
toEither does exactly what you think it does, it converts a Validated[A,B] to an Either[A,B].
As #Michael pointed out, you could also use traverseU instead of map and sequenceU
val validationResult: ListValidationResult[String] =
stringList.traverseU(evaluateString).toEither

Why the case class returns a dataframe in spark sql

Below just take Union as an example.
I am reading the spark sql source code, and got stucken on this code, which is in the DataFrame.scala
def unionAll(other: DataFrame): DataFrame = Union(logicalPlan, other.logicalPlan)
and the Union is a case class which defined like this
case class Union(left: LogicalPlan, right: LogicalPlan) extends BinaryNode {...}
I'm confused, how can the result be treat as a instance of DataFrame type?
Well, if something is not clear in Scala it has to be an implicit. First lets take a look at the BinaryNode node definition:
abstract class BinaryNode extends LogicalPlan
Since LogicalPlan combined with SQLContext is the only thing required to create a DataFrame it looks like a good place for a conversion. And here it is:
#inline private implicit def logicalPlanToDataFrame(logicalPlan: LogicalPlan):
DataFrame = {
new DataFrame(sqlContext, logicalPlan)
}
Actually this conversion has been removed in 1.6.0 by SPARK-11513 with a following description:
DataFrame has an internal implicit conversion that turns a LogicalPlan into a DataFrame. This has been fairly confusing to a few new contributors. Since it doesn't buy us much, we should just remove that implicit conversion.