GroupBy + custom aggregation on Dataset with Case class / Trait in the Key - scala

I am trying to refactor some code and put the general logic into a trait. I basically want to process datasets, group them by some key and aggregate:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{ Dataset, Encoder, Encoders, TypedColumn }
case class SomeKey(a: String, b: Boolean)
case class InputRow(
SomeKey,
v: Double
)
trait MyTrait {
def processInputs: Dataset[InputRow]
def groupAndAggregate(
logs: Dataset[InputRow]
): Dataset[(SomeKey, Long)] = {
import logs.sparkSession.implicits._
logs
.groupByKey(i => i.key)
.agg(someAggFunc)
}
//Whatever agg function: here, it counts the number of v that are >= 0.5
def someAggFunc: TypedColumn[InputRow, Long] =
new Aggregator[
/*input type*/ InputRow,
/* "buffer" type */ Long,
/* output type */ Long
] with Serializable {
def zero = 0L
def reduce(b: Long, a: InputRow) = {
if (a.v >= 0.5)
b + 1
else
b
}
def merge(b1: Long, b2: Long) =
b1 + b2
// map buffer to output type
def finish(b: Long) = b
def bufferEncoder: Encoder[Long] = Encoders.scalaLong
def outputEncoder: Encoder[Long] = Encoders.scalaLong
}.toColumn
}
everything works fine: I can instantiate a class that inherits from MyTrait and override the way I process inputs:
import spark.implicits._
case class MyTraitTest(testDf: DataFrame) extends MyTrait {
override def processInputs: Dataset[InputRow] = {
val ds = testDf
.select(
$"a",
$"b",
$"v",
)
.rdd
.map(
r =>
InputRow(
SomeKey(r.getAs[String]("a"), r.getAs[Boolean]("b")),
r.getAs[Double]("v")
)
)
.toDS
ds
}
val df: DataFrame = Seq(
("1", false, 0.40),
("1", false, 0.54),
("0", true, 0.85),
("1", true, 0.39)
).toDF("a", "b", "v")
val myTraitTest = MyTraitTest(df)
val ds: Dataset[InputRow] = myTraitTest.processInputs
val res = myTraitTest.groupAndAggregate(ds)
res.show(false)
+----------+----------------------------------+
|key |InputRow |
+----------+----------------------------------+
|[1, false]|1 |
|[0, true] |1 |
|[1, true] |0 |
+----------+----------------------------------+
Now the problem: I want SomeKey to derive from a more generic trait Key, because the key will not always have only two fields, the fields won't have the same type etc. It will always be a simple tuple of some basic primitive types though.
So I tried to do the following:
trait Key extends Product
case class SomeKey(a: String, b: Boolean) extends Key
case class SomeOtherKey(x: Int, y: Boolean, z: String) extends Key
case class InputRow[T <: Key](
key: T,
v: Double
)
trait MyTrait[T <: Key] {
def processInputs: Dataset[InputRow[T]]
def groupAndAggregate(
logs: Dataset[InputRow[T]]
): Dataset[(T, Long)] = {
import logs.sparkSession.implicits._
logs
.groupByKey(i => i.key)
.agg(someAggFunc)
}
def someAggFunc: TypedColumn[InputRow[T], Long] = {...}
I now do:
case class MyTraitTest(testDf: DataFrame) extends MyTrait[SomeKey] {
override def processInputs: Dataset[InputRow[SomeKey]] = {
...
}
etc.
But now I get the error : Unable to find encoder for type T. An implicit Encoder[T] is needed to store T instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.groupByKey(i => i.key)
I really don't know how to work around this issue, I tried lots of things without success. Sorry for this quite lengthy description but hopefully you have all the elements to help me understand... thanks!

Spark needs to be able to implicitly create the encoder for product type T so you'll need to help it work around the JVM type erasure and pass the TypeTag for T as an implicit parameter of your groupAndAggregate method.
A working example:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.{ DataFrame, Dataset, Encoders, TypedColumn }
import scala.reflect.runtime.universe.TypeTag
trait Key extends Product
case class SomeKey(a: String, b: Boolean) extends Key
case class SomeOtherKey(x: Int, y: Boolean, z: String) extends Key
case class InputRow[T <: Key](key: T, v: Double)
trait MyTrait[T <: Key] {
def processInputs: Dataset[InputRow[T]]
def groupAndAggregate(
logs: Dataset[InputRow[T]]
)(implicit tTypeTag: TypeTag[T]): Dataset[(T, Long)] = {
import logs.sparkSession.implicits._
logs
.groupByKey(i => i.key)
.agg(someAggFunc)
}
def someAggFunc: TypedColumn[InputRow[T], Long] =
new Aggregator[InputRow[T], Long, Long] with Serializable {
def reduce(b: Long, a: InputRow[T]) = b + (a.v * 100).toLong
def merge(b1: Long, b2: Long) = b1 + b2
def zero = 0L
def finish(b: Long) = b
def bufferEncoder = Encoders.scalaLong
def outputEncoder = Encoders.scalaLong
}.toColumn
}
with a wrapping case class
case class MyTraitTest(testDf: DataFrame) extends MyTrait[SomeKey] {
import testDf.sparkSession.implicits._
import org.apache.spark.sql.functions.struct
override def processInputs = testDf
.select(struct($"a", $"b") as "key", $"v" )
.as[InputRow[SomeKey]]
}
and a test execution
val df = Seq(
("1", false, 0.40),
("1", false, 0.54),
("0", true, 0.85),
("1", true, 0.39)
).toDF("a", "b", "v")
val myTraitTest = MyTraitTest(df)
val ds = myTraitTest.processInputs
val res = myTraitTest.groupAndAggregate(ds)
res.show(false)
+----------+-----------------------------------------------+
|key |$anon$1($line5460910223.$read$$iw$$iw$InputRow)|
+----------+-----------------------------------------------+
|[1, false]|94 |
|[1, true] |39 |
|[0, true] |85 |
+----------+-----------------------------------------------+

Related

How to take out Nothing out of inferred type

The idea comes from this video: https://www.youtube.com/watch?v=BfaBeT0pRe0&t=526s, where they talk about implementing type safety through implementation of custom types.
And a possible trivial implementation is
trait Col[Self] { self: Self =>
}
trait Id extends Col[Id]
object IdCol extends Id
trait Val extends Col[Val]
object ValCol extends Val
trait Comment extends Col[Comment]
object CommentCol extends Comment
case class DataSet[Schema >: Nothing](df: DataFrame) {
def validate[T1 <: Col[T1], T2 <: Col[T2]](
col1: (Col[T1], String),
col2: (Col[T2], String)
): Option[DataSet[Schema with T1 with T2]] =
if (df.columns
.map(e => e.toLowerCase)
.filter(
e =>
e.toLowerCase() == col1._2.toLowerCase || e
.toLowerCase() == col2._2.toLowerCase
)
.length >= 1)
Some(DataSet[Schema with T1 with T2](df))
else None
}
object SchemaTypes extends App {
lazy val spark: SparkSession = SparkSession
.builder()
.config(
new SparkConf()
.setAppName(
getClass()
.getName()
)
)
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, "a", "first value"),
(2, "b", "second value"),
(3, "c", "third value")
).toDF("Id", "Val", "Comment")
val myData =
DataSet/*[Id with Val with Comment]*/(df)
.validate(IdCol -> "Id", ValCol -> "Val")
myData match {
case None => throw new java.lang.Exception("Required columns missing")
case _ =>
}
}
The type for myData is Option[DataSet[Nothing with T1 with T2]]. It makes sense since the constructor is called w/o any type parameter, but in the video they show the type to be in line with DataSet[T1 with T2].
Of course, changing the invocation by passing explicity type takes Nothing out, but it is redundant to specify the type parameter value since the types are already included in the arg list.
val myData =
DataSet[Id with Val with Comment](df).validate(IdCol -> "Id", ValCol -> "Val")
Types Id and Val can be inferred because there are IdCol and ValCol inside .validate. But type Comment can't be inferred. So try
val myData =
DataSet[Comment](df)
.validate(IdCol -> "Id", ValCol -> "Val")
println(shapeless.test.showType(SchemaTypes.myData))
//Option[App.DataSet[App.Comment with App.Id with App.Val]]
https://scastie.scala-lang.org/yj0HnpkyQfCreKq8ZV4D7A
Actually if you specify DataSet[Id with Val with Comment](df) the type will be Option[DataSet[Id with Val with Comment with Id with Val]], which is equal (=:=) to Option[DataSet[Id with Val with Comment]].
Ok, I watched the video till that time-code. I guess speakers tried to explain their idea (combining F-bounded polymorphism T <: Col[T] with intersection types T with U) and you shouldn't take their slides literally, there can be inaccuracies there.
Firstly they show slide
case class DataSet[Schema](df: DataFrame) {
def validate[T <: Col[T]](
col: (Col[T], String)
): Option[DataSet[Schema with T]] = ???
}
and this code can be illustrated with
val myDF: DataFrame = ???
val myData = DataSet[VideoId](myDF).validate(Country -> "country_code")
myData : Option[DataSet[VideoId with Country]]
Then they show slide
val myData = DataSet(myDF).validate(
VideoId -> "video_id",
Country -> "country_code",
ProfileId -> "profile_id",
Score -> "score"
)
myData : DataSet[VideoId with Country with ProfileId with Score]
but this illustrating code doesn't correspond to the previous slide. You should define
// actually we don't use Schema here
case class DataSet[Schema](df: DataFrame) {
def validate[T1 <: Col[T1], T2 <: Col[T2], T3 <: Col[T3], T4 <: Col[T4]](
col1: (Col[T1], String),
col2: (Col[T2], String),
col3: (Col[T3], String),
col4: (Col[T4], String),
): DataSet[T1 with T2 with T3 with T4] = ???
}
So take it as an idea, not literally.
You can have something similar with
case class DataSet[Schema](df: DataFrame) {
def validate[T <: Col[T]](
col: (Col[T], String)
): Option[DataSet[Schema with T]] = ???
}
val myDF: DataFrame = ???
val myData = DataSet[Any](myDF).validate(VideoId -> "video_id").flatMap(
_.validate(Country -> "country_code")
).flatMap(
_.validate(ProfileId -> "profile_id")
).flatMap(
_.validate(Score -> "score")
)
myData: Option[DataSet[VideoId with Country with ProfileId with Score]]
Dmytro Mitin's answer is good, but I wanted to provide some more information.
If you write something like DataSet(df).validate(...), the type parameter of DataSet(df) is inferred first. Here it's Nothing because there's no information which would make it anything else. So the Schema is Nothing, and Schema with T1 with T2 (which appears in the return type of validate) is Nothing with Id with Val.

Scala spark: Create List of Dataset from a Dataset map operation

Suppose I want to create 2 types of metric : metricA or metricB after transforming another dataset. If a certain condition is met, it'll generate both metricA and B, if condition is not met, generate only metric A. The idea is to write the 2 metrics to 2 different paths (pathA, pathB).
The approach I took was to create a Dataset of GeneralMetric and then based on whats inside, write to different paths, but obviously it didn't work as pattern matching inside Dataset wouldn't work
val s: SparkSession = SparkSession
.builder()
.appName("Metric")
.getOrCreate()
import s.implicits._
case class original (id : Int, units: List[Double])
case class MetricA (a: Int, b: Int, filtered_unit: List[Double])
case class MetricB (a: Int, filtered_unit: List[Double])
case class GeneralMetric(metricA: MetricA, metricB: Option[MetricB])
def createA: MetricA = {
MetricA(1, 1, List(1.0, 2.0)
}
def createB: MetricB = {
MetricB(1, List(10.0, 20.0)
}
def create (isBoth: Boolean): GeneralMetric = {
if(isBoth) {
val a: MetricA = createA()
val b: MetricB = createB()
GeneralMetric(a, Some(b))
}
else {
val a: MetricA = createA()
GeneralMetric(a, None)
}
}
val originalDF: DataFrame
val result : Dataset[GeneralMetric] =
originalDF.as[original]
.map { r =>
if(r.id == 21) create(true)
else create(false)
}
val pathA: String = "s3://pathA"
val pathB: String = "s3://pathB"
//below code obviously wouldn't work
result.map(x => {
case (metricA, Some(metricB)) => {
metricA.write.parquet(pathA)
metricB.write.parquet(pathB)
}
case (metricA, None) => metricA.write.parquet(pathA)
})
The next approach I was thinking of, was putting the results in a List[GeneralMetric], where GeneralMetric is a sealed trail, extended by both MetricA and MetricB, but how can I make a dataset transformation return a list of GeneralMetric.
Any ideas would be helpful
Why wouldn't
result.map({
case (metricA, Some(metricB)) =>
metricA.write.parquet(pathA)
metricB.write.parquet(pathB)
case (metricA, None) => metricA.write.parquet(pathA)
})
work in your case? Is this just a syntax problem?
Also: it seems that you send metrics independently (or at least in this example). You could model it as:
sealed trait Metric {
def write
}
case class MetricA (a: Int, b: Int, filtered_unit: List[Double]) extends Metric {
override def write: Unit = ???
}
case class MetricB (a: Int, filtered_unit: List[Double]) extends Metric {
override def write: Unit = ???
}
and call
implicit val enc: Encoder[Metric] = Encoders.kryo[Metric]
val result: Dataset[Metric] =
originalDF.as[original]
.flatMap { r =>
if (r.id == 21) createA :: createB :: Nil
else createA :: Nil
}
result.foreach(metric.write.parquet())

Scala - Get string representation of object property name, not value, for comparison

I want to be able to get the string representation of an objects property name, not the properties value, so that I can compare it with a variables value inside a conditional statement.
case class CustomObj(name: T)
case class PropertyObj(property: String)
val custObj = CustomObj("Chris")
val propObj = PropertyObj("name")
if(propObj.property.equals(custObj. /* the property name as a String, so "name", not the value ("Chris"*/)) {
// do something
}
How can I access what is essentially the key of the property on the CustomObj?
Try productElementNames like so
case class CustomObj(name: String)
case class PropertyObj(property: String)
val custObj = CustomObj("Chris")
val propObj = PropertyObj("name")
if (custObj.productElementNames.toList.headOption.contains(propObj.property)) { ... } else { ... }
Addressing the comment, based on Krzysztof, try shapeless solution
import shapeless._
import shapeless.ops.record._
def firstPropertyNameOf[P <: Product, L <: HList](p: P)(implicit
gen: LabelledGeneric.Aux[P, L],
toMap: ToMap[L]): Option[String] = {
toMap(gen.to(p)).map{ case (k: Symbol, _) => k.name }.toList.headOption
}
firstPropertyNameOf(custObj).contains(propObj.property) // res1: Boolean = true
I will assume you don't know the type of custObj at compile time. Then you'll have to use runtime reflection in Scala 2.12.
scala> case class CustomObj(name: String)
defined class CustomObj
scala> val custObj: Any = CustomObj("Chris")
custObj: Any = CustomObj(Chris)
scala> import scala.reflect.runtime.currentMirror
import scala.reflect.runtime.currentMirror
scala> val sym = currentMirror.classSymbol(custObj.getClass)
sym: reflect.runtime.universe.ClassSymbol = class CustomObj
scala> val props = sym.info.members.collect{ case m if m.isMethod && m.asMethod.isCaseAccessor => m.name.toString }
props: Iterable[String] = List(name)
scala> if (props.exists(_ == "name")) println("ok")
ok

Scala-Cats Validated: value mapN is not a member of ValidatedNel tuple

Scala community.
Currently I'm trying to implement custom model/single parameter validation using cats Validated Monad. But, after removal of Cartesian product since 1.0 I'm unable to use (v1 |#| v2) map (f) and unable to compile my code:
import cats.Semigroupal
import cats.data.Validated.{Invalid, Valid}
import cats.data.{ValidatedNel, _}
import cats.implicits._
import cats.instances.all._
case class FieldErrorInfo(name: String, error: String)
type FieldName = String
type ValidationResult[A] = ValidatedNel[FieldErrorInfo, A]
trait SingleFieldValidationRule[U] extends ((U, FieldName) => ValidationResult[U])
trait ModelValidationRule[M] extends (M => ValidationResult[M])
object ValidateNameRule extends SingleFieldValidationRule[String] {
override def apply(v1: String, name: String): ValidationResult[String] = {
if (v1.contains("cats"))
v1.validNel
else
FieldErrorInfo(name, "Some Error").invalidNel
}
}
object ValidateQuantityRule extends SingleFieldValidationRule[Int] {
override def apply(v1: Int, name: String): ValidationResult[Int] =
if (v1 > 0)
v1.validNel
else FieldErrorInfo(name, "Some Error").invalidNel
}
case class SampleModel(name: String, quantity: Int)
object ValidateSampleModel extends ModelValidationRule[SampleModel] {
override def apply(v1: SampleModel): ValidationResult[SampleModel] = {
val stage1: ValidatedNel[FieldErrorInfo, String] = ValidateNameRule(v1.name, "name")
val stage2: ValidatedNel[FieldErrorInfo, Int] = ValidateQuantityRule(v1.quantity, "quantity")
implicit val sga: Semigroupal[NonEmptyList] = new Semigroupal[NonEmptyList] {
override def product[A, B](fa: NonEmptyList[A], fb: NonEmptyList[B]): NonEmptyList[(A, B)] = fa.flatMap(a => fb.map(b => a -> b))
}
(stage1, stage2).mapN(SampleModel)
}
}
Compiler says, that
Error:(43, 23) value mapN is not a member of (cats.data.ValidatedNel[FieldErrorInfo,String], cats.data.ValidatedNel[FieldErrorInfo,Int])
(stage1, stage2).mapN(SampleModel)
^
Point me please how to use new Applicative syntax or what I did wrong...(forgot to create/import some implicits)
You seem to be missing the following import:
import cats.syntax.apply._
for the mapN.
Please ensure that you have -Ypartial-unification compiler flag activated, otherwise the compiler will have hard time extracting ValidatedNel[FieldErrorInfo, ?] from the types of stage1 and stage2:
libraryDependencies += "org.typelevel" %% "cats-core" % "1.1.0"
scalaVersion := "2.12.5"
scalacOptions += "-Ypartial-unification"
With the above settings, the following works:
import cats.Semigroupal
import cats.data.Validated.{Invalid, Valid}
import cats.data.ValidatedNel
import cats.data.NonEmptyList
import cats.syntax.apply._ // for `mapN`
import cats.syntax.validated._ // for `validNel`
case class FieldErrorInfo(name: String, error: String)
type FieldName = String
type ValidationResult[A] = ValidatedNel[FieldErrorInfo, A]
trait SingleFieldValidationRule[U] extends ((U, FieldName) => ValidationResult[U])
trait ModelValidationRule[M] extends (M => ValidationResult[M])
object ValidateNameRule extends SingleFieldValidationRule[String] {
override def apply(v1: String, name: String): ValidationResult[String] = {
if (v1.contains("cats"))
v1.validNel
else
FieldErrorInfo(name, "Some Error").invalidNel
}
}
object ValidateQuantityRule extends SingleFieldValidationRule[Int] {
override def apply(v1: Int, name: String): ValidationResult[Int] =
if (v1 > 0)
v1.validNel
else FieldErrorInfo(name, "Some Error").invalidNel
}
case class SampleModel(name: String, quantity: Int)
object ValidateSampleModel extends ModelValidationRule[SampleModel] {
override def apply(v1: SampleModel): ValidationResult[SampleModel] = {
val stage1: ValidatedNel[FieldErrorInfo, String] = ValidateNameRule(v1.name, "name")
val stage2: ValidatedNel[FieldErrorInfo, Int] = ValidateQuantityRule(v1.quantity, "quantity")
implicit val sga: Semigroupal[NonEmptyList] = new Semigroupal[NonEmptyList] {
override def product[A, B](fa: NonEmptyList[A], fb: NonEmptyList[B]): NonEmptyList[(A, B)] = fa.flatMap(a => fb.map(b => a -> b))
}
(stage1, stage2).mapN(SampleModel)
}
}
Values stage1 and stage2 must have type ValidationResult[_].
In this case implicit for mapN should work.
object ValidateSampleModel extends ModelValidationRule[SampleModel] {
override def apply(v1: SampleModel): ValidationResult[SampleModel] = {
val stage1: ValidationResult[String] = ValidateNameRule(v1.name, "name")
val stage2: ValidationResult[Int] = ValidateQuantityRule(v1.quantity, "quantity")
(stage1, stage2).mapN(SampleModel)
}
}

How to make Reflection for getting the field value by its string name and its original type

When trying to get object field by its string name, the value returned, not by correct scala type. As:
import scala.language.reflectiveCalls
import scala.language.implicitConversions
case class Intity(flag: Boolean, id: Int, name: String)
val inty = Intity(false, 123, "blue")
implicit def reflect(r: AnyRef) = new {
def get(n:String) = {
val c = r.getClass.getDeclaredField(n)
c.setAccessible(true); c}
def getVal(n: String) = get(n).get(r)
def getType (n:String) = get(n).getType
}
then when using this
inty.getType("flag") // res0: Class[_] = boolean --not Boolean
inty.getVal("id") // res1: Object = 123 --Object not Int
Any efficient way of doing the above implementation?
Not sure, how one can return different types from a single function.
But you can infer the correct type of any Class attribute using scala reflect api(s).
import scala.reflect.runtime.{universe => ru}
implicit class ForAnyInstance[T: ru.TypeTag](i: T)(implicit c: scala.reflect.ClassTag[T]) {
/* a mirror sets a scope of the entities on which we have reflective access */
val mirror = ru.runtimeMirror(getClass.getClassLoader)
/* here we get an instance mirror to reflect on an instance */
val im = ru.runtimeMirror(i.getClass.getClassLoader)
def fieldInfo(name: String) = {
ru.typeOf[T].members.filter(!_.isMethod).filter(_.name.decoded.trim.equals(name)).foreach(s => {
val fieldValue = im.reflect(i).reflectField(s.asTerm).get
/* typeSignature contains runtime type information about a Symbol */
s.typeSignature match {
case x if x =:= ru.typeOf[String] => /* do something */
case x if x =:= ru.typeOf[Int] => /* do something */
case x if x =:= ru.typeOf[Boolean] => /* do something */
}
})
}
}
And then invoke it as:
case class Entity(flag: Boolean, id: Int, name: String)
val e = Entity(false, 123, "blue")
e.fieldInfo("flag")
e.fieldInfo("id")
You can do something similar at compile time with shapeless.
scala> import shapeless._
import shapeless._
scala> val inty = Intity(false, 123, "blue")
inty: Intity = Intity(false,123,blue)
scala> val intyGen = LabelledGeneric[Intity].to(inty)
intyGen: shapeless.::[Boolean with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("flag")],Boolean],shapeless.::[Int with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("id")],Int],shapeless.::[String with shapeless.labelled.KeyTag[Symbol with shapeless.tag.Tagged[String("name")],String],shapeless.HNil]]] = false :: 123 :: blue :: HNil
scala> import shapeless.record._
import shapeless.record._
scala> intyGen.get('flag)
res10: Boolean = false
scala> intyGen.get(Symbol("id"))
res11: Int = 123