using Typeclasses on SparkTypes - scala

I am trying to use scala TypeClass on Spark Types, here is a small code snippet I wrote.
trait ThresholdMethods[A] {
def applyThreshold(): Boolean
}
object ThresholdMethodsInstances {
def thresholdMatcher[A](v:A)(implicit threshold: ThresholdMethods[A]):Boolean =
threshold.applyThreshold()
implicit val exactMatchThresholdStringType = new ThresholdMethods[StringType] {
def applyThreshold(): Boolean = {
print("string")
true
}
}
implicit val exactMatchThresholdNumericType = new ThresholdMethods[IntegerType] {
def applyThreshold(): Boolean = {
print("numeric")
true
}
}
}
object Main{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("ZenDataValidationLib").config("spark.some.config.option", "some-value")
.master("local").getOrCreate()
import spark.sqlContext.implicits._
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880, "xyz"), ("2016-04-30", "14", "FR", 9875, 13,"xyz"), ("2017-06-10", "15", "PQR", 9867, 57721,"xyz")
).toDF("WEEK", "DIM1", "DIM2","T1","T2","T3")
import ThresholdMethodsInstances._
println(df1.schema("T1").dataType)
ThresholdMethodsInstances.thresholdMatcher(df1.schema("T1").dataType)
}
}
When I run this on my local intellij, following error is thrown
Error:(46, 51) could not find implicit value for parameter threshold: com.amazon.zen.datavalidation.activity.com.amazon.zen.datavalidation.activity.ThresholdMethods[org.apache.spark.sql.types.DataType]
ThresholdMethodsInstances.thresholdMatcher(df1.schema("T1").dataType)
Error:(46, 51) not enough arguments for method thresholdMatcher: (implicit threshold: com.amazon.zen.datavalidation.activity.com.amazon.zen.datavalidation.activity.ThresholdMethods[org.apache.spark.sql.types.DataType])Boolean.
Unspecified value parameter threshold.
ThresholdMethodsInstances.thresholdMatcher(df1.schema("T1").dataType)
I also tried the same things using String and Int and it worked perfectly fine. Can someone help me in doing this on SparkTypes ?

Please note that StructField.dataType returns DataType, not a specific subclass and your code defines implicit ThresholdMethods only for IntegerType and StringType.
Because implicit resolution happens at compilation time there is not enough information for compile to determine the right instance, which would make the types match.
This would work if you'd (obviously you don't want that):
ThresholdMethodsInstances.thresholdMatcher(
df1.schema("T1").dataType.asInstanceOf[IntegerType]
)
but in practice, you'll need something more explicit like pattern matching, not implicit argument.
You can also consider switching to strongly typed Dataset, which be a better choice here.

Related

Mockito verify fails when method takes a function as argument

I have a Scala test which uses Mockito to verify that certain DataFrame transformations are invoked. I broke it down to this simple problematic example
import org.apache.spark.sql.DataFrame
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions._
import org.mockito.{Mockito, MockitoSugar}
class SimpleTest extends AnyFunSuite{
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreeting)
mockDF.transform(withGreeting)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreeting)
}
}
I'm trying to assert that the transform was called on my mockDF, but it fails with
Argument(s) are different! Wanted:
dataset.transform(<function1>);
-> at org.apache.spark.sql.Dataset.transform(Dataset.scala:2182)
Actual invocations have different arguments:
dataset.transform(<function1>);
Why would the verify fail in this case?
You need to save lambda expression argument for transform as val for correct testing and pass it to all transform calls:
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
val withGreetingExpression = df => withGreeting(df)
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreetingExpression)
mockDF.transform(withGreetingExpression)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreetingExpression)
}
Mockito requires to provide same (or equal) arguments to the mocked functions calls. When you are passing lambda expression without saving each call transform(withGreeting) creates new object Function[DataFrame, DataFrame]
transform(withGreeting)
is the same as:
transform(new Function[DataFrame, DataFrame] {
override def apply(df: DataFrame): DataFrame = withGreeting(df)
})
And they aren't equal to each other - this is the cause of error message:
Argument(s) are different!
For example, try to execute:
println(((df: DataFrame) => withGreeting(df)) == ((df: DataFrame) => withGreeting(df))) //false
You can read more about objects equality in java (in the scala it's same):
wikibooks
javaworld.com

json4s, how to deserialize json with FullTypeHints w/o explicitly setting TypeHints

I do specify FullTypeHints before deserialization
def serialize(definition: Definition): String = {
val hints = definition.tasks.map(_.getClass).groupBy(_.getName).values.map(_.head).toList
implicit val formats = Serialization.formats(FullTypeHints(hints))
writePretty(definition)
}
It produces json with type hints, great!
{
"name": "My definition",
"tasks": [
{
"jsonClass": "com.soft.RootTask",
"name": "Root"
}
]
}
Deserialization doesn't work, it ignores "jsonClass" field with type hint
def deserialize(jsonString: String): Definition = {
implicit val formats = DefaultFormats.withTypeHintFieldName("jsonClass")
read[Definition](jsonString)
}
Why should I repeat typeHints using Serialization.formats(FullTypeHints(hints)) for deserialization if hints are in json string?
Can json4s infer them from json?
The deserialiser is not ignoring the type hint field name, it just does not have anything to map it with. This is where the hints come in. Thus, you have to declare and assign your hints list object once again and pass it to the DefaultFormats object either by using the withHints method or by overriding the value when creating a new instance of DefaultFormats. Here's an example using the latter approach.
val hints = definition.tasks.map(_.getClass).groupBy(_.getName).values.map(_.head).toList
implicit val formats: Formats = new DefaultFormats {
outer =>
override val typeHintFieldName = "jsonClass"
override val typeHints = hints
}
I did it this way since I have contract:
withTypeHintFieldName is known in advance
withTypeHintFieldName contains fully qualified class name and it's always case class
def deserialize(jsonString: String): Definition = {
import org.json4s._
import org.json4s.native.JsonMethods._
import org.json4s.JsonDSL._
val json = parse(jsonString)
val classNames: List[String] = (json \\ $$definitionTypes$$ \\ classOf[JString])
val hints: List[Class[_]] = classNames.map(clz => Try(Class.forName(clz)).getOrElse(throw new RuntimeException(s"Can't get class for $clz")))
implicit val formats = Serialization.formats(FullTypeHints(hints)).withTypeHintFieldName($$definitionTypes$$)
read[Definition](jsonString)

udf No TypeTag available for type string

I don't understand a behavior of spark.
I create an udf which returns an Integer like below
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object Show {
def main(args: Array[String]): Unit = {
val (sc,sqlContext) = iniSparkConf("test")
val testInt_udf = sqlContext.udf.register("testInt_udf", testInt _)
}
def iniSparkConf(appName: String): (SparkContext, SQLContext) = {
val conf = new SparkConf().setAppName(appName)//.setExecutorEnv("spark.ui.port", "4046")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
(sc, sqlContext)
}
def testInt() : Int= {
return 2
}
}
I work perfectly but if I change the return type of method test from Int to String
val testString_udf = sqlContext.udf.register("testString_udf", testString _)
def testString() : String = {
return "myString"
}
I get the following error
Error:(34, 43) No TypeTag available for String
val testString_udf = sqlContext.udf.register("testString_udf", testString _)
Error:(34, 43) not enough arguments for method register: (implicit evidence$1: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction.
Unspecified value parameter evidence$1.
val testString_udf = sqlContext.udf.register("testString_udf", testString _)
here are my embedded jars:
datanucleus-api-jdo-3.2.6
datanucleus-core-3.2.10
datanucleus-rdbms-3.2.9
spark-1.6.1-yarn-shuffle
spark-assembly-1.6.1-hadoop2.6.0
spark-examples-1.6.1-hadoop2.6.0
I am a little bit lost... Do you have any idea?
Since I can't reproduce the issue copy-pasting just your example code into a new file, I bet that in your real code String is actually shadowed by something else. To verify this theory you can try to change you signature to
def testString() : scala.Predef.String = {
return "myString"
}
or
def testString() : java.lang.String = {
return "myString"
}
If this one compiles, search for "String" to see how you shadowed the standard type. If you use IntelliJ Idea, you can try to use "Ctrl+B" (GoTo) to find it out. The most obvious candidate is that you used String as a name of generic type parameter but there might be some other choices.

using datetime/timestamp in scala slick

is there an easy way to use datetime/timestamp in scala? What's best practice? I currently use "date" to persist data, but I'd also like to persist the current time.
I'm struggling to set the date. This is my code:
val now = new java.sql.Timestamp(new java.util.Date().getTime)
I also tried to do this:
val now = new java.sql.Date(new java.util.Date().getTime)
When changing the datatype in my evolutions to "timestamp", I got an error:
case class MyObjectModel(
id: Option[Int],
title: String,
createdat: Timestamp,
updatedat: Timestamp,
...)
object MyObjectModel{
implicit val myObjectFormat = Json.format[MyObjectModel]
}
Console:
app\models\MyObjectModel.scala:31: No implicit format for
java.sql.Timestamp available.
[error] implicit val myObjectFormat = Json.format[MyObjectModel]
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
Update:
object ProcessStepTemplatesModel {
implicit lazy val timestampFormat: Format[Timestamp] = new Format[Timestamp] {
override def reads(json: JsValue): JsResult[Timestamp] = json.validate[Long].map(l => Timestamp.from(Instant.ofEpochMilli(l)))
override def writes(o: Timestamp): JsValue = JsNumber(o.getTime)
}
implicit val processStepFormat = Json.format[ProcessStepTemplatesModel]
}
try using this in your code
implicit object timestampFormat extends Format[Timestamp] {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SS'Z'")
def reads(json: JsValue) = {
val str = json.as[String]
JsSuccess(new Timestamp(format.parse(str).getTime))
}
def writes(ts: Timestamp) = JsString(format.format(ts))
}
it is (de)serialized in a JS compatible format like the following "2018-01-06T18:31:29.436Z"
please note: the implicit object shall be decleared in the code before it is used
I guess your question is handled in What's the standard way to work with dates and times in Scala? Should I use Java types or there are native Scala alternatives?.
Go with Java 8 "java.time".
In the subject you mention Slick (Scala Database Library) but the error you got comes from a Json library and it says that you don't have a converter for java.sql.Timestamp to Json. Without knowing which Json library you are using it's hard to help you with a working example.

How to write Expectations with ScalaMock?

This question seems pretty obvious I know, but i've tried everything that is written on the documentation and I cant't mock a single method on any classes.
For this test, I am using scalaMock 3 for Scala 2.10 and ScalaTest 2
DateServiceTest.scala
#RunWith(classOf[JUnitRunner])
class DateServiceTest extends FunSuite with MockFactory{
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
implicit val sqlc = new SQLContext(sc)
val m = mock[DateService]
(m.getDate _).expects().returning(Some(new DateTime)) // Error here
}
DateService.scala
class DateService extends Serializable {
def getDate(implicit sqlc: SQLContext): Option[DateTime] = {
Some(new DateTime(loadDateFromDatabase(sqlc)))
}
}
This seems pretty simple for me, but the expectation is Throwing me this error
type mismatch; found : Option[org.joda.time.DateTime] required: ? ⇒ ?
Am I doing something wrong here? Is there another way to set the expectations of a method ?
Your getDate method expects a SQLContext - try adding a wildcard (*) or pass the particular sqlc:
(m.getDate _).expects(*).returning(Some(new DateTime))
// Or, alternatively
(m.getDate _).expects(sqlc).returning(Some(new DateTime))