Mockito verify fails when method takes a function as argument - scala

I have a Scala test which uses Mockito to verify that certain DataFrame transformations are invoked. I broke it down to this simple problematic example
import org.apache.spark.sql.DataFrame
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions._
import org.mockito.{Mockito, MockitoSugar}
class SimpleTest extends AnyFunSuite{
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreeting)
mockDF.transform(withGreeting)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreeting)
}
}
I'm trying to assert that the transform was called on my mockDF, but it fails with
Argument(s) are different! Wanted:
dataset.transform(<function1>);
-> at org.apache.spark.sql.Dataset.transform(Dataset.scala:2182)
Actual invocations have different arguments:
dataset.transform(<function1>);
Why would the verify fail in this case?

You need to save lambda expression argument for transform as val for correct testing and pass it to all transform calls:
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
val withGreetingExpression = df => withGreeting(df)
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreetingExpression)
mockDF.transform(withGreetingExpression)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreetingExpression)
}
Mockito requires to provide same (or equal) arguments to the mocked functions calls. When you are passing lambda expression without saving each call transform(withGreeting) creates new object Function[DataFrame, DataFrame]
transform(withGreeting)
is the same as:
transform(new Function[DataFrame, DataFrame] {
override def apply(df: DataFrame): DataFrame = withGreeting(df)
})
And they aren't equal to each other - this is the cause of error message:
Argument(s) are different!
For example, try to execute:
println(((df: DataFrame) => withGreeting(df)) == ((df: DataFrame) => withGreeting(df))) //false
You can read more about objects equality in java (in the scala it's same):
wikibooks
javaworld.com

Related

NullPointerException exception when using Flink's leftOuterJoinLateral in Scala

I am trying to follow the documentation and create a Table Function to "flatten" some data. The Table Function seems to work fine when using the joinLateral to do the flattening. When using leftOuterJoinLateral though, I get the following error. I'm using Scala and have tried both Table API and SQL with the same result:
Caused by: java.lang.NullPointerException: Null result cannot be stored in a Case Class.
Here is my job:
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.functions.TableFunction
object example_job{
// Split the List[Int] into multiple rows
class Split() extends TableFunction[Int] {
def eval(nums: List[Int]): Unit = {
nums.foreach(x =>
if(x != 3) {
collect(x)
})
}
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.createLocalEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
val splitMe = new Split()
// Create some dummy data
val events: DataStream[(String, List[Int])] = env.fromElements(("simon", List(1,2,3)), ("jessica", List(3)))
val table = tableEnv.fromDataStream(events, 'name, 'numbers)
.leftOuterJoinLateral(splitMe('numbers) as 'number)
.select('name, 'number)
table.toAppendStream[(String, Int)].print()
env.execute("Flink jira ticket example")
}
}
When I change .leftOuterJoinLateral to .joinLateral I get the expected result:
(simon,1)
(simon,2)
When using the .leftOuterJoinLateral I would expect something like:
(simon,1)
(simon,2)
(simon,null) // or (simon, None)
(jessica,null) // or (jessica, None)
Seems like this might be a bug with the Scala API? I wanted to check here first before raising a ticket just in case I'm doing something stupid!
The problem is that Flink per default does expect that all fields of a row are non-null. That's why the program fails when it sees the null result from the outer join operation. In order to accept null values, you either need to disable the null check via
val tableConfig = tableEnv.getConfig
tableConfig.setNullCheck(false)
Or you must specify the result type to tolerate null values, e.g. specifying a custom POJO output type:
table.toAppendStream[MyOutput].print()
with
class MyOutput(var name: String, var number: Integer) {
def this() {
this(null, null)
}
override def toString: String = s"($name, $number)"
}

Using default argument values in Scala UDF from pyspark?

I have a UDF defined in Scala with a default argument value like so:
package myUDFs
import org.apache.spark.sql.api.java.UDF3
class my_udf extends UDF3[Int, Int, Int, Int] {
override def call(a: Int, b: Int, c: Int = 6): Int = {
c*(a + b)
}
}
I then build this appropriately with build clean assembly (can provide more build details if needed) and extract the jar myUDFs-assembly-0.1.1.jar and include that in my Spark configuration in Python:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import IntType
spark_conf = SparkConf().setAll([
('spark.jars', 'myUDFs-assembly-0.1.1.jar')
])
spark = SparkSession.builder \
.appName('my_app') \
.config(conf = spark_conf) \
.enableHiveSupport() \
.getOrCreate()
spark.udf.registerJavaFunction(
"my_udf", "myUDFs.my_udf", IntType()
)
But, when I try to leverage the default, I'm rebuffed:
spark.sql('select my_udf(1, 2)').collect()
AnalysisException: 'Invalid number of arguments for function my_udf. Expected: 3; Found: 2; line x pos y'
Is it not possible to have a UDF with a default value like this? The output should be 6*(1+2) = 18.
Just looking at the call chain there is no chance for the default argument to be recognized here.
Python registerJavaFunction invokes its JVM UDFRegistration.registerJava.
registerJava invokes matching register implementation.
Which, in case of UDF3, looks like this:
* Register a deterministic Java UDF3 instance as user-defined function (UDF).
* #since 1.3.0
*/
def register(name: String, f: UDF3[_, _, _, _], returnType: DataType): Unit = {
val func = f.asInstanceOf[UDF3[Any, Any, Any, Any]].call(_: Any, _: Any, _: Any)
def builder(e: Seq[Expression]) = if (e.length == 3) {
ScalaUDF(func, returnType, e, e.map(_ => true), udfName = Some(name))
} else {
throw new AnalysisException("Invalid number of arguments for function " + name +
". Expected: 3; Found: " + e.length)
}
functionRegistry.createOrReplaceTempFunction(name, builder)
}
As you can see, the builder only verifies if the provided expression matches the arity of the function before the call is actually dispatched.
You might have a better luck with implementing an intermediate API which would handle default arguments and dispatch to UDF under the covers. This however will work only with DataFrame API, so it might not fit your needs.
You are passing only two argument while calling the function in spark sql. Try to pass three arguments
spark.sql('select my_udf(1, 2, 3 )').collect()

Scala : map Dataset[Row] to Dataset[Row]

I am trying to use scala to transform a dataset with array to a dataset with label and vectors, before putting it into some machine learning algo.
So far, I succeeded to add a double label, but i block on the vectors part. Below, the code to create the vectors :
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.sql.types.{DataTypes, StructField}
import org.apache.spark.sql.{Dataset, Row, _}
import spark.implicits._
def toVectors(withLabelDs: Dataset[Row]) = {
val allLabel = withLabelDs.count()
var countLabel = 0
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
println("schema line {}", line.schema)
//StructType(
// StructField(label,DoubleType,false),
// StructField(code,ArrayType(IntegerType,true),true),
// StructField(score,ArrayType(IntegerType,true),true))
val label = line.getDouble(0)
val indicesList = line.getList(1)
val indicesSize = indicesList.size
val indices = new Array[Int](indicesSize)
val valuesList = line.getList(2)
val values = new Array[Double](indicesSize)
var i = 0
while ( {
i < indicesSize
}) {
indices(i) = indicesList.get(i).asInstanceOf[Int] - 1
values(i) = valuesList.get(i).asInstanceOf[Int].toDouble
i += 1
}
var r: Row = null
try {
r = Row(label, Vectors.sparse(195, indices, values))
countLabel += 1
}
catch {
case e: IllegalArgumentException =>
println("something went wrong with label {} / indices {} / values {}", label, indices, values)
println("", e)
}
println("Still {} labels to process", allLabel - countLabel)
r
})
newDataset
}
With this code, I got this error :
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
So naturally, I changed my code
def toVectors(withLabelDs: Dataset[Row]) = {
...
}, Encoders.bean(Row.getClass))
newDataset
}
But I got this error :
error: overloaded method value map with alternatives:
[U](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,U],
encoder: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
<and>
[U](func: org.apache.spark.sql.Row => U)
(implicit evidence$6: org.apache.spark.sql.Encoder[U])org.apache.spark.sql.Dataset[U]
cannot be applied to (org.apache.spark.sql.Row => org.apache.spark.sql.Row, org.apache.spark.sql.Encoder[?0])
val newDataset: Dataset[Row] = withLabelDs.map((line: Row) => {
How can I make this work ? Aka, having a dataset[Row] returned with Vectors ?
Two things:
.map is of type (T => U)(implicit Encoder[U]) => Dataset[U] but looks like you are calling it like it is (T => U, implicit Encoder[U]) => Dataset[U] which are slightly different. Instead of .map(f, encoder), try .map(f)(encoder).
Also, I doubt Encoders.bean(Row.getClass) will work since Row is not a bean. Some quick googling turned up RowEncoder which looks like it should work but I couldn't find much documentation about it.
The error message is unfortunately quite poor. import spark.implicits._ is only correct in the spark-shell. What it actually means is to import <Spark Session object>.implicits._, spark just happens to be the variable name used for the SparkSession object in the spark-shell.
You can access the SparkSession from a Dataset
At the top of your method you can add the import
def toVectors(withLabelDs: Dataset[Row]) = {
val sparkSession = withLabelIDs.sparkSession
import sparkSession.implicits._
//rest of method code

Scala Akka Stream: How to Pass Through a Seq

I'm trying to wrap some blocking calls in Future.The return type is Seq[User] where User is a case class. The following just wouldn't compile with complaints of various overloaded versions being present. Any suggestions? I tried almost all the variations is Source.apply without any luck.
// All I want is Seq[User] => Future[Seq[User]]
def findByFirstName(firstName: String) = {
val users: Seq[User] = userRepository.findByFirstName(firstName)
val sink = Sink.fold[User, User](null)((_, elem) => elem)
val src = Source(users) // doesn't compile
src.runWith(sink)
}
First of all, I assume that you are using version 1.0 of akka-http-experimental since the API may changed from previous release.
The reason why your code does not compile is that the akka.stream.scaladsl.Source$.apply() requires
scala.collection.immutable.Seq instead of scala.collection.mutable.Seq.
Therefore you have to convert from mutable sequence to immutable sequence using to[T] method.
Document: akka.stream.scaladsl.Source
Additionally, as you see the document, Source$.apply() accepts ()=>Iterator[T] so you can also pass ()=>users.iterator as argument.
Since Sink.fold(...) returns the last evaluated expression, you can give an empty Seq() as the first argument, iterate over the users with appending the element to the sequence, and finally get the result.
However, there might be a better solution that can create a Sink which puts each evaluated expression into Seq, but I could not find it.
The following code works.
import akka.actor._
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source,Sink}
import scala.concurrent.ExecutionContext.Implicits.global
case class User(name:String)
object Main extends App{
implicit val system = ActorSystem("MyActorSystem")
implicit val materializer = ActorMaterializer()
val users = Seq(User("alice"),User("bob"),User("charlie"))
val sink = Sink.fold[Seq[User], User](Seq())(
(seq, elem) =>
{println(s"elem => ${elem} \t| seq => ${seq}");seq:+elem})
val src = Source(users.to[scala.collection.immutable.Seq])
// val src = Source(()=>users.iterator) // this also works
val fut = src.runWith(sink) // Future[Seq[User]]
fut.onSuccess({
case x=>{
println(s"result => ${x}")
}
})
}
The output of the code above is
elem => User(alice) | seq => List()
elem => User(bob) | seq => List(User(alice))
elem => User(charlie) | seq => List(User(alice), User(bob))
result => List(User(alice), User(bob), User(charlie))
If you need just Future[Seq[Users]] dont use akka streams but
futures
import scala.concurrent._
import ExecutionContext.Implicits.global
val session = socialNetwork.createSessionFor("user", credentials)
val f: Future[List[Friend]] = Future {
session.getFriends()
}

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail