How to invoke Spark functions (with arguments) from applications.properties(config file)? - scala

So, I have a typesafe config file named application.properties which contains certain values like:
dev.execution.mode = local
dev.input.base.dir = /Users/debaprc/Documents/QualityCheck/Data
dev.schema.lis = asin StringType,subs_activity_date DateType,marketplace_id DecimalType
I have used these values as Strings in my Spark code like:
def main(args: Array[String]): Unit = {
val props = ConfigFactory.load()
val envProps = props.getConfig("dev")
val spark = SparkSession.builder.appName("DataQualityCheckSession")
.config("spark.master", envProps.getString("execution.mode"))
.getOrCreate()
Now I have certain functions defined in my spark code (func1, func2, etc...). I want to specify which functions are to be called, along with the respective arguments, in my application.properties file. Something like this:
dev.functions.lis = func1,func2,func2,func3
dev.func1.arg1.lis = arg1,arg2
dev.func2.arg1.lis = arg3,arg4,arg5
dev.func2.arg2.lis = arg6,arg7,arg8
dev.func3.arg1.lis = arg9,arg10,arg11,arg12
Now, once I specify these, what do I do in Spark, to call the functions with the provided arguments? Or do I need to specify the functions and arguments in a different way?

I agree with #cchantep the approach seems wrong. But if you still want to do something like that, I would decouple the function names in the properties file from the actual functions/methods in your code.
I have tried this and worked fine:
def function1(args: String): Unit = {
println(s"func1 args: $args")
}
def function2(args: String): Unit = {
println(s"func2 args: $args")
}
val functionMapper: Map[String, String => Unit] = Map(
"func1" -> function1,
"func2" -> function2
)
val args = "arg1,arg2"
functionMapper("func1")(args)
functionMapper("func2")(args)
Output:
func1 args: arg1,arg2
func2 args: arg1,arg2
Edited: Simpler approach with output example.

Related

Flink: RowRowConverter seems to fail for nested DataTypes

I am trying to load a complex JSON file (multiple different data types, nested objects/arrays etc) from my local, read them in as a source using the Table API File System Connector, convert them into DataStream, and then do some action afterwards (not shown here for brevity).
The conversion gives me a DataStream of type DataStream[Row], which I need to convert to DataStream[RowData] (for sink purposes, won't go into details here). Thankfully, there's a RowRowConverter utility that helps to do this mapping. It works when I tried a completely flat JSON, but when I introduced Arrays and Maps within the JSON, it no longer works.
Here is the exception that was thrown - a null pointer exception:
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.allocateWriter(ArrayObjectArrayConverter.java:140)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toBinaryArrayData(ArrayObjectArrayConverter.java:114)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:93)
at org.apache.flink.table.data.conversion.ArrayObjectArrayConverter.toInternal(ArrayObjectArrayConverter.java:40)
at org.apache.flink.table.data.conversion.DataStructureConverter.toInternalOrNull(DataStructureConverter.java:61)
at org.apache.flink.table.data.conversion.RowRowConverter.toInternal(RowRowConverter.java:75)
at flink.ReadJsonNestedData$.$anonfun$main$2(ReadJsonNestedData.scala:48)
Interestingly, when I setup my breakpoints and debugger this is what I discovered: RowRowConverter::toInternal, the first time it was called works, will go all the way down to ArrayObjectArrayConverter::allocateWriter()
However, for some strange reason, RowRowConverter::toInternal runs twice, and if I continue stepping through eventually it will come back here, which is where the null pointer exception happens.
Example of the JSON (simplified with only a single nested for brevity). I placed it in my /src/main/resources folder
{"discount":[670237.997082,634079.372133,303534.821218]}
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.table.api.DataTypes
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment
import org.apache.flink.table.data.conversion.RowRowConverter
import org.apache.flink.table.types.FieldsDataType
import org.apache.flink.table.types.logical.RowType
import scala.collection.JavaConverters._
object ReadJsonNestedData {
def main(args: Array[String]): Unit = {
// setup
val jsonResource = getClass.getResource("/NESTED.json")
val jsonFilePath = jsonResource.getPath
val tableName = "orders"
val readJSONTable =
s"""
| CREATE TABLE $tableName (
| `discount` ARRAY<DECIMAL(12, 6)>
| )WITH (
| 'connector' = 'filesystem',
| 'path' = '$jsonFilePath',
| 'format' = 'json'
|)""".stripMargin
val colFields = Array(
"discount"
)
val defaultDataTypes = Array(
DataTypes.ARRAY(DataTypes.DECIMAL(12, 6))
)
val rowType = RowType.of(defaultDataTypes.map(_.getLogicalType), colFields)
val defaultDataTypesAsList = defaultDataTypes.toList.asJava
val dataType = new FieldsDataType(rowType, defaultDataTypesAsList)
val rowConverter = RowRowConverter.create(dataType)
// Job
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val tableEnv = StreamTableEnvironment.create(env)
tableEnv.executeSql(readJSONTable)
val ordersTable = tableEnv.from(tableName)
val dataStream = tableEnv
.toDataStream(ordersTable)
.map(row => rowConverter.toInternal(row))
dataStream.print()
env.execute()
}
}
I would hence like to know:
Why RowRowConverter is not working and how I can remedy it
Why RowRowConverter::toInternal is running twice for the same Row .. which may be the cause of that NullPointerException
If my method of instantiating and using the RowRowConverter is correct based on my code above.
Thank you!
Environment:
IntelliJ 2021.3.2 (Ultimate)
AdoptOpenJDK 1.8
Scala: 2.12.15
Flink: 1.13.5
Flink Libraries Used (for this example):
flink-table-api-java-bridge
flink-table-planner-blink
flink-clients
flink-json
The first call of RowRowConverter::toInternal is an internal implementation for making a deep copy of the StreamRecord emitted by table source, which is independent from the converter in your map function. The reason of the NPE is that the RowRowConverter in the map function is not initialized by calling RowRowConverter::open. You can use RichMapFunction instead to invoke the RowRowConverter::open in RichMapFunction::open.
Thank you to #renqs for the answer.
Here is the code, if anyone is interested.
class ConvertRowToRowDataMapFunction(fieldsDataType: FieldsDataType)
extends RichMapFunction[Row, RowData] {
private final val rowRowConverter = RowRowConverter.create(fieldsDataType)
override def open(parameters: Configuration): Unit = {
super.open(parameters)
rowRowConverter.open(this.getClass.getClassLoader)
}
override def map(row: Row): RowData =
this.rowRowConverter.toInternal(row)
}
// at main function
// ... continue from previous
val dataStream = tableEnv
.toDataStream(personsTable)
.map(new ConvertRowToRowDataMapFunction(dataType))

Create Spark UDF of a function that depends on other resources

I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
}
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
("kingdomofheaven"),
("enemyatthegates"),
("salesinfointheyearofdecember"),
).toDF("column_name")
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
.................
.................
}
Its giving me the expected output:
+----------------------------+--------------------------+
|columnname |tokenized |
+----------------------------+--------------------------+
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
+----------------------------+--------------------------+
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
?
At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.

Mockito verify fails when method takes a function as argument

I have a Scala test which uses Mockito to verify that certain DataFrame transformations are invoked. I broke it down to this simple problematic example
import org.apache.spark.sql.DataFrame
import org.scalatest.funsuite.AnyFunSuite
import org.apache.spark.sql.functions._
import org.mockito.{Mockito, MockitoSugar}
class SimpleTest extends AnyFunSuite{
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreeting)
mockDF.transform(withGreeting)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreeting)
}
}
I'm trying to assert that the transform was called on my mockDF, but it fails with
Argument(s) are different! Wanted:
dataset.transform(<function1>);
-> at org.apache.spark.sql.Dataset.transform(Dataset.scala:2182)
Actual invocations have different arguments:
dataset.transform(<function1>);
Why would the verify fail in this case?
You need to save lambda expression argument for transform as val for correct testing and pass it to all transform calls:
def withGreeting(df: DataFrame):DataFrame = {
df.withColumn("greeting", lit("hello"))
}
test("sample test") {
val mockDF = MockitoSugar.mock[DataFrame]
val mockDF2 = MockitoSugar.mock[DataFrame]
val withGreetingExpression = df => withGreeting(df)
MockitoSugar.doReturn(mockDF2).when(mockDF).transform(withGreetingExpression)
mockDF.transform(withGreetingExpression)
val orderVerifier = Mockito.inOrder(mockDF)
orderVerifier.verify(mockDF).transform(withGreetingExpression)
}
Mockito requires to provide same (or equal) arguments to the mocked functions calls. When you are passing lambda expression without saving each call transform(withGreeting) creates new object Function[DataFrame, DataFrame]
transform(withGreeting)
is the same as:
transform(new Function[DataFrame, DataFrame] {
override def apply(df: DataFrame): DataFrame = withGreeting(df)
})
And they aren't equal to each other - this is the cause of error message:
Argument(s) are different!
For example, try to execute:
println(((df: DataFrame) => withGreeting(df)) == ((df: DataFrame) => withGreeting(df))) //false
You can read more about objects equality in java (in the scala it's same):
wikibooks
javaworld.com

What is Spark execution order with function calls in scala?

I have a spark program as follows:
object A {
var id_set: Set[String] = _
def init(argv: Array[String]) = {
val args = new AArgs(argv)
id_set = args.ids.split(",").toSet
}
def main(argv: Array[String]) {
init(argv)
val conf = new SparkConf().setAppName("some.name")
val rdd1 = getRDD(paras)
val rdd2 = getRDD(paras)
//......
}
def getRDD(paras) = {
//function details
getRDDDtails(paras)
}
def getRDDDtails(paras) = {
//val id_given = id_set
id_set.foreach(println) //worked normal, not empty
someRDD.filter{ x =>
val someSet = x.getOrElse(...)
//id_set.foreach(println) ------wrong, id_set just empty set
(some_set & id_set).size > 0
}
}
class AArgs(args: Array[String]) extends Serializable {
//parse args
}
I have a global variable id_set. At first, it is just an empty set. In main function, I call init which sets id_set to a non-empty set from args. After that, I call getRDD function which calls getRDDDtails. In getRDDDtails, I filter a rdd based on contents in id_set. However, the result semms to be empty. I tried to print is_set in executor, and it is just an empty line. So, the problem seems to be is_set is not well initilized(in init function). However, when I try to print is_set in driver(in head lines of function getRDDDtails), it worked normal, not empty.
So, I have tried to add val id_given = id_set in function getRDDDtails, and use id_given later. This seems to fix the problem. But I'm totally confused why should this happen? What is the execution order of Spark programs? Why does my solution work?

Evalutate complex type with quasiquote scala, unlifting

I need to compile function and then evaluate it with different parameters of type List[Map[String, AnyRef]].
I have the following code that does not compile with such the type but compiles with simple type like List[Int].
I found that there are just certain implementations of Liftable in scala.reflect.api.StandardLiftables.StandardLiftableInstances
import scala.reflect.runtime.universe
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val functionWrapper =
"""
object FunctionWrapper {
def makeBody(messages: List[Map[String, AnyRef]]) = Map.empty
}""".stripMargin
val functionSymbol =
tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
val list: List[Map[String, AnyRef]] = List(Map("1" -> "2"))
tb.eval(q"$functionSymbol.function($list)")
Getting compilation error for this, how can I make it work?
Error:(22, 38) Can't unquote List[Map[String,AnyRef]], consider using
... or providing an implicit instance of
Liftable[List[Map[String,AnyRef]]]
tb.eval(q"$functionSymbol.function($list)")
^
The problem comes not from complicated type but from the attempt to use AnyRef. When you unquote some literal, it means you want the infrastructure to be able to create a valid syntax tree to create an object that would exactly match the object you pass. Unfortunately this is obviously not possible for all objects. For example, assume that you've passed a reference to Thread.currentThread() as a part of the Map. How it could possible work? Compiler is just not able to recreate such a complicated object (not to mention making it the current thread). So you have two obvious alternatives:
Make you argument also a Tree i.e. something like this
def testTree() = {
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val functionWrapper =
"""
| object FunctionWrapper {
|
| def makeBody(messages: List[Map[String, AnyRef]]) = Map.empty
|
| }
""".stripMargin
val functionSymbol =
tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
//val list: List[Map[String, AnyRef]] = List(Map("1" -> "2"))
val list = q"""List(Map("1" -> "2"))"""
val res = tb.eval(q"$functionSymbol.makeBody($list)")
println(s"testTree = $res")
}
The obvious drawback of this approach is that you loose type safety at compile time and might need to provide a lot of context for the tree to work
Another approach is to not try to pass anything containing AnyRef to the compiler-infrastructure. It means you create some function-like Wrapper:
package so {
trait Wrapper {
def call(args: List[Map[String, AnyRef]]): Map[String, AnyRef]
}
}
and then make your generated code return a Wrapper instead of directly executing the logic and call the Wrapper from the usual Scala code rather than inside compiled code. Something like this:
def testWrapper() = {
val tb = universe.runtimeMirror(getClass.getClassLoader).mkToolBox()
val functionWrapper =
"""
|object FunctionWrapper {
| import scala.collection._
| import so.Wrapper /* <- here probably different package :) */
|
| def createWrapper(): Wrapper = new Wrapper {
| override def call(args: List[Map[String, AnyRef]]): Map[String, AnyRef] = Map.empty
| }
|}
| """.stripMargin
val functionSymbol = tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
val list: List[Map[String, AnyRef]] = List(Map("1" -> "2"))
val tree: tb.u.Tree = q"$functionSymbol.createWrapper()"
val wrapper = tb.eval(tree).asInstanceOf[Wrapper]
val res = wrapper.call(list)
println(s"testWrapper = $res")
}
P.S. I'm not sure what are you doing but beware of performance issues. Scala is a hard language to compile and thus it might easily take more time to compile your custom code than to run it. If performance becomes an issue you might need to use some other methods such as full-blown macro-code-generation or at least caching of the compiled code.