I need to execute some functions based on the values that I receive from topics. I'm currently using ForeachWriter to convert all the topics to a List.
Now, I want to pass this List as a parameter to the methods.
This is what I have so far
def doA(mylist: List[String]) = { //something for A }
def doB(mylist: List[String]) = { //something for B }
Ans this is how I call my streaming queries
//{"s":"a","v":"2"}
//{"s":"b","v":"3"}
val readTopics = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "myTopic").load()
val schema = new StructType()
.add("s",StringType)
.add("v",StringType)
val parseStringDF = readTopics.selectExpr("CAST(value AS STRING)")
val parseDF = parseStringDF.select(from_json(col("value"), schema).as("data"))
.select("data.*")
parseDF.writeStream
.format("console")
.outputMode("append")
.start()
//fails here
val listOfTopics = parseDF.select("s").map(row => (row.getString(0))).collect.toList
//unable to call the below methods
for (t <- listOfTopics ){
if(t == "a")
doA(listOfTopics)
else if (t == "b")
doB(listOfTopics)
else
println("do nothing")
}
spark.streams.awaitAnyTermination()
Questions:
How can I call a stand-alone (non-streaming) method in a streaming job?
I cannot use ForeachWriter here as I want to pass a SparkSession to methods and since SparkSession is not serializable, I cannot use ForeachWriter. What are the alternatives to call the methods doA and doB in parallel?
If you want to be able to collect data to a local Spark driver/executor, you need to use parseDF.write.foreachBatch, i.e. using a ForEachWriter
It's unclear what you need the SparkSession for within your two methods, but since they are working on non-Spark datatypes, you probably shouldn't be using a SparkSession instance, anyway
Alternatively, you should .select() and filter your topic column, then apply the functions to two "topic-a" and "topic-b" dataframes, thus parallelizing the workload. Otherwise, you would be better off just using regular KafkaConsumer from kafka-clients or kafka-streams rather than Spark
Related
I have an use case where I manipulate streaming datasets, make an external API to enrich the dataset and write it to a sink. What I am doing so far:
val simpleDS: Dataset[SimpleModel] = spark
.readstream
.format("kafka")
.option(..)..
def enrich(model: SimpleModel): EnrichedModel = {
val fut: Future[Int] = lookupLabel(model.id)
val enrich: Int = Await.result(fut, 5.seconds)
EnrichModel(model.id, enrich)
}
val enrichedDS = dataset.map(enrich)
enrichedDS
.toJson
.writeStream
.format("kafka")
.option(..)..
Although this works, I am unsure about the Await.resultpart since it blocks. However, future.onComplete which is non-blocking, seems to be interested in the side effect (Unit) and not on the value returned by the future(Int). Is there a way for me to use a non-blocking call to get a value returned by a Future?
In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreachBatch(function)
.start()
query.awaitTermination()
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
}
)
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
}
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
OR
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?
I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).
#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(dataConsumer)
.start()
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
true
}
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
}
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
}
}
}
I want to unit test code that read DataFrame from RDBMS using sparkSession.read.jdbc(...). But I did't find a way how to mock DataFrameReader to return dummy DataFrame for test.
Code example:
object ConfigurationLoader {
def readTable(tableName: String)(implicit spark: SparkSession): DataFrame = {
spark.read
.format("jdbc")
.option("url", s"$postgresUrl/$postgresDatabase")
.option("dbtable", tableName)
.option("user", postgresUsername)
.option("password", postgresPassword)
.option("driver", postgresDriver)
.load()
}
def loadUsingFilter(dummyFilter: String*)(implicit spark: SparkSession): DataFrame = {
readTable(postgresFilesTableName)
.where(col("column").isin(fileTypes: _*))
}
}
And second problem - to mock scala object, looks like I need to use other approach to create such service.
In my opinion, unit tests are not meant to test database connections. This should be done in integration tests to check that all the parts work together. Unit tests are just meant to test your functional logic, and not spark's ability to read from a database.
This is why I would design your code slightly differently and do just that, without caring about the DB.
/** This, I don't test. I trust spark.read */
def readTable(tableName: String)(implicit spark: SparkSession): DataFrame = {
spark.read
.option(...)
...
.load()
// Nothing more
}
/** This I test, this is my logic. */
def transform(df : DataFrame, dummyFilter: String*): DataFrame = {
df
.where(col("column").isin(fileTypes: _*))
}
Then I use the code this way in production.
val source = readTable("...")
val result = transform(source, filter)
And now transform, that contains my logic, is easy to test. In case you wonder how to create dummy dataframes, one way I like is this:
val df = Seq((1, Some("a"), true), (2, Some("b"), false),
(3, None, true)).toDF("x", "y", "z")
// and the test
val result = transform(df, filter)
result should be ...
If you want to test sparkSession.read.jdbc(...), you can play with in-memory H2 database. I do it sometimes when I'm writing learning tests. You can find an example here: https://github.com/bartosz25/spark-scala-playground/blob/d3cad26ff236ae78884bdeb300f2e59a616dc479/src/test/scala/com/waitingforcode/sql/LoadingDataTest.scala Please note however that you may encounter some subtle differences with "real" RDBMS.
On the other side, you can better separate the concerns of the code and create the DataFrame differently, for instance with toDF(...) method. You can find an example here: https://github.com/bartosz25/spark-scala-playground/blob/77ea416d2493324ddd6f3f2be42122855596d238/src/test/scala/com/waitingforcode/sql/CorrelatedSubqueryTest.scala
Finally and IMO, if you have to mock DataFrameReader, it means that maybe there is something to do with the code separation. For instance, you can put all your filters inside a Filters object and test each filter separately. Same for mapping or aggregation functions. 2 years ago I wrote a blog post about testing Apache Spark - https://www.waitingforcode.com/apache-spark/testing-spark-applications/read It describes RDD API but the idea of separating concerns is the same.
Updated:
object Filters {
def isInFileTypes(inputDataFrame: DataFrame, fileTypes: Seq[String]): DataFrame = {
inputDataFrame.where(col("column").isin(fileTypes: _*))
}
}
object ConfigurationLoader {
def readTable(tableName: String)(implicit spark: SparkSession): DataFrame = {
val input = spark.read
.format("jdbc")
.option("url", s"$postgresUrl/$postgresDatabase")
.option("dbtable", tableName)
.option("user", postgresUsername)
.option("password", postgresPassword)
.option("driver", postgresDriver)
.load()
Filters.isInFileTypes(input, Seq("txt", "doc")
}
And with that you can test your filtering logic whatever you want :) If you have more filters and want to test them, you can also combine them in a single method, pass any DataFrameyou want and voilà :)
You shouldn't test the .load() unless you have very good reasons to do so. It's Apache Spark internal logic, already tested.
Update, answer for:
So, now I am able to test filters, but how to make sure that readTable really use proper filter(sorry for thoroughness, it is just question of full coverage). Probably you have some simple approach how to mock scala object(it is actually mu second problem). – dytyniak 14 mins ago
object MyApp {
def main(args: Array[String]): Unit = {
val inputDataFrame = readTable(postgreSQLConnection)
val outputDataFrame = ProcessingLogic.generateOutputDataFrame(inputDataFrame)
}
}
object ProcessingLogic {
def generateOutputDataFrame(inputDataFrame: DataFrame): DataFrame = {
// Here you apply all needed filters, transformations & co
}
}
As you can see, no need to mock an object here. It seems redundant but it's not because you can test every filter in isolation thanks to Filters object and all your processing logic combined thanks to ProcessingLogic object (name only for example). And you can create your DataFrame in any valid way. The drawback is you will need define a schema explicitly or use case classes since in your PostgreSQL source, Apache Spark will resolve the schema automatically (I explained this here: https://www.waitingforcode.com/apache-spark-sql/schema-projection/read).
Write UT for all DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter
The sample test case using the above steps
Mock
Behavior
Assertion
Maven based dependencies
<groupId>org.scalatestplus</groupId>
<artifactId>mockito-3-4_2.11</artifactId>
<version>3.2.3.0</version>
<scope>test</scope>
<groupId>org.mockito</groupId>
<artifactId>mockito-inline</artifactId>
<version>2.13.0</version>
<scope>test</scope>
Let’s use an example of a spark class where the source is Hive and the sink is JDBC
class DummySource extends SparkPipeline {
/**
* Method to read the source and create a Dataframe
*
* #param sparkSession : SparkSession
* #return : DataFrame
*/
override def read(spark: SparkSession): DataFrame = {
spark.read.table("Table_Name").filter("_2 > 1")
}
/**
* Method to transform the dataframe
*
* #param df : DataFrame
* #return : DataFrame
*/
override def transform(df: DataFrame): DataFrame = ???
/**
* Method to write/save the Dataframe to a target
*
* #param df : DataFrame
*
*/
override def write(df: DataFrame): Unit =
df.write.jdbc("url", "targetTableName", new Properties())
}
Mocking Read
test("Spark read table") {
val dummySource = new DummySource()
val sparkSession = SparkSession
.builder()
.master("local[*]")
.appName("mocking spark test")
.getOrCreate()
val testData = Seq(("one", 1), ("two", 2))
val df = sparkSession.createDataFrame(testData)
df.show()
val mockDataFrameReader = mock[DataFrameReader]
val mockSpark = mock[SparkSession]
when(mockSpark.read).thenReturn(mockDataFrameReader)
when(mockDataFrameReader.table("Table_Name")).thenReturn(df)
dummySource.read(mockSpark).count() should be(1)
}
Mocking Write
test("Spark write") {
val dummySource = new DummySource()
val mockDf = mock[DataFrame]
val mockDataFrameWriter = mock[DataFrameWriter[Row]]
when(mockDf.write).thenReturn(mockDataFrameWriter)
when(mockDataFrameWriter.mode(SaveMode.Append)).thenReturn(mockDataFrameWriter)
doNothing().when(mockDataFrameWriter).jdbc("url", "targetTableName", new Properties())
dummySource.write(df = mockDf)
}
Streaming code in ref
Ref : https://medium.com/walmartglobaltech/spark-mocking-read-readstream-write-and-writestream-b6fe70761242
I am facing a strange behaviour from Spark. Here's my code:
object MyJob {
def main(args: Array[String]): Unit = {
val sc = new SparkContext()
val sqlContext = new hive.HiveContext(sc)
val query = "<Some Hive Query>"
val rawData = sqlContext.sql(query).cache()
val aggregatedData = rawData.groupBy("group_key")
.agg(
max("col1").as("max"),
min("col2").as("min")
)
val redisConfig = new RedisConfig(new RedisEndpoint(sc.getConf))
aggregatedData.foreachPartition {
rows =>
writePartitionToRedis(rows, redisConfig)
}
aggregatedData.write.parquet(s"/data/output.parquet")
}
}
Against my intuition the spark scheduler yields two jobs for each data sink (Redis, HDFS/Parquet). The problem is the second job is also performing the hive query and doubling the work. I assumed both write operations would share the data from aggregatedData stage. Is something wrong or is it behaviour to be expected?
You've missed a fundamental concept of spark: Lazyness.
An RDD does not contain any data, all it is is a set of instructions that will be executed when you call an action (like writing data to disk/hdfs). If you reuse an RDD (or Dataframe), there's no stored data, just store instructions that will need to be evaluated everytime you call an action.
If you want to reuse data without needing to reevaluate an RDD, use .cache() or preferably persist. Persisting an RDD allows you to store the result of a transformation so that the RDD doesn't need to be reevaluated in future iterations.
I am unsure if this is a bug, so if you do something like this
// d:spark.RDD[String]
d.distinct().map(x => d.filter(_.equals(x)))
you will get a Java NPE. However if you do a collect immediately after distinct, all will be fine.
I am using spark 0.6.1.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list.
It looks like your current code is trying to group the elements of d by value; you can do this efficiently with the groupBy() RDD method:
scala> val d = sc.parallelize(Seq("Hello", "World", "Hello"))
d: spark.RDD[java.lang.String] = spark.ParallelCollection#55c0c66a
scala> d.groupBy(x => x).collect()
res6: Array[(java.lang.String, Seq[java.lang.String])] = Array((World,ArrayBuffer(World)), (Hello,ArrayBuffer(Hello, Hello)))
what about the windowing example provided in the Spark 1.3.0 stream programming guide
val dataset: RDD[String, String] = ...
val windowedStream = stream.window(Seconds(20))...
val joinedStream = windowedStream.transform { rdd => rdd.join(dataset) }
SPARK-5063 causes the example to fail since the join is being called from within the transform method on an RDD