I'm using Spark Structured streaming, and processing messages from Kafka.
At one point my result table looks something like below, where each line in the dataset has a Spark SQL query.
+----+--------------------+
|code| triggerSql|
+----+--------------------+
| US|SELECT * FROM def...|
| UK|SELECT * FROM def...|
+----+--------------------+
I need to execute each of these queries and process the results. However, structured streaming won't allow to collect these SQLs to driver side, and We can't open a new SparkSession inside any transformation.
val query = df3.writeStream.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = {
//..
}
override def process(value: Row): Unit = {
val triggerSqlString = value.getAs[String]("triggerSql")
val code = value.getAs[String]("value")
println("Code="+code+"; TriggerSQL="+triggerSqlString)
//TODO
}
override def close(errorOrNull: Throwable): Unit = {
// println("===> Closing..")
}
}).trigger(Trigger.ProcessingTime("5 seconds"))
.start()
Is there any better alternative way to dynamically execute these SQL in spark.
tl;dr Use DataStreamWriter.foreachBatch operation.
The following sample shows how one could achieve execution of SQL queries from a batch dataset:
def sqlExecution(ds: Dataset[String], batchId: Long): Unit = {
ds.as[String].collect.foreach { s => sql(s).show }
}
spark
.readStream
.textFile("sqls")
.writeStream
.foreachBatch(sqlExecution)
.start
Related
I want to unit test code that read DataFrame from RDBMS using sparkSession.read.jdbc(...). But I did't find a way how to mock DataFrameReader to return dummy DataFrame for test.
Code example:
object ConfigurationLoader {
def readTable(tableName: String)(implicit spark: SparkSession): DataFrame = {
spark.read
.format("jdbc")
.option("url", s"$postgresUrl/$postgresDatabase")
.option("dbtable", tableName)
.option("user", postgresUsername)
.option("password", postgresPassword)
.option("driver", postgresDriver)
.load()
}
def loadUsingFilter(dummyFilter: String*)(implicit spark: SparkSession): DataFrame = {
readTable(postgresFilesTableName)
.where(col("column").isin(fileTypes: _*))
}
}
And second problem - to mock scala object, looks like I need to use other approach to create such service.
In my opinion, unit tests are not meant to test database connections. This should be done in integration tests to check that all the parts work together. Unit tests are just meant to test your functional logic, and not spark's ability to read from a database.
This is why I would design your code slightly differently and do just that, without caring about the DB.
/** This, I don't test. I trust spark.read */
def readTable(tableName: String)(implicit spark: SparkSession): DataFrame = {
spark.read
.option(...)
...
.load()
// Nothing more
}
/** This I test, this is my logic. */
def transform(df : DataFrame, dummyFilter: String*): DataFrame = {
df
.where(col("column").isin(fileTypes: _*))
}
Then I use the code this way in production.
val source = readTable("...")
val result = transform(source, filter)
And now transform, that contains my logic, is easy to test. In case you wonder how to create dummy dataframes, one way I like is this:
val df = Seq((1, Some("a"), true), (2, Some("b"), false),
(3, None, true)).toDF("x", "y", "z")
// and the test
val result = transform(df, filter)
result should be ...
If you want to test sparkSession.read.jdbc(...), you can play with in-memory H2 database. I do it sometimes when I'm writing learning tests. You can find an example here: https://github.com/bartosz25/spark-scala-playground/blob/d3cad26ff236ae78884bdeb300f2e59a616dc479/src/test/scala/com/waitingforcode/sql/LoadingDataTest.scala Please note however that you may encounter some subtle differences with "real" RDBMS.
On the other side, you can better separate the concerns of the code and create the DataFrame differently, for instance with toDF(...) method. You can find an example here: https://github.com/bartosz25/spark-scala-playground/blob/77ea416d2493324ddd6f3f2be42122855596d238/src/test/scala/com/waitingforcode/sql/CorrelatedSubqueryTest.scala
Finally and IMO, if you have to mock DataFrameReader, it means that maybe there is something to do with the code separation. For instance, you can put all your filters inside a Filters object and test each filter separately. Same for mapping or aggregation functions. 2 years ago I wrote a blog post about testing Apache Spark - https://www.waitingforcode.com/apache-spark/testing-spark-applications/read It describes RDD API but the idea of separating concerns is the same.
Updated:
object Filters {
def isInFileTypes(inputDataFrame: DataFrame, fileTypes: Seq[String]): DataFrame = {
inputDataFrame.where(col("column").isin(fileTypes: _*))
}
}
object ConfigurationLoader {
def readTable(tableName: String)(implicit spark: SparkSession): DataFrame = {
val input = spark.read
.format("jdbc")
.option("url", s"$postgresUrl/$postgresDatabase")
.option("dbtable", tableName)
.option("user", postgresUsername)
.option("password", postgresPassword)
.option("driver", postgresDriver)
.load()
Filters.isInFileTypes(input, Seq("txt", "doc")
}
And with that you can test your filtering logic whatever you want :) If you have more filters and want to test them, you can also combine them in a single method, pass any DataFrameyou want and voilà :)
You shouldn't test the .load() unless you have very good reasons to do so. It's Apache Spark internal logic, already tested.
Update, answer for:
So, now I am able to test filters, but how to make sure that readTable really use proper filter(sorry for thoroughness, it is just question of full coverage). Probably you have some simple approach how to mock scala object(it is actually mu second problem). – dytyniak 14 mins ago
object MyApp {
def main(args: Array[String]): Unit = {
val inputDataFrame = readTable(postgreSQLConnection)
val outputDataFrame = ProcessingLogic.generateOutputDataFrame(inputDataFrame)
}
}
object ProcessingLogic {
def generateOutputDataFrame(inputDataFrame: DataFrame): DataFrame = {
// Here you apply all needed filters, transformations & co
}
}
As you can see, no need to mock an object here. It seems redundant but it's not because you can test every filter in isolation thanks to Filters object and all your processing logic combined thanks to ProcessingLogic object (name only for example). And you can create your DataFrame in any valid way. The drawback is you will need define a schema explicitly or use case classes since in your PostgreSQL source, Apache Spark will resolve the schema automatically (I explained this here: https://www.waitingforcode.com/apache-spark-sql/schema-projection/read).
Write UT for all DataFrameWriter, DataFrameReader, DataStreamReader, DataStreamWriter
The sample test case using the above steps
Mock
Behavior
Assertion
Maven based dependencies
<groupId>org.scalatestplus</groupId>
<artifactId>mockito-3-4_2.11</artifactId>
<version>3.2.3.0</version>
<scope>test</scope>
<groupId>org.mockito</groupId>
<artifactId>mockito-inline</artifactId>
<version>2.13.0</version>
<scope>test</scope>
Let’s use an example of a spark class where the source is Hive and the sink is JDBC
class DummySource extends SparkPipeline {
/**
* Method to read the source and create a Dataframe
*
* #param sparkSession : SparkSession
* #return : DataFrame
*/
override def read(spark: SparkSession): DataFrame = {
spark.read.table("Table_Name").filter("_2 > 1")
}
/**
* Method to transform the dataframe
*
* #param df : DataFrame
* #return : DataFrame
*/
override def transform(df: DataFrame): DataFrame = ???
/**
* Method to write/save the Dataframe to a target
*
* #param df : DataFrame
*
*/
override def write(df: DataFrame): Unit =
df.write.jdbc("url", "targetTableName", new Properties())
}
Mocking Read
test("Spark read table") {
val dummySource = new DummySource()
val sparkSession = SparkSession
.builder()
.master("local[*]")
.appName("mocking spark test")
.getOrCreate()
val testData = Seq(("one", 1), ("two", 2))
val df = sparkSession.createDataFrame(testData)
df.show()
val mockDataFrameReader = mock[DataFrameReader]
val mockSpark = mock[SparkSession]
when(mockSpark.read).thenReturn(mockDataFrameReader)
when(mockDataFrameReader.table("Table_Name")).thenReturn(df)
dummySource.read(mockSpark).count() should be(1)
}
Mocking Write
test("Spark write") {
val dummySource = new DummySource()
val mockDf = mock[DataFrame]
val mockDataFrameWriter = mock[DataFrameWriter[Row]]
when(mockDf.write).thenReturn(mockDataFrameWriter)
when(mockDataFrameWriter.mode(SaveMode.Append)).thenReturn(mockDataFrameWriter)
doNothing().when(mockDataFrameWriter).jdbc("url", "targetTableName", new Properties())
dummySource.write(df = mockDf)
}
Streaming code in ref
Ref : https://medium.com/walmartglobaltech/spark-mocking-read-readstream-write-and-writestream-b6fe70761242
I reading data from a MQTT streaming source with Spark Structured Streaming API.
val lines:= spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", "Employee")
.option("username", "username")
.option("password", "passwork")
.option("clientId", "employee11")
.load("tcp://localhost:8000").as[(String, Timestamp)]
I convert the streaming data to case class Employee
case class Employee(Name: String, Department: String)
val ds = lines.map {
row =>
implicit val format = DefaultFormats
parse(row._1).extract[Employee]
}
....some transformations
df.writeStream
.outputMode("append")
.format("es")
.option("es.resource", "spark/employee")
.option("es.nodes", "localhost")
.option("es.port", 9200)
.start()
.awaitTermination()
Now there were some messages in the queue which had different structure than Employee case class. Lets say some required columns were missing. My streaming job failed with field not found exception.
Now I will like to handle such exception and also will like to send an alert notification for the same. I tried putting a try/catch block.
case class ErrorMessage(row: String)
catch {
case e: Exception =>
val ds = lines.map {
row =>
implicit val format = DefaultFormats
parse(row._1).extract[ErrorMessage]
}
val error = lines.foreach(row => {
sendErrorMail(row._1)
})
}
}
Got the exception that Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
mqtt
Any help on this will be appreciated.
I think you should rather use the return object of the start() method as described in Spark streaming doc. Something like:
val query = df.writeStream. ... .start()
try {
//If the query has terminated with an exception, then the exception will be thrown.
query.awaitTermination()
catch {
case ex: Exception => /*code to send mail*/
}
Implementing your own foreach sink can cause overhead with frequent opening and closing connections.
I created a foreach sink in the catch block and was able to handle the exceptions and send out mail alerts as well.
catch {
case e: Exception =>
val foreachWriter = new ForeachWriter[Row] {
override def open(partitionId: Timestamp, version: Timestamp): Boolean = {
true
}
override def process(value: Row): Unit = {
code for sending mail.........
}
override def close(errorOrNull: Throwable): Unit = {}
}
val df = lines.selectExpr("cast (value as string) as json")
df.writeStream
.foreach(foreachWriter)
.outputMode("append")
.start()
.awaitTermination()
}
If the streaming is writing to delta tables you may use the merge for handling exceptions.
First, create the function tho merge and handle problems.
from delta.tables import DeltaTable
myTable = DeltaTable.forName(spark, "MYTABLE")
# Function to upsert microBatchOutputDF into Delta table using merge
def insertMessages(microBatchOutputDF, batchId):
try:
myTable.alias("trg").merge(
microBatchOutputDF.alias("src"),
"""
src.keyId = trg.keyId and
src.secondKeyId = trg.secondKeyId
""") \
.whenNotMatchedInsertAll() \
.execute()
except Exception as e:
print(f"Exception in writing data to MYTABLE: {e}")
try:
pass # do something with the bad data / log the issue
except:
print(f"Exception in writing bad data / logging the issue: {e}")
Run the stream:
mytable_df.writeStream.format("delta").foreachBatch(insertMessages).outputMode("append").option("checkpointLocation", "/tmp/delta/messages/_checkpoints2/").start()
Important note:
If at least one record in the batch causes an exception (for example NOT NULL constraint) then the whole batch (all records) are not merged. The stream keep working after that issue, it's not breaking.
I am trying to use spark structured streaming in update output mode write to a file. I found this StructuredSessionization example and it works fine as long as the console format is configured. But if I change the output mode to:
val query = sessionUpdates
.writeStream
.outputMode("update")
.format("json")
.option("path", "/work/output/data")
.option("checkpointLocation", "/work/output/checkpoint")
.start()
I get following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source json does not support Update output mode;
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:279)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286)
at palyground.StructuredStreamingMergeSpans$.main(StructuredStreamingMergeSpans.scala:84)
at palyground.StructuredStreamingMergeSpans.main(StructuredStreamingMergeSpans.scala)
Can i use update mode and use the FileFormat to write the result table to a file sink?
In the source code i found a pattern match that ensures Append Mode.
You cannot write to file in update mode using spark structured streaming. You need to write ForeachWriter for it. I have written simple for each writer here. You can modify it according to your requirement.
val writerForText = new ForeachWriter[Row] {
var fileWriter: FileWriter = _
override def process(value: Row): Unit = {
fileWriter.append(value.toSeq.mkString(","))
}
override def close(errorOrNull: Throwable): Unit = {
fileWriter.close()
}
override def open(partitionId: Long, version: Long): Boolean = {
FileUtils.forceMkdir(new File(s"src/test/resources/${partitionId}"))
fileWriter = new FileWriter(new File(s"src/test/resources/${partitionId}/temp"))
true
}
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.foreach(writerForText)
.start()
Append output mode is required for any of the FileFormat sinks, incl. json, which Spark Structured Streaming validates before starting your streaming query.
if (outputMode != OutputMode.Append) {
throw new AnalysisException(
s"Data source $className does not support $outputMode output mode")
}
In Spark 2.4, you could use DataStreamWriter.foreach operator or the brand new DataStreamWriter.foreachBatch operator that simply accepts a function that accepts the Dataset of a batch and the batch ID.
foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]
I am using a data stream to be written to a kafka topic as well as hbase.
For Kafka, I use a format as this:
dataset.selectExpr("id as key", "to_json(struct(*)) as value")
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", Settings.KAFKA_URL)
.option("topic", Settings.KAFKA_TOPIC2)
.option("checkpointLocation", "/usr/local/Cellar/zookeepertmp")
.outputMode(OutputMode.Complete())
.start()
and then for Hbase, I do something like this:
dataset.writeStream.outputMode(OutputMode.Complete())
.foreach(new ForeachWriter[Row] {
override def process(r: Row): Unit = {
//my logic
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}).start().awaitTermination()
This writes to Hbase as expected but doesn't always write to the kafka topic. I am not sure why that is happening.
Use foreachBatch in spark:
If you want to write the output of a streaming query to multiple locations, then you can simply write the output DataFrame/Dataset multiple times. However, each attempt to write can cause the output data to be recomputed (including possible re-reading of the input data). To avoid recomputations, you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(…).save(…) // location 1
batchDF.write.format(…).save(…) // location 2
batchDF.unpersist()
}
Is there a way in Spark's structured streaming to add a final operation to a DataStreamWriter's query plan? I'm attempting to read from a streaming data source, enrich the data in some way, and then write back to a partitioned, external table (assume Hive) in parquet format. The write operation works just fine, partitioning the data in directories for me, but I can't seem to figure out how to additionally run an MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION operation after writing the data to disk for any new partitions that may have been created.
For simplicity's sake, take the following Scala code as an example:
SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
<-------------------- something I can place here for an additional operation?
.awaitTermination()
Potential workarounds?:
1: Maybe using something like .foreach(new ForeachWriter[Row]) and passing a FileStreamSink or something similar would work (using the def close() to run an external query), but I haven't looked into it enough to get a good grasp on using it. - using ForeachWriter doesn't result in the close() method being called after a batch completes.
2: Forking the stream. Something along the lines of the following:
val stream = SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
stream
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
.awaitTermination()
stream
.map(getPartitionName).distinct
.map { partition =>
// Run query here
partition
}
.writeStream
.start()
.awaitTermination()
The problem here would be ensuring the first operation completes before the second.
3: Naming the query and attaching a listener for completed batches which manually adds all partitions. A bit of a waste, but potentially viable?
...
stream
.writeStream
.queryName("SomeName")
...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.name == "SomeName") {
// search through files in filesystem and add partitions
fileSystem.listDir("/path/to/directory").foreach { partition =>
// run "ALTER TABLE ADD PARTITION $partition"
}
}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
})
I didn't see anything in the documentation that covers this, hopefully I didn't miss anything. Thanks in advance.
Using a StreamingQueryListener works, though I'm not sure if it's good/bad practice.
I implemented something along the lines of this:
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink") && event.progress.sink.description.contains("/path/to/write/directory")) {
client.sql(s"MSCK REPAIR TABLE $db.$table")
}
}
})
If you happen to have time-based partitions, this works decently as long as you intend to create partitions based on now():
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
var lastPartition: String = ""
val dateTimeFormat: String = "yyyy-MM-dd"
override def onQueryStarted...
override onQueryTerminated...
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink[s3") && event.progress.sink.description.contains("/path/to/write/directory")) {
val newPartition = new DateTime().toString(dateTimeFormat)
if (newPartition != lastPartition) {
client.sql(s"ALTER TABLE $db.$table ADD IF NOT EXISTS PARTITION ($partitionColumn='$newPartition')")
lastPartition = newPartition
}
}
}