Spark Structured Streaming Run Final Operation on External Table (MSCK REPAIR) - scala

Is there a way in Spark's structured streaming to add a final operation to a DataStreamWriter's query plan? I'm attempting to read from a streaming data source, enrich the data in some way, and then write back to a partitioned, external table (assume Hive) in parquet format. The write operation works just fine, partitioning the data in directories for me, but I can't seem to figure out how to additionally run an MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION operation after writing the data to disk for any new partitions that may have been created.
For simplicity's sake, take the following Scala code as an example:
SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
<-------------------- something I can place here for an additional operation?
.awaitTermination()
Potential workarounds?:
1: Maybe using something like .foreach(new ForeachWriter[Row]) and passing a FileStreamSink or something similar would work (using the def close() to run an external query), but I haven't looked into it enough to get a good grasp on using it. - using ForeachWriter doesn't result in the close() method being called after a batch completes.
2: Forking the stream. Something along the lines of the following:
val stream = SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
stream
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
.awaitTermination()
stream
.map(getPartitionName).distinct
.map { partition =>
// Run query here
partition
}
.writeStream
.start()
.awaitTermination()
The problem here would be ensuring the first operation completes before the second.
3: Naming the query and attaching a listener for completed batches which manually adds all partitions. A bit of a waste, but potentially viable?
...
stream
.writeStream
.queryName("SomeName")
...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.name == "SomeName") {
// search through files in filesystem and add partitions
fileSystem.listDir("/path/to/directory").foreach { partition =>
// run "ALTER TABLE ADD PARTITION $partition"
}
}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
})
I didn't see anything in the documentation that covers this, hopefully I didn't miss anything. Thanks in advance.

Using a StreamingQueryListener works, though I'm not sure if it's good/bad practice.
I implemented something along the lines of this:
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink") && event.progress.sink.description.contains("/path/to/write/directory")) {
client.sql(s"MSCK REPAIR TABLE $db.$table")
}
}
})
If you happen to have time-based partitions, this works decently as long as you intend to create partitions based on now():
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
var lastPartition: String = ""
val dateTimeFormat: String = "yyyy-MM-dd"
override def onQueryStarted...
override onQueryTerminated...
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink[s3") && event.progress.sink.description.contains("/path/to/write/directory")) {
val newPartition = new DateTime().toString(dateTimeFormat)
if (newPartition != lastPartition) {
client.sql(s"ALTER TABLE $db.$table ADD IF NOT EXISTS PARTITION ($partitionColumn='$newPartition')")
lastPartition = newPartition
}
}
}

Related

How to collect all records at spark executor and process it as batch

In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreachBatch(function)
.start()
query.awaitTermination()
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
}
)
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
}
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
OR
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?
I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).
#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.format("kinesis")
.option("streamName", streamName)
.option("region", region)
.option("endpointUrl",endpointUrl)
.option("initialPosition", "TRIM_HORIZON")
.load()
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(dataConsumer)
.start()
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
true
}
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
}
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
}
}
}

How to execute dynamic SQLs in streaming queries?

I'm using Spark Structured streaming, and processing messages from Kafka.
At one point my result table looks something like below, where each line in the dataset has a Spark SQL query.
+----+--------------------+
|code| triggerSql|
+----+--------------------+
| US|SELECT * FROM def...|
| UK|SELECT * FROM def...|
+----+--------------------+
I need to execute each of these queries and process the results. However, structured streaming won't allow to collect these SQLs to driver side, and We can't open a new SparkSession inside any transformation.
val query = df3.writeStream.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = {
//..
}
override def process(value: Row): Unit = {
val triggerSqlString = value.getAs[String]("triggerSql")
val code = value.getAs[String]("value")
println("Code="+code+"; TriggerSQL="+triggerSqlString)
//TODO
}
override def close(errorOrNull: Throwable): Unit = {
// println("===> Closing..")
}
}).trigger(Trigger.ProcessingTime("5 seconds"))
.start()
Is there any better alternative way to dynamically execute these SQL in spark.
tl;dr Use DataStreamWriter.foreachBatch operation.
The following sample shows how one could achieve execution of SQL queries from a batch dataset:
def sqlExecution(ds: Dataset[String], batchId: Long): Unit = {
ds.as[String].collect.foreach { s => sql(s).show }
}
spark
.readStream
.textFile("sqls")
.writeStream
.foreachBatch(sqlExecution)
.start

How to use update output mode with FileFormat format?

I am trying to use spark structured streaming in update output mode write to a file. I found this StructuredSessionization example and it works fine as long as the console format is configured. But if I change the output mode to:
val query = sessionUpdates
.writeStream
.outputMode("update")
.format("json")
.option("path", "/work/output/data")
.option("checkpointLocation", "/work/output/checkpoint")
.start()
I get following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source json does not support Update output mode;
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:279)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286)
at palyground.StructuredStreamingMergeSpans$.main(StructuredStreamingMergeSpans.scala:84)
at palyground.StructuredStreamingMergeSpans.main(StructuredStreamingMergeSpans.scala)
Can i use update mode and use the FileFormat to write the result table to a file sink?
In the source code i found a pattern match that ensures Append Mode.
You cannot write to file in update mode using spark structured streaming. You need to write ForeachWriter for it. I have written simple for each writer here. You can modify it according to your requirement.
val writerForText = new ForeachWriter[Row] {
var fileWriter: FileWriter = _
override def process(value: Row): Unit = {
fileWriter.append(value.toSeq.mkString(","))
}
override def close(errorOrNull: Throwable): Unit = {
fileWriter.close()
}
override def open(partitionId: Long, version: Long): Boolean = {
FileUtils.forceMkdir(new File(s"src/test/resources/${partitionId}"))
fileWriter = new FileWriter(new File(s"src/test/resources/${partitionId}/temp"))
true
}
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.foreach(writerForText)
.start()
Append output mode is required for any of the FileFormat sinks, incl. json, which Spark Structured Streaming validates before starting your streaming query.
if (outputMode != OutputMode.Append) {
throw new AnalysisException(
s"Data source $className does not support $outputMode output mode")
}
In Spark 2.4, you could use DataStreamWriter.foreach operator or the brand new DataStreamWriter.foreachBatch operator that simply accepts a function that accepts the Dataset of a batch and the batch ID.
foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]

Structured Spark Streaming multiple writes

I am using a data stream to be written to a kafka topic as well as hbase.
For Kafka, I use a format as this:
dataset.selectExpr("id as key", "to_json(struct(*)) as value")
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", Settings.KAFKA_URL)
.option("topic", Settings.KAFKA_TOPIC2)
.option("checkpointLocation", "/usr/local/Cellar/zookeepertmp")
.outputMode(OutputMode.Complete())
.start()
and then for Hbase, I do something like this:
dataset.writeStream.outputMode(OutputMode.Complete())
.foreach(new ForeachWriter[Row] {
override def process(r: Row): Unit = {
//my logic
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}).start().awaitTermination()
This writes to Hbase as expected but doesn't always write to the kafka topic. I am not sure why that is happening.
Use foreachBatch in spark:
If you want to write the output of a streaming query to multiple locations, then you can simply write the output DataFrame/Dataset multiple times. However, each attempt to write can cause the output data to be recomputed (including possible re-reading of the input data). To avoid recomputations, you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(…).save(…) // location 1
batchDF.write.format(…).save(…) // location 2
batchDF.unpersist()
}

Structured Streaming - Foreach Sink

I am basically reading from a Kafka source, and dumping each message through to my foreach processor (Thanks Jacek's page for the simple example).
If this actually works, i shall actually perform some business logic in the process method here, however, this doesn't work. I believe that the println doesn't work since its running on executors and there is no way for getting those logs back to driver. However, this insert into a temp table should at least work and show me that the messages are actually consumed and processed through to the sink.
What am I missing here ?
Really looking for a second set of eyes to check my effort here:
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", Util.getProperty("kafka10.broker"))
.option("subscribe", src_topic)
.load()
val rec = stream.selectExpr("CAST(value AS STRING) as txnJson").as[(String)]
val df = stream.selectExpr("cast (value as string) as json")
val writer = new ForeachWriter[Row] {
val scon = new SConConnection
override def open(partitionId: Long, version: Long) = {
true
}
override def process(value: Row) = {
println("++++++++++++++++++++++++++++++++++++" + value.get(0))
scon.executeUpdate("insert into rs_kafka10(miscCol) values("+value.get(0)+")")
}
override def close(errorOrNull: Throwable) = {
scon.closeConnection
}
}
val yy = df.writeStream
.queryName("ForEachQuery")
.foreach(writer)
.outputMode("append")
.start()
yy.awaitTermination()
Thanks for comments from Harald and others, I found out a couple of things, which led me to achieve normal processing behaviour -
test code with local mode, yarn isnt the biggest help in debugging
for some reason, the process method of foreach sink doesnt allow calling other methods. When i put my business logic directly in there, it works.
hope it helps others.