Structured Spark Streaming multiple writes - scala

I am using a data stream to be written to a kafka topic as well as hbase.
For Kafka, I use a format as this:
dataset.selectExpr("id as key", "to_json(struct(*)) as value")
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", Settings.KAFKA_URL)
.option("topic", Settings.KAFKA_TOPIC2)
.option("checkpointLocation", "/usr/local/Cellar/zookeepertmp")
.outputMode(OutputMode.Complete())
.start()
and then for Hbase, I do something like this:
dataset.writeStream.outputMode(OutputMode.Complete())
.foreach(new ForeachWriter[Row] {
override def process(r: Row): Unit = {
//my logic
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}).start().awaitTermination()
This writes to Hbase as expected but doesn't always write to the kafka topic. I am not sure why that is happening.

Use foreachBatch in spark:
If you want to write the output of a streaming query to multiple locations, then you can simply write the output DataFrame/Dataset multiple times. However, each attempt to write can cause the output data to be recomputed (including possible re-reading of the input data). To avoid recomputations, you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(…).save(…) // location 1
batchDF.write.format(…).save(…) // location 2
batchDF.unpersist()
}

Related

How to execute dynamic SQLs in streaming queries?

I'm using Spark Structured streaming, and processing messages from Kafka.
At one point my result table looks something like below, where each line in the dataset has a Spark SQL query.
+----+--------------------+
|code| triggerSql|
+----+--------------------+
| US|SELECT * FROM def...|
| UK|SELECT * FROM def...|
+----+--------------------+
I need to execute each of these queries and process the results. However, structured streaming won't allow to collect these SQLs to driver side, and We can't open a new SparkSession inside any transformation.
val query = df3.writeStream.foreach(new ForeachWriter[Row] {
override def open(partitionId: Long, epochId: Long): Boolean = {
//..
}
override def process(value: Row): Unit = {
val triggerSqlString = value.getAs[String]("triggerSql")
val code = value.getAs[String]("value")
println("Code="+code+"; TriggerSQL="+triggerSqlString)
//TODO
}
override def close(errorOrNull: Throwable): Unit = {
// println("===> Closing..")
}
}).trigger(Trigger.ProcessingTime("5 seconds"))
.start()
Is there any better alternative way to dynamically execute these SQL in spark.
tl;dr Use DataStreamWriter.foreachBatch operation.
The following sample shows how one could achieve execution of SQL queries from a batch dataset:
def sqlExecution(ds: Dataset[String], batchId: Long): Unit = {
ds.as[String].collect.foreach { s => sql(s).show }
}
spark
.readStream
.textFile("sqls")
.writeStream
.foreachBatch(sqlExecution)
.start

How to use update output mode with FileFormat format?

I am trying to use spark structured streaming in update output mode write to a file. I found this StructuredSessionization example and it works fine as long as the console format is configured. But if I change the output mode to:
val query = sessionUpdates
.writeStream
.outputMode("update")
.format("json")
.option("path", "/work/output/data")
.option("checkpointLocation", "/work/output/checkpoint")
.start()
I get following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source json does not support Update output mode;
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:279)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286)
at palyground.StructuredStreamingMergeSpans$.main(StructuredStreamingMergeSpans.scala:84)
at palyground.StructuredStreamingMergeSpans.main(StructuredStreamingMergeSpans.scala)
Can i use update mode and use the FileFormat to write the result table to a file sink?
In the source code i found a pattern match that ensures Append Mode.
You cannot write to file in update mode using spark structured streaming. You need to write ForeachWriter for it. I have written simple for each writer here. You can modify it according to your requirement.
val writerForText = new ForeachWriter[Row] {
var fileWriter: FileWriter = _
override def process(value: Row): Unit = {
fileWriter.append(value.toSeq.mkString(","))
}
override def close(errorOrNull: Throwable): Unit = {
fileWriter.close()
}
override def open(partitionId: Long, version: Long): Boolean = {
FileUtils.forceMkdir(new File(s"src/test/resources/${partitionId}"))
fileWriter = new FileWriter(new File(s"src/test/resources/${partitionId}/temp"))
true
}
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.foreach(writerForText)
.start()
Append output mode is required for any of the FileFormat sinks, incl. json, which Spark Structured Streaming validates before starting your streaming query.
if (outputMode != OutputMode.Append) {
throw new AnalysisException(
s"Data source $className does not support $outputMode output mode")
}
In Spark 2.4, you could use DataStreamWriter.foreach operator or the brand new DataStreamWriter.foreachBatch operator that simply accepts a function that accepts the Dataset of a batch and the batch ID.
foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]

Spark Structured Streaming Run Final Operation on External Table (MSCK REPAIR)

Is there a way in Spark's structured streaming to add a final operation to a DataStreamWriter's query plan? I'm attempting to read from a streaming data source, enrich the data in some way, and then write back to a partitioned, external table (assume Hive) in parquet format. The write operation works just fine, partitioning the data in directories for me, but I can't seem to figure out how to additionally run an MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION operation after writing the data to disk for any new partitions that may have been created.
For simplicity's sake, take the following Scala code as an example:
SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
<-------------------- something I can place here for an additional operation?
.awaitTermination()
Potential workarounds?:
1: Maybe using something like .foreach(new ForeachWriter[Row]) and passing a FileStreamSink or something similar would work (using the def close() to run an external query), but I haven't looked into it enough to get a good grasp on using it. - using ForeachWriter doesn't result in the close() method being called after a batch completes.
2: Forking the stream. Something along the lines of the following:
val stream = SparkSession
.builder()
.appName("some name")
.enableHiveSupport()
.getOrCreate()
.readStream
.format("text")
.load("/path/from/somewhere")
// additional transformations
stream
.writeStream
.format("parquet")
.partitionBy("some_column")
.start("/path/to/somewhere")
.awaitTermination()
stream
.map(getPartitionName).distinct
.map { partition =>
// Run query here
partition
}
.writeStream
.start()
.awaitTermination()
The problem here would be ensuring the first operation completes before the second.
3: Naming the query and attaching a listener for completed batches which manually adds all partitions. A bit of a waste, but potentially viable?
...
stream
.writeStream
.queryName("SomeName")
...
spark.streams.addListener(new StreamingQueryListener() {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.name == "SomeName") {
// search through files in filesystem and add partitions
fileSystem.listDir("/path/to/directory").foreach { partition =>
// run "ALTER TABLE ADD PARTITION $partition"
}
}
}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
})
I didn't see anything in the documentation that covers this, hopefully I didn't miss anything. Thanks in advance.
Using a StreamingQueryListener works, though I'm not sure if it's good/bad practice.
I implemented something along the lines of this:
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = Unit
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = Unit
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink") && event.progress.sink.description.contains("/path/to/write/directory")) {
client.sql(s"MSCK REPAIR TABLE $db.$table")
}
}
})
If you happen to have time-based partitions, this works decently as long as you intend to create partitions based on now():
spark.streams.addListener(new StreamingQueryListener() {
val client = new Client()
var lastPartition: String = ""
val dateTimeFormat: String = "yyyy-MM-dd"
override def onQueryStarted...
override onQueryTerminated...
override def onQueryProgress(event: QueryProgressEvent): Unit = {
if (event.progress.numInputRows > 0 && event.progress.sink.description.startsWith("FileSink[s3") && event.progress.sink.description.contains("/path/to/write/directory")) {
val newPartition = new DateTime().toString(dateTimeFormat)
if (newPartition != lastPartition) {
client.sql(s"ALTER TABLE $db.$table ADD IF NOT EXISTS PARTITION ($partitionColumn='$newPartition')")
lastPartition = newPartition
}
}
}

Structured Streaming - Foreach Sink

I am basically reading from a Kafka source, and dumping each message through to my foreach processor (Thanks Jacek's page for the simple example).
If this actually works, i shall actually perform some business logic in the process method here, however, this doesn't work. I believe that the println doesn't work since its running on executors and there is no way for getting those logs back to driver. However, this insert into a temp table should at least work and show me that the messages are actually consumed and processed through to the sink.
What am I missing here ?
Really looking for a second set of eyes to check my effort here:
val stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", Util.getProperty("kafka10.broker"))
.option("subscribe", src_topic)
.load()
val rec = stream.selectExpr("CAST(value AS STRING) as txnJson").as[(String)]
val df = stream.selectExpr("cast (value as string) as json")
val writer = new ForeachWriter[Row] {
val scon = new SConConnection
override def open(partitionId: Long, version: Long) = {
true
}
override def process(value: Row) = {
println("++++++++++++++++++++++++++++++++++++" + value.get(0))
scon.executeUpdate("insert into rs_kafka10(miscCol) values("+value.get(0)+")")
}
override def close(errorOrNull: Throwable) = {
scon.closeConnection
}
}
val yy = df.writeStream
.queryName("ForEachQuery")
.foreach(writer)
.outputMode("append")
.start()
yy.awaitTermination()
Thanks for comments from Harald and others, I found out a couple of things, which led me to achieve normal processing behaviour -
test code with local mode, yarn isnt the biggest help in debugging
for some reason, the process method of foreach sink doesnt allow calling other methods. When i put my business logic directly in there, it works.
hope it helps others.

How to write ElasticsearchSink for Spark Structured streaming

I'm using Spark structured streaming to process high volume data from Kafka queue and doing some heaving ML computation but I need to write the result to Elasticsearch.
I tried using the ForeachWriter but can't get a SparkContext inside it, the other option probably is to do HTTP Post inside the ForeachWriter.
Right now, am thinking of writing my own ElasticsearchSink.
Is there any documentation out there to create a Sink for Spark Structured streaming ?
If you are using Spark 2.2+ and ES 6.x then there is a ES sink out of the box:
df
.writeStream
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "mappingId")
.start("index/type") // index/type
If you are using ES 5.x like I was you need to implement an EsSink and an EsSinkProvider:
EsSinkProvider:
class EsSinkProvider extends StreamSinkProvider with DataSourceRegister {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
EsSink(sqlContext, parameters, partitionColumns, outputMode)
}
override def shortName(): String = "my-es-sink"
}
EsSink:
case class ElasticSearchSink(sqlContext: SQLContext,
options: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode)
extends Sink {
override def addBatch(batchId: Long, df: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[String] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row]).map(_.getAs[String](0))
}
// from org.elasticsearch.spark.rdd library
EsSpark.saveJsonToEs(rdd, "index/type", Map("es.mapping.id" -> "mappingId"))
}
}
And then lastly, when writing the stream use this provider class as the format:
df
.writeStream
.queryName("ES-Writer")
.outputMode(OutputMode.Append())
.format("path.to.EsProvider")
.start()
You can take a look at ForeachSink. It shows how to implement a Sink and convert DataFrame to RDD (it's very tricky and has a large comment). However, please be aware that the Sink API is still private and immature, it might be changed in future.