Spark Streaming write to Kafka with delay - after x minutes - scala

We have a spark Streaming application.
Architecture is as follows
Kinesis to Spark to Kafka.
The Spark application is using qubole/kinesis-sql for structured streaming from Kinesis. The data is then aggregated and then pushed to Kafka.
Our use case demands a delay of 4 minutes before pushing to Kafka.
The windowing is done with 2 minutes and watermark of 4 minutes
val windowedCountsDF = messageDS
.withWatermark("timestamp", "4 minutes")
.groupBy(window($"timestamp", "2 minutes", "2 minutes"), $"id", $"eventType", $"topic")
Write to Kafka is triggered every two minutes
val eventFilteredQuery = windowedCountsDF
.selectExpr("topic", "id as key", "to_json(struct(*)) AS value")
.writeStream
.trigger(Trigger.ProcessingTime("2 minutes"))
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.option("checkpointLocation", checkPoint)
.outputMode("update")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.queryName("events_kafka_stream")
.start()
I can change the trigger time to match the window , but still some events gets pushed to kafka instantly.
Is there any way to delay writes to Kafka x minutes after the window is completed.
Thanks

Change your output mode from update to append (the default option). The output mode will write all updated rows to the sink, hence, if you use a watermark or not will not matter.
However, with the append mode any writes will need wait until the watermark is crossed - which is exactly what you want:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed).

Related

Spark structured streaming avoid delay and checkpointing: startingOffsets latest does not work?

i am developing a spark structured streaming process for a real time application
I need to read current kafka messages without any delay.
Messages older than 30 seconds are not relevant in this project.
I am reading old messages with a big delay from current timestamp ...(minutes) .. it seems that spark structured streaming does not use well startingOffsets property to latest.
I guess that the problem is the HDFS checkpoint location of topic where i write ...
I do not want to read old messages, only are important current ones!!
I have test many different configurations, kafka properties, etc .. but did not work ..
Here is my code and relevant config (kafka.bootstrap.servers and kafka.ssl.* properties are not include here but exists)
Spark version: 2.4.0-cdh6.3.3
Consumer properties used at readStream
offsets.retention.minutes -> 1,
groupIdPrefix -> 710b6fb4-4454-4a52-819e-f565e047ecb7,
subscribe -> topic_x,
consumer.override.auto.offset.reset -> latest,
failOnDataLoss -> false,
startingOffset -> latest,
offsets.retention.check.interval.ms -> 1000
Reader Kafka topic readerB
val readerB = spark
.readStream
.format("kafka")
.options(consumerProperties)
.load()
Producer properties used at writeStream
topic -> topic_destination,
checkpointLocation -> checkPointLocation
Write stream block
val sendToKafka = dtXXXKafka
.select(to_sr(struct($"*"), xxx_out_topic, xxx_out_topic, schemaRegistryNamespace).alias("value"))
.writeStream
.format("kafka")
.options(producerProperties(properties, xxx_agg_out_topic, xxx_agg_out_localcheckpoint_hdfs))
.outputMode("append")
.start()
The startingOffset property is applied only when a query is started, meaning, for the first batch of your streaming query only. Afterwards, this property is ignored and the batches are defined according to the checkpoint data.
In your case, the streaming query starts by reading only the freshest data from your kafka topic (thanks to the startingOffset -> latest setting). BUT, The second batch (and all the next batches) will be defined according to the checkpoint, or in other words, they will start exactly from where the previous batch ended - the offsets folder in the checkpoint contains the ending offsets (exclusive) for each batch so batch X starts from the offsets saved for the batch X-1. This is how the exactly once delivery semantics is achieved in spark structured streaming.
docs: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
If you want to process only the data from the last 30 seconds, I would suggest you to filter the Dataframe by the timestamp field to include only the data from the desired time frame.

Spark Kafka Streaming - There a is a lot of delay in processing the batch

I am running spark streaming with kafka fro word count program, there is a lot of delay in batch creation and processing - around 2 mins for each batch.
How could i reduce this time ? Are there any properties to be configured this to be quickly as possible - like properties at spark streaming level or kafka level ?
you should define the interval between each batch in your unstructured StreamingContext, exemple :
val ssc = StreamingContext(new SparkConf(), Minutes(1))
in strutured streaming you have a option: kafkaConsumer.pollTimeoutMs
with 512 ms as default value, more informations: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
Another problem can come from the kafka lag. you application can take a long time to process a specific offset, maybe 2 minutes, so as soon as this offset is finish, he will poll others for processing. Try to look at the current offset of your consumer group and the last offset of your topic.

Spark Structured Streaming checkpoint usage in production

I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()

Spark Structured Streaming with foreach

I am using spark structured streaming to read data from kafka.
val readStreamDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", config.getString("kafka.source.brokerList"))
.option("startingOffsets", config.getString("kafka.source.startingOffsets"))
.option("subscribe", config.getString("kafka.source.topic"))
.load()
Based on an uid in the message read from kafka, I have to make an api call to an external source and fetch data and write back to another kafka topic.
For this I am using a custom foreach writer and processing every message.
import spark.implicits._
val eventData = readStreamDF
.select(from_json(col("value").cast("string"), event).alias("message"), col("timestamp"))
.withColumn("uid", col("message.eventPayload.uid"))
.drop("message")
val q = eventData
.writeStream
.format("console")
.foreach(new CustomForEachWriter())
.start()
The CustomForEachWriter makes an API call and fetch results against the given uid from a service. The result is an array of ids. These ids are then again written back to another kafka topic via a kafka producer.
There are 30 kafka partition and I have launched spark with following config
num-executors = 30
executors-cores = 3
executor-memory = 10GB
But still the spark job starts lagging and is not able to keep up with the incoming data rate.
Incoming data rate is around 10K messages per sec. The avg time to process a single msg in 100ms.
I want to understand how spark process this in case of structured streaming.
In case of structured streaming there is one dedicated executor which is responsible for reading data from all partitions of kafka.
Does that executor distributes tasks based on no. of partitions in kafka.
The data in a batch get processed sequentially. How can that be made to process parallel so as to maximize the throughput.
I think CustomForEachWriter writer will work on a single row/record of the dataset. If you are using 2.4 version of Spark, you can experiment foreachBatch. But it is under Evolving.

Spark Structured Streaming - writeStream split data hourly

Currently we have Spark streaming running in Production. I am in process of converting the code to use Structured streaming.
I am able to successfully read data from Kinesis and write (sink) to Parquet file in S3 .
Our business logic demands, we write the streamed data in hourly folders. The incoming data from kinesis does not have any date-time field. So cannot partitionBy date-time.
We have a defined a function [getSubfolderNameFromDate()] which gets us the current hour+date (1822062017 - 18th Hour of the day, 22nd June 2017) so we can write data in hourly folders.
With Spark streaming the Context gets re-initialized and automatically writes data in next hour folder, but I am not able to achieve the same with structured streaming.
For example, 2 million records were streamed in 4th hour of the day, it should be written to "S3_location/0422062017.parquet/", data streamed in the following hour should be in "S3_location/0522062017.parquet/" so on and so forth.
With Structured streaming it continuous to write to same folder throughout the day, which I understand is because it evaluates the folder name just once and continuous to append the data to it.
But I want to append new data to hourly folders, is there a way to achieve this ?
I am currently using below query :
val query = streamedDataDF
.writeStream
.format("parquet")
.option("checkpointLocation", checkpointDir)
.option("path", fileLocation + FileDirectory.getSubfolderNameFromDate() + ".parquet/")
.outputMode("append")
.start()