Spark Structured Streaming - writeStream split data hourly - scala

Currently we have Spark streaming running in Production. I am in process of converting the code to use Structured streaming.
I am able to successfully read data from Kinesis and write (sink) to Parquet file in S3 .
Our business logic demands, we write the streamed data in hourly folders. The incoming data from kinesis does not have any date-time field. So cannot partitionBy date-time.
We have a defined a function [getSubfolderNameFromDate()] which gets us the current hour+date (1822062017 - 18th Hour of the day, 22nd June 2017) so we can write data in hourly folders.
With Spark streaming the Context gets re-initialized and automatically writes data in next hour folder, but I am not able to achieve the same with structured streaming.
For example, 2 million records were streamed in 4th hour of the day, it should be written to "S3_location/0422062017.parquet/", data streamed in the following hour should be in "S3_location/0522062017.parquet/" so on and so forth.
With Structured streaming it continuous to write to same folder throughout the day, which I understand is because it evaluates the folder name just once and continuous to append the data to it.
But I want to append new data to hourly folders, is there a way to achieve this ?
I am currently using below query :
val query = streamedDataDF
.writeStream
.format("parquet")
.option("checkpointLocation", checkpointDir)
.option("path", fileLocation + FileDirectory.getSubfolderNameFromDate() + ".parquet/")
.outputMode("append")
.start()

Related

Spark structured streaming avoid delay and checkpointing: startingOffsets latest does not work?

i am developing a spark structured streaming process for a real time application
I need to read current kafka messages without any delay.
Messages older than 30 seconds are not relevant in this project.
I am reading old messages with a big delay from current timestamp ...(minutes) .. it seems that spark structured streaming does not use well startingOffsets property to latest.
I guess that the problem is the HDFS checkpoint location of topic where i write ...
I do not want to read old messages, only are important current ones!!
I have test many different configurations, kafka properties, etc .. but did not work ..
Here is my code and relevant config (kafka.bootstrap.servers and kafka.ssl.* properties are not include here but exists)
Spark version: 2.4.0-cdh6.3.3
Consumer properties used at readStream
offsets.retention.minutes -> 1,
groupIdPrefix -> 710b6fb4-4454-4a52-819e-f565e047ecb7,
subscribe -> topic_x,
consumer.override.auto.offset.reset -> latest,
failOnDataLoss -> false,
startingOffset -> latest,
offsets.retention.check.interval.ms -> 1000
Reader Kafka topic readerB
val readerB = spark
.readStream
.format("kafka")
.options(consumerProperties)
.load()
Producer properties used at writeStream
topic -> topic_destination,
checkpointLocation -> checkPointLocation
Write stream block
val sendToKafka = dtXXXKafka
.select(to_sr(struct($"*"), xxx_out_topic, xxx_out_topic, schemaRegistryNamespace).alias("value"))
.writeStream
.format("kafka")
.options(producerProperties(properties, xxx_agg_out_topic, xxx_agg_out_localcheckpoint_hdfs))
.outputMode("append")
.start()
The startingOffset property is applied only when a query is started, meaning, for the first batch of your streaming query only. Afterwards, this property is ignored and the batches are defined according to the checkpoint data.
In your case, the streaming query starts by reading only the freshest data from your kafka topic (thanks to the startingOffset -> latest setting). BUT, The second batch (and all the next batches) will be defined according to the checkpoint, or in other words, they will start exactly from where the previous batch ended - the offsets folder in the checkpoint contains the ending offsets (exclusive) for each batch so batch X starts from the offsets saved for the batch X-1. This is how the exactly once delivery semantics is achieved in spark structured streaming.
docs: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
If you want to process only the data from the last 30 seconds, I would suggest you to filter the Dataframe by the timestamp field to include only the data from the desired time frame.

Spark Structured Streaming checkpoint usage in production

I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()

Spark Streaming write to Kafka with delay - after x minutes

We have a spark Streaming application.
Architecture is as follows
Kinesis to Spark to Kafka.
The Spark application is using qubole/kinesis-sql for structured streaming from Kinesis. The data is then aggregated and then pushed to Kafka.
Our use case demands a delay of 4 minutes before pushing to Kafka.
The windowing is done with 2 minutes and watermark of 4 minutes
val windowedCountsDF = messageDS
.withWatermark("timestamp", "4 minutes")
.groupBy(window($"timestamp", "2 minutes", "2 minutes"), $"id", $"eventType", $"topic")
Write to Kafka is triggered every two minutes
val eventFilteredQuery = windowedCountsDF
.selectExpr("topic", "id as key", "to_json(struct(*)) AS value")
.writeStream
.trigger(Trigger.ProcessingTime("2 minutes"))
.format("org.apache.spark.sql.kafka010.KafkaSourceProvider")
.option("checkpointLocation", checkPoint)
.outputMode("update")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.queryName("events_kafka_stream")
.start()
I can change the trigger time to match the window , but still some events gets pushed to kafka instantly.
Is there any way to delay writes to Kafka x minutes after the window is completed.
Thanks
Change your output mode from update to append (the default option). The output mode will write all updated rows to the sink, hence, if you use a watermark or not will not matter.
However, with the append mode any writes will need wait until the watermark is crossed - which is exactly what you want:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (i.e. after watermark is crossed).

structured streaming read based on kafka partitions

I am using spark structured Streaming to Read incoming messages from a Kafka topic and write to multiple parquet tables based on the incoming message
So i created a single readStream as Kafka source is common and for each parquet table created separate write stream in a loop . This works fine but the readstream is creating a bottleneck as for each writeStream it create a readStream and there is no way to cache the dataframe which is already read.
val kafkaDf=spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", conf.servers)
.option("subscribe", conf.topics)
// .option("earliestOffset","true")
.option("failOnDataLoss",false)
.load()
foreach table {
//filter the data from source based on table name
//write to parquet
parquetDf.writeStream.format("parquet")
.option("path", outputFolder + File.separator+ tableName)
.option("checkpointLocation", "checkpoint_"+tableName)
.outputMode("append")
.trigger(Trigger.Once())
.start()
}
Now every write stream is creating a new consumer group and reading entire data from Kafka and then doing the filter and writing to Parquet. This is creating a huge overhead. To avoid this overhead, I can partition the Kafka topic to have as many partitions as the number of tables and then the readstream should only read from a given partition. But I don't see a way to specify partition details as part of Kafka read stream.
if data volume is not that high, write your own sink, collect data of each micro-batch , then you should be able to cache that dataframe and write to different locations, need some tweaks though but it will work
you can use foreachBatch sink and cache the dataframe. Hopefully it works

How to process files using Spark Structured Streaming chunk by chunk?

I am treating a large amount of files, and I want to treat these files chunk by chunk, let's say that during each batch, I want to treat each 50 files separately.
How can I do it using Spark Structured Streaming ?
I have seen that Jacek Laskowski (https://stackoverflow.com/users/1305344/jacek-laskowski) said in a similar question (Spark to process rdd chunk by chunk from json files and post to Kafka topic) that it was possible using the Spark Structured Streaming, but I can't find any examples about it.
Thanks a lot,
If using File Source:
maxFilesPerTrigger: maximum number of new files to be considered in every trigger (default: no max)
spark
.readStream
.format("json")
.path("/path/to/files")
.option("maxFilesPerTrigger", 50)
.load
If using a Kafka Source it would be similar but with the maxOffsetsPerTrigger option.