offset Management for EventHub - scala

Summary - I am using Scala spark to read eventhub as source and saving as streaming data frame
and my streaming dataframe output looks like below
scala> messages.writeStream.outputMode("append").option("truncate", false) .format("console").start() .awaitTermination()
+----------+-----------------------+----------+---+--------+----+-------+
|Offset |Time (readable) |Timestamp |id |date |user|integer|
+----------+-----------------------+----------+---+--------+----+-------+
|4294970696|2022-01-25 07:20:19.082|1643113219|001|01292017|you |2696 |
+----------+-----------------------+----------+---+--------+----+-------+
How can I implement offset managent in this code. I understand about the checkpoint location. But I wanted to handle offset management based on my event failure and the next start based on previous offset. I am not able implement this logic. Appreciate some guidance

Related

Convert Stream Dataframe into Spark Dataframe using scala and spark

I have the following stream dataframe.
+----------------------------------
|______value______________________|
| I am going to school 😀 |
| why are you crying 🙁 😞 |
| You are not very good my friend |
I have created the above dataframe using below code
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("failOnDataLoss", false)
.option("subscribe", "myTopic.raw")
.load()
I want to store the same stream dataframe into spark dataframe. is that possible to convert so in scala and spark? because at the end I want to convert the spark dataframe into a list of sentences. Issue with stream dataframe is i am unable to convert it directly into a list that I can iterate and do some data processing actions.
You should be able to do many of standard operations on the stream that you're getting from Kafka, but you need to take into account the differences in semantics between batch and streaming processing - refer to the Spark docs for that.
Also, when you're getting data from Kafka, the set of columns is fixed, and you get a binary payload that you need to cast the value column to string, or something like this (see docs):
val df = readStream.select($"value".cast("string").alias("sentences"))
after that you'll get a dataframe with actual payload, and start processing. Depending on the complexity of processing, you may need to revert to the foreachBatch functionality, but that may not be necessary - you need to provide more details on what kind of processing you need to do.

Measure time taken for repartitioning of spark structured streaming dataframe

I want to do a performance comparison in which I want to measure the time taken in repartitioning of spark structured streaming dataframe.
for eg time taken for the following step
df.repartition($"colname")
Any ideas how can this be achieved ?

Spark Structured Streaming checkpoint usage in production

I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()

scala Spark structured streaming Receiving duplicate message

I am using Kafka and Spark 2.4.5 Structured Streaming.I am doing the average operation.but i am facing issues due to getting duplicate records from the Kafka topic in a current batch.
For example ,Kafka topic message received on 1st batch batch on update mode
car,Brand=Honda,speed=110,1588569015000000000
car,Brand=ford,speed=90,1588569015000000000
car,Brand=Honda,speed=80,15885690150000000000
here the result is average on car brand per timestamp
i.e groupby on 1588569015000000000 and Brand=Honda , the result we got
110+90/2 = 100
now second message received late data with the duplicate message with same timestamp
car,Brand=Honda,speed=50,1588569015000000000
car,Brand=Honda,speed=50,1588569015000000000
i am expecting average should update to 110+90+50/3 = 83.33
but result update to 110+90+50+50/4=75,which is wrong
val rawDataStream: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "topic1") // Both topics on same stream!
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as data")
group by timestamp and brand
write to kafka with checkpoint
How to use Spark Structured Streaming to do this or anything wrong on code?
Spark Structured Streaming allows deduplication on a streaming dataframe using dropDuplicates. You would need to specify fields to identify a duplicate record and across batches, spark will consider only the first record per combination and records with duplicate values will be discarded.
Below snippet will deduplicate your streaming dataframe on Brand, Speed and timestamp combination.
rawDataStream.dropDuplicates("Brand", "speed", "timestamp")
Refer to spark documentation here

Spark Structured Streaming - writeStream split data hourly

Currently we have Spark streaming running in Production. I am in process of converting the code to use Structured streaming.
I am able to successfully read data from Kinesis and write (sink) to Parquet file in S3 .
Our business logic demands, we write the streamed data in hourly folders. The incoming data from kinesis does not have any date-time field. So cannot partitionBy date-time.
We have a defined a function [getSubfolderNameFromDate()] which gets us the current hour+date (1822062017 - 18th Hour of the day, 22nd June 2017) so we can write data in hourly folders.
With Spark streaming the Context gets re-initialized and automatically writes data in next hour folder, but I am not able to achieve the same with structured streaming.
For example, 2 million records were streamed in 4th hour of the day, it should be written to "S3_location/0422062017.parquet/", data streamed in the following hour should be in "S3_location/0522062017.parquet/" so on and so forth.
With Structured streaming it continuous to write to same folder throughout the day, which I understand is because it evaluates the folder name just once and continuous to append the data to it.
But I want to append new data to hourly folders, is there a way to achieve this ?
I am currently using below query :
val query = streamedDataDF
.writeStream
.format("parquet")
.option("checkpointLocation", checkpointDir)
.option("path", fileLocation + FileDirectory.getSubfolderNameFromDate() + ".parquet/")
.outputMode("append")
.start()