Spark Structure Streaming checkpoint vs spark context CheckPointDir - scala

Hello stack overflow community.
I'm using a spark streaming app in production environment and it was noticed that spark-checkpoints are contributing greatly to the under replication factor in HDFS and thus affects the HDFS stability. I'm trying to investigate a proper solution to clean spark check points regularly and not being through manual hdfs delete. I referred to couple of posts :
Spark Structured Streaming Checkpoint Cleanup and Spark structured streaming checkpoint size huge
So what I came up with is that I would set up the spark checkpoint directory and the spark structured streaming checkpoint directory referring to the same path and set the cleaning configuration to true. This solution will create a spark check point per spark context. I'm doubting that this might contradict with the purpose of check pointing but I'm still trying to understand the internals of spark and would appreciate any guidance here. Below is snippest of my code
spark.sparkContext.setCheckpointDir(checkPointLocation)
val options = Map("checkpointLocation" -> s"${spark.sparkContext.getCheckpointDir.get }")
val q = df.writeStream
.options(options)
.trigger(trigger)
.queryName(queryName)

Related

Spark Structured Streaming checkpoint usage in production

I have troubles understanding how checkpoints work when working with Spark Structured streaming.
I have a spark process that generates some events, which I log in an Hive table.
For those events, I receive a confirmation event in a kafka stream.
I created a new spark process that
reads the events from the Hive log table into a DataFrame
joins those events with the stream of confirmation events using Spark Structured Streaming
writes the joined DataFrame to an HBase table.
I tested the code in spark-shell and it works fine, below the pseudocode (I'm using Scala).
val tableA = spark.table("tableA")
val startingOffset = "earliest"
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
val joinTableAWithStreamOfData = streamOfData.join(tableA, Seq("a"), "inner")
joinTableAWithStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
).start()
.awaitTermination()
Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run, and inner join those new events with my log table, so to write only new data to the final HBase table.
I created a directory in HDFS where to store the checkpoint file.
I provided that location to the spark-submit command I use to call the spark code.
spark-submit --conf spark.sql.streaming.checkpointLocation=path_to_hdfs_checkpoint_directory
--all_the_other_settings_and_libraries
At this moment the code runs fine every 15 minutes without any error, but it doesn't do anything basically since it is not dumping the new events to the HBase table.
Also the checkpoint directory is empty, while I assume some file has to be written there?
And does the readStream function need to be adapted so to start reading from the latest checkpoint?
val streamOfData = .readStream
.format("kafka")
.option("startingOffsets", startingOffsets) ??
.option("otherOptions", otherOptions)
I'm really struggling to understand the spark documentation regarding this.
Thank you in advance!
Trigger
"Now I would like to schedule this code to run periodically, e.g. every 15 minutes, and I'm struggling understanding how to use checkpoints here.
In case you want your job to be triggered every 15 minutes, you can make use of Triggers.
You do not need to "use" checkpointing specifically, but just provide a reliable (e.g. HDFS) checkpoint location, see below.
Checkpointing
At every run of this code, I would like to read from the stream only the events I haven't read yet in the previous run [...]"
When reading data from Kafka in a Spark Structured Streaming application it is best to have the checkpoint location set directly in your StreamingQuery. Spark uses this location to create checkpoint files that keep track of your application's state and also record the offsets already read from Kafka.
When restarting the application it will check these checkpoint files to understand from where to continue to read from Kafka so it does not skip or miss any message. You do not need to set the startingOffset manually.
It is important to keep in mind that only specific changes in your application's code are allowed such that the checkpoint files can be used for secure re-starts. A good overview can be found in the Structured Streaming Programming Guide on Recovery Semantics after Changes in a Streaming Query.
Overall, for productive Spark Structured Streaming applications reading from Kafka I recommend the following structure:
val spark = SparkSession.builder().[...].getOrCreate()
val streamOfData = spark.readStream
.format("kafka")
// option startingOffsets is only relevant for the very first time this application is running. After that, checkpoint files are being used.
.option("startingOffsets", startingOffsets)
.option("otherOptions", otherOptions)
.load()
// perform any kind of transformations on streaming DataFrames
val processedStreamOfData = streamOfData.[...]
val streamingQuery = processedStreamOfData
.writeStream
.foreach(
writeDataToHBaseTable()
)
.option("checkpointLocation", "/path/to/checkpoint/dir/in/hdfs/"
.trigger(Trigger.ProcessingTime("15 minutes"))
.start()
streamingQuery.awaitTermination()

When (if ever) to modify checkpointed metadata of streaming query in case of failure?

I have a doubt regarding Spark Checkpoints. I have spark streaming application and I'am managing Checkpoint n HDFS using following approach :-
val checkpointDirectory = "hdfs://192.168.0.1:8020/markingChecksPoints"
df.writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF
.write
.cassandraFormat(
"table",
"keyspace",
"clustername"
)
.mode(SaveMode.Append)
.save()
}
.outputMode(OutputMode.Append())
.option("checkpointLocation", checkpointDirectory)
}
Now When I run the Application, in checkpoint directory I got 4 folders:
commits
offsets
metadata
sources
In offsets folder- I have files for each offset consumed which is like this
v1
{"batchWatermarkMs":0,"batchTimestampMs":1574680172097,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"30"}}
{"datatopicname":{"23":441210,"8":3094007,"17":44862,"26":0,"11":4302147,"29":0,"2":3758094,"20":6273,"5":4620156,"14":15375428,"4":4511998,"13":10652363,"22":1247616,"7":1787900,"16":1239584,"25":0,"10":3441724,"1":1808759,"28":0,"19":4123,"27":0,"9":3293762,"18":68,"12":4439364,"3":5910468,"21":182,"15":13510271,"6":2510314,"24":0,"0":40337}}
So, now my query is in case of failure or any other scenario How can I modify my directory so that when Application is restarted it should take from that particular point?
I understand that whenever we restart the application , it will automatically pick from the checkpoint that is fine but just in case I want to start it from any Specific value or change. Then what should I do ?
Shall I simply edit this "offsets" last created file.
Delete the checkpoint directory and restart the application with custom checkpoint for first run so that new checkpoint directory will be created from that onward.
There could be more files in the checkpointLocation directory (like state for stateful operators), but their role is exactly what you're asking for - in case of failure the stream processing engine of Spark Structured Streaming is going to resume a streaming query based on the checkpointed metadata.
Since these files are internal it's not recommended to amend the files in any way.
You can though (and perhaps that's one of the reasons why they're human-readable). Whether you change the existing offsets files or you create them from scratch does not really matter. The engine sees no difference. If the files are in proper format, they're going to be used. Otherwise, the checkpoint location will be of no use.
I'd rather use data source-specific options, e.g. startingOffsets for kafka data source to (re)start from specific offset.

Are spark.streaming.backpressure.* properties applicable to Spark Structured Streaming?

My understanding is that Spark structured streaming is build on top of Spark SQL and not Spark Streaming. Hence, the following question, does the properties that apply to spark streaming also applies to spark structured streaming such as:
spark.streaming.backpressure.initialRate
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate
No, these settings are applicable only to DStream API.
Spark Structured Streaming does not have a backpressure mechanism. You can find more details in this discussion: How Spark Structured Streaming handles backpressure?
No.
Spark Structured Stream processes data asap by default - after finishing the current batch. You can control via the rate of processing for various types, e.g. maxFilesPerTrigger for files and maxOffsetsPerTrigger for KAFKA.
This link http://javaagile.blogspot.com/2019/03/everything-you-needed-to-know-about.html explains that back pressure is not relevant.
It quotes: "Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.".
I am not sure this aspect is relevant as KAFKA buffers the data. None-the-less the article has good merit imho.

Connecting Spark streaming to streamsets input

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.
The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.

Stream the most recent data in cassandra with spark streaming

I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844