How to purge the checkpoint directory in spark structured streaming - spark-structured-streaming

I am new to spark structured streaming. I have set up a streaming pipeline from kafka and also set up the checkpoint directory.
Now the offsets are being written to the checkpoint directory and it works fine.
I could see the checkpoint directory keeps increasing. Is there any property in spark to purge this automatically. If not what is a better solution to do this.
I tried to use the property spark.cleaner.referenceTracking.cleanCheckpoints = true. But this does not seem to help.

Related

dataproc spark checkpoint best practices? what should I set the checkpoint dir too?

I am running a very long-running batch job. It generates a lot of OOM exceptions. To minimize this problem added checkpoints()
Where should I set the checkpoint dir to? The location has to be accessible to all the executors. Currently, I am using a bucket. Based on log files I can see that my code has progressed past several of the checkpoint() calls however the bucket is empty
sparkContext.setCheckpointDir("gs://myBucket/checkpointDir/")
based on CPU utilization and log messages, it looks like my job is still running and making progress after. any idea what the spark where the checkpoint data?
2022-01-22 18:38:06 WARN DAGScheduler:69 - Broadcasting large task binary with size 4.9 MiB
2022-01-22 18:47:23 WARN BlockManagerMasterEndpoint:69 - No more replicas available for broadcast_50_piece0 !
2022-01-22 18:47:23 WARN BlockManagerMaster:90 - Failed to remove broadcast 50 with removeFromMaster = true - org.apache.spark.SparkException: Could not find BlockManagerEndpoint1.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
kind regards
Andy
Did you manually trigger checkpoint in your code? If not, it won't be automatically triggered. See https://programmer.help/blogs/spark_-correct-use-of-checkpoint-in-spark-and-its-difference-from-cache.html Checkpointing is generally not a way to solve OOM problem in Spark.

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Azure Event Hubs Streaming: Does Checkpointing override setStartingPosition?

If we specify the starting position in EventHub conf like so:
EventHubsConf(ConnectionStringBuilder(eventHubConnectionString).build)
.setStartingPosition(EventPosition.fromStartOfStream)
or
.setStartingPosition(EventPosition.fromEndOfStream)
And also sepecify the checkpoint location in the StreamWriter
streamingInputDF
.writeStream
.option("checkpointLocation", checkpointLocation)
...
After a restart, does the setStartingPosition become irrelevant because the checkpoint is always used as the point from where to begin reading?
Thanks.
The information on offsets stored in the checkpoint files will be used when restarting the streamimg query.
Interestingly, this is not specifically mentioned in the structured streaming eventhubs integration guide, however, in the DStreams guide it is:
"The connector fully integrates with the Structured Streaming checkpointing mechanism. You can recover the progress and state of you query on failures by setting a checkpoint location in your query. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."
Make sure to follow the general guidance on checkpoint recovery.

How Apache Beam manage kinesis checkpointing?

I have a streaming pipeline developed in Apache Beam (using Spark Runner) which reads from kinesis stream.
I am looking out for options in Apache Beam to manage kinesis checkpointing (i.e. stores periodically the current position of kinesis stream) so as it allows the system to recover from failures and continue processing where the stream left off.
Is there a provision available for Apache Beam to support kinesis checkpointing as similar to Spark Streaming (Reference link - https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html)?
Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read (actually, pulled from a records queue that is feed separately by actually fetching the records from Kinesis shard), then the shard checkpoint will be updated by using the record SequenceNumber and then, depending on runner implementation of UnboundedSource and checkpoints processing, will be saved.
Afaik, Beam Spark Runner uses Spark States mechanism for this purposes.

Fault tolerance in Spark streaming

I am trying to explain about fault tolerance here. Say I have number of files 1 to 10 in hdfs and spark streaming has read this file. Now my spark streaming has stopped unfortunately. I have files in hdfs say 1 to 20 where 1 to 10 files were already parsed by spark streaming and 11 to 20 were added newly. Now I start spark streaming, I can see files 1- 30. Since I started spark at the time of 21st file in hdfs, My spark styreaming will loose files 11-20. how do I get lost files.
I use fileStream.
The behaviour of fileStream in Spark streaming is to monitor a folder and pick up new files there. So it would only pick up files that are new after the process has started. In order to process files from 11-20, you might have to rename them after the process started.
A better way to handle this scenario is to use messaging queues like Kafka, where you can continue processing streams from any point you like:
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
Spark Streaming also provides option for checkpointing.
If it is enabled, the process will save checkpoints before start of every batch (in specified folder). Then, if the spark streaming process crashes for some reason, it can be started from the last checkpoint.
def createContext(folderName):
sc = SparkContext(appName='SparkApplication')
ssc = StreamingContext(sc, 2) # 2 second window
## Your stream configuration here
ssc.checkpoint(folderName)
return ssc
ssc = StreamingContext.getOrCreate('/path/to/checkpoint/directory',
lambda: createContext('/path/to/dir') )
ssc.start()
ssc.awaitTermination()