Azure Event Hubs Streaming: Does Checkpointing override setStartingPosition? - spark-structured-streaming

If we specify the starting position in EventHub conf like so:
EventHubsConf(ConnectionStringBuilder(eventHubConnectionString).build)
.setStartingPosition(EventPosition.fromStartOfStream)
or
.setStartingPosition(EventPosition.fromEndOfStream)
And also sepecify the checkpoint location in the StreamWriter
streamingInputDF
.writeStream
.option("checkpointLocation", checkpointLocation)
...
After a restart, does the setStartingPosition become irrelevant because the checkpoint is always used as the point from where to begin reading?
Thanks.

The information on offsets stored in the checkpoint files will be used when restarting the streamimg query.
Interestingly, this is not specifically mentioned in the structured streaming eventhubs integration guide, however, in the DStreams guide it is:
"The connector fully integrates with the Structured Streaming checkpointing mechanism. You can recover the progress and state of you query on failures by setting a checkpoint location in your query. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."
Make sure to follow the general guidance on checkpoint recovery.

Related

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

How to purge the checkpoint directory in spark structured streaming

I am new to spark structured streaming. I have set up a streaming pipeline from kafka and also set up the checkpoint directory.
Now the offsets are being written to the checkpoint directory and it works fine.
I could see the checkpoint directory keeps increasing. Is there any property in spark to purge this automatically. If not what is a better solution to do this.
I tried to use the property spark.cleaner.referenceTracking.cleanCheckpoints = true. But this does not seem to help.

How Apache Beam manage kinesis checkpointing?

I have a streaming pipeline developed in Apache Beam (using Spark Runner) which reads from kinesis stream.
I am looking out for options in Apache Beam to manage kinesis checkpointing (i.e. stores periodically the current position of kinesis stream) so as it allows the system to recover from failures and continue processing where the stream left off.
Is there a provision available for Apache Beam to support kinesis checkpointing as similar to Spark Streaming (Reference link - https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html)?
Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read (actually, pulled from a records queue that is feed separately by actually fetching the records from Kinesis shard), then the shard checkpoint will be updated by using the record SequenceNumber and then, depending on runner implementation of UnboundedSource and checkpoints processing, will be saved.
Afaik, Beam Spark Runner uses Spark States mechanism for this purposes.

Will flink resume from the last offset after executing yarn application kill and running again?

I use FlinkKafkaConsumer to consume kafka and enable checkpoint. Now I'm a little confused on the offset management and checkpoint mechanism.
I have already know flink will start reading partitions from the consumer group’s.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-start-position-configuration
and the offset will store into checkpoint in remote fileSystem.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-and-fault-tolerance
What happen if I stop the application by executing the yarn application -kill appid
and run the start command like ./bin flink run ...?
Will flink get the offset from checkpoint or from group-id managed by kafka?
If you run the job again without defining a savepoint ($ bin/flink run -s :savepointPath [:runArgs]) flink will try to get the offsets of your consumer-group from kafka (in older versions from zookeeper). But you will loose all other state of your flink job (which might be ignorable if you have a stateless flink job).
I must admit that this behaviour is quite confusing. By default starting a job without a savepoint is like starting from zero. As far as I know only the implementation of the kafka source differs from that behaviour. If you wanna change that behaviour you can set the setStartFromGroupOffsets of the FlinkKafkaConsumer[08/09/10] to false. This is described here: Kafka Consumers Start Position Configuration
It might be worth having a closer look at the documentation of flink: What is a savepoint and how does it differ from checkpoints.
In a nutshell
Checkpoints:
The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink
Savepoints:
Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume
There are currently ongoing discussions on how to "unify" savepoints and checkpoints. Find a lot of technical details here: Flink improvals 47: Checkpoints vs Savepoints

Structured Streaming - Could not use FileContext API for managing metadata log files on AWS S3

I have a StreamingQuery in Spark(v2.2.0), i.e.,
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.load()
val query = df
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("parquet")
.option("checkpointLocation", "s3n://bucket/checkpoint/test")
.option("path", "s3n://bucket/test")
.start()
When I am running query then data does get save on AWS S3 and checkpoints are created at s3n://bucket/checkpoint/test. But, I am also receiving following WARNING in the logs:
WARN [o.a.s.s.e.streaming.OffsetSeqLog] Could not use FileContext API for managing metadata log files at path s3n://bucket/checpoint/test/offsets. Using FileSystem API instead for managing log files. The log may be inconsistent under failures.
I am not able to understand as to why this WARNING is coming. Also, will my checkpoints be inconsistent in case of any failure?
Can anyone help me in resolving it?
Looking at the source code, this error comes from the HDFSMetadataLog class. A comment in the code states that:
Note: [[HDFSMetadataLog]] doesn't support S3-like file systems as they don't guarantee listing files in a directory always shows the latest files.
So the problem is due to using AWS S3 and it will force you to use the FileSystemManager API. Checking the comment for that class, we see that,
Implementation of FileManager using older FileSystem API. Note that this implementation cannot provide atomic renaming of paths, hence can lead to consistency issues. This should be used only as a backup option, when FileContextManager cannot be used.
Hence, some issues can come up when multiple writers want to concurrently do rename operations. There is a related ticket here, however, it has been closed since the issue can't be fixed in Spark.
Some things to consider if you need to checkpoint on S3:
To aviod the warning and potential trouble, checkpoint to HDFS and then copy over the results
Checkpoint to S3, but have a long gap between checkpoints.
Nobody should be using S3n as the connector. It is obsolete and removed from Hadoop 3. If you have the Hadoop 2.7.x JARs on the classpath, use s3a
The issue with rename() is not just the consistency, but the bigger the file, the longer it takes.
Really checkpointing to object stores needs to be done differently. If you look closely, there is no rename(), yet so much existing code expects it to be an O(1) atomic operation.