How Apache Beam manage kinesis checkpointing? - apache-beam

I have a streaming pipeline developed in Apache Beam (using Spark Runner) which reads from kinesis stream.
I am looking out for options in Apache Beam to manage kinesis checkpointing (i.e. stores periodically the current position of kinesis stream) so as it allows the system to recover from failures and continue processing where the stream left off.
Is there a provision available for Apache Beam to support kinesis checkpointing as similar to Spark Streaming (Reference link - https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html)?

Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read (actually, pulled from a records queue that is feed separately by actually fetching the records from Kinesis shard), then the shard checkpoint will be updated by using the record SequenceNumber and then, depending on runner implementation of UnboundedSource and checkpoints processing, will be saved.
Afaik, Beam Spark Runner uses Spark States mechanism for this purposes.

Related

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Apache Flink State Store vs Kafka Streams

As far as I know handles Kafka Streams its States localy in memory or on disc or in a Kafka topic because all the input date is from a partition, where all the messages are keyed by a defined value. Most of the time the computations can be done without knowing the state of other Processors. If so, you have another Streams instance whichs calculsates the result. Like in this picture:
Where exactly does Flink store its States? Can Flink also store the states locally or does it always publish them always to all instances (tasks)? Is it possible to configure Flink so that it stores the States in a Kafka Broker?
Flink also uses local stores (that can be keyed), similar to Kafka Streams. However, it does not write state into Kafka topics.
For fault-tolerance, it takes so-called "distributed snapshots", that are stored in a configurable state backend (eg, HDFS).
Check out the docs for more details:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html
https://ci.apache.org/projects/flink/flink-docs-stable/internals/stream_checkpointing.html
https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/state_backends.html
There is a distinction between Flink and Kafka Streams. Flink is cluster framework, your code is deployed and runs as job in Flink Cluster. Kafka streams is API that you embed in your standard java application. Stream processing logic runs inside the your application java process. They both can sink results to Kafka, key value store, database or external systems. Flinkā€™s master node implements its own high availability mechanism based on ZooKeeper and ensures the availability interim states after the disaster. If you are using Kafka Streams once you managed to save your interim states to Kafka Cluster you will have the same HA features provided by Kafka Cluster.

Apache Beam over Apache Kafka Stream processing

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing?
I am trying to grasp the technical and programmatic differences as well.
Please help me understand by reporting from your experience.
Beam is an API that uses an underlying stream processing engine like Flink, Storm, etc... in one unified way.
Kafka is mainly an integration platform that offers a messaging system based on topics that standalone applications use to communicate with each other.
On top of this messaging system (and the Producer/Consummer API), Kafka offers an API to perform stream processing using messages as data and topics as input or output. Kafka Stream processing applications are standalone Java applications and act as regular Kafka Consummer and Producer (this is important to understand how these applications are managed and how workload is shared among stream processing application instances).
Shortly said, Kafka Stream processing applications are standalone Java applications that run outside the Kafka Cluster, feed from the Kafka Cluster and export results to the Kafka Cluster. With other stream processing platforms, stream processing applications run inside the cluster engine (and are managed by this engine), feed from somewhere else and export results to somewhere else.
One big difference between Kafka and Beam Stream API is that Beam makes the difference between bounded and unbounded data inside the data stream whereas Kafka does not make that difference. Thereby, handling bounded data with Kafka API has to be done manually using timed/sessionized windows to gather data.
Beam is a programming API but not a system or library you can use. There are multiple Beam runners available that implement the Beam API.
Kafka is a stream processing platform and ships with Kafka Streams (aka Streams API), a Java stream processing library that is build to read data from Kafka topics and write results back to Kafka topics.

Fault Tolerance of FlinkKafkaConsumer in HiBench

I am running some experiments to test the fault tolerance capabilities of Apache Flink. I am currently using the HiBench framework with the WordCount micro benchmark implemented for Flink.
I noticed that if I kill a TaskManager during an execution, the state of the Flink operators is recovered after the automatic "redeploy" but many (all?) tuples sent from the benchmark to Kafka are missed (stored in Kafka but not received in Flink).
It seems that after the recovery, the FlinkKafkaConsumer (the benchmark uses FlinkKafkaConsumer08) in place of start reading from the last offset read before the failure start reading from the latest available one (losing all the event sent during the failure).
Any suggestion?
Thanks!
The problem was with the HiBench framework itself and with the latest version of Flink.
I had to update the version of Flink in the benchmark in order to use the "setStartFromGroupOffsets()" method in the Kafka consumer.

Spark Streaming with Nifi

I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?
You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.