Terminate a Flink job when using a Kafka Source - apache-kafka

When my producer has finished streaming all of its messages to Kafka, and after Flink has finished processing them, I want to be able to terminate the Flink job so it doesn't keep running, and also so I can know when Flink has finished processing all the data. I also cannot use batch processing as I need Flink to run in parallel to my Kafka stream.
Usually, Flink uses the isEndOfStream method in a DeserializationSchema class to see if it should end early (returning true in the method would automatically end the job). However, when using Kafka as a source with Flink, the new KafkaSource class has deprecated the use of the isEndOfStream method in deserializers and no longer checks it to see if the stream should end or not. Is there any other way to terminate a Flink job early?

The mechanism provided by the KafkaSource for operating on bounded streams is to use setBounded or setUnbounded with the builder, as in
KafkaSource<String> source = KafkaSource
.<String>builder()
.setBootstrapServers(...)
.setGroupId(...)
.setTopics(...)
.setDeserializer(...) // or setValueOnlyDeserializer
.setStartingOffsets(...)
.setBounded(...) // or setUnbounded
.build();
setBounded indicates that the source should be stopped once it has consumed all of the data up through the specified offsets.
setUnbounded can be used instead to indicate that while the source should not read any data past the specified offsets, it should remain running. This allows the source to participate in checkpointing if running in STREAMING mode.
If you know upfront how much you want to read, this works fine. I've used setBounded with a specific timestamp, e.g.,
.setBounded(
OffsetsInitializer.timestamp(
Instant.parse("2021-10-31T23:59:59.999Z").toEpochMilli()))
and also like this
.setBounded(OffsetsInitializer.latest())

Related

Is it possible to achieve at least once semantics using kafka in flink without using chekpoints?

I would like to write a simple Flink application that reads from a Kafka queue and processes the message and stores the output to an external system, with at least once semantics and without using checkpoints. I would like to avoid checkpoints because if the Kafka offsets are checkpointed, then all intermediate state will have to be checkpointed as well. In other words, I want the application to be as stateless as possible.
The way I envision at least once to work is the following:
a source reads from kafka
processing happens
the output is stored to the external system
the message is acknowledged to kafka
Note that:
If 2. or 3. fail, and the app restarts, the same message will be processed again (good)
If 2. and 3. succeed, 4. fails and the app restarts, we will will have stored the result twice (acceptable)
Based on https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration, the only way to get at least once (or the stronger exactly once) guarantees is by using checkpoints.
It seems that the core of the issue is that 4. needs to communicate back to 1. to ack to Kafka, which cannot happen in standard Flink, but should be possible using stateful functions.
To put it all together, the question is:
Is it possible to achieve at least once semantics using kafka in flink without using chekpoints?
According to the documentation you already linked it says:
"Checkpointing disabled: if checkpointing is disabled, the Flink Kafka Consumer relies on the automatic periodic offset committing capability of the internally used Kafka clients. Therefore, to disable or enable offset committing, simply set the enable.auto.commit / auto.commit.interval.ms keys to appropriate values in the provided Properties configuration."
As your goal is to disable checkpointing, you could set
enable.auto.commit=true
auto.commit.interval.ms=??? // use a time high enough such that your steps 2. and 3. are covered.

End-to-end Exactly-once processing in Apache Flink

Apache Flink guarantee exactly once processing upon failure and recovery by resuming the job from a checkpoint, with the checkpoint being a consistent snapshot of the distributed data stream and operator state (Chandy-Lamport algorithm for distributed snapshots). This guarantee exactly once upon failover.
In case of normal cluster operation, how does Flink guarantee exactly once processing, for instance given a Flink source that reads from external source (say Kafka), how does Flink guarantee that the event is read one time from the source? are there any kind of application level acking between the event source and Flink source? Also, how does Flink guarantee that events are propogated exactly once from upstream operators to downstream operators? Does that require any kind of acking for received events as well?
Flink does not guarantee that every event is read once from the sources. Instead, it guarantees that every event affects the managed state exactly once.
Checkpoints include the source offsets, and during a checkpoint restore, the sources are rewound and some events may be replayed. That's fine, because the checkpoint included the state throughout the job that had resulted from reading everything up to the offsets that were stored in the checkpoint, and nothing beyond those offsets.
Thus Flink's exactly once guarantee requires replayable sources. Exactly once messaging between operators depends on tcp.
Guaranteeing that the sinks don't receive duplicated results further requires transactional sinks. Flink commits transactions as part of checkpointing.

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Apache NiFi & Kafka Integration

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?
I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

Gracefully shut down Flink Kafka Comsumer at run time

I am using FlinkKafkaConsumer010 with Flink 1.2.0, and the problem I am facing is that: Is there a way that I can shut down the entire pipeline programmatically if some scenario is seen?
On possible solution is that I can shut down the kafka consumer source by calling the close() method defined inside of FlinkKafkaConsumer010, then the pipeline with shut down as well. For this approach, I create a list that contains the references to all FlinkKafkaConsumer010 instance that I created at the beginning of the pipeline for the kafka topics. Then during the execution of the pipeline, I have another thread that calls close() of each of the FlinkKafkaConsumer010 in my list. I expect that this should shut down the consumer, but the result is that the consumer is still running.
Can someone shed some light on this or give me some other suggestion on how can I shut down the flink pipeline at runtime programmatically?
Is the scenario that you're trying to respond to based on the input events? If so, I would suggest to have a MapFunction somewhere appropriate in the pipeline, and just deliberately throw an exception to fail the job when some condition is met.
The other alternative is to look at the isEndOfStream method in KeyedDeserializationSchema. Basically, when the condition is met for some event, signal that the stream has ended.
One other option to consider is to let the MapFunction mentioned above be instead a FlatMapFunction, that send an signaling event to the outside world. A separate process external to the Flink job listens to that event, and when received, shutdown the Flink job via the Flink CLI.