We're trying to achieve a deduplication service using Kafka Streams.
The big picture is that it will use its rocksDB state store in order to check existing keys during process.
Please correct me if I'm wrong, but to make those stateStores fault tolerant too, Kafka streams API will transparently copy the values in the stateStore inside a Kafka topic ( called the change Log).
That way, if our service falls, another service will be able to rebuild its stateStore according to the changeLog found in Kafka.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
I mean, When the service will update its stateStore, it will update the changelog in an exactly once fashion too.. ?
If the service crash, another one will take the load, but can we sure it won't miss a stateStore update from the crashing service ?
Regards,
Yannick
Short answer is yes.
Using transaction - Atomic multi-partition write - Kafka Streams insure, that when offset commit was performed, state store was also flashed to changelog topic on the brokers. Above operations are Atomic, so if one of them will failed, application will reprocess messages from previous offset position.
You can read in following blog more about exactly once semantic https://www.confluent.io/blog/enabling-exactly-kafka-streams/. There is section: How Kafka Streams Guarantees Exactly-Once Processing.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
Yes -- as others have already said here. You must of course configure your application to use exactly-once semantics via the configuration parameter processing.guarantee, see https://kafka.apache.org/21/documentation/streams/developer-guide/config-streams.html#processing-guarantee (this link is for Apache Kafka 2.1).
We're trying to achieve a deduplication service using Kafka Streams. The big picture is that it will use its rocksDB state store in order to check existing keys during process.
There's also an event de-duplication example application available at https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java. This links points to the repo branch for Confluent Platform 5.1.0, which uses Apache Kafka 2.1.0 = the latest version of Kafka available right now.
Related
I would like to write a simple Flink application that reads from a Kafka queue and processes the message and stores the output to an external system, with at least once semantics and without using checkpoints. I would like to avoid checkpoints because if the Kafka offsets are checkpointed, then all intermediate state will have to be checkpointed as well. In other words, I want the application to be as stateless as possible.
The way I envision at least once to work is the following:
a source reads from kafka
processing happens
the output is stored to the external system
the message is acknowledged to kafka
Note that:
If 2. or 3. fail, and the app restarts, the same message will be processed again (good)
If 2. and 3. succeed, 4. fails and the app restarts, we will will have stored the result twice (acceptable)
Based on https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration, the only way to get at least once (or the stronger exactly once) guarantees is by using checkpoints.
It seems that the core of the issue is that 4. needs to communicate back to 1. to ack to Kafka, which cannot happen in standard Flink, but should be possible using stateful functions.
To put it all together, the question is:
Is it possible to achieve at least once semantics using kafka in flink without using chekpoints?
According to the documentation you already linked it says:
"Checkpointing disabled: if checkpointing is disabled, the Flink Kafka Consumer relies on the automatic periodic offset committing capability of the internally used Kafka clients. Therefore, to disable or enable offset committing, simply set the enable.auto.commit / auto.commit.interval.ms keys to appropriate values in the provided Properties configuration."
As your goal is to disable checkpointing, you could set
enable.auto.commit=true
auto.commit.interval.ms=??? // use a time high enough such that your steps 2. and 3. are covered.
Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.
However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.
I was wondering whether it's Ok to use those API directly without Kafka Streams.
I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems.
The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes.
In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?
Switching to Kafka Streams is not an option because we are quite late in the project.
Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.
We can use an idempotent producer to do achieve this.
When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().
When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.
I'm using Spring Kafka to interface with my Kafka instance. Assume that I have a single topic with, say, 2+ partitions.
In the instances where, for example, my Spring Kafka-based application crashes (or even rebalances), and then comes back online and there are messages waiting in the topic, I'm currently using a strategy where the latest committed offsets for each partition are stored in an external store, which I then look up on a consumer's assignment to a partition and then seek to that offset to resume processing.
(This is based on a strategy I'd read about in an O'Reilly book.)
Is there a better way of handling this situation in order to implement "exactly once" semantics and not to miss any waiting messages? Or is there a better/more idiomatic way with Spring Kafka to handle this situation?
Thanks in advance.
Is there a reason you dont checkpoint your offsets to kafka itself?
generally, your options for "exactly once" processing are:
store your offsets and your side-effects together transactionally. this is only possible if your side effects go into a transaction-capable system (say a database)
use kafka transactions. this is a simplified variant of 1 as long as your side effects go to the same kafka cluster you read from
come up with a scheme that allows you to detect and disregard duplicates downstream of your kafka pipeline (aka idempotence)
I have a requirement to send data from an external system to a Kafka topic with exactly once semantics.
The source has an offset, we can consume messages from a given offset.
Looking at Kafka documentation, I see there are 2 ways to do this.
Kafka Source Connector
Use plain Kafka producer with transactions.
It looks like option 1 doesn't support exactly once semantics now, Kafka jira 6080 is unresolved. Also I would like to understand how we can do this directly with the producer apis.
For option 2, the (consume, transform, produce) loop in all the documents show committing offsets of consumers using AddOffsetsToTxn. What is the recommended strategy if the source is not a Kafka topic? Looks like writing the source offset in a different topic as part of the transaction and using it during recovery would work. Is this the recommended way?
I am building a simple application which does below in order -
1) Reads messages from a remote IBM MQ(legacy system only works with IBM MQ)
2) Writes these messages to Kafka Topic
3) Reads these messages from the same Kafka Topic and calls a REST API.
4) There could be other consumers reading from this topic in future.
I came to know that Kafka has the new streams API which is supposed to be better than Kafka consumer in terms of speed/simplicity etc. Can someone please let me know if the streams API is a good fit for my use case and at what point in my process i can plug it ?
It is true that Kafka Streams API has a simple way to consume records in comparison to Kafka Consumer API (e.g. you don't need to poll, manage a thread and loop), but it also comes with a cost (e.g. local data store - if you do stateful processing).
I would say that if you need to consume records one by one and call a REST API use the Consumer API, if you need stateful processing, query the topic state, etc. use the Streams API.
For more info take a look to this blog post: https://balamaci.ro/kafka-streams-for-stream-processing/
Reads messages from a remote IBM MQ (legacy system only works with
IBM MQ)
Writes these messages to Kafka Topic
I'd use Kafka Connect for (1) and (2). It is part of the Kafka project, and there are many free as well as commercial "connectors" available for hundreds of systems.
Reads these messages from the same Kafka Topic and calls a REST API.
You can use Kafka Streams as well as the lower-level Consumer API of Kafka, depending on what you prefer. I'd go with Kafka Streams as it is easier to use and far more powerful. (Both are part of the Kafka project.)
There could be other consumers reading from this topic in future.
This works out-of-the-box -- once data is stored in a Kafka topic according to step 2, many different applications and "consumers" can read this data independently.
Looks like you are not doing any processing/transformation once you consume you message from your IBM MQ or even after your Kafka Topic.
First one -> from IBM Mq to your Kafka Topic is kind of a pipeline and
Secondly -> You are just calling the REST API(I assume w/o any processing)
Considering these facts it seems to be a good fit for using Simple consumer.
Let's not use a technology only because it's there :)