Confused about Kafka exactly-once semantics - apache-kafka

So i've been reading about kafka's exactly once semantics, and I'm a bit confused about how it works.
I understand how the producer avoids sending duplicate messages (in case the ack from the broker fails), but what I don't understand is how exactly-once works in the scenario where the consumer processes the message but then crashes before committing the offset. Won't kafka retry in that scenario?

here's what i think you mean:
consumer X sees record Y, and "acts" on it, yet does not commit its offset
consumer X crashes (still without committing its offsets)
consumer X boots back up, is re-assigned the same partition (not guaranteed) and eventually sees record Y again
this is totally possible. however, for kafka exactly once to "work" all of your side effects (state, output) must also go into the same kafka cluster. so here's whats going to happen:
consumer X starts a transaction
consumer X sees record Y, emits some output record Z (as part of the transaction started in 1)
consumer X crashes. shortly after the broker acting as the transaction coordinator "rolls back" (im simplifying) the transaction started in 1, meaning no other kafka consumer will ever see record Z
consumer X boots back up, is assigned the same partition(s) as before, starts a new transaction
consumer X sees record Y again, emits record Z2 (as part of the transaction started in 4)
some time later consumer X commits its offsets (as part of the transaction from 4) and then commits that transaction
record Z2 becomes visible to downstream consumers.
if you have side-effects outside of the same kafka cluster (say instead of record Z you insert a row into mysql) there's no general way to make kafka exactly-once work for you. you'd need to rely on oldschool dedup and idempotance.

Radal explained it well in its answer, regarding exactly once in a isolated Kafka cluster.
When dealing with an external database ( transactional at least) , one easy way to achieve exactly once is to UPDATE one row ( in a sgbd transaction), with your business value AND the Partition / offsets where it comes from. That way , if your consumer crash before committing to Kafka, you'll be able to get back the last Kafka offset it has processed ( by using consumer.seek())
It can though be a quite data overhead in your sgbd ( keeping offset/partition for all your rows), but you might be able to optimize a bit.
Yannick

Related

How to handle various failure conditions in Kafka

Issue we were facing:
In our system we were logging a ticket in database with status NEW and also putting it in the kafka queue for further processing. The processors pick those tickets from kafka queue, do processing and update the status accordingly. We found that some tickets are left in NEW state forever. So we were guessing whether tickets are failing to get produced in the queue or are no getting consumed.
Message loss / duplication scenarios (and some other related points):
So I started to dig exhaustively to know in what all ways we can face message loss and duplication in Kafka. Below I have listed all possible message loss and duplication scenarios that I can find in this post:
How data loss can occur in different approaches to handle all replicas down
Handle by waiting for leader to come online
Messages sent between all replica down and leader comes online are lost.
Handle by electing new broker as a leader once it comes online
If new broker is out of sync from previous leader, all data written between the
time where this broker went down and when it was elected the new leader will be
lost. As additional brokers come back up, they will see that they have committed
messages that do not exist on the new leader and drop those messages.
How data loss can occur when leader goes down, while other replicas may be up
In this case, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of in sync replicas. This may take a few seconds and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
When a consumer may miss to consume a message
If Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages.
Evaluating different approaches to consumer consistency
Message might not be processed when consumer is configured to receive each message at most once
Message might be duplicated / processed twice when consumer is configured to receive each message at least once
No message is processed multiple times or left unprocessed if consumer is configured to receive each message exactly once.
Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability:
Messages sent to a topic partition will be appended to the commit log in the order they are sent,
a single consumer instance will see messages in the order they appear in the log,
a message is ‘committed’ when all in sync replicas have applied it to their log, and
any committed message will not be lost, as long as at least one in sync replica is alive.
Approach I came up with:
After reading several articles, I felt I should do following:
If message is not enqueued, producer should resend
For this producer should listen for acknowledgement for each message sent. If no ackowledement is received, it can retry sending message
Producer should be async with callback:
As explained in last example here
How to avoid duplicates in case of producer retries sending
To avoid duplicates in queue, set enable.idempotence=true in producer configs. This will make producer ensure that exactly one copy of each message is sent. This requires following properties set on producer:
max.in.flight.requests.per.connection<=5
retries>0
acks=all (Obtain ack when all brokers has committed message)
Producer should be transactional
As explained here.
Set transactional id to unique id:
producerProps.put("transactional.id", "prod-1");
Because we've enabled idempotence, Kafka will use this transaction id as part of its algorithm to deduplicate any message this producer sends, ensuring idempotency.
Use transactions semantics: init, begin, commit, close
As explained here:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Consumer should be transactional
consumerProps.put("isolation.level", "read_committed");
This ensures that consumer don't read any transactional messages before the transaction completes.
Manually commit offset in consumer
As explained here
Process record and save offsets atomically
Say by atomically saving both record processing output and offsets to any database. For this we need to set auto commit of database connection to false and manually commit after persisting both processing output and offset. This also requires setting enable.auto.commit to false.
Read initial offset (say to read after recovery from cache) from database
Seek consumer to this offset and then read from that position.
Doubts I have:
(Some doubts might be primary and can be resolved by implementing code. But I want words from experienced kafka developer.)
Does the consumer need to read the offset from database only for initial (/ first after consumer recovery) read or for all reads? I feel it needs to read offset from database only on restarts, as explained here
Do we have to opt for manual partitioning? Does this approach works only with auto partitioning off? I have this doubt because this example explains storing offset in MySQL by specifying partitions explicitly.
Do we need both: Producer side kafka transactions and consumer side database transactions (for storing offset and processing records atomically)? I feel for producer idempotence, we need producer to have unique transaction id and for that we need to use kafka transactional api (init, begin, commit). And as a counterpart, consumer also need to set isolation.level to read_committed. However can we ensure no message loss and duplicate processing without using kafka transactions? Or they are absolutely necessary?
Should we persist offset to external db as explained above and here
or send offset to transaction as explained here (also I didnt get what does it exactly mean by sending offset to transaction)
or follow sync async commit combo explained here.
I feel message loss / duplication scenarios 1 and 2 are handled by points 1 to 4 of approach I explained above.
I feel message loss / duplication scenario 3 is handled by point 6 of approach I explained above.
How do we implement different consumer consistency approaches as stated in message loss / duplication scenario 4? Is their any configuration or it needs to be implemented inside custom logic inside consumer?
Message loss / duplication scenario 5 says: "Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition."? Is it something to concern about while building correct system?
Is any consideration unnecessary/redundant in the approach I came up with above? Also did I miss any necessary consideration? Did I miss any message loss / duplication scenarios?
Is their any other standard / recommended / preferable approach to ensure no message loss and duplicate processing than what I have thought above?
Do I have to actually code above approach using kafka APIs? or is there any high level API built atop kafka API which allows to easily ensure no message loss and duplicate processing?
Looking at issue we were facing (as stated at very beginning), we were thinking if we can recover any lost/unprocessed messages from files in which kafka stores messages. However that isnt correct, right?
(Extremely sorry for such an exhaustive post but wanted to write question which will ask all related question at one place allowing to build big picture of how to build system around kafka.)

Why do the offsets of the consumer-group (app-id) of my Kafka Streams Application get reset after application restart?

I have a Kafka Streams application for which, whenever I restart it, the offsets for the topic it is consuming get reset. Hence, for all partitions, the lags increase and the app needs to reprocess all the data.
UPDATE:
The output topic is receiving a burst of events that were already processed after the App gets restarted, is not that the input topic offsets are getting reset as I said in the previous paragraph. However, the internal topic (KTABLE-SUPPRESS-STATE-STORE) offsets are getting reset, see comments below.
I have ensured the lag is 1 for every partition before the restart (this is for the output topic).
All consumers that belong to that consumer-group-id (app-id) are active.
The restart is immediate, it takes around 30 secs.
The app is using exactly once as processing guarantee.
I have read this answer How does an offset expire for an Apache Kafka consumer group? .
I have tried with auto.offset.reset = latest and auto.offset.reset = earliest.
It seems like the offsets for these topics are not effectively committed, (but I am not sure about this).
I assume that after the restart the app should pick-up from the latest committed offset for that consumer group.
UPDATE:
I assume this for the internal topic (KTABLE-SUPPRESS-STATE-STORE)
Does the Kafka Stream API ensure to commit all consumed offset before shutting down? (after calling streams.close())
I would really appreciate any clue about this.
UPDATE:
This is the code the App execute:
final StreamsBuilder builder = new StreamsBuilder();
final KStream<..., ...> events = builder
.stream(inputTopicNames, Consumed.with(..., ...)
.withTimestampExtractor(...);
events
.filter((k, v) -> ...)
.flatMapValues(v -> ...)
.flatMapValues(v -> ...)
.selectKey((k, v) -> v)
.groupByKey(Grouped.with(..., ...))
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowSizeInSecs))
.advanceBy(Duration.ofSeconds(windowSizeInSecs))
.grace(Duration.ofSeconds(windowSizeGraceInSecs)))
.reduce((agg, new) -> {
...
return agg;
})
.suppress(Suppressed.untilWindowCloses(
Suppressed.BufferConfig.unbounded()))
.toStream()
.to(outPutTopicNameOfGroupedData, Produced.with(..., ...));
The offset reset just and always happens (after restarting) with the KTABLE-SUPPRESS-STATE-STORE internal topic created by the Kafka Stream API.
I have tried with the Processing guarantee exactly once and at least once.
Once again, I will really appreciate any clue about this.
UPDATE:
This has been solved in the release 2.2.1 (https://issues.apache.org/jira/browse/KAFKA-7895)
The offset reset just and always happens (after restarting) with the KTABLE-SUPPRESS-STATE-STORE internal topic created by the Kafka Stream API.
This is currently (version 2.1) expected behavior, because the suppress() operator works in-memory only. Thus, on restart, the suppress buffer must be recreate from the changelog topic before processing can start.
Note, it is planned to let suppress() write to disk in future releases (cf. https://issues.apache.org/jira/browse/KAFKA-7224). This will avoid the overhead of recreating the buffer from the changelog topic.
I think #Matthias J. Sax 's reply covers most of the internals of suppress. One thing I need to clarify though: when you say "restart the application", what exactly did you do? Did you shutdown the whole application gracefully, and then restart it?
Commit frequency is controlled by the parameter commit.interval.ms. Check whether your offsets are indeed committed. By default, offsets are committed every 100 ms or 30 secs, depending upon your processing guarantee config. Check this out

Kafka ordering guarantees

I was going through kafka documentation and came across
Guarantees
At a high-level, Kafka gives the following guarantees:
Messages sent by a producer to a particular topic partition will be
appended in the order they are sent. That is, if a record M1 is sent
by the same producer as a record M2, and M1 is sent first, then M1
will have a lower offset than M2 and appear earlier in the log. A
consumer instance sees records in the order they are stored in the
log. For a topic with replication factor N, we will tolerate up to N-1
server failures without losing any records committed to the log.
I had few questions.
Is it always guaranteed that M1 will have a lower offset than M2 ? what if M1 is retried later than M2 ?
I also understood from various documentations that ordering is not guaranteed, and the consumer has to deal with it.
A possible scenario even with a single partition is:
Producer sends M1
Producer sends M2
M1 is not ack'ed on the first try due to some failure
M2 is delivered
M1 is delivered in a subsequent try.
One easy way to avoid this is through the producer config max.in.flight.requests.per.connection=1.
This of course has performance implications, so it should be used with caution.
Please notice, that ordering guarantees apply at the partition level. So, if you have more than one partition in the topic, you'll need to set the same partition key for messages that you require to appear in order.
For example, if you want to collect messages from various sensors and sensor has it's id, then if you use this ID as message key, ordering of messages from every sensor will be guaranteed on consumers (as no sensor will write messages to more than 1 partition).
To answer your questions:
Yes, M1 will have always offset lower than M2. The offsets are set by broker, so the time of message arrival at the broker is key here.
Ordering is not guaranteed on topic level only
I have an article about deep understanding ordering guarantees provided by Kafka.
You can check it in my medium post.

Repeatedly produced to Apache Kafka, different offsets? (Exactly once semantics)

While trying to implement exactly-once semantics, I found this in the official Kafka documentation:
Exactly-once delivery requires co-operation with the destination
storage system but Kafka provides the offset which makes implementing
this straight-forward.
Does this mean that I can use the (topic, partiton, offset) tuple as a unique primary identifier to implement deduplication?
An example implementation would be to use an RDBMS and this tuple as a primary key for an insert operation within a big processing transaction where the transaction fails if the insertion is not possible anymore because of an already existing primary key.
I think the question is equivalent to:
Does a producer use the same offset for a message when retrying to send it after detecting a possible failure or does every retry attempt get its own offset?
If the offset is reused when retrying, consumers obviously see multiple messages with the same offset.
Other question, maybe somehow related:
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Another possibility could be that the offset is determined e.g. solely by or as recently as the message reaches the leader which does the job (implying that - if not listening to something like a producer's suggested offset - there are probably no gaps/offset jumps, but also different offsets for duplicate messages and I would have to use my own unique identifier within the application's message on application level).
To answer my own question:
The offset is generated solely by the server (more precisely: by the leader of the corresponding partition), not by the producing client. It is then sent back to the producer in the produce response. So:
Does a producer use the same offset for a message when retrying to
send it after detecting a possible failure or does every retry attempt
get its own offset?
No. (See update below!) The producer does not determine offsets and two identical/duplicate application messages can have different offsets. So the offset cannot be used to identify messages for producer deduplication purposes and a custom UID has to be defined in the application message. (Source)
With single or multiple producers producing to the same topic, can there be "gaps" in the offset number sequence seen by one consumer?
Due to the fact that there is only a single leader for every partition which maintains the current offset and the fact that (with the default configuration) this leadership is only transfered to active in-sync replica in case of a failure, I assume that the latest used offset is always communicated correctly when electing a new leader for a partition and therefore there are should not be any offset gaps or jumps initially. However, because of the log compaction feature, there are cases (assuming log compaction being enabled) where there can indeed be gaps in a stream of offsets when consuming already committed messages of a partition once again after the compaction has kicked in. (Source)
Update (Kafka >= 0.11.0)
Starting from Kafka version 0.11.0, producers now additionally send a sequence number with their requests, which is then used by the leader to deduplicate requests by this number and the producer's ID. So with 0.11.0, the precondition on the producer side for implementing exactly once semantics is given by Kafka itself and there's no need to send another unique ID or sequence number within the application's message.
Therefore, the answer to question 1 could now also be yes, somehow.
However, note that exactly once semantics are still only possible with the consumer never failing. Once the consumer can fail, one still has to watch out for duplicate message processings on consumer side.

Kafka only once consumption guarantee

I see in some answers around stack-overflow and in general in the web the idea that Kafka does not support consumption acknowledge or that exactly once consumption is hard to achieve.
In the following entry as a sample
Is there any reason to use RabbitMQ over Kafka?, I can read the following statements:
RabbitMQ will keep all states about consumed/acknowledged/unacknowledged messages while Kafka doesn't
or
Exactly once guarantees are hard to get with Kafka.
This is not what I understand by reading the official Kafka documentation at:
https://kafka.apache.org/documentation/#design_consumerposition
The previous documentation states that Kafka does not use a traditional acknowledge implementation (as RabbitMQ). Instead they rely on the relationship partition-consumer and offset...
This makes the equivalent of message acknowledgements very cheap
Could somebody please explain why "only once consumption guarantee" in Kafka is difficult to achieve? and How this differs from Kafka vs other more traditional Message Broker as RabbitMQ? What am I missing?
If you mean exactly once the problem is like this.
Kafka consumer as you may know use a polling mechanism, that is consumers ask the server for messages. Also, you need to recall that the consumer commit message offsets, that is, it tells the cluster what is the next expected offset. So, imagine what could happen.
Consumer poll for messages and get message with offset = 1.
A) If consumer commit that offset immediately before processing the message, then it can crash and will never receive that message again because it was already committed, on next poll Kafka will return message with offset = 2. This is what they call at most once semantic.
B) If consumer process the message first and then commit the offset, what could happen is that after processing the message but before committing, the consumer crashes, so in that case next poll will get again the same message with offset = 1 and that message will be processed twice. This is what they call at least once.
In order to achieve exactly once, you need to process the message and commit that offset in an atomic operation, where you always do both or none of them. This is not so easy. One way to do this (if possible) is to store the result of the processing along with the offset of the message that generated that result. Then, when consumer starts it looks for the last processed offset outside Kafka and seek to that offset.