kafka streams app - ignore old messages on restart

kafka streams app - ignore old messages on restart - apache-kafka

I deal with timeseries data for a live application. So old data has no significance. I just want to process data received after the stream app has started and not from previously committed offset. What is the correct way to ignore old records on kafka stream app after restart?
With kafka consumer API I generally used the seekToEnd() method to skip forward to the latest record. Is there a equivalent mechanism for streams?
I want to avoid filtering through all messages since last commit to ignore old messages.

You can create another consumer using Kafka Consumer API with groupId same as the applicationId for kafka-streams and use that consumer to do a seekToEnd() before starting your stream. Disable autoCommit for this special consumer and commit the offset manually after seekToEnd(). Then try starting your stream.
Make sure the stream has not started until your offsets from reset consumer are committed.

Related

Kafka consumer is processing all messages at startup

I am new to Kafka, and am developing a personal project with a few services and the communication between them is made through Kafka and I am using Confluent for housing Kafka remotely.
All works fine, but when I startup a server it will try to process all the old messages in the topics that were generated as I was testing the system.
I would like to avoid this because it is time consuming and those messages were already processed, when the server was up the last time. Is there any way to prevent this in the development environment?
Am I even using Kafka correctly? Are there good practises that I missed?

By "server", I assume you mean consumer. The broker server doesn't process data, only stores it.
If you have auto.offset.reset=earliest + enable.auto.commit=false + are not committing the records in your code (or are overall using a new group.id each time), this is the expected behavior since your group.id is not tracking already consumed data.
Since you're now in a situation where you have processed data, but no stored offsets, first set a static group id, then your options include
re-process all the data again, accepting the duplicates, perhaps adding some conditional filter in your consumer code to skip records
skip all processed and un-processed data and only start consuming brand-new records after the consumer starts, by either setting a new group.id + auto.offset.reset=latest, or use consumer.seekToEnd() / the kafka-consumer-groups CLI tool ; downside of setting auto.offset.reset=latest is that you might run into a situation where the consumer group has been idle too long, and the group expires, causing you to go back to the end of the topic, even though there may still be un-processed data
manually find the offsets for all the partitions for the last processed data and consumer.seek() to those offsets

(Spring) Kafka appears to consume newly produced messages out of order

Situation:
We have a Spring Boot / Spring Kafka application that is reading from a Kafka topic with a single partition. There is a single instance of the application running and it has a single-threaded KafkaMessageListenerContainer (not Concurrent). We have a single consumer group.
We want to manage offsets ourselves based on committing to a transactional database. At startup, we read initial offsets from our database and seek to that offset and begin reading older messages. (For example with an empty database, we would start at offset 0.) We do this via implementing ConsumerRebalanceListener and seek()ing in that callback. We pause() the KafkaMessageListenerContainer prior to starting it so that we don't read any messages prior to the ConsumerRebalanceListener being invoked (then we resume() the container inside the ConsumerRebalanceListener.onPartitionsAssigned() callback). We acknowledge messages manually as they are consumed.
Issue:
While in the middle of reading these older messages (1000s of messages and 10s of seconds/minutes into the reading), a separate application produces messages into the same topic and partition we're reading from.
We observe that these newly produced messages are consumed immediately, intermingled with the older messages we're in the process of reading. So we observe message offsets that jump in this single consumer thread: from the basically sequential offsets of the older messages to ones that are from the new messages that were just produced, and then back to the older, sequential ones.
We don't see any errors in reading messages or anything that would trigger retries or anything like that. The reads of newer messages happen in the main thread as do the reads of older messages, so I don't believe there's another listener container running.
How could this happen? Doesn't this seem contrary to the ordering guarantees Kafka is supposed to provide? How can we prevent this behavior?
Details:
We have the following settings (some in properties, some in code, please excuse the mix):
properties.consumer.isolationLevel = KafkaProperties.IsolationLevel.READ_COMMITTED
properties.consumer.maxPollRecords = 500
containerProps.ackMode = ContainerProperties.AckMode.MANUAL
containerProps.eosMode = ContainerProperties.EOSMode.BETA
spring.kafka.consumer.auto-offset-reset=none
spring.kafka.enable-auto-commit=false
Versions:
Spring Kafka 2.5.5.RELEASE
Kafka 2.5.1
(we could definitely try upgrading if there was a reason to believe this was the result of a bug that was fixed since then.)
I can share some code snippets for any of the above if it's interesting.

kafka offset management auto vs manual

I'm working on an application of spring boot which uses Kafka stream, in my application, I want to manage Kafka offset and commit the offset in case of the successful message processing only. This is important, to be certain I won't lose messages even if Kafka restarted or the zookeeper is down. my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages.
also, I need to know what is the difference between managing the Kafka offset automatic using autoCommitOffset and manging it manually using HBase or zookeeper or checkpoints?
also, what are the benefits of managing it manually if there is an automatic config we can use?

You have no guarantee of durability with auto commit
Older Kafka clients did use Zookeeper for offset storage, but now it is all in the broker to minimize dependencies. Kafka Streams API has no way to integrate offset storage outside of Kafka itself, and so you must use the Consumer API to lookup and seek/commit offsets to external storage, if you choose to do so, however, you can still then end up with less than optimal message processing.
my current situation is when my Kafka is down and up my consumer starts from the beginning and consumes all the previous messages
Sounds like you set auto.offset.reset=earliest and you never commit any offsets at all...
The auto commit setting does a periodic commit, not "automatic after reading any message".
If you want to guarantee delivery, then you need to set at least acks=1 in the producer and actually do a commitSync in the consumer

Confused about preventing duplicates with new Kafka idempotent producer API

My app has 5+ consumers consuming off of five partitions on a kafka topic.(using kafka version 11) My consumer's each produce a message to another topic then save some state to the database, then do a manual_ immediate acknowledgement and move onto the next message.
I'm trying to solve the scenario when they emit successful to the outbound topic. then we have a failure/lose the consumer. When another consumer takes over the partition it will emit ANOTHER message to the outbound topic. This is bad :(
I discovered that kafka now has idempotent producers but from what I read it only guarantees for a producers session.
"When producer restarts, new PID gets assigned. So the idempotency is promised only for a single producer session" - (blog) - https://hevodata.com/blog/kafka-exactly-once
This seems largely useless to me. In my use-case the whole point is when I replay a message on another consumer it does not duplicate the outbound message.
Is there something i'm missing?

When using transactions, you shouldn't use ANY consumer-based mechanism, manual or otherwise, to commit the offsets.
Instead, you use the producer to send the offsets to the transaction so the offset commit is part of the transaction.
If configured with a KafkaTransactionManager, or ChainedKafkaTransactionManager the Spring listener container will send the offsets to the transaction when the listener exits normally.
If you don't use a Kafka transaction manager, you need to use the KafkaTemplate (or Producer if you are using the native APIs) to send the offsets to the transaction.
Using the consumer to commit the offset is not part of the transaction, so things will not work as expected.
When using a transaction manager, the listener container binds the Producer to the thread so any downstream KafkaTemplate operations participate in the transaction that the consumer starts. See the documentation.

Making Kafka consumers consume existing messages before subscription

Having Publisher and N Consumers, if consumers use auto.offset.reset=latest then they miss all the messages that were published to a topic before they subscribed to it ... It is a known fact that Consumer with auto.offset.reset=latestdoesn't replay messages that existed in the topic before it subscribed...
So I would need either :
Make publisher wait until all subscribers start consuming messages and then start publishing. Dunno how to do that without leveraging Zookeeper for instance. Does Kafka provide means to do that ?
Another way would be having auto.offset.reset=latest Consumers and make them explicitly consume all existing messages before in case they are about to subscribe to a topic with existing messages...
What is the best practice for this case?
I guess that consumer must check topic for existing messages, consume them if there are any and then initiate auto.offset.reset=latest consumption. That sounds like the best way to me ...

If a high level consumer gets started, it does the following:
look for committed offsets for its consumer group
a. if valid offsets are found, resume from there
b. if no valid offsets are found, set offsets according to auto.offset.reset
Thus, auto.offset.reset only triggers, if no valid offset was committed. This behavior is intended and necessary to provide at-least-once processing guarantees in case of failure.
Thus, is you want to read a topic from its beginning, you can either use a new consumer group.id and set auto.offset.reset = earliest or you explicitly modify the offsets on startup using seekToBeginning() before you start your poll() loop.

We do the option (1) using a service discovery feature provided by Eureka (any other service discovery app would do the job) + aliasing. Basically a publisher does not register itself (and start processing requests nor publish notifications) until at least one subscriber is available.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse