How to reread Kafka topic from the beginning - apache-kafka

I have Spring Boot app that uses Kafka via Spring Kafka module. A neighboring team sends data in JSON format periodically to a compacted topic that serves as "snapshot" of their internal database at a certain moment of time. But,sometimes, the team updates contract without notification, our DTOs don't reflect the recent changes and, obviously, deserialization fails, and, because we have our listener containers configured as batched with default BATCH ack mode and BatchLoggingErrorHandler, we found, from time to time, that our Kuber pod with consumer is full of errors and we can't reread topic with fresh DTOs after microservice redeploy that simple since the last offset in every partition is commited and we can't change group.id (InfoSec department policy) to use auto.offset.reset = "earliest" as a workaround.
So, is there a way to reposition every consumer in a consumer group to the initial offset in assigned partitions programmaticaly? If it is true, I think we could write a REST endpoint, which being called triggers a "reprocessing from scratch".

See the documentation.
If your listener extends AbstractConsumerSeekAware, you can perform all kinds of seek operations (e.g. initial seek during initialization, arbitrary seeks between polls).

Related

Kafka consumer is processing all messages at startup

I am new to Kafka, and am developing a personal project with a few services and the communication between them is made through Kafka and I am using Confluent for housing Kafka remotely.
All works fine, but when I startup a server it will try to process all the old messages in the topics that were generated as I was testing the system.
I would like to avoid this because it is time consuming and those messages were already processed, when the server was up the last time. Is there any way to prevent this in the development environment?
Am I even using Kafka correctly? Are there good practises that I missed?
By "server", I assume you mean consumer. The broker server doesn't process data, only stores it.
If you have auto.offset.reset=earliest + enable.auto.commit=false + are not committing the records in your code (or are overall using a new group.id each time), this is the expected behavior since your group.id is not tracking already consumed data.
Since you're now in a situation where you have processed data, but no stored offsets, first set a static group id, then your options include
re-process all the data again, accepting the duplicates, perhaps adding some conditional filter in your consumer code to skip records
skip all processed and un-processed data and only start consuming brand-new records after the consumer starts, by either setting a new group.id + auto.offset.reset=latest, or use consumer.seekToEnd() / the kafka-consumer-groups CLI tool ; downside of setting auto.offset.reset=latest is that you might run into a situation where the consumer group has been idle too long, and the group expires, causing you to go back to the end of the topic, even though there may still be un-processed data
manually find the offsets for all the partitions for the last processed data and consumer.seek() to those offsets

How to manage offsets in KafkaItemReader used in spring batch job in case of any exception occurs in mid of reading messages

I am working on a Kafka based spring boot application for the first time. My requirement was to create an output file with all the records using spring batch. I created a spring batch job where integrated with a customized class which extends KafkaItemReader. I don't want to commit the offsets for now as i might need to go back read some records from already consumed offsets. My consumer config has these properties;
enable.auto.commit: false
auto-offset-reset: latest
group.id:
There are two scenarios->
1. A happy path, where i can read all the messages from kafka topic and transform them and then write them to an output file using above configuration.
2. I am getting an exception while reading thru' the messages, and i am not sure how to manage the offsets in such cases. Even if i go back to rest the offset, how to make sure it is the correct offset for messages. I dont persist the payload of message record anywhere except it goes to the spring batch output file.
You need to use a persistent Job Repository for that and configure the KafkaItemReader to save its state. The state consists in the offset of each partition assigned to the reader and will be saved at chunk boundaries (aka at each transaction).
In a restart scenario, the reader will be initialized with the last offset for each partition from the execution context and resume where it left off.

Does rebuilding state stores in Kafka Streams propagate duplicate records to downstream topics?

I'm currently using Kafka Streams for a stateful application. The state is not stored in a Kafka state store though, but rather just in memory for the moment being. This means whenever I restart the application, all state is lost and it has to be rebuilt by processing all records from the start.
After doing some research on Kafka state stores, this seems to be exactly the solution I'm looking for to persist state between application restarts (either in memory or on disk). However, I find the resources online lack some pretty important details, so I still have a couple of questions on how this would work exactly:
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
State stores use their own changelog topics, and kafka-streams state stores take on responsibility for loading from them. If your state stores are uninitialised, your kafka-streams app will rehydrate its local state store from the changelog topic using EARLIEST, since it has to read every record.
This means the startup sequence for a brand new instance is roughly:
Observe there is no local state-store cache
Load the local state store by consumeing from the changelog topic for the statestore (the state-store's topic name is <state-store-name>-changelog)
Read each record and update a local rocksDB instance accordingly
Do not emit anything, since this is an application-service, not your actual topology
Read your consumer-groups offsets using EARLIEST or LATEST according to how you configured the topology. Not this is only a concern if your consumer group doesn't have any offsets yet
Process stuff, emitting records according to the topology
Whether you set your actual topology's auto.offset.reset to LATEST or EARLIEST is up to you. In the event they are lost, or you create a new group, its a balance between potentially skipping records (LATEST) vs handling reprocessing of old records & deduplication (EARLIEST),
Long story short: state-restoration is different from processing, and handled by kafka-streams its self.
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If you are re-launching the same application (e.g. after having stopped it before), then state will not be recalculated by reprocessing the original input data. Instead, the state will be restored from its "backup" (every state store or KTable is durably stored in a Kafka topic, the so-called "changelog topic" of that table/state store for such purposes) so that its data is exactly what it was when the application was stopped. This behavior enables you to seamlessly stop+restart your applications without skipping over records that arrived between "stop" and "restart".
But there is a different caveat that you need to be aware of: The configuration to set the offset start point (latest or earliest) is only used when you run your Kafka Streams application for the first time. Afterwards, whenever you stop+restart your application, it will always continue where it previously stopped. That's because, if the app has run at least once, it has stored its consumer offset information in Kafka, which allows it to know from where to automatically resume operations once it is being restarted.
If you need the different behavior of always (re)starting from e.g. the latest offsets (thus potentially skipping records that arrived in between when you stopped the application and when you restarted it), you must reset your Kafka Streams application. One of the steps the reset tool performs is removing the application's consumer offset information from Kafka, which makes the application think that it was never started before, so to speak.
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
This reprocessing will not happen by default as explained above. State will be automatically reconstructed to its prior state (pun intended) at the point when the application was stopped.
Reprocessing would only happen if you manually reset your application (see above) and e.g. configure the application to re-read historical data (like setting auto.offset.reset to earliest after you did the reset).

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Making Kafka consumers consume existing messages before subscription

Having Publisher and N Consumers, if consumers use auto.offset.reset=latest then they miss all the messages that were published to a topic before they subscribed to it ... It is a known fact that Consumer with auto.offset.reset=latestdoesn't replay messages that existed in the topic before it subscribed...
So I would need either :
Make publisher wait until all subscribers start consuming messages and then start publishing. Dunno how to do that without leveraging Zookeeper for instance. Does Kafka provide means to do that ?
Another way would be having auto.offset.reset=latest Consumers and make them explicitly consume all existing messages before in case they are about to subscribe to a topic with existing messages...
What is the best practice for this case?
I guess that consumer must check topic for existing messages, consume them if there are any and then initiate auto.offset.reset=latest consumption. That sounds like the best way to me ...
If a high level consumer gets started, it does the following:
look for committed offsets for its consumer group
a. if valid offsets are found, resume from there
b. if no valid offsets are found, set offsets according to auto.offset.reset
Thus, auto.offset.reset only triggers, if no valid offset was committed. This behavior is intended and necessary to provide at-least-once processing guarantees in case of failure.
Thus, is you want to read a topic from its beginning, you can either use a new consumer group.id and set auto.offset.reset = earliest or you explicitly modify the offsets on startup using seekToBeginning() before you start your poll() loop.
We do the option (1) using a service discovery feature provided by Eureka (any other service discovery app would do the job) + aliasing. Basically a publisher does not register itself (and start processing requests nor publish notifications) until at least one subscriber is available.