Does GlobalKTable maintain data after application restart? - apache-kafka

I'm working with Spring Cloud Streams and I have a BiFunction that receives a KStream and a GlobalKTable. I dont want to lose the GlobalKTable data after my application restarts, but it's what is happening.
#Bean
public BiFunction<KStream<String, MyClass1>, GlobalKTable<String, MyClass2>, KStream<String, MyClass3>> process() {
...
}
I've also configured the "materializedAs" property:
spring.cloud.stream.kafka.streams.bindings.process-in-1.consumer.materializedAs: MYTABLE
I Have a topic A that have a retention time of 1 week. So, if a message from topic A was erased due retention time and my application restarts, the GlobalKTable doesn't find this message.
The GlobalKTable data should really be erased when my application restarts?

GlobalKTable always restores from the input topic directly. It builds the state store based on the input topic. If the state store is already there and in sync with the input topic, I believe the restore on startup will be faster (therefore, if you are using Spring for Apache Kafka < 2.7, you need to do what Gary suggested above). However, if the input topic is completely removed, then the state store needs to be rebuilt entirely from scratch with the new input topic. That is the reason, why you are not seeing any data restored on startup after deleting the topic. This thread has some more details on this topic.

See the binder documentation.
By default, the Kafkastreams.cleanup() method is called when the binding is stopped. See the Spring Kafka documentation. To modify this behavior simply add a single CleanupConfig #Bean (configured to clean up on start, stop, or neither) to the application context; the bean will be detected and wired into the factory bean.
Spring for Apache Kafka 2.7 and later does not remove the state by default any longer: https://github.com/spring-projects/spring-kafka/commit/eff205404389b563849fdd4dceb52b23aeb38f20

Related

How to reread Kafka topic from the beginning

I have Spring Boot app that uses Kafka via Spring Kafka module. A neighboring team sends data in JSON format periodically to a compacted topic that serves as "snapshot" of their internal database at a certain moment of time. But,sometimes, the team updates contract without notification, our DTOs don't reflect the recent changes and, obviously, deserialization fails, and, because we have our listener containers configured as batched with default BATCH ack mode and BatchLoggingErrorHandler, we found, from time to time, that our Kuber pod with consumer is full of errors and we can't reread topic with fresh DTOs after microservice redeploy that simple since the last offset in every partition is commited and we can't change group.id (InfoSec department policy) to use auto.offset.reset = "earliest" as a workaround.
So, is there a way to reposition every consumer in a consumer group to the initial offset in assigned partitions programmaticaly? If it is true, I think we could write a REST endpoint, which being called triggers a "reprocessing from scratch".
See the documentation.
If your listener extends AbstractConsumerSeekAware, you can perform all kinds of seek operations (e.g. initial seek during initialization, arbitrary seeks between polls).

where does kafka save the local state store?

I created a Kafka topic and sent some messages to it.
I created an application which had the stream topology. Basically, it was doing the sum and materialized it to a local state store.
I saw the new directory was created in the state folder I configured.
I could read the sum from the local state store.
Everything was good so far.
Then, I turned off my application which was running the stream.
I removed the directory created in my state folder.
I restart the Kafka cluster.
I restart my application which has the stream topology.
In my understanding, the state was gone. Kafka needs to do the aggregation again. But it did not. I was still able to get the previous sum result.
How comes? Where did Kafka save the local state store?
Here is my code
Reducer<Double> reduceFunction = (subtotal, amount) -> {
// detect when the reducer is triggered
System.out.println("reducer is running to add subtotal with amount..." + amount);
return subtotal + amount;
};
groupedByAccount.reduce(reduceFunction,
Materialized.<String, Double, KeyValueStore<Bytes, byte[]>>as(BALANCE).withValueSerde(Serdes.Double()));
I explicitly put the System.out in the reduceFunction. Whenever it is executed, I shall see it on the console.
But I did not see any after restart kafka cluster and my application.
Does Kafka really recover the state? Or it saves state somewhere else?
If I'm not mistaken and according to Designing Event-Driven Systems by Ben Stopford (free book) on page 137, it states:
We could store, these
stats in a state store and they’ll be saved locally as well as being backed up to
Kafka, using what’s called a changelog topic, inheriting all of Kafka’s durability
guarantees.
It seems like a copy of your state store is also backed up in Kafka itself (i.e. changelog topic). I don't think restarting a cluster will flush out (or remove) messages already in a topic as they're kept track in the Zookeeper.
So once you restart your cluster and application again, the local state store is recovered from Kafka.

How to manage offsets in KafkaItemReader used in spring batch job in case of any exception occurs in mid of reading messages

I am working on a Kafka based spring boot application for the first time. My requirement was to create an output file with all the records using spring batch. I created a spring batch job where integrated with a customized class which extends KafkaItemReader. I don't want to commit the offsets for now as i might need to go back read some records from already consumed offsets. My consumer config has these properties;
enable.auto.commit: false
auto-offset-reset: latest
group.id:
There are two scenarios->
1. A happy path, where i can read all the messages from kafka topic and transform them and then write them to an output file using above configuration.
2. I am getting an exception while reading thru' the messages, and i am not sure how to manage the offsets in such cases. Even if i go back to rest the offset, how to make sure it is the correct offset for messages. I dont persist the payload of message record anywhere except it goes to the spring batch output file.
You need to use a persistent Job Repository for that and configure the KafkaItemReader to save its state. The state consists in the offset of each partition assigned to the reader and will be saved at chunk boundaries (aka at each transaction).
In a restart scenario, the reader will be initialized with the last offset for each partition from the execution context and resume where it left off.

Can Kafka Streams be used instead of Producer?

Kafka Streams with RocksDB store looks very fine for processing existed messages but can it be used as fault tolerance producer?
Idea to create custom Processor that will check new events in some in-memory generate every 10-100ms and forward it to ProcessorContext with attached store looks really crazy. Is there any straightforward approach?

Kafka Streams stateStores fault tolerance exactly once?

We're trying to achieve a deduplication service using Kafka Streams.
The big picture is that it will use its rocksDB state store in order to check existing keys during process.
Please correct me if I'm wrong, but to make those stateStores fault tolerant too, Kafka streams API will transparently copy the values in the stateStore inside a Kafka topic ( called the change Log).
That way, if our service falls, another service will be able to rebuild its stateStore according to the changeLog found in Kafka.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
I mean, When the service will update its stateStore, it will update the changelog in an exactly once fashion too.. ?
If the service crash, another one will take the load, but can we sure it won't miss a stateStore update from the crashing service ?
Regards,
Yannick
Short answer is yes.
Using transaction - Atomic multi-partition write - Kafka Streams insure, that when offset commit was performed, state store was also flashed to changelog topic on the brokers. Above operations are Atomic, so if one of them will failed, application will reprocess messages from previous offset position.
You can read in following blog more about exactly once semantic https://www.confluent.io/blog/enabling-exactly-kafka-streams/. There is section: How Kafka Streams Guarantees Exactly-Once Processing.
But it raises a question to my mind, do this " StateStore --> changelog" itself is exactly once ?
Yes -- as others have already said here. You must of course configure your application to use exactly-once semantics via the configuration parameter processing.guarantee, see https://kafka.apache.org/21/documentation/streams/developer-guide/config-streams.html#processing-guarantee (this link is for Apache Kafka 2.1).
We're trying to achieve a deduplication service using Kafka Streams. The big picture is that it will use its rocksDB state store in order to check existing keys during process.
There's also an event de-duplication example application available at https://github.com/confluentinc/kafka-streams-examples/blob/5.1.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java. This links points to the repo branch for Confluent Platform 5.1.0, which uses Apache Kafka 2.1.0 = the latest version of Kafka available right now.