Does rebuilding state stores in Kafka Streams propagate duplicate records to downstream topics? - apache-kafka

I'm currently using Kafka Streams for a stateful application. The state is not stored in a Kafka state store though, but rather just in memory for the moment being. This means whenever I restart the application, all state is lost and it has to be rebuilt by processing all records from the start.
After doing some research on Kafka state stores, this seems to be exactly the solution I'm looking for to persist state between application restarts (either in memory or on disk). However, I find the resources online lack some pretty important details, so I still have a couple of questions on how this would work exactly:
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?

State stores use their own changelog topics, and kafka-streams state stores take on responsibility for loading from them. If your state stores are uninitialised, your kafka-streams app will rehydrate its local state store from the changelog topic using EARLIEST, since it has to read every record.
This means the startup sequence for a brand new instance is roughly:
Observe there is no local state-store cache
Load the local state store by consumeing from the changelog topic for the statestore (the state-store's topic name is <state-store-name>-changelog)
Read each record and update a local rocksDB instance accordingly
Do not emit anything, since this is an application-service, not your actual topology
Read your consumer-groups offsets using EARLIEST or LATEST according to how you configured the topology. Not this is only a concern if your consumer group doesn't have any offsets yet
Process stuff, emitting records according to the topology
Whether you set your actual topology's auto.offset.reset to LATEST or EARLIEST is up to you. In the event they are lost, or you create a new group, its a balance between potentially skipping records (LATEST) vs handling reprocessing of old records & deduplication (EARLIEST),
Long story short: state-restoration is different from processing, and handled by kafka-streams its self.

If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If you are re-launching the same application (e.g. after having stopped it before), then state will not be recalculated by reprocessing the original input data. Instead, the state will be restored from its "backup" (every state store or KTable is durably stored in a Kafka topic, the so-called "changelog topic" of that table/state store for such purposes) so that its data is exactly what it was when the application was stopped. This behavior enables you to seamlessly stop+restart your applications without skipping over records that arrived between "stop" and "restart".
But there is a different caveat that you need to be aware of: The configuration to set the offset start point (latest or earliest) is only used when you run your Kafka Streams application for the first time. Afterwards, whenever you stop+restart your application, it will always continue where it previously stopped. That's because, if the app has run at least once, it has stored its consumer offset information in Kafka, which allows it to know from where to automatically resume operations once it is being restarted.
If you need the different behavior of always (re)starting from e.g. the latest offsets (thus potentially skipping records that arrived in between when you stopped the application and when you restarted it), you must reset your Kafka Streams application. One of the steps the reset tool performs is removing the application's consumer offset information from Kafka, which makes the application think that it was never started before, so to speak.
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
This reprocessing will not happen by default as explained above. State will be automatically reconstructed to its prior state (pun intended) at the point when the application was stopped.
Reprocessing would only happen if you manually reset your application (see above) and e.g. configure the application to re-read historical data (like setting auto.offset.reset to earliest after you did the reset).

Related

How to reread Kafka topic from the beginning

I have Spring Boot app that uses Kafka via Spring Kafka module. A neighboring team sends data in JSON format periodically to a compacted topic that serves as "snapshot" of their internal database at a certain moment of time. But,sometimes, the team updates contract without notification, our DTOs don't reflect the recent changes and, obviously, deserialization fails, and, because we have our listener containers configured as batched with default BATCH ack mode and BatchLoggingErrorHandler, we found, from time to time, that our Kuber pod with consumer is full of errors and we can't reread topic with fresh DTOs after microservice redeploy that simple since the last offset in every partition is commited and we can't change group.id (InfoSec department policy) to use auto.offset.reset = "earliest" as a workaround.
So, is there a way to reposition every consumer in a consumer group to the initial offset in assigned partitions programmaticaly? If it is true, I think we could write a REST endpoint, which being called triggers a "reprocessing from scratch".
See the documentation.
If your listener extends AbstractConsumerSeekAware, you can perform all kinds of seek operations (e.g. initial seek during initialization, arbitrary seeks between polls).

Kafka Streams - Global State Store restoration before stream threads starts processing

I have Global State Store attached my topology. Global state store in reading from compacted topic. This global state store stores 100,000 records and these records should be there in state store for correct processing of the topology.
Question:
Q. During application restart, kafka streams will start the global state store thread and make sure that the state is fully built before starting streams threads ?
I am trying to find some documentation related to this topic
Please point me to code or documentation also.
It depends if there is still any state remaining in the state.dir local filesystem for your application.id.
If there is, it'll start rebuilding data onto there. Otherwise, the beginning of the topic will have to be consumed to recreate that data

Permanent Kafka Streams/KSQL retention policy

I'm presently working on an use case in which user interaction with a platform is tracked, thus generating a stream of events that gets stored into kafka and will be subsequently processed in Kafka Streams/KSQL.
But I've run into an issue concerning the state store and changelog topic retention policies. User sessions could happen indefinitely apart in time, therefore I must guarantee that the state will be persisted through that period and restored in case of node and clusterwide failures. During our searches, we came accross the following information:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management
Kafka Streams allows for stateful stream processing, i.e. operators that have an internal state. (...). The default implementation used by Kafka Streams DSL is a fault-tolerant state store using 1. an internally created and compacted changelog topic (for fault-tolerance) and 2. one (or multiple) RocksDB instances (for cached key-value lookups). Thus, in case of starting/stopping applications and rewinding/reprocessing, this internal data needs to get managed correctly.
(...) Thus, RocksDB memory requirement does not grow infinitely (in contrast to changelog topic). (KAFKA-4015 was fixed in 0.10.1 release, and windowed changelog topics don't grow unbounded as they apply an additional retention time parameter).
Retention time in kafka local state store / changelog
"For windowed KTables there is a local retention time and there is the changlog retention time. You can set the local store retention time via Materialized.withRetentionTime(...) -- the default value is 24h.
If a new application is created, changelog topics are created with the same retention time as local store retention time."
https://docs.confluent.io/current/streams/developer-guide/config-streams.html
The windowstore.changelog.additional.retention.ms parameter states:
Added to a windows maintainMs to ensure data is not deleted from the log prematurely. Allows for clock drift.
It would seem that Kafka Streams' maintains both a (replicated) local state store and a changelog topic for fault tolerance, with both having a finite, configurable retention period, and will apparently erase records once the retention time expires. This would lead to unnaceptable data loss in our platform, thus raising the following questions:
Does Kafka Streams actually clean up the default state store over time or have I misunderstood something? Is there an actual risk of data loss?
In that case, is it advisable or even possible to set an infinite retention policy to the state store? Or perhaps there could be another way of making sure the state will be persisted, such as using a more traditional database as state store, if that makes sense?
Does the retention policy apply to standby replicas?
If it's impossible to persist the state permanently, could there be another stream processing framework that better suits our use case?
Any clarification would be appreciated.
Seems you're asking about two different things. Session windows and changelog topics...
Compacted topics retain unique key pairs forever. Session window duration should probably be closed over time; a user session a week/month/year from one today is arguably a new session, and you should tie together each individual session window as a collection by the userId, not only store the most recent session (which implies removing previous sessions from the state store)

When does kafka stream app clean it state store?

I have a kafka streams app which is currently just joining two KStreams with a 5-minute window and writing the join result to another topic.
Since I am joining two topics over a time window, my app will have state associated with it. I was under the impression that the state stores in my app would get pruned after every the 5-minute window (Because my app cares only about the 5 min window of events for the join state).
I was expecting a constant disk-space utilization. But, seems like that is not the case. Its been 12 hrs and I do not see that the state store is getting cleaned up. It's consistently growing.
So I have multiple concerns on this now,
When does Kafka Streams app clean up its state?
If one of my app in the kafka streams app cluster fails, and I boot another host and make it join the cluster, after rebalancing, is there orphaned state store sitting in the disk for the partitions that got rebalanced?
My understanding is that the events are joined only if they happen in the defined window, so, why does kafka need to hold on to data that is older than the defined window period in its state store?
Let me know if you need any other information from me regarding my streams app. I am currently running kafka-streams version 2.2.1 and my brokers are also on the same version.
When does Kafka Streams app clean up its state?
The size of the state depends on the retention period, that is 1 day by default.
Atm, it's not possible to change the retention period for KStream-KStream joins -- it's already WIP to add this feature: https://issues.apache.org/jira/browse/KAFKA-8558
If one of my app in the kafka streams app cluster fails, and I boot another host and make it join the cluster, after rebalancing, is there orphaned state store sitting in the disk for the partitions that got rebalanced?
Yes. However, this state will be cleaned (if you restart Kafka Streams on the recovered host) if the state is not reused after a configurable (state.cleanup.delay.ms) period of time.
My understanding is that the events are joined only if they happen in the defined window, so, why does kafka need to hold on to data that is older than the defined window period in its state store?
Having a higher retention period that your window size allows Kafka Streams to process out-of-order data. Note that Kafka Streams uses event time semantics, not processing time semantics.

Kafka Streams: Any guarantees on ordering of saves to state stores when using at_least_once?

We have a Kafka Streams Java topology built with the Processor API.
In the topology, we have a single processor, that saves to multiple state stores.
As we use at_least_once, we would expect to see some inconsistencies between the state stores - e.g. an incoming record results in writes to both state store A and B, but a crash between the saves results in only the save to store A getting written to the Kafka change log topic.
Are we guaranteed that the order in which we save will also be the order in which the writes to the state stores happen? E.g. if we first save to store A and then to store B, we can of course have situation where the write to both change logs succeeded, and a situation where only the write to change log A was completed - but can we also end up in a situation where only the write to change log B was completed?
What situations will result in replays? A crash of course - but what about rebalances, new broker partition leader, or when we get an "Offset commit failed" error (The request timed out)?
A while ago, we tried using exactly_once, which resulted in a lot of error messages, that didn't make sense to us. Would exactly_once give us atomic writes across multiple state stores?
Ad 3. According to The original design document on exactly-once support in Kafka Streams I think with eaxctly_once you get atomic writes across multiple state stores
When stream.commit() is called, the following steps are executed in order:
Flush local state stores (KTable caches) to make sure all changelog records are sent downstream.
Call producer.sendOffsetsToTransactions(offsets) to commit the current recorded consumer’s positions within the transaction. Note that although the consumer of the thread can be shared among multiple tasks hence multiple producers, task’s assigned partitions are always exclusive, and hence it is safe to just commit the offsets of this tasks’ assigned partitions.
Call producer.commitTransaction() to commit the current transaction. As a result the task state represented as the above triplet is committed atomically.
Call producer.beginTransaction() again to start the next transaction.