When does kafka stream app clean it state store? - apache-kafka

I have a kafka streams app which is currently just joining two KStreams with a 5-minute window and writing the join result to another topic.
Since I am joining two topics over a time window, my app will have state associated with it. I was under the impression that the state stores in my app would get pruned after every the 5-minute window (Because my app cares only about the 5 min window of events for the join state).
I was expecting a constant disk-space utilization. But, seems like that is not the case. Its been 12 hrs and I do not see that the state store is getting cleaned up. It's consistently growing.
So I have multiple concerns on this now,
When does Kafka Streams app clean up its state?
If one of my app in the kafka streams app cluster fails, and I boot another host and make it join the cluster, after rebalancing, is there orphaned state store sitting in the disk for the partitions that got rebalanced?
My understanding is that the events are joined only if they happen in the defined window, so, why does kafka need to hold on to data that is older than the defined window period in its state store?
Let me know if you need any other information from me regarding my streams app. I am currently running kafka-streams version 2.2.1 and my brokers are also on the same version.

When does Kafka Streams app clean up its state?
The size of the state depends on the retention period, that is 1 day by default.
Atm, it's not possible to change the retention period for KStream-KStream joins -- it's already WIP to add this feature: https://issues.apache.org/jira/browse/KAFKA-8558
If one of my app in the kafka streams app cluster fails, and I boot another host and make it join the cluster, after rebalancing, is there orphaned state store sitting in the disk for the partitions that got rebalanced?
Yes. However, this state will be cleaned (if you restart Kafka Streams on the recovered host) if the state is not reused after a configurable (state.cleanup.delay.ms) period of time.
My understanding is that the events are joined only if they happen in the defined window, so, why does kafka need to hold on to data that is older than the defined window period in its state store?
Having a higher retention period that your window size allows Kafka Streams to process out-of-order data. Note that Kafka Streams uses event time semantics, not processing time semantics.

Related

Kafka streams losing messages in State Store during first launch of app

We have been using Kafka Streams to implement a CDC based app. Attached is the sub topology of interest.
table2 topic is created by Debezium who is connected to a SQL DB. It contains 26K lines. We take table2 and create a key which is only a conversion of the key from string to int. This means that we should expect that #table2=#repartition-topic=#state-store; which actually is not verified. What we end up with is the following #table2=#repartition-topic, but #repartition-topic>#state-store. We actually lose messages and thus corrupt the state store, which makes the app live in incorrect state. (Please note that there is no insertion in table2 as we paused the connector to verify the cardinality.)
The above happens only during the first launch, i.e. the app has never been launched before, so internal topics do not exist yet. Restarts of pre-existing apps do not yield any problems.
We have:
Broker on Kafka 3.2.
Client app on 2.8|3.2 (we tried both and we faced the same issue).
All parameters are by default except CACHE_MAX_BYTES_BUFFERING_CONFIG set to 0 and NUM_STREAM_THREADS_CONFIG set to >1.
What actually worked
Use a monothread at first launch: using one thread solves the problem. The #table2=#repartition-topic=#state-store is verified.
Pre-creating kafka internal topics: we noticed that whenever there is rebalance during the first launch of Kafka Streams app, the state stores ended up missing values. This also happens when you launch multiple pods in K8s for example. When we read through the logs, we noticed that there is a rebalance that is triggered when we first launch the app. This comes from the fact that the internal topics get created and assigned, thus the rebalance. So by creating the internal topics before, we avoid the rebalance and we end up by #table2=#repartition-topic=#state-store.
What we noticed from the logs
On multi-thread mode, we noticed that it is the partitions that are assigned to the thread chosen by the Coordinator to be the Leader of consumers that suffer the data loss. What we think is happening is the following:
Consumers threads are launched and inform the coordinator.
Coordinator assign topics and choses the Leader among the threads.
App create internal topics.
Consumers/producers process data. Specifically the Consumer leader consumes from the repartition topic, which triggers the delete of those messages without flushing them to changelog topic.
Leader notified of new assignment with internal topics. Triggers rebalance.
Leader pauses partitions.
Rebalance finished. The leader resumes partitions.
Leader fetches the oldest offset of repartition partitions he got assigned. He will not start from zero, but instead from where he got interrupted in 4. The chunk of early messages are thus lost.
Please note, that on mono-thread mode, there is no data loss which is weird since the leader is actually the unique thread.
So my questions are:
Are we understanding wrongly what's happening?
What can be the origin of this problem?

Uneven partition assignment in kafka streams

I am experiencing strange assignment behavior with Kafka Streams. I am having 3-node cluster of Kafka streams. My stream is pretty straightforward, one source topic (24 partitions, all kafka brokers are running on other machines than kafka stream nodes) and our stream graph only takes messages, group them by key, perform some filtering and store everything to sink topic. Everything is running with 2 Kafka Threads on each node.
However whenever I am doing rolling update of my kafka stream (by shutting down always only one app so other two nodes are running) my kafka streams ends with uneven number of partitions per "node"(usually 16-9-0). Only once I restart node01 and sometimes node02 cluster gets back to more even state.
Can somebody advice any hint how I can achieve more equal distribution before additional restarts?
I assume both nodes running the kafka streams app have identical group ids for consumption.
I suggest you check to see if the partition assignment strategy your consumers are using isn't org.apache.kafka.clients.consumer.RangeAssignor.
If this is the case, configure it to be org.apache.kafka.clients.consumer.RoundRobinAssignor. This way, when the group coordinator receives a JoinGroup request and hands the partitions over to the group leader, the group leader will ensure the spread between the nodes isn't uneven by more than 1.
Unless you're using an older version of Kafka streams, the default is Range and does not guarantee even spread across consumers.
Is your Kafka Streams application stateful? If so, you can possibly thank this well-intentioned KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams
If you want to override this behaviour, you can set acceptable.recovery.lag=9223372036854775807 (Long.MAX_VALUE).
The definition of that config from https://docs.confluent.io/platform/current/streams/developer-guide/config-streams.html#acceptable-recovery-lag
The maximum acceptable lag (total number of offsets to catch up from the changelog) for an instance to be considered caught-up and able to receive an active task. Streams only assigns stateful active tasks to instances whose state stores are within the acceptable recovery lag, if any exist, and assigns warmup replicas to restore state in the background for instances that are not yet caught up. Should correspond to a recovery time of well under a minute for a given workload. Must be at least 0.

Permanent Kafka Streams/KSQL retention policy

I'm presently working on an use case in which user interaction with a platform is tracked, thus generating a stream of events that gets stored into kafka and will be subsequently processed in Kafka Streams/KSQL.
But I've run into an issue concerning the state store and changelog topic retention policies. User sessions could happen indefinitely apart in time, therefore I must guarantee that the state will be persisted through that period and restored in case of node and clusterwide failures. During our searches, we came accross the following information:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management
Kafka Streams allows for stateful stream processing, i.e. operators that have an internal state. (...). The default implementation used by Kafka Streams DSL is a fault-tolerant state store using 1. an internally created and compacted changelog topic (for fault-tolerance) and 2. one (or multiple) RocksDB instances (for cached key-value lookups). Thus, in case of starting/stopping applications and rewinding/reprocessing, this internal data needs to get managed correctly.
(...) Thus, RocksDB memory requirement does not grow infinitely (in contrast to changelog topic). (KAFKA-4015 was fixed in 0.10.1 release, and windowed changelog topics don't grow unbounded as they apply an additional retention time parameter).
Retention time in kafka local state store / changelog
"For windowed KTables there is a local retention time and there is the changlog retention time. You can set the local store retention time via Materialized.withRetentionTime(...) -- the default value is 24h.
If a new application is created, changelog topics are created with the same retention time as local store retention time."
https://docs.confluent.io/current/streams/developer-guide/config-streams.html
The windowstore.changelog.additional.retention.ms parameter states:
Added to a windows maintainMs to ensure data is not deleted from the log prematurely. Allows for clock drift.
It would seem that Kafka Streams' maintains both a (replicated) local state store and a changelog topic for fault tolerance, with both having a finite, configurable retention period, and will apparently erase records once the retention time expires. This would lead to unnaceptable data loss in our platform, thus raising the following questions:
Does Kafka Streams actually clean up the default state store over time or have I misunderstood something? Is there an actual risk of data loss?
In that case, is it advisable or even possible to set an infinite retention policy to the state store? Or perhaps there could be another way of making sure the state will be persisted, such as using a more traditional database as state store, if that makes sense?
Does the retention policy apply to standby replicas?
If it's impossible to persist the state permanently, could there be another stream processing framework that better suits our use case?
Any clarification would be appreciated.
Seems you're asking about two different things. Session windows and changelog topics...
Compacted topics retain unique key pairs forever. Session window duration should probably be closed over time; a user session a week/month/year from one today is arguably a new session, and you should tie together each individual session window as a collection by the userId, not only store the most recent session (which implies removing previous sessions from the state store)

Does rebuilding state stores in Kafka Streams propagate duplicate records to downstream topics?

I'm currently using Kafka Streams for a stateful application. The state is not stored in a Kafka state store though, but rather just in memory for the moment being. This means whenever I restart the application, all state is lost and it has to be rebuilt by processing all records from the start.
After doing some research on Kafka state stores, this seems to be exactly the solution I'm looking for to persist state between application restarts (either in memory or on disk). However, I find the resources online lack some pretty important details, so I still have a couple of questions on how this would work exactly:
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
State stores use their own changelog topics, and kafka-streams state stores take on responsibility for loading from them. If your state stores are uninitialised, your kafka-streams app will rehydrate its local state store from the changelog topic using EARLIEST, since it has to read every record.
This means the startup sequence for a brand new instance is roughly:
Observe there is no local state-store cache
Load the local state store by consumeing from the changelog topic for the statestore (the state-store's topic name is <state-store-name>-changelog)
Read each record and update a local rocksDB instance accordingly
Do not emit anything, since this is an application-service, not your actual topology
Read your consumer-groups offsets using EARLIEST or LATEST according to how you configured the topology. Not this is only a concern if your consumer group doesn't have any offsets yet
Process stuff, emitting records according to the topology
Whether you set your actual topology's auto.offset.reset to LATEST or EARLIEST is up to you. In the event they are lost, or you create a new group, its a balance between potentially skipping records (LATEST) vs handling reprocessing of old records & deduplication (EARLIEST),
Long story short: state-restoration is different from processing, and handled by kafka-streams its self.
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If you are re-launching the same application (e.g. after having stopped it before), then state will not be recalculated by reprocessing the original input data. Instead, the state will be restored from its "backup" (every state store or KTable is durably stored in a Kafka topic, the so-called "changelog topic" of that table/state store for such purposes) so that its data is exactly what it was when the application was stopped. This behavior enables you to seamlessly stop+restart your applications without skipping over records that arrived between "stop" and "restart".
But there is a different caveat that you need to be aware of: The configuration to set the offset start point (latest or earliest) is only used when you run your Kafka Streams application for the first time. Afterwards, whenever you stop+restart your application, it will always continue where it previously stopped. That's because, if the app has run at least once, it has stored its consumer offset information in Kafka, which allows it to know from where to automatically resume operations once it is being restarted.
If you need the different behavior of always (re)starting from e.g. the latest offsets (thus potentially skipping records that arrived in between when you stopped the application and when you restarted it), you must reset your Kafka Streams application. One of the steps the reset tool performs is removing the application's consumer offset information from Kafka, which makes the application think that it was never started before, so to speak.
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
This reprocessing will not happen by default as explained above. State will be automatically reconstructed to its prior state (pun intended) at the point when the application was stopped.
Reprocessing would only happen if you manually reset your application (see above) and e.g. configure the application to re-read historical data (like setting auto.offset.reset to earliest after you did the reset).

Kafka Streams processor taking long time to consume changelog topics and initialize state stores

I'm working on a stream processor which has KStream-KStream and KStream-KTable join and also uses a state store remove the duplicates while doing the join.
We have been performing load tests for this processor and the messages in the topic are growing, which is causing the stream processor to take long time (~1 hour) to consume the changelog topics and initialize the state stores when there's a restart/redeployment happens.
We have a retention of 7 days for the topics.
There are multiple reasons for which this happens:
Your broker performance, i.e. how much data your KStream app can pull from each broker
Your KStream performance
Your serialization format (if you use Avro, the data size will be way smaller)
The solution to avoid expensive restarts is to have a persistent local state store. For example, you can map the default state store folder (/tmp/kafka-streams) to some sort of persistent volume