Kafka Streams Window Store keeping Data for Much Longer then Retention Period

Kafka Streams Window Store keeping Data for Much Longer then Retention Period - apache-kafka

The use case is to flush records from partitions which did not receive new data in kafka streams as we are using suppress which requires stream time.
So We have a Window Store with tumbling window of 1 minute with reduce operation attached with suppress. The design of suppress is dependent on stream time. So if any partition does not receive new consumer record then suppress will not move ahead for the pending consumer record in that partition.
It is worth noting that retention period of state store is set to 65 seconds.
So, to do a explicit flush from window state store decided to go with tranform api and used in the DSL topology.
In transform node we are using context.schedule to schedule punctuator to get access to state store and run a windowed query i.e. fetchall(startTimeInstant,endTimeInstant) to get old keys which are still not flushed out.
It is worth noting from the documentation that retention period is the minimum amount of time the data will stay in window store. Only if all the records in the window are old enough then only it is flushed.
Now the idea is the successful records should not be there in state store when we run fetchall (as starttime is (utc-3minutes). But till 6 minutes old data which was flushed out is still there in the window store.
The PROBLEM here is i do not want to see old records in window store as then payload has to be seen/parsed to make a choice whether to flush the data or not which is performance intensive.
i also checked the changelog store topic compact/delete policy. It also has 65 seconds.
I know classic approach is to send keep alive packet on all the partitions of the input topic but that is not feasible in our case as input topic is used by multiple clients. They all will have to change.

Related

Kafka Streams: reprocessing old data when windowing

Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0

Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.

Apache Kafka: large retention time vs. fast read of last value

Dear Apache Kafka friends,
I have a use case for which I am looking for an elegant solution:
Data is published in a Kafka-Topic at a relatively high rate. There are two competing requirements
all records should be kept for 7 days (which is configured by min.compaction.lag)
applications should read the "last status" from the topic during their initialization phase
LogCompaction is enabled in order for the "last state" to be available in the topic.
Now comes the problem. If an application wants to initialize itself from the topic, it has to read a lot of records to get the last state for all keys (the entire topic content must be processed). But this is not performant possible with the amount of records.
Idea
A streaming process streams the data of the topic into a corresponding ShortTerm topic which has a much shorter min.compaction.lag time (1 hour). The applications initialize themselves from this topic.
Risk
The streaming process is a potential source of errors. If it temporarily fails, the applications will no longer receive the latest status.
My Question
Are there any other possible solutions to satisfy the two requirements. Did I maybe miss a Kafa concept that helps to handle these competing requirements?
Any contribution is welcome. Thank you all.

If you don't have a strict guarantee how frequently each key will be updated, you cannot do anything else as you proposed.
To avoid the risk that the downstream app does not get new updates (because the data replication jobs stalls), I would recommend to only bootstrap an app from the short term topic, and let it consume from the original topic afterwards. To not miss any updates, you can sync the switch over as follows:
On app startup, get the replication job's committed offsets from the original topic.
Get the short term topic's current end-offsets (because the replication job will continue to write data, you just need a fixed stopping point).
Consume the short term topic from beginning to the captured end offsets.
Resume consuming from the original topic using the captured committed offsets (from step 1) as start point.
This way, you might read some messages twice, but you won't lose any updates.

To me, the two requirements you have mentioned together with the requirement for new consumers are not competing. In fact, I do not see any reason why you should keep a message of an outdated key in your topic for 7 days, because
New consumers are only interested in the latest message of a key.
Already existing consumers will have processed the message within 1 hour (as taken from your comments).
Therefore, my understanding is that your requirement "all records should be kept for 7 days" can be replaced by "each consumer should have enough time to consume the message & the latest message for each key should be kept for 7 days".
Please correct me if I am wrong and explain which consumer actually does need "all records for 7 days".
If that is the case you could do the following:
Enable log compaction as well as time-based retention to 7 days for this topic
Fine-tune the compaction frequency to be very eager, meaning to keep as little as possible outdated messages for a key.
Set min.compaction.lag to 1 hour such that all consumers have the chance to keep up.
That way, new consumers will read (almost) only the latest message for each key. If that is not performant enough, you can try increasing the partitions and consumer threads of your consumer groups.

Permanent Kafka Streams/KSQL retention policy

I'm presently working on an use case in which user interaction with a platform is tracked, thus generating a stream of events that gets stored into kafka and will be subsequently processed in Kafka Streams/KSQL.
But I've run into an issue concerning the state store and changelog topic retention policies. User sessions could happen indefinitely apart in time, therefore I must guarantee that the state will be persisted through that period and restored in case of node and clusterwide failures. During our searches, we came accross the following information:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Management
Kafka Streams allows for stateful stream processing, i.e. operators that have an internal state. (...). The default implementation used by Kafka Streams DSL is a fault-tolerant state store using 1. an internally created and compacted changelog topic (for fault-tolerance) and 2. one (or multiple) RocksDB instances (for cached key-value lookups). Thus, in case of starting/stopping applications and rewinding/reprocessing, this internal data needs to get managed correctly.
(...) Thus, RocksDB memory requirement does not grow infinitely (in contrast to changelog topic). (KAFKA-4015 was fixed in 0.10.1 release, and windowed changelog topics don't grow unbounded as they apply an additional retention time parameter).
Retention time in kafka local state store / changelog
"For windowed KTables there is a local retention time and there is the changlog retention time. You can set the local store retention time via Materialized.withRetentionTime(...) -- the default value is 24h.
If a new application is created, changelog topics are created with the same retention time as local store retention time."
https://docs.confluent.io/current/streams/developer-guide/config-streams.html
The windowstore.changelog.additional.retention.ms parameter states:
Added to a windows maintainMs to ensure data is not deleted from the log prematurely. Allows for clock drift.
It would seem that Kafka Streams' maintains both a (replicated) local state store and a changelog topic for fault tolerance, with both having a finite, configurable retention period, and will apparently erase records once the retention time expires. This would lead to unnaceptable data loss in our platform, thus raising the following questions:
Does Kafka Streams actually clean up the default state store over time or have I misunderstood something? Is there an actual risk of data loss?
In that case, is it advisable or even possible to set an infinite retention policy to the state store? Or perhaps there could be another way of making sure the state will be persisted, such as using a more traditional database as state store, if that makes sense?
Does the retention policy apply to standby replicas?
If it's impossible to persist the state permanently, could there be another stream processing framework that better suits our use case?
Any clarification would be appreciated.

Seems you're asking about two different things. Session windows and changelog topics...
Compacted topics retain unique key pairs forever. Session window duration should probably be closed over time; a user session a week/month/year from one today is arguably a new session, and you should tie together each individual session window as a collection by the userId, not only store the most recent session (which implies removing previous sessions from the state store)

Does rebuilding state stores in Kafka Streams propagate duplicate records to downstream topics?

I'm currently using Kafka Streams for a stateful application. The state is not stored in a Kafka state store though, but rather just in memory for the moment being. This means whenever I restart the application, all state is lost and it has to be rebuilt by processing all records from the start.
After doing some research on Kafka state stores, this seems to be exactly the solution I'm looking for to persist state between application restarts (either in memory or on disk). However, I find the resources online lack some pretty important details, so I still have a couple of questions on how this would work exactly:
If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?

State stores use their own changelog topics, and kafka-streams state stores take on responsibility for loading from them. If your state stores are uninitialised, your kafka-streams app will rehydrate its local state store from the changelog topic using EARLIEST, since it has to read every record.
This means the startup sequence for a brand new instance is roughly:
Observe there is no local state-store cache
Load the local state store by consumeing from the changelog topic for the statestore (the state-store's topic name is <state-store-name>-changelog)
Read each record and update a local rocksDB instance accordingly
Do not emit anything, since this is an application-service, not your actual topology
Read your consumer-groups offsets using EARLIEST or LATEST according to how you configured the topology. Not this is only a concern if your consumer group doesn't have any offsets yet
Process stuff, emitting records according to the topology
Whether you set your actual topology's auto.offset.reset to LATEST or EARLIEST is up to you. In the event they are lost, or you create a new group, its a balance between potentially skipping records (LATEST) vs handling reprocessing of old records & deduplication (EARLIEST),
Long story short: state-restoration is different from processing, and handled by kafka-streams its self.

If the stream is set to start from offset latest, will the state still be (re)calculated from all the previous records?
If you are re-launching the same application (e.g. after having stopped it before), then state will not be recalculated by reprocessing the original input data. Instead, the state will be restored from its "backup" (every state store or KTable is durably stored in a Kafka topic, the so-called "changelog topic" of that table/state store for such purposes) so that its data is exactly what it was when the application was stopped. This behavior enables you to seamlessly stop+restart your applications without skipping over records that arrived between "stop" and "restart".
But there is a different caveat that you need to be aware of: The configuration to set the offset start point (latest or earliest) is only used when you run your Kafka Streams application for the first time. Afterwards, whenever you stop+restart your application, it will always continue where it previously stopped. That's because, if the app has run at least once, it has stored its consumer offset information in Kafka, which allows it to know from where to automatically resume operations once it is being restarted.
If you need the different behavior of always (re)starting from e.g. the latest offsets (thus potentially skipping records that arrived in between when you stopped the application and when you restarted it), you must reset your Kafka Streams application. One of the steps the reset tool performs is removing the application's consumer offset information from Kafka, which makes the application think that it was never started before, so to speak.
If previously already processed records need to be reprocessed in order to rebuild the state, will this propagate records through the rest of the Streams topology (e.g. InputTopic -> stateful processor -> OutputTopic, will this result in duplicated records in the OutputTopic because of rebuilding state)?
This reprocessing will not happen by default as explained above. State will be automatically reconstructed to its prior state (pun intended) at the point when the application was stopped.
Reprocessing would only happen if you manually reset your application (see above) and e.g. configure the application to re-read historical data (like setting auto.offset.reset to earliest after you did the reset).

Kafka Streams Sort Within Processing Time Window

I wonder if there's any way to sort records within a window using Kafka Streams DSL or Processor API.
Imagine the following situation as an example (arbitrary one, but similar to what I need):
There is a Kafka topic of some events, let's say user clicks. Let's say topic has 10 partitions. Messages are partitioned by key, but each key is unique, so it's sort of a random partitioning. Each record contains a user id, which is used later to repartition the stream.
We consume the stream, and publish each message to another topic partitioning the record by it's user id (repartition the original stream by user id).
Then we consume this repartitioned stream, and we store consumed records in local state store windowed by 10 minutes. All clicks of a particular user are always in the same partition, but order is not guarantied, because the original topic had 10 partitions.
I understand the windowing model of Kafka Streams, and that time is advanced when new records come in, but I need this window to use processing time, not the event time, and then when window is expired, I need to be able to sort buffered events, and emit them in that order to another topic.
Notice:
We need to be able to flush/process records within the window using processing time, not the event time. We can't wait for the next click to advance the time, because it may never happen.
We need to remove all the records from the store, as soon window is sorted and flushed.
If application crashes, we need to recover (in the same or another instance of the application) and process all the windows that were not processed, without waiting for new records to come for a particular user.
I know Kafka Streams 1.0.0 allows to use wall clock time in Processing API, but I'm not sure what would be the right way to implement what I need (more importantly taking into account the recovery process requirement described above).

You can see my answer to a similar question here:
https://stackoverflow.com/a/44345374/7897191
Since your message keys are already unique you can ignore my comments about de-duplication.
Now that KIP-138 (wall-clock punctuation semantics) has been released in 1.0.0 you should be able to implement the outlined algorithm without issues. It uses the Processor API. I don't know of a way of doing this with only the DSL.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse