Deduplication using Kafka-Streams - apache-kafka

I want to deduplication in my kafka-streams application which uses state-store and using this very good example:
https://github.com/confluentinc/kafka-streams-examples/blob/5.5.0-post/src/test/java/io/confluent/examples/streams/EventDeduplicationLambdaIntegrationTest.java
I have few questions about this example.
As I correctly understand, this example briefly do this:
Message comes into input topic
Look at the store, if it does not exist, write to state-store and return
if it does exist drop the record, so the deduplication is applied.
But in the code example there is a time window size that you can determine. Also, retention time for the messages in the state-store. You can also check the record is in the store or not by giving timestamp timeFrom + timeTo
final long eventTime = context.timestamp();
final WindowStoreIterator<String> timeIterator = store.fetch(
key,
eventTime - leftDurationMs,
eventTime + rightDurationMs
);
What is the actual purpose for the timeTo and timeFrom ? I am not sure why I am checking the next time interval because I am checking the future messages that did not come to my topic yet ?
My second question does this time interval related and should HIT the previous time window ?
If I am able to search the time interval by giving timeTo and timeFrom, why time window size is important ?
If I give the window Size 12 hours, am I able to guarantee that I am deduplicated messages for 12 hours ?
I think like this:
First message comes with key "A" in the first minute of the application start-up, after 11 hours, the message with a key "A" comes again. Can I catch this duplicated message by giving enough time interval like eventTime - 12hours ?
Thanks for any ideas !

TimeWindow size decides how long you wants the "duplication" runs, no duplication forever or just during 5 minutes. Kafka has to store these records. A large timewindow may consume a large resource of your server.
TimeFrom and TimeTo, cause your record(event) may arrive/process late in kafka, so the event-time of the record is 1 minute ago, not now. Kafka is process an "old" record, and that's it needs to take care of records which are not that old, relatively "future" records to the "old" one.

Related

is Time Based log Compaction in Kafka based on Wall Clock time or Event time or a mix of Both?

I have been trying to understand how to set time based log compaction, but still can't understand its behavior properly. In particular i am interested in the behavior of log.roll.ms.
What I would like to understand is the following statement taken from the official kafka doc: https://kafka.apache.org/documentation.html#upgrade_10_1_breaking
The log rolling time is no longer depending on log segment create time. Instead it is now based on the timestamp in the messages. More specifically. if the timestamp of the first message in the segment is T, the log will be rolled out when a new message has a timestamp greater than or equal to T + log.roll.ms
in T + log.roll.ms
a) I understand that T is based on the timestamp of the message and therefore can be considered the Event Time. However what is the clock behind log.roll.ms. In kafka stream for instance, when working with Event time it is clear what is the stream, it is the highest timestamp seen. So does the time for log compaction progress with the timestamp of the message and therefore is Event Time, or it progress based on the walk-clock time of the Brokers
I thought it was event time, but then i saw the following talk https://www.confluent.io/kafka-summit-san-francisco-2019/whats-the-time-and-why/ where
#Matthias J. Sax talks about it. From his talks i got confused. It seems indeed that compaction is driven by both Event time T and Walking time

KStreamWindowAggregate 2.0.1 vs 2.5.0: skipping records instead of processing

I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.

KTable suppress(Suppressed.untilTimeLimit()) not contains records the specified time

I have implemented a stream processing app that makes some calculations and transformations, and send the result to an output topic.
After that, I read from that topic and I want to suppress the results for 35", just like a timer, meaning that all the output records from that suppress will be sended to an specific "timeout" topic.
The simplified code looks like this:
inputStream
.suppress(Suppressed.untilTimeLimit(Duration.ofSeconds(35), Suppressed.BufferConfig.unbounded()))
.toStream()
.peek((key, value) -> LOGGER.warn("incidence with Key: {} timeout -> Time = {}", key, 35))
.filterNot((key, value) -> value.isDisconnection())
The problem I have here is that suppress contains the records during an arbitrary time, not the specified 35 seconds.
For more information I'm using event-time extracted in the former process described at the beginning, and records are arriving each second;
Thanks
Update
This is an input record example:
rowtime: 4/8/20 8:26:33 AM UTC, key: 34527882, value: {"incidenceId":"34527882","installationId":"18434","disconnection":false,"timeout":false, "creationDate":"1270801593"}
I ran into a similar issue some time ago, and the reason why suppress contains the records during an arbitrary time it's due to the fact that the suppress operator uses what they call stream time instead of the intuitive wall-clock time.
As of now, untilTimeLimit only supports stream time, which limits its usefulness. Work is underway to add a wall-clock time option, expanding this feature into a general rate control mechanism.
The important aspects of time for Suppress are:
The timestamp of each record: This is also known as event time.
What time is “now”? Stream processing systems have two main choices here:
The intuitive “now” is wall-clock time, the time you would see if you look at a clock on the wall while your program is running.
There is also stream time, which is essentially the maximum timestamp your program has observed in the stream so far. If you’ve been polling a topic, and you’ve seen records with timestamps 10, 11, 12, 11, then the current stream time is 12.
Reference: Kafka Streams’ Take on Watermarks and Triggers

Kafka Late Arrival message processing

As New to Kafka , We are trying to understand about late arrival records based on the following details. Please help us on below Questions.
To process the late arrival records , what time parameter can be chosen to determine how long we can wait for Late arrival records ?
If records not arrived that time too , what would be that case ? records will be discarded for processing ?
Consumers performing time-series analysis are the only ones that care about time; Kafka does not.
For example, say you had a device emitting metrics for some game while in airplane mode, and that would cache data locally until network reconnected.
In that case, you would collect both time that the events originally occurred while offline and the time that the records reached Kafka.
Topics are append-only, therefore your analysis is only able to project data at which it entered the system, and it's up to your analysis to discover the minimum time window of the original events.

Last value corresponding to each key sent on a Kafka topic

We have a Kafka topic configured on which we publish accumulated reports for each stock we traded throughout the day.
For example Stock A - Buy-50, Sell-60, Stock B - Buy-44, Sell-34 etc. The key while publishing is RIC code of the stock.
The next day I want all consumers to get the last published positions for each stock individually. I want to understand how to configure Kafka producer/consumer to achieve this behavior.
One thing that comes to mind is creating a partition for each stock, this will result into individual offsets for each stock and all consumers can point to the HIGHEST offset and get the latest position.
Is this the correct approach or am I missing something obvious?
Your approach will work, but only if you don't care about the time boundaries too much - for example, you do not need to get the counts for each day separately, with a strict requirement that only events that happened between say, [01/25/2017 00:00 - 01/26/2017 00:00] must be counted.
If you do need to get counts per day in a strict manner - you could try using Kafka Streams , with the key of RIC and the window set to 24 hours based on the event timestamp.
This is just one other way to do that - I'm sure there are more approaches available!