How to select the type of time semantics when working with tumbling windows? - apache-kafka

I am working on kafka streams windowing , particularly tumbling windows for my use case.
TimeWindowedKStream<String, Blob> windowedStreams = groupedStreams
.windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(5)));
this is a tumbling window for 5 minutes per record key and advances by 5 minutes. For my use case, I want no old message to be dropped and hence I want it to consider processing time as time semantic.
what is the default behaviour of tumbling window for time semantics, how does I specify in tumbling windows which time semantic to pick ?event time/processing time/ingestion time.

The time semantics are not specified on the window definition, but depend on the configured TimestampeExtractor. If you want to switch to processing time semantics, you can set default.timestamp.extractor to WallclockTimestampExtractor.class in the KafkaStreams config.
Compare
https://docs.confluent.io/current/streams/concepts.html#time
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#streams-developer-guide-timestamp-extractor

Related

Difference between eventTimeTimeout and processingTimeTimeout in mapGroupsWithState

What is the difference between eventTimeTimeout and processingTimeTimeout in mapGroupsWithState?
Also, is possible to make a state expire after every 10 min and if the data for that particular key arrives after 10 min the state should be maintained from the beginning?
In short:
processing-based timeouts rely on the time/clock of the machine your job is running. It is independent of any timestamps given in your data/events.
event-based timeouts rely on a timestamp column within your data that serves as the event time. In that case you need to declare this timestamp as a Watermark.
More details are available in the Scala Docs on the relevant class
GroupState:
With ProcessingTimeTimeout, the timeout duration can be set by calling GroupState.setTimeoutDuration. The timeout will occur when the clock has advanced by the set duration. Guarantees provided by this timeout with a duration of D ms are as follows:
Timeout will never be occur before the clock time has advanced by D ms
Timeout will occur eventually when there is a trigger in the query (i.e. after D ms). So there is a no strict upper bound on when the timeout would occur. For example, the trigger interval of the query will affect when the timeout actually occurs. If there is no data in the stream (for any group) for a while, then their will not be any trigger and timeout function call will not occur until there is data.
Since the processing time timeout is based on the clock time, it is affected by the variations in the system clock (i.e. time zone changes, clock skew, etc.).
With EventTimeTimeout, the user also has to specify the event time watermark in the query using Dataset.withWatermark(). With this setting, data that is older than the watermark are filtered out. The timeout can be set for a group by setting a timeout timestamp usingGroupState.setTimeoutTimestamp(), and the timeout would occur when the watermark advances beyond the set timestamp. You can control the timeout delay by two parameters - (i) watermark delay and an additional duration beyond the timestamp in the event (which is guaranteed to be newer than watermark due to the filtering). Guarantees provided by this timeout are as follows:
Timeout will never be occur before watermark has exceeded the set timeout.
Similar to processing time timeouts, there is a no strict upper bound on the delay when the timeout actually occurs. The watermark can advance only when there is data in the stream, and the event time of the data has actually advanced.
"Also, is possible to make a state expire after every 10 min and if the data for that particular key arrives after 10 min the state should be maintained from the beginning?"
This is happening automatically when using mapGroupsWithState. You just need to make sure to actually remove the state after the 10 minutes.

How to find last hopping window using Apache Kafka Streams

I'm trying to get average value in the last 30 seconds using hopping windows. Here are windowing and suppressing code;
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).advanceBy(Duration.ofSeconds(30)).grace(Duration.ZERO))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
When I do that, I'm getting hopping windows in 30 seconds. But I'm interested in just the last 30 seconds. Do I catch the last hopping windows? Then I'm going to look for the top 5 average value in that window using Java treeset.
If you only want the latest you can put the windows in a KTable and if they have the same key you will only have the latest window in the table.

Is it possible to close a kafka streams window before its retention period?

First of all, I must say that I'm new to Kafka and streaming, but I'll try to explain the problem the best I can.
My team is currently developing and application to process data using Kafka Streams. We're using windowing to perform aggregation operations and we need to emit the results only when the window is closed.
Now here's the problem itself: our window retention period is of three days (yes, that is correct). We must keep the window open for this long period of time in case any record arrives late, but the normal course of the operations is that all the records arrive on time and we're able to identify when one batch is complete so, form most of the time, the windows don't need to stay open for so long.
So my question is: knowing that the last record of a batch has already arrived before the 3-day window retention period, is it possible to close the window and emit the aggregation results?

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?
I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;
Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link
To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.
The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

How to specify retention period of join window?

I want to join two streams and I have set the join window to 25 hours as the records to be joined can be a maximum of 24 hours apart.
final Long JOIN_WINDOW = TimeUnit.HOURS.toMillis(25);
kstream.join(
runsheetIdStream,
(jt,r) -> { jt.setDate(r.getStart_date()); return jt; },
JoinWindows.of(JOIN_WINDOW),
Joined.with(Serdes.Long(),jobTransactionSerde,runsheetSerde))
This is throwing the following exception:
org.apache.kafka.streams.errors.TopologyException: Invalid topology: The retention period of the join window KSTREAM-JOINTHIS-0000000016-store must be no smaller than its window size.
How do I increase the retention period?
When you join and used JoinWindows.of(JOIN_WINDOW) you implicitly defined the metadata of the underlying state store.
From the javadoc of JoinWindows.of:
Specifies that records of the same key are joinable if their timestamps are within timeDifference, i.e., the timestamp of a record from the secondary stream is max timeDifference earlier or later than the timestamp of the record from the primary stream.
The so-called retention period (aka window maintain duration) was earlier (before Kafka Streams 2.1.0) specified using until:
Set the window maintain duration (retention time) in milliseconds. This retention time is a guaranteed lower bound for how long a window will be maintained.
Since by default the retention is 1 day (can't find the reference at the moment) that's the reason for the exception.
As of Kafka Streams 2.1.0 you should be using Materialized API:
Used to describe how a StateStore should be materialized. You can either provide a custom StateStore backend through one of the provided methods accepting a supplier or use the default RocksDB backends by providing just a store name.
Materialized gives you a full control over the underlying state store for join and gives withRetention(java.time.Duration retention):
Configure retention period for window and session stores.
Note that the retention period must be at least long enough to contain the windowed data's entire life cycle, from window-start through window-end, and for the entire grace period.