How does the suppress with emitEarlywhenFull() for tumbling windows works? - apache-kafka

I am using suppress on tumbling windows to get aggregated results. I am exploring both untilTimeLimit and untilWindowCloses for suppress.
I dont want my streams to shutdown when buffer fulls. I have seen this feature emitEarlyWhenFull() , but it cant be applicable on top of untilWindowCloses.
Hence, i am picking untilTimeLImit with emitEarlyWhenFull() , please refer below code :
groupedStreams.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(() -> initialBlob, blobAggregator,someserde)
.suppress(Suppressed.untilTimeLimit(Duration.ofMinutes(5), new StrictBufferConfigImpl().emitEarlyWhenFull()))
.toStream()
In my case, I am using tumbling windows for 5 mins. So, for every 5 mins, a window will be open for every record key. According to documentation, oldest records will be sent when the buffer gets filled.
what happens to the new records with same key comes after old records sent down in the same tumbling window?
For Example : messages flow :
(A,1)
(A,2)
(A,3) -> agg result : (A,6) . suppose here , the buffer is full, (A,6) will be sent downstream. Lets suppose (A,4) comes now in the same tumbling window, what will come next ? will it be : (A,10) or it will start fresh with (A,4) again?

If suppress() emits, the state will be preserved. Thus, for your example, the aggregation will continue and eventually (A,10) will be emitted.

Related

How to find last hopping window using Apache Kafka Streams

I'm trying to get average value in the last 30 seconds using hopping windows. Here are windowing and suppressing code;
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).advanceBy(Duration.ofSeconds(30)).grace(Duration.ZERO))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
When I do that, I'm getting hopping windows in 30 seconds. But I'm interested in just the last 30 seconds. Do I catch the last hopping windows? Then I'm going to look for the top 5 average value in that window using Java treeset.
If you only want the latest you can put the windows in a KTable and if they have the same key you will only have the latest window in the table.

Window does not assess elements from Kafka Source

I think my perception of Flink windows may be wrong, since they are not evaluated as I would expect from the documentation or the Flink book. The goal is to join a Kafka topic, which has rather static data, with a Kafka topic with constantly incoming data.
env.addSource(createKafkaConsumer())
.join(env.addSource((createKafkaConsumer()))))
.where(keySelector())
.equalTo(keySelector())
.window(TumblingProcessingTimeWindows.of(Time.hours(2)))
.apply(new RichJoinFunction<A, B>() { ... }
createKafkaConsumer() returns a FlinkKafkaConsumer
keySelector() is a placeholder for my key selector.
KafkaTopic A has 1 record, KafkaTopic B has 5. My understanding would be, that the JoinFunction is triggered 5 times (join condition is valid each time), resulting in 5 records in the sink. If a new record for topic A comes in within the 2 hours, another 5 records would be created (2x5 records). However, what comes through in the sink is rather unpredictable, I could not see a pattern. Sometimes there's nothing, sometimes the initial records, but if I send additional messages, they are not being processed by the join with prior records.
My key question:
What does even happen here? Are the records emitted after the window is done processing? I would expect a real-time output to the sink, but that would explain a lot.
Related to that:
Could I handle this problem with onElement trigger or would this make my TimeWindow obsolete? Do those two concepts exists parallel to each other, i.e. that the join window is 2 hours, but the join function + output is triggered per element? How about duplicates in that case?
Subsequently, does processing time mean the point in time, when the record is consumed from the topic? So if I e.g. setStartFromEarliest() on start, all messages which were consumed within the next two hours, were in that window?
Additional info:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); is set and I also switched to EventTime in between.
The semantics of a tumbling processing time window is that it processes all events which fall into the given timespan. In your case, it is 2 hours. Per default, the window will only output results once the 2 hours are over because it needs to know that no other events will be coming for this window.
If you want to output early results (e.g. for every incoming record), then you could specify a custom Trigger which fires on every element. See the Trigger API docs for more information about this.
Update
The window time does not start with the first element but the window starts at multiples of the window length. For example, if your window size is 2 hours, then you can only have windows [0, 2), [2, 4), ... but not [1, 3), [3, 5).

Process item in a window with Kafka streams

I'm trying to process some events in a sliding window with kafka stream but I think i don't understand some details of kafka streams so I'm not able to do what I want.
What I have :
input topic of events with key/value like (Int, Person)
What I want :
read these events within a sliding window of 10 minutes
process each element in the sliding window
filter and count some element, fire some event to an other kafka
topic (like if a wrong value is detected)
To be simple: get all the events in a sliding window of 10 minutes, do a foreach on them, compute some stats/events in the context of the window, go to the next window...
What I tried :
I tried to mix the Stream and the processor API like :
val streamBuilder = new StreamsBuilder()
streamBuilder.stream[Int, Person](topic)
.groupBy((_, value) => PersonWrapper(value.id, value.name))
.windowedBy(TimeWindows.of(10 * 60 * 1000L).advanceBy(1 * 60 * 1000L))
// now I have a window of (PersonWrapper, Person) right ?
streamBuilder.build().addProcessor(....)
And now I'd add a processor to this topology to process each events of the sliding window.
I don't understand what is TimeWindowStream and why we should have a KGroupedStream to apply a Window on events. If someone can enlight me about Kafka stream and what I'm trying to do.
Did you read the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
Windowing is a special form of grouping (grouping based on time)
Grouping is always require to compute an aggregation in Kafka Streams
After you have a grouped and windowed stream you call aggregate() for the actually processing (not need to attach a Processor manually; the call to aggregate() will implicitly add a Processor for you).
Btw: Kafka Streams does not really support "sliding windows" for aggregation. The window you define is called a hopping window.
KGroupedStream and TimeWindowedKStreams are basically just helper classes and an intermediate representation that allows for a fluent API design.
The tutorial is also a good way to get started: https://docs.confluent.io/current/streams/quickstart.html
You should also check out the examples: https://github.com/confluentinc/kafka-streams-examples

Is it possible to close a kafka streams window before its retention period?

First of all, I must say that I'm new to Kafka and streaming, but I'll try to explain the problem the best I can.
My team is currently developing and application to process data using Kafka Streams. We're using windowing to perform aggregation operations and we need to emit the results only when the window is closed.
Now here's the problem itself: our window retention period is of three days (yes, that is correct). We must keep the window open for this long period of time in case any record arrives late, but the normal course of the operations is that all the records arrive on time and we're able to identify when one batch is complete so, form most of the time, the windows don't need to stay open for so long.
So my question is: knowing that the last record of a batch has already arrived before the 3-day window retention period, is it possible to close the window and emit the aggregation results?

Flink session window with onEventTime trigger?

I want to create an EventTime based session-window in Flink, such that it triggers when the event time of a new message is more than 180 seconds greater than the event time of the message, that created the window.
For example:
t1(0 seconds) : msg1 <-- This is the first message which causes the session-windows to be created
t2(13 seconds) : msg2
t3(39 seconds) : msg3
.
.
.
.
t7(190 seconds) : msg7 <-- The event time (t7) is more than 180 seconds than t1 (t7 - t1 = 190), so the window should be triggered and processed now.
t8(193 seconds) : msg8 <-- This message, and all subsequent messages have to be ignored as this window was processed at t7
I want to create a trigger such that the above behavior is achieved through appropriate watermark or onEventTime trigger. Can anyone please provide some examples to achieve this?
The best way to approach this might be with a ProcessFunction, rather than with custom windowing. If, as shown in your example, the events will be processed in timestamp order, then this will be pretty straightforward. If, on the other hand, you have to handle out-of-order events (which is common when working with event time data), it will be somewhat more complex. (Imagine that msg6 with for time 187 arrives after t8. If that's possible, and if that will affect the results you want to produce, then this has to be handled.)
If the events are in order, then the logic would look roughly like this:
Use an AscendingTimestampExtractor as the basis for watermarking.
Use Flink state (perhaps ListState) to store the window contents. When an event arrives, add it to the window and check to see if it has been more than 180 seconds since the first event. If so, process the window contents and clear the list.
If your events can be out-of-order, then use a BoundedOutOfOrdernessTimestampExtractor, and don't process the window's contents until currentWatermark indicates that event time has passed 180 seconds past the window's start time (you can use an event time timer for this). Don't completely clear the list when triggering a window, but just remove the elements that belong to the window that is closing.