How to write only final output of KStreams windowed operation? - apache-kafka

Say, I need to do wordcount like processing but for every 5 minutes. So i am using tumbling windows, but in the output what i see is the intermittent changelog counts also. I want to see only the final counts for the window in the output.
Is there a way to achieve this.

Related

How to find last hopping window using Apache Kafka Streams

I'm trying to get average value in the last 30 seconds using hopping windows. Here are windowing and suppressing code;
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).advanceBy(Duration.ofSeconds(30)).grace(Duration.ZERO))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
When I do that, I'm getting hopping windows in 30 seconds. But I'm interested in just the last 30 seconds. Do I catch the last hopping windows? Then I'm going to look for the top 5 average value in that window using Java treeset.
If you only want the latest you can put the windows in a KTable and if they have the same key you will only have the latest window in the table.

Process item in a window with Kafka streams

I'm trying to process some events in a sliding window with kafka stream but I think i don't understand some details of kafka streams so I'm not able to do what I want.
What I have :
input topic of events with key/value like (Int, Person)
What I want :
read these events within a sliding window of 10 minutes
process each element in the sliding window
filter and count some element, fire some event to an other kafka
topic (like if a wrong value is detected)
To be simple: get all the events in a sliding window of 10 minutes, do a foreach on them, compute some stats/events in the context of the window, go to the next window...
What I tried :
I tried to mix the Stream and the processor API like :
val streamBuilder = new StreamsBuilder()
streamBuilder.stream[Int, Person](topic)
.groupBy((_, value) => PersonWrapper(value.id, value.name))
.windowedBy(TimeWindows.of(10 * 60 * 1000L).advanceBy(1 * 60 * 1000L))
// now I have a window of (PersonWrapper, Person) right ?
streamBuilder.build().addProcessor(....)
And now I'd add a processor to this topology to process each events of the sliding window.
I don't understand what is TimeWindowStream and why we should have a KGroupedStream to apply a Window on events. If someone can enlight me about Kafka stream and what I'm trying to do.
Did you read the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
Windowing is a special form of grouping (grouping based on time)
Grouping is always require to compute an aggregation in Kafka Streams
After you have a grouped and windowed stream you call aggregate() for the actually processing (not need to attach a Processor manually; the call to aggregate() will implicitly add a Processor for you).
Btw: Kafka Streams does not really support "sliding windows" for aggregation. The window you define is called a hopping window.
KGroupedStream and TimeWindowedKStreams are basically just helper classes and an intermediate representation that allows for a fluent API design.
The tutorial is also a good way to get started: https://docs.confluent.io/current/streams/quickstart.html
You should also check out the examples: https://github.com/confluentinc/kafka-streams-examples

How does kafka streams compute watermarks?

Does Kafka Streams internally compute watermarks? Is it possible to observe the result of a window (only) when it completes (i.e. when watermark passes end of window)?
Kafka Streams does not use watermarks internally, but a new feature in 2.1.0 lets you observe the result of a window when it closed. It's called Suppressed, and you can read about it in the docs: Window Final Results:
KGroupedStream<UserId, Event> grouped = ...;
grouped
.windowedBy(TimeWindows.of(Duration.ofHours(1)).grace(ofMinutes(10)))
.count()
.suppress(Suppressed.untilWindowCloses(unbounded()))
Does Kafka Streams internally compute watermarks?
No. Kafka Streams follows a continuous update processing model that does not require watermarks. You can find more details about this online:
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/resources/streams-tables-two-sides-same-coin
Is it possible to observe the result of a window (only) when it completes (i.e. when watermark passes end of window)?
You can observe of result of a window at any point in time. Either via subscribing to the result changelog stream via for example KTable#toStream()#foreach() (ie, a push based approach), or via Interactive Queries that let you query the result window actively (ie, a pull based approach).
As mentioned by #Dmitry, for the push based approach, you can also use suppress() operator if you are only interested in a window's final result.

Spark Structured Streaming Aggregation Output Interval

I'm reviewing the StructuredNetworkWordCountWindowed example in Apache Spark Structured Streaming and am having trouble finding information about how I can update the example to control the output intervals. When I run the example I receive output every time a micro batch is processed. I understand that this is intended because the main case is to process data and emit results in real time but what about the case where I want to process data in real time but output the state at some specific interval? Does Spark Structured Streaming support this scenario? I reviewed the programming guide and the only similar concept that is mentioned is the Trigger.ProcessingTime option. Unfortunately, this option is not quite what is needed since it applies to the batch processing time and the scenario described above still requires processing data in real time.
Is this feature supported? More specifically, how do I only output the state at the time the window ends assuming there are no late arrivals and using a tumbling window?

Need advice on storing time series data in aligned 10 minute batches per channel

I have time series data in Kafka. The schema is quite simple - the key is the channel name, and the values are Long/Double tuples of the timestamp and the value (in reality it's a custom Avro object but it boils down to this). They always come in correct chronological order.
The wanted end result is data packaged in 10 minute batches, aligned at 10 minutes (i.e., 00:00 < t <= 00:10, 00:10 < t <= 00:20, ..., 23: 50 < t <= 00:00). Each package is to contain only data of one channel.
My idea is to have two Spark Streaming jobs. The first one takes the data from the Kafka topics and dumps it to a table in a Cassandra database where the key is the timestamp and the channel name, and every time such an RDD hits a 10 minute boundary, this boundary is posted to another topic, alongside the channel whose boundary is hit.
The second job listens to this "boundary topic", and for every received 10 minute boundary, the data is pulled from Cassandra, some calculations like min, max, mean, stddev are done and the data and these results are packaged to a defined output directory. That way, each directory contains the data from one channel and one 10 minute window.
However, this looks a bit clunky and like a lot of extra work to me. Is this a feasible solution or are there any other more efficient tricks to it, like some custom windowing of the Kafka data?
I agree with your intuition that this solution is clunky. How about simply using the time windowing functionality built into the Streams DSL?
http://kafka.apache.org/11/documentation/streams/developer-guide/dsl-api.html#windowing
The most natural output would be a new topic containing the windowed aggregations, but if you really need it written to a directory that should be possible with Kafka Connect.
I work with the Flink Stream Processing, not Spark-streaming but I guess the programming concept of both of them is alike. So supposing data are ordered chronologically and you want to aggregate data for every 10 minutes and do some processing on aggregated data, I think the best approach is to use the Streaming Window Functions. I suggest to define a function to map every incoming data's timestamp to the last 10 minutes:
12:10:24 ----> 12:10:00
12:10:30 ----> 12:10:00
12:25:24 ----> 12:20:00
So you can create a keyed stream object like:
StreamObject<Long, Tuple<data>>
That the Long field is the mapped timestamp of every message. Then you can apply a window. You should search what kind of window is more appropriate for your case.
Point: Setting a key for the data stream will cause the window function to consider a logical window for every key.
In the most simple case, you should define a time window of 10 minutes and aggregate all data incoming on that period of time.
The other approach, if you know the rate of generating of data and how many messages will be generated in a period of 10 minutes, is to use Count window. For example, a window with the count of 20 will listen to the stream and aggregate all the messages with the same key in a logical window and apply the window function just when the number of messages in the window reaches 20.
After messages aggregated in a window as desired, you could apply your processing logic using a reduce function or some action like that.