Process item in a window with Kafka streams - scala

I'm trying to process some events in a sliding window with kafka stream but I think i don't understand some details of kafka streams so I'm not able to do what I want.
What I have :
input topic of events with key/value like (Int, Person)
What I want :
read these events within a sliding window of 10 minutes
process each element in the sliding window
filter and count some element, fire some event to an other kafka
topic (like if a wrong value is detected)
To be simple: get all the events in a sliding window of 10 minutes, do a foreach on them, compute some stats/events in the context of the window, go to the next window...
What I tried :
I tried to mix the Stream and the processor API like :
val streamBuilder = new StreamsBuilder()
streamBuilder.stream[Int, Person](topic)
.groupBy((_, value) => PersonWrapper(value.id, value.name))
.windowedBy(TimeWindows.of(10 * 60 * 1000L).advanceBy(1 * 60 * 1000L))
// now I have a window of (PersonWrapper, Person) right ?
streamBuilder.build().addProcessor(....)
And now I'd add a processor to this topology to process each events of the sliding window.
I don't understand what is TimeWindowStream and why we should have a KGroupedStream to apply a Window on events. If someone can enlight me about Kafka stream and what I'm trying to do.

Did you read the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
Windowing is a special form of grouping (grouping based on time)
Grouping is always require to compute an aggregation in Kafka Streams
After you have a grouped and windowed stream you call aggregate() for the actually processing (not need to attach a Processor manually; the call to aggregate() will implicitly add a Processor for you).
Btw: Kafka Streams does not really support "sliding windows" for aggregation. The window you define is called a hopping window.
KGroupedStream and TimeWindowedKStreams are basically just helper classes and an intermediate representation that allows for a fluent API design.
The tutorial is also a good way to get started: https://docs.confluent.io/current/streams/quickstart.html
You should also check out the examples: https://github.com/confluentinc/kafka-streams-examples

Related

merge collection with different window strategy

we have 3 different data sources which eventually we need to do some kind of inner join between them
we create all pcollections with group by key
pcollectionA - Implemented using state (the data is not changed)
pcollectionB - windowed for 5h. if another event arrives within that time we would like to increase the window in another 5h hours. Implemented using custom window
pcollectionC - windowed for 30 minutes. Implemented using fixed window.
our purpose is to send pcollectionC events only if the relevant key exist in pcollectionA & pcollectionB
what is the best way to implement it as regular joins are not working due do the window differences
From your question I am assuming that the first two streams (pcollectionA and pcollectionB) are more like control/reference data used by the stream processor of pcollectionC.
Have you tried the Broadcast State Pattern? You can convert pcollectionA and pcollectionB to Broadcast streams and then look them up in the processor of pcollectionC.
Here is a very high level design:
// broadcast pcollectionA
BroadcastStream<> A_BroadcastStream = pcollectionA_stream
.broadcast(stateDescriptor);
// broadcast pcollectionB
BroadcastStream<> B_BroadcastStream = pcollectionB_stream
.broadcast(stateDescriptor);
//Process pcollectionC with broadcast data from A and B.
DataStream<String> output = pcollectionC_Stream.
.connect(A_BroadcastStream)
.process(
// process/join with pcollectionA items
)
.connect(B_BroadcastStream)
.process(
// process/join with pcollectionB items
)
This documentation has a great example for this design pattern - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/broadcast_state/

Kafka - windowing between two particular events

I would like to perform operations (e.g. aggregation) of different events occurring between two concrete events. E.g. A user clicks button 'A' and some time after clicks button 'B' and I would like to count how many events (from other topics) have been arrived during this time.
The general concept I'm facing in my application is that my events have duration, they are not single events happening independently at a given time. In the example, the click on button 'A' would be the start of the event and the click on button 'B' would be the end.
My problem is that the windowing process offered by kafka (tumbling, hopping, sliding, session) does not fit to my scenario. Is there any other alternative for implementing this in Kafka Streams? Any other framework as Flink or Spark that can handle it?
I am not sure about other frameworks but a generic windowing solution from KStreams will probably not work for your case.
However there are ways to make it work for you. I don't know how your keys are set up so I am going to make an assumption that in the key you can determine the user and if it is a "start" or "stop" event.
If you are willing to make a new processor you can easily react on a start event, gather events until a stop event and then send that batch on as a single record. Which is basically a window. You can combine this with your DLS code using process, that simplifies constructing the topology.
There is probably a way to do this by grouping the stream and aggregating a certain way but that might require changes to how your key is constructed.

Window does not assess elements from Kafka Source

I think my perception of Flink windows may be wrong, since they are not evaluated as I would expect from the documentation or the Flink book. The goal is to join a Kafka topic, which has rather static data, with a Kafka topic with constantly incoming data.
env.addSource(createKafkaConsumer())
.join(env.addSource((createKafkaConsumer()))))
.where(keySelector())
.equalTo(keySelector())
.window(TumblingProcessingTimeWindows.of(Time.hours(2)))
.apply(new RichJoinFunction<A, B>() { ... }
createKafkaConsumer() returns a FlinkKafkaConsumer
keySelector() is a placeholder for my key selector.
KafkaTopic A has 1 record, KafkaTopic B has 5. My understanding would be, that the JoinFunction is triggered 5 times (join condition is valid each time), resulting in 5 records in the sink. If a new record for topic A comes in within the 2 hours, another 5 records would be created (2x5 records). However, what comes through in the sink is rather unpredictable, I could not see a pattern. Sometimes there's nothing, sometimes the initial records, but if I send additional messages, they are not being processed by the join with prior records.
My key question:
What does even happen here? Are the records emitted after the window is done processing? I would expect a real-time output to the sink, but that would explain a lot.
Related to that:
Could I handle this problem with onElement trigger or would this make my TimeWindow obsolete? Do those two concepts exists parallel to each other, i.e. that the join window is 2 hours, but the join function + output is triggered per element? How about duplicates in that case?
Subsequently, does processing time mean the point in time, when the record is consumed from the topic? So if I e.g. setStartFromEarliest() on start, all messages which were consumed within the next two hours, were in that window?
Additional info:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); is set and I also switched to EventTime in between.
The semantics of a tumbling processing time window is that it processes all events which fall into the given timespan. In your case, it is 2 hours. Per default, the window will only output results once the 2 hours are over because it needs to know that no other events will be coming for this window.
If you want to output early results (e.g. for every incoming record), then you could specify a custom Trigger which fires on every element. See the Trigger API docs for more information about this.
Update
The window time does not start with the first element but the window starts at multiples of the window length. For example, if your window size is 2 hours, then you can only have windows [0, 2), [2, 4), ... but not [1, 3), [3, 5).

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

How does kafka streams compute watermarks?

Does Kafka Streams internally compute watermarks? Is it possible to observe the result of a window (only) when it completes (i.e. when watermark passes end of window)?
Kafka Streams does not use watermarks internally, but a new feature in 2.1.0 lets you observe the result of a window when it closed. It's called Suppressed, and you can read about it in the docs: Window Final Results:
KGroupedStream<UserId, Event> grouped = ...;
grouped
.windowedBy(TimeWindows.of(Duration.ofHours(1)).grace(ofMinutes(10)))
.count()
.suppress(Suppressed.untilWindowCloses(unbounded()))
Does Kafka Streams internally compute watermarks?
No. Kafka Streams follows a continuous update processing model that does not require watermarks. You can find more details about this online:
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/resources/streams-tables-two-sides-same-coin
Is it possible to observe the result of a window (only) when it completes (i.e. when watermark passes end of window)?
You can observe of result of a window at any point in time. Either via subscribing to the result changelog stream via for example KTable#toStream()#foreach() (ie, a push based approach), or via Interactive Queries that let you query the result window actively (ie, a pull based approach).
As mentioned by #Dmitry, for the push based approach, you can also use suppress() operator if you are only interested in a window's final result.