Kafka - windowing between two particular events - apache-kafka

I would like to perform operations (e.g. aggregation) of different events occurring between two concrete events. E.g. A user clicks button 'A' and some time after clicks button 'B' and I would like to count how many events (from other topics) have been arrived during this time.
The general concept I'm facing in my application is that my events have duration, they are not single events happening independently at a given time. In the example, the click on button 'A' would be the start of the event and the click on button 'B' would be the end.
My problem is that the windowing process offered by kafka (tumbling, hopping, sliding, session) does not fit to my scenario. Is there any other alternative for implementing this in Kafka Streams? Any other framework as Flink or Spark that can handle it?

I am not sure about other frameworks but a generic windowing solution from KStreams will probably not work for your case.
However there are ways to make it work for you. I don't know how your keys are set up so I am going to make an assumption that in the key you can determine the user and if it is a "start" or "stop" event.
If you are willing to make a new processor you can easily react on a start event, gather events until a stop event and then send that batch on as a single record. Which is basically a window. You can combine this with your DLS code using process, that simplifies constructing the topology.
There is probably a way to do this by grouping the stream and aggregating a certain way but that might require changes to how your key is constructed.

Related

Kafka Consumer per business use case - best way to implement

I'm using a single kafka topic for different types of related messages. Topic name is: apiEvents. Events are of type:
ApiUpdateEvent
EndpointUpdateEvent
TemplateUpdateEvent
One of the applications I have, consumes all these events. Moreover - I want it to consume the same event differently (twice), in two unrelated use cases.
For example, two use cases for the same event (EndpointUpdateEvent):
I'd like create a windowed time frame of 500ms and respond to an aggregation of all the events that came in this time frame - ONCE!
These same events as stated in section (1) - I want to respond to each one individually, by triggering some DB operation.
Thus, I want to write code that will be clean and maintainable and wouldn't want to throw all use cases in one big consumer with a lot of mess.
A solution I've thought about is to write a new kafka-consumer for each use case and to assign each consumer a different groupId (within the same application). That way, each business logic use case will have its own class which will handle the events in its own special way. Seems tidy enough.
May there arise any problems if I create too many consumer groups in one application?
Is there a better solution that will allow me to keep clean and divide different business logic use cases?
It sounds like you are on the right track by using separate consumer groups for different business logic use cases that will have separately managed offsets to track the individual progress. This will also align more with a microservice style architecture where different business cases may be implemented in different components.
One more consideration - And I cannot judge this just based on the information provided, but I would also think about splitting your topic into one per event type. It is not a problem for a consumer group to be subscribed to multiple topics at the same time. Whereas I believe it is less efficient to have consumers process/discard a large number of events that are irrelevant for them.
You can use Kafka Streams Processor API to consume and act on individual messages as well as window them within a specific, rolling/hopping time period

Delay fixed window from triggering for several minutes

Using Fixed Windows in Apache Beam. The watermark is set by the event time.
Some data may arrive out of order and cause the window to close.
How can a trigger be defined in Java to occur say 2 minutes after the last data was seen?
It's not entire clear what behavior you expect. One question is what do you expect to happen if the data arrives within the two minutes? Do you want to restart the two minutes interval, don't restart it, re-emit the data or not?
Looks like the trigger you are trying to describe is something along these lines:
wait until the watermark passed the end of window, in event time;
wait for additional 2 minutes in processing time;
emit the data;
If in step 2 it was event time, i.e. you wanted to re-emit the window if a late element arrives that fits within window + 2min, then you could use withAllowedLateness(). Though it sounds different from what you want, because it can keep re-emitting the window contents every time a matching late element arrives.
With processing-time in step 2 this is not possible in general with basic triggers that are available in Beam. You can probably achieve a behavior you want if you manually manage state and timers in your own ParDo, e.g. you can watch for the incoming elements, keep track on them in the state, and then on timer emit what you want. This can become very complicated and might still be not flexible enough for your specific use case.
One of the major problems is that there is no good way to define processing time triggers in Beam in general. It would be complicated to define a general mechanism of working with timers in this manner. For example, when you want to express "wait for 2 minutes", the framework needs to understand in relation to what these two minutes are, when to start the timer, so you need a mechanism to express that as well. And with composition, continuation and other complications this doesn't seem easy to reason about. So it's not in the framework in this general form.
In order to implement only the "wait for 2 minutes after the last element was seen in the window", the framework has to watch for it and set the timer. Technically it is possible to do something like this but doesn't seem like anyone has done it yet.
There seems to be only one meaningful processing time trigger available in Beam but it's not generic enough and doesn't do what you want. You can look at composite triggers like AfterFirst or AfterAll but they likely won't help you without a better general processing time trigger.
I decided against using Beam and implemented the solution in Kafka Streams.
I basically grouped by, then used fixed windows and the aggregated the result.
The "grace" on the window allows data to arrive late.
KGroupedStream<Long, OxyStreamItem> grouped = input.groupByKey();
TimeWindowedKStream<Long, OxyStreamItem> windowed =
grouped.windowedBy(
TimeWindows.of(WIN_SIZE)
.advanceBy(WIN_SIZE)
.grace(Duration.ofSeconds(5L)));
return windowed
.aggregate(
makeInitializer(),
makeAggregator(),
Materialized
.<Long, Aggregate, WindowStore<Bytes, byte[]>>as("tmp")
.withValueSerde(new AggregateSerde()))
.suppress(
Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.map(calculateAvg());

How do you ensure that events are applied in order to read model?

This is easy for projections that subscribe to all events from the stream, you just keep version of the last event applied on your read model. But what do you do when projection is composite of multiple streams? Do you keep version of each stream that is partaking in the projection. But then what about the gaps, if you are not subscribing to all events? At most you can assert that version is greater than the last one. How do others deal with this? Do you respond to every event and bump up version(s)?
For the EventStore, I would suggest using the $all stream as the default stream for any read-model subscription.
I have used the category stream that essentially produces the snapshot of a given entity type but I stopped doing so since read-models serve a different purpose.
It might be not desirable to use the $all stream as it might also get events, which aren't domain events. Integration events could be an example. In this case, adding some attributes either to event contracts or to the metadata might help to create an internal (JS) projection that will create a special all stream for domain events, or any event category in that regard, where you can subscribe to. You can also use a negative condition, for example, filter out all system events and those that have the original stream name starting with Integration.
As well as processing messages in the correct order, you also have the problem of resuming a projection after it is restarted - how do you ensure you start from the right place when you restart?
The simplest option is to use an event store or message broker that both guarantees order and provides some kind of global stream position field (such as a global event number or an ordered timestamp with a disambiguating component such as MongoDB's Timestamp type). Event stores where you pull the events directly from the store (such as eventstore.org or homegrown ones built on a database) tend to guarantee this. Also, some message brokers like Apache Kafka guarantee ordering (again, this is pull-based). You want at-least-once ordered delivery, ideally.
This approach limits write scalability (reads scale fine, using read replicas) - you can shard your streams across multiple event store instances in various ways, then you have to track the position on a per-shard basis, which adds some complexity.
If you don't have these ordering, delivery and position guarantees, your life is much harder, and it may be hard to make the system completely reliable. You can:
Hold onto messages for a while after receiving them, before processing them, to allow other ones to arrive
Have code to detect missing or out-of-order messages. As you mention, this only works if you receive all events with a global sequence number or if you track all stream version numbers, and even then it isn't reliable in all cases.
For each individual stream, you keep things in order by fetching them from a data store that knows the correct order. A way of thinking of this is that your query the data store, and you get a Document Message back.
It may help to review Greg Young's Polyglot Data talk.
As for synchronization of events in multiple streams; a thing that you need to recognize is that events in different streams are inherently concurrent.
You can get some loose coordination between different streams if you have happens-before data encoded into your messages. "Event B happened in response to Event A, therefore A happened-before B". That gets you a partial ordering.
If you really do need a total ordering of everything everywhere, then you'll need to be looking into patterns like Lamport Clocks.

Event sourcing with Kafka streams

I'm trying to implement a simple CQRS/event sourcing proof of concept on top of Kafka streams (as described in https://www.confluent.io/blog/event-sourcing-using-apache-kafka/)
I have 4 basic parts:
commands topic, which uses the aggregate ID as the key for sequential processing of commands per aggregate
events topic, to which every change in aggregate state are published (again, key is the aggregate ID). This topic has a retention policy of "never delete"
A KTable to reduce aggregate state and save it to a state store
events topic stream ->
group to a Ktable by aggregate ID ->
reduce aggregate events to current state ->
materialize as a state store
commands processor - commands stream, left joined with aggregate state KTable. For each entry in the resulting stream, use a function (command, state) => events to produce resulting events and publish them to the events topic
The question is - is there a way to make sure I have the latest version of the aggregate in the state store?
I want to reject a command if violates business rules (for example - a command to modify the entity is not valid if the entity was marked as deleted). But if a DeleteCommand is published followed by a ModifyCommand right after it, the delete command will produce the DeletedEvent, but when the ModifyCommand is processed, the loaded state from the state store might not reflect that yet and conflicting events will be published.
I don't mind sacrificing command processing throughput, I'd rather get the consistency guarantees (since everything is grouped by the same key and should end up in the same partition)
Hope that was clear :) Any suggestions?
I don't think Kafka is good for CQRS and Event sourcing yet, the way you described it, because it lacks a (simple) way of ensuring protection from concurrent writes. This article talks about this in details.
What I mean by the way you described it is the fact that you expect a command to generate zero or more events or to fail with an exception; this is the classical CQRS with Event sourcing. Most of the people expect this kind of Architecture.
You could have Event sourcing however in a different style. Your Command handlers could yield events for every command that is received (i.e. DeleteWasAccepted). Then, an Event handler could eventually handle that Event in an Event sourced way (by rebuilding Aggregate's state from its event stream) and emit other Events (i.e. ItemDeleted or ItemDeletionWasRejected). So, commands are fired-and-forget, sent async, the client does not wait for an immediate response. It waits however for an Event describing the outcome of its command execution.
An important aspect is that the Event handler must process events from the same Aggregate in a serial way (exactly once and in order). This can be implemented using a single Kafka Consumer Group. You can see about this architecture in this video.
Please read this article by my colleague Jesper. Kafka is a great product but actually not a good fit at all for event sourcing
https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
A possible solution I came up with is to implement a sort of optimistic locking mechanism:
Add an expectedVersion field on the commands
Use the KTable Aggregator to increase the version of the aggregate snapshot for each handled event
Reject commands if the expectedVersion doesn't match the snapshot's aggregate version
This seems to provide the semantics I'm looking for

Is Event Sourcing applicable for batch inputs?

I have a use case where the inputs to the application comes in batches of XML files. For example, a nightly batch of bank transactions. I am trying to see if I can use event sourcing to create a log of events. Based on what I read so far, the examples seems to be based on user driven input (click streams, updates from a user interface etc.,). Is event sourcing using a distributed log mechanism(like Kafka) a valid approach for batch/file based inputs?
Below is the approach I would like to take:
Accept input as a batch in file/xml
Run some basic validations in the memory.
Convert the batch input into a series of events
Write the event log to a Kafka topic(s).
Use the event log to store the data into the database, send the events
to a search engine, update caches, run spark jobs to do aggregations
etc.,
Repeat the process for other incoming batches.
If this approach is not efficient, what other options are available for distributed processing of such inputs?
Are your inputs coming from something that looks like an event storage? I.e. a database that acts as an immutable source of truth, of append only events.
If that is the case, you have the foundation to use event sourcing, and additionally CQRS. (They're not the same thing)
What you would have to realize is that the so called write side / command side... has already been done for you.
The incoming batch of XML files with transactions... each transaction is an event already. It doesn't sound like you need to convert these to events, to then put these into Kafka. You can just map these to something you can put into Kafka, and then all subscribers of the topics can do stuff accordingly.
Effectively you would be implementing the read side of Event Sourcing + CQRS.
In practical terms, unless you are going to be doing things on the write side (where the xml files are generated / where user input is received)... I wouldn't worry too much about the subtleties of event sourcing as it relates to DDD and CQRS. I would simply think of what you're doing as a way to distribute your data to multiple services.
And make sure to consider how caches, search engines, etc. will only be updated whenever you get those XML files.
If each individual event in these xml files has a timestamp then you can think of the output to Kafka as just a steam of late arriving events. Kafka allows you to set the event time on these messages to be the timestamp of the event rather than the time it was ingested to Kafka. In that way, any downstream processing apps like Kafka Streams can put the event into the right temporal context and aggregate into the proper time windows or session windows or even join with other realtime inputs