Is there an equivalent to debounce in WebFlux? - reactive-programming

I have a webflux that gets data from a stream of events, but I want it to end up in a more or less stable state (i.e. the events stop flowing for a few milliseconds) sort of like the debounce operator in RxJS.
I can't seem to find anything like that in the API though.

The webflux equivalent is the bufferTimeout method
.bufferTimeout(20, Duration.ofSeconds(1))
The advantage of it is you get all the events not just the last one which is what debounce does as a list.

Related

Kafka - windowing between two particular events

I would like to perform operations (e.g. aggregation) of different events occurring between two concrete events. E.g. A user clicks button 'A' and some time after clicks button 'B' and I would like to count how many events (from other topics) have been arrived during this time.
The general concept I'm facing in my application is that my events have duration, they are not single events happening independently at a given time. In the example, the click on button 'A' would be the start of the event and the click on button 'B' would be the end.
My problem is that the windowing process offered by kafka (tumbling, hopping, sliding, session) does not fit to my scenario. Is there any other alternative for implementing this in Kafka Streams? Any other framework as Flink or Spark that can handle it?
I am not sure about other frameworks but a generic windowing solution from KStreams will probably not work for your case.
However there are ways to make it work for you. I don't know how your keys are set up so I am going to make an assumption that in the key you can determine the user and if it is a "start" or "stop" event.
If you are willing to make a new processor you can easily react on a start event, gather events until a stop event and then send that batch on as a single record. Which is basically a window. You can combine this with your DLS code using process, that simplifies constructing the topology.
There is probably a way to do this by grouping the stream and aggregating a certain way but that might require changes to how your key is constructed.

Is the mongo timestamp type atomic with the reads?

I guess the title is confusing, but I could not find a better one.
I have an event stream in MongoDB with multiple producers and one consumer. To ensure that I read each event exactly once in the correct order, I use the MongoDB timestamp type as an incrementing value, populated by the server. In the SQL world I would probably use an auto-incremented integer.
My consumer just polls MongoDB and asks for all events since the last timestamp it has seen. In one of the environments we have realized that sometimes the consumer does not handle all events. It does not happen very often, like one of 50.000 events is missed, but ideally it should not happen at all.
My assumption is that MongoDB does something like this internally.
ParseDocument(doc);
lock
{
SetTimestamp(doc);
}
WriteDocument(doc);
UpdateIndex(doc);
So it could happen that for a very short period of time an document is not available when the consumer queries the events, because only event #1, #2 and #4 is written yet and event #3 is written a fraction of a millisecond later.
I Have seen this with a C# client and MongoDB 4.2 running in Docker, but I guess the client does not matter here.
Is this assumption correct and if yes, what can I do it?
My idea is to change my consumer to ask for all events since the last timestamp minus a few seconds and then filter out the already received events in the consumer.
But is there a more elegant solution? Perhaps some way to enforce collection level write locks or could transactions help?
Since you said "consumer" - singular, I suggest:
Use a change stream to be notified of events. Change stream, if correctly iterated, will not skip changes nor will it return the same change twice.
Whenever a document is returned from change stream, when it is processed by the singular consumer, add a counter to it. Since there is only one consumer it is relatively easy to implement the counter without race conditions and such.
Also write the current resume token into each event being processed.
If you wish, you can use the counter to uniquely identify the events.
To iterate events again, use the counter to look up events in the past. Given that each event has both a counter and a resume token, once you get to the most recent event you can seamlessly transition from iterating based on the counter to iterating based on the resume token.

How do you ensure that events are applied in order to read model?

This is easy for projections that subscribe to all events from the stream, you just keep version of the last event applied on your read model. But what do you do when projection is composite of multiple streams? Do you keep version of each stream that is partaking in the projection. But then what about the gaps, if you are not subscribing to all events? At most you can assert that version is greater than the last one. How do others deal with this? Do you respond to every event and bump up version(s)?
For the EventStore, I would suggest using the $all stream as the default stream for any read-model subscription.
I have used the category stream that essentially produces the snapshot of a given entity type but I stopped doing so since read-models serve a different purpose.
It might be not desirable to use the $all stream as it might also get events, which aren't domain events. Integration events could be an example. In this case, adding some attributes either to event contracts or to the metadata might help to create an internal (JS) projection that will create a special all stream for domain events, or any event category in that regard, where you can subscribe to. You can also use a negative condition, for example, filter out all system events and those that have the original stream name starting with Integration.
As well as processing messages in the correct order, you also have the problem of resuming a projection after it is restarted - how do you ensure you start from the right place when you restart?
The simplest option is to use an event store or message broker that both guarantees order and provides some kind of global stream position field (such as a global event number or an ordered timestamp with a disambiguating component such as MongoDB's Timestamp type). Event stores where you pull the events directly from the store (such as eventstore.org or homegrown ones built on a database) tend to guarantee this. Also, some message brokers like Apache Kafka guarantee ordering (again, this is pull-based). You want at-least-once ordered delivery, ideally.
This approach limits write scalability (reads scale fine, using read replicas) - you can shard your streams across multiple event store instances in various ways, then you have to track the position on a per-shard basis, which adds some complexity.
If you don't have these ordering, delivery and position guarantees, your life is much harder, and it may be hard to make the system completely reliable. You can:
Hold onto messages for a while after receiving them, before processing them, to allow other ones to arrive
Have code to detect missing or out-of-order messages. As you mention, this only works if you receive all events with a global sequence number or if you track all stream version numbers, and even then it isn't reliable in all cases.
For each individual stream, you keep things in order by fetching them from a data store that knows the correct order. A way of thinking of this is that your query the data store, and you get a Document Message back.
It may help to review Greg Young's Polyglot Data talk.
As for synchronization of events in multiple streams; a thing that you need to recognize is that events in different streams are inherently concurrent.
You can get some loose coordination between different streams if you have happens-before data encoded into your messages. "Event B happened in response to Event A, therefore A happened-before B". That gets you a partial ordering.
If you really do need a total ordering of everything everywhere, then you'll need to be looking into patterns like Lamport Clocks.

Event sourcing with Kafka streams

I'm trying to implement a simple CQRS/event sourcing proof of concept on top of Kafka streams (as described in https://www.confluent.io/blog/event-sourcing-using-apache-kafka/)
I have 4 basic parts:
commands topic, which uses the aggregate ID as the key for sequential processing of commands per aggregate
events topic, to which every change in aggregate state are published (again, key is the aggregate ID). This topic has a retention policy of "never delete"
A KTable to reduce aggregate state and save it to a state store
events topic stream ->
group to a Ktable by aggregate ID ->
reduce aggregate events to current state ->
materialize as a state store
commands processor - commands stream, left joined with aggregate state KTable. For each entry in the resulting stream, use a function (command, state) => events to produce resulting events and publish them to the events topic
The question is - is there a way to make sure I have the latest version of the aggregate in the state store?
I want to reject a command if violates business rules (for example - a command to modify the entity is not valid if the entity was marked as deleted). But if a DeleteCommand is published followed by a ModifyCommand right after it, the delete command will produce the DeletedEvent, but when the ModifyCommand is processed, the loaded state from the state store might not reflect that yet and conflicting events will be published.
I don't mind sacrificing command processing throughput, I'd rather get the consistency guarantees (since everything is grouped by the same key and should end up in the same partition)
Hope that was clear :) Any suggestions?
I don't think Kafka is good for CQRS and Event sourcing yet, the way you described it, because it lacks a (simple) way of ensuring protection from concurrent writes. This article talks about this in details.
What I mean by the way you described it is the fact that you expect a command to generate zero or more events or to fail with an exception; this is the classical CQRS with Event sourcing. Most of the people expect this kind of Architecture.
You could have Event sourcing however in a different style. Your Command handlers could yield events for every command that is received (i.e. DeleteWasAccepted). Then, an Event handler could eventually handle that Event in an Event sourced way (by rebuilding Aggregate's state from its event stream) and emit other Events (i.e. ItemDeleted or ItemDeletionWasRejected). So, commands are fired-and-forget, sent async, the client does not wait for an immediate response. It waits however for an Event describing the outcome of its command execution.
An important aspect is that the Event handler must process events from the same Aggregate in a serial way (exactly once and in order). This can be implemented using a single Kafka Consumer Group. You can see about this architecture in this video.
Please read this article by my colleague Jesper. Kafka is a great product but actually not a good fit at all for event sourcing
https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
A possible solution I came up with is to implement a sort of optimistic locking mechanism:
Add an expectedVersion field on the commands
Use the KTable Aggregator to increase the version of the aggregate snapshot for each handled event
Reject commands if the expectedVersion doesn't match the snapshot's aggregate version
This seems to provide the semantics I'm looking for

Usage of Monix Debounce Observable

I'm trying out some of the operations that I could do on the Observable from Monix. I came across this debounce operator and could not understand its behavior:
Observable.interval(5.seconds).debounce(2.seconds)
This one above just emits a Long every 5 seconds.
Observable.interval(2.seconds).debounce(5.seconds)
This one however does not emit anything at all. So what is the real purpose of the debounce operator and in which cases could I use it?
The term debounce comes from mechanical relays. You can think of it as a frequency filter: o.debounce(5.seconds) filters out any events that are emitted more frequently than once every 5 seconds.
An example of where I've used it is where I expect to get a batch of similar events in rapid succession, and my response to each event is the same. By debouncing I can reduce the amount of work I need to do by making the batch look like just one event.
It isn't useful in situations like your examples where the input frequency is constant, as the only possibilities are that it does nothing or it filters out everything.