How to replay in a deterministic way in CQRS / event-sourcing? - cqrs

In CQRS / ES based systems, you store events in an event-store. These events refer to an aggregate, and they have an order with respect to the aggregate they belong to. Furthermore, aggregates are consistency / transactional boundaries, which means that any transactional guarantees are only given on a per-aggregate level.
Now, supposed I have a read model which consumes events from multiple aggregates (which is perfectly fine, AFAIK). To be able to replay the read model in a deterministic way, the events need some kind of global ordering, across aggregates – otherwise you wouldn't know whether to replay events for aggregate A before or after the ones for B, or how to intermix them.
The simplest solution to achieve this is by using a timestamp on the events, but typically timestamps are not fine-granular enough (or, to put it another way, not all databases are created equal). Another option is to use a global sequence, but this is bad performance-wise and hinders scaling.
How do you solve this issue? Or is my basic assumption, that replays of read models should be deterministic, wrong?

I see these options:
Global sequence
if your database allows it, you can use timestamp+aggregateId+aggregateVersion as an index. This usually doesnt work well in the distributed database case.
in the distributed database you can use vector clock to get a global sequence without having a lock.
Event sequence inside each read model. You can literally store all events in the read model and sort them as you want before applying a projection function.
Allow non-determinism and deal with it. For instance, in your example, if there is no group when add_user event arrives - just create an empty group record to the read model and add a user. And when create_group event arrives - update that group record.
After all, you have checked in UI and/or command handler that there
is a group with this aggregateId, right?

How do you solve this issue?
It's known issue, and of course nor simple timestamps, nor global sequence, nor event naïve methods will not help.
Use vector clock with weak timestamp to enumerate your events and vector cursor to read them. That guarantees some stable deterministic order to intermix events between aggregates. This will work even if each thread has clock synchronization gap, which is regular use case for database clusters, because perfect timestamp synchronization is impossible.
Also this automatically gives possibility to seamless mix reading events from event store and event bus later, and excludes any database locks inter different aggregates events.
Algorithm draft:
1) Determine real quantity of simultaneous transactions in your database, e.g. maximum number of workers in cluster.
Since every event had been written in only one transaction in one thread, you can determine it's unique id as tuple (thread number, thread counter), where thread counter is amount of transactions processed on current thread.
Calculate event weak timestamp as MAX(thread timestamp, aggregate timestamp), where aggregate timestamp is timestamp of last event for current aggregate.
2) Prepare vector cursor for reading events via thread number boundary. Read events from each thread sequentially until timestamp gap exceed allowed value. Allowed weak timestamp gap is trade between event reading performance and preserving native events order.
Minimal value is cluster threads synchronization time delta, so events are arrived in native aggregate intermix order. Maximum value is infinity, so events will be spitted by aggregate. When using RDBMS like postgres, that value can be automatically determined via smart SQL query.
You can see referent implementation for PostgreSQL database for saving events and loading events. Saving events performance is about 10000 events per second for 4GB RAM RDS Postgres cluster.


Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

How do you ensure that events are applied in order to read model?

This is easy for projections that subscribe to all events from the stream, you just keep version of the last event applied on your read model. But what do you do when projection is composite of multiple streams? Do you keep version of each stream that is partaking in the projection. But then what about the gaps, if you are not subscribing to all events? At most you can assert that version is greater than the last one. How do others deal with this? Do you respond to every event and bump up version(s)?
For the EventStore, I would suggest using the $all stream as the default stream for any read-model subscription.
I have used the category stream that essentially produces the snapshot of a given entity type but I stopped doing so since read-models serve a different purpose.
It might be not desirable to use the $all stream as it might also get events, which aren't domain events. Integration events could be an example. In this case, adding some attributes either to event contracts or to the metadata might help to create an internal (JS) projection that will create a special all stream for domain events, or any event category in that regard, where you can subscribe to. You can also use a negative condition, for example, filter out all system events and those that have the original stream name starting with Integration.
As well as processing messages in the correct order, you also have the problem of resuming a projection after it is restarted - how do you ensure you start from the right place when you restart?
The simplest option is to use an event store or message broker that both guarantees order and provides some kind of global stream position field (such as a global event number or an ordered timestamp with a disambiguating component such as MongoDB's Timestamp type). Event stores where you pull the events directly from the store (such as or homegrown ones built on a database) tend to guarantee this. Also, some message brokers like Apache Kafka guarantee ordering (again, this is pull-based). You want at-least-once ordered delivery, ideally.
This approach limits write scalability (reads scale fine, using read replicas) - you can shard your streams across multiple event store instances in various ways, then you have to track the position on a per-shard basis, which adds some complexity.
If you don't have these ordering, delivery and position guarantees, your life is much harder, and it may be hard to make the system completely reliable. You can:
Hold onto messages for a while after receiving them, before processing them, to allow other ones to arrive
Have code to detect missing or out-of-order messages. As you mention, this only works if you receive all events with a global sequence number or if you track all stream version numbers, and even then it isn't reliable in all cases.
For each individual stream, you keep things in order by fetching them from a data store that knows the correct order. A way of thinking of this is that your query the data store, and you get a Document Message back.
It may help to review Greg Young's Polyglot Data talk.
As for synchronization of events in multiple streams; a thing that you need to recognize is that events in different streams are inherently concurrent.
You can get some loose coordination between different streams if you have happens-before data encoded into your messages. "Event B happened in response to Event A, therefore A happened-before B". That gets you a partial ordering.
If you really do need a total ordering of everything everywhere, then you'll need to be looking into patterns like Lamport Clocks.

How Axon framework's sequencing policy works in terms of statefulness

In Axon's reference guide it is written that
Besides these provided policies, you can define your own. All policies must implement the SequencingPolicy interface. This interface defines a single method, getSequenceIdentifierFor, that returns the sequence identifier for a given event. Events for which an equal sequence identifier is returned must be processed sequentially. Events that produce a different sequence identifier may be processed concurrently.
Even more, in this thread's last message it says that
with the sequencing policy, you indicate which events need to be processed sequentially. It doesn't matter whether the threads are in the same JVM, or in different ones. If the sequencing policy returns the same value for 2 messages, they will be guaranteed to be processed sequentially, even if you have tracking processor threads across multiple JVMs.
So does this mean that event processors are actually stateless? If yes, then how do they manage to synchronise? Is the token store used for this purpose?
I think this depends on what you count as state, but I assume that from the point of view your looking at it, yes, the EventProcessor implementations in Axon are indeed stateless.
The SubscribingEventProcessor receives it's events from a SubscribableMessageSource (the EventBus implements this interface) when they occur.
The TrackingEventProcessor retrieves it's event from a StreamableMessageSource (the EventStore implements this interface) on it's own leisure.
The latter version for that needs to keep track of where it is in regards to events on the event stream. This information is stored in a TrackingToken, which is saved by the TokenStore.
A given TrackingEventProcessor thread can only handle events if it has laid a claim on the TrackingToken for the processing group it is part of. Hence, this ensure that the same event isn't handled by two distinct threads to accidentally update the same query model.
The TrackingToken also allow multithreading this process, which is done by segmented the token. The number of segments (adjustable through the initialSegmentCount) drives the number of pieces the TrackingToken for a given processing group will be partitioned in. From the point of view of the TokenStore, this means you'll have several TrackingToken instances stored which equal the number of segments you've set it to.
The SequencingPolicy its job is to drive which events in a stream belong to which segment. Doing so, you could for example use the SequentialPerAggregate SequencingPolicy to ensure all the events with a given aggregate identifier are handled by one segment.

RDBMS Event-Store: Ensure ordering (single threaded writer)

Short description about the setup:
I'm trying to implement a "basic" event store/ event-sourcing application using a RDBMS (in my case Postgres). The events are general purpose events with only some basic fields like eventtime, location, action, formatted as XML. Due to this general structure, there is now way of partitioning them in a useful way. The events are captured via a Java Application, that validate the events and then store them in an events table. Each event will get an uuid and recordtime when it is captured.
In addition, there can be subscriptions to external applications, which should get all events matching a custom criteria. When a new matching event is captured, the event should be PUSHED to the subscriber. To ensure, that the subscriber does not miss any event, I'm currently forcing the capture process to be single threaded. When a new event comes in, a lock is set, the event gets a recordtime assigned to the current time and the event is finally inserted into the DB table (explicitly waiting for the commit). Then the lock is released. For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime. When the matching events are successfully pushed to the subscriber, the subscription_recordtime is set to the events max recordtime.
Everything is actually working but as you can imagine, a single threaded capture process, does not scale very well. Thus the main question is: How can I optimise this and allow for example multiple capture processes running in parallel?
I already thought about setting the recordtime in the DB itself on insert, but since the order of commits cannot be guaranteed (JVM pauses), I think I might loose events when two capture transactions are running nearly at the same time. When I understand the DB generated timestamp currectly, it will be set before the actual commit. Thus a transaction with a recordtime t2 can already be visible to the subscription query, although another transaction with a recordtime t1 (t1 < t2), is still ongoing and so has not been committed. The recordtime for the subscription will be set to t2 and so the event from transaction 1 will be lost...
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed? Every newly visible event must have a later timestamp then the event before (strictly monotonically increasing). I know about a full table lock, but I think, then I will have the same performance penalties as before.
Is it possible to set the DB to use a single threaded writer? Then each capture process would also be waiting for another write TX to finished, but on a DB level, which would be much better than a single instance/threaded capture application. Or can I use a different field/id for tracking the current state? Normal sequence ids will suffer from the same reasons.
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed?
You should not be concerned with global ordering of events. Your events should contain a Version property. When writing events, you should always be inserting monotonically increasing Version numbers for a given Aggregate/Stream ID. That really is the only ordering that should matter when you are inserting. For Customer ABC, with events 1, 2, 3, and 4, you should only write event 5.
A database transaction can ensure the correct order within a stream using the rules above.
For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime.
Reading events is a slightly different story. Firstly, you will likely have a serial column to uniquely identify events. That will give you ordering and allow you to determine if you have read all events. When you read events from the store, if you detect a gap in the sequence. This will happen if an insert was in flight when you read the latest events. In this case, simply re-read the data and see if the gap is gone. This requires your subscription to maintain it's position in the index. Alternatively or additionally, you can read events that are at least N milliseconds old where N is a threshold high enough to compensate for delays in transactions (e.g 500 or 1000).
Also, bear in mind that there are open source RDBMS event stores that you can either use or leverage in your process.

How can I measure the propagation latency of DynamoDB Streams?

I'm using DynamoDB Streams + Kinesis Client Library (KCL).
How can I measure latency between when an event was created in a stream and when it was processed on KCL side?
As I know, KCL's MillisBehindLatest metric is specific to Kinesis Streams(not DynamoDB streams).
approximateCreationDateTime record attribute has a minute-level approximation, which is not acceptable for monitoring in sub-second latency systems.
Could you please help with some useful metrics for monitoringDynamoDB Streams latency?
You can change the way you do writes in your application to allow your application to track the propagation delay of mutations in the table's stream. For example, you could always update a 'last_updated=' timestamp attribute when you create and update items. That way, when your creations and updates appear in the stream, you can estimate the propagation delay by subtracting the current time from last_updated in the NEW_IMAGE of the stream record.
Because deletions do not have a NEW_IMAGE in stream records, your deletes would need to take place in two steps:
logical deletion where you write the 'logically_deleted='
timestamp to the item and
physical deletion where you actually call DeleteItem immediately following 1.
Then, you would use the same math as for creations and updates, only differences being that you would use the OLD_IMAGE when processing deletions and you would need to subtract at least around 10ms to account for the time it takes to perform the logical delete (step 1).