NEventStore issue with replaying events - cqrs

We are using CQRS + ES. The ES is NEventStore (folrmerly JOliver EventStore). We have 2 aggregates in different commands. The projections of the second AR depends on the data written by the first AR projections in the read model. The problem is that when we run the software everything goes so fast that sometimes the two aggregates are persisted in the event store with identical datetime (column CommitStamp). When replaying the events we get them from the beginning ordered by CommitStamp column. But if the two streams are with identical CommitStamp and taken in wrong order the read model projections blow with exceptions.
Any idea how to solve this problem?
===============================
Here is the discussion about this problem at github
https://github.com/NEventStore/NEventStore/issues/170
===============================
EDIT: This is how we currently replay events. I searched how GetFrom(...) works and it turned out that commitstamp column is not used for ordering. After all there is not commit order. So if I start replaying events it may return an event from today, next an event recorded 2 years ago, next etc
public void ReplayEvents(Action<List<UncommittedEvent>> whatToDoWithEvents, DateTime loadEventsAfterDate)
{
var eventPortion = store.Advanced.GetFrom(loadEventsAfterDate);
var uncommitedEventStream = new UncommittedEventStream();
foreach (var commit in eventPortion)
{
foreach (var eventMessage in commit.Events.ToList()))
{
uncommitedEventStream.Append(new UncommittedEvent(eventMessage.Body));
}
}
whatToDoWithEvents(uncommitedEventStream.ToList());
}

In NEventStore, the consistency boundary is the stream. As of version 3.2 (as #Marijn mentioned, issue #159) the CommitSequence column is used to order CommitMessages (and the therein contained EventMessages) when reading from a stream across all persistence engines.
EventMessage ordering is guaranteed on a per stream basis. There is no implied ordering of messages across streams. Any actual ordering that may occur as a result some aspect of the chosen persistence engine is accidental and must not be relied upon.
To guarantee ordering across streams will severely restrict the distributed-friendly aspects of the library. Even if we were to consider such a feature, it would have to work with all supported persistence engines, which includes NoSQL stores.
If you are practising Domain Driven Design, where each stream represents an aggregate root, and you need to guarantee ordering across 2 or more aggregates, this points to a design issue in your domain model.
If your projections need to merge values from multiple sources (streams), you can rely on ordering intra- source, but you need be flexible on ordering inter-source. You should also account for the possibility of duplicate messages, especially if you are replaying through an external bus or queue.
If you attempt to re-order multiple streams on the receiver end using a timestamp (CommitStamp), that will be fragile. Timestamps have a fixed resolution (ms, tick, etc). Even with a single writer, things may still happen 'at the same time'.

Damian added a checkpoint column in the database. This is in the current master branch. When the events a replayed with GetFromCheckpoint(int) the results are correct.

At the database level, while the CommitStamp is fine for filtering, the CommitSequence column is the one that should guide the ordering.
As for that that translates to in terms of API calls on whatever version of the libs you're using -- I'll leave that as an exercise for you (or if you fill in a code snippet and/or a mention of the version perhaps someone else can step in)

Related

Designing event-based architecture for the customer service

Being a developer with solid experience, i am only entering the world of microservices and event-driven architecture. Things like loose coupling, independent scalability and proper implementation of asynchronous business processes is something that i feel should get simplified as compared with traditional monolith approach. So giving it a try, making a simple PoC for myself.
I am considering making a simple application where user can register, login and change the customer details. However, i want to react on certain events asynchronously:
customer logs in - we send them an email, if the IP address used is new to the system.
customer changes their name, we send them an email notifying of the change.
The idea is to make a separate application that reacts on "CustomerLoggedIn", "CustomerChangeName" events.
Here i can think of three approaches, how to implement this simple functionality, with each of them having some drawbacks. So, when a customer submits their name change:
Store change name Changed name is stored in the DB + an event is sent to Kafkas when the DB transaction is completed. One of the big problems that arise here is that if a customer had 2 tabs open and almost simultaneously submits a change from initial name "Bob" to "Alice" in one tab and from "Bob" to "Jim" in another one, on a database level one of the updates overwrites the other, which is ok, however we cannot guarantee the order of the events to be the same. We can use some checks to ensure that DB update is only done when "the last version" has been seen, thus preventing the second update at all, so only one event will be emitted. But in general case, this pattern will not allow us to preserve the same order of events in the DB as in Kafka, unless we do DB change + Kafka event sending in one distributed transaction, which is anti-pattern afaik.
Change the name in the DB, and use Debezium or similar DB CDC to capture the event and stream it. Here we get a single event source, so ordering problem is solved, however what bothers me is that i lose the ability to enrich the events with business information. Another related drawback is that CDC will stream all the updates in the "customer" table regardless of the business meaning of the event. So, in this case, i will probably need to build a Kafka Streams application to convert the DB CDC events to business events and decouple the DB structure from event structure. The potential benefit of this approach is that i will be able to capture "direct" DB changes in the same manner as those originated in the application.
Emit event from the application, without storing it in the DB. One of the subscribers might to the DB persistence, another will do email sending, etc. The biggest problem i see here is - what do i return to the client? I cannot say "Ok, your name is changed", it's more like "Ok, you request has been recorded and will be processed". In case if the customer quickly hits refresh - he expects to see his new name, as we don't want to explain to the customers what's eventual consistency, do we? Also the order of processing the same event by "email sender" and "db updater" is not guaranteed, so i can send an email before the change is persisted.
I am looking for advices regarding any of these three approaches (and maybe some others i am missing), maybe the usecases when one can be preferrable over others?
It sounds to me like you want event sourcing. In event sourcing, all you need to store is the event: the current state of a customer is derived from replaying the events (either from the beginning of time, or since a snapshot: the snapshot is just an optional optimization). Some other process (there are a few ways to go about this) can then project the events to Kafka for consumption by interested parties. Since every event has a sequence number, you can use the sequence number to prevent concurrent modification (alternatively, the more actor modely event-sourcing implementations can use techniques like cluster sharding in Akka to achieve the same ends).
Doing this, you can have a "write-side" which processes the updates in a strongly consistent manner and can respond to queries which only involve a single customer having seen every update to that point (the consistency boundary basically makes customer in this case an aggregate in domain-driven-design terms). "Read-sides" consuming events are eventually consistent: the latencies are typically fairly short: in this case your services sending emails are read-sides (as would be a hypothetical panel showing names of all customers), but the customer's view of their own data could be served by the write-side.
(The separation into read-sides and write-side (the pluralization is significant) is Command Query Responsibility Segregation, which sometimes gets interpreted as "reads can only be served by a read-side". This is not totally accurate: for one thing a write-side's model needs to be read in order for the write-side to perform its task of validating commands and synchronizing updates, so nearly any CQRS-using project violates that interpretation. CQRS should instead be interpreted as "serve reads from the model that makes the most sense and avoid overcomplicating a model (including that model in the write-side) to support a new read".)
I think I qualify to answer this, having extensively used debezium for simplifying the architecture.
I would prefer Option 2:
Every transaction always results in an event emitted in correct order
Option 1/3 has a corner case, what if transaction succeeds, but application fails to emit the event?
To your point:
Another related drawback is that CDC will stream all the updates in
the "customer" table regardless of the business meaning of the event.
So, in this case, i will probably need to build a Kafka Streams
application to convert the DB CDC events to business events and
decouple the DB structure from event structure.
I really dont think that is a roadblock. The benefit you get is potentially other usecases may crop up where another consumer to this topic may want to read other columns of the table.
Option 1 and 3 are only going to tie this to your core application logic, and that is not doing any favor from simplifying PoV. With option 2, with zero code changes to core application APIs, a developer can independently work on the events, with no need to understand that core logic.

DDD, Event Sourcing, and the shape of the Aggregate state

I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".

Axon or Kafka to support CQRS/ES

Consider the simple use case in which I want to store product ratings as events in an event store.
I could use two different approaches:
Using Axon: A Rating aggregate is responsible for handling the CreateRatingCommand and sending the RatingCreatedEvent. Sending the event would case the Rating to be stored in the event store. Other event handlers have the possibility to replay the event stream when connecting to the Axon server instance and doing whatever needed with the ratings. In this case, the event handler will be used as a stream processor.
Using Kafka: A KafkaProducer will be used to store a Rating POJO (after proper serialization) in a Kafka topic. Setting the topic's retention time to indefinite would cause no events to get lost in time. Kafka Streams would in this case be used to do the actual rating processing logic.
Some architectural questions appear to me for both approaches:
When using Axon:
Is there any added value to use Axon (or similar solutions) if there is no real state to be maintained or altered within the aggregate? The aggregate just serves as a "dumb" placeholder for the data, but does not provide any state changing logic.
How does Axon handle multiple event handlers of the same event type? Will they all handle the same event (same aggregate id) in parallel, or is the same event only handled once by one of the handlers?
Are events stored in the Axon event store kept until the end of time?
When using Kafka:
Kafka stores events/messages with the same key in the same partition. How does one select the best value for a key in the use case of user-product ratings? UserId, ProductId or a separate topic for both and publish each event in both topics.
Would it be wise to use a separate topic for each user and each product resulting in a massive amount of topics on the cluster? (Approximately <5k products and >10k users).
I don't know if SO is the preferred forum for this kind of questions... I was just wondering what you (would) recommend in this particular use case as the best practise. Looking forward to your feedback and feel free to point out other points of thought I missed in the previous questions.
EDIT#12/11/2020 : I just found a related discussion containing useful information related to my question.
As Jan Galinski already puts it, this hasn't got a fool proof answer to it really. This is worth a broader discussion on for example indeed AxonIQ's Discuss forum. Regardless, there are some questions in here I can definitely give an answer to, so let's get to it:
Axon Question 1 - Axon Framework is as you've noticed used a lot for DDD centric applications. Nothing however forces you to base yourself on that notion at all. You can strip the framework from Event Sourcing specifics, as well as modelling specifics entirely and purely go for the messaging idea of distinct commands, events and queries. It has been a conscious decision to segregate Axon Framework version 3 into these sub-part when version 4 (current) was released actually. Next to that, I think there is great value in not just basing yourself on event messages. Using distinct commands and queries only further decouples your components, making for a far richer and easier to extend application landscape.
Axon Question 2 - This depends on where the #EventHandler annotated methods are located actually. If they're in the same class only one will be invoked. If they're positioned into distinct classes, then both will receive the same event. Furthermore if they're segregated between distinct classes, it is important to note Axon uses an Event Processor as the technical solution to invoking your event handlers. If distinct classes are grouped under the same Event Processor, you can impose a certain ordering which handler is invoked first. Next to this if the event handling should occur in parallel, you will have to configure a so called TrackingEventProcessor (the default in Axon Framework), as it allows configuration of several threads to handle events concurrently. Well, to conclude this section, everything you're asking in question two is an option, neither a necessity. Just a matter of configuration really. Might be worth checking up on this documentation page of Axon Framework on the matter.
Axon Question 3 - As Axon Server serves the purpose of an Event Store, there is no retention period at all. So yes, they're by default kept until the end of time. There is nothing stopping your from dropping the events though, if you feel there's no value in storing the events to for example base all your models on (as you'd do when using Event Sourcing).
It's the Kafka question I'm personally less familiar with (figures as a contributor to Axon Framework I guess). I can give you my two cents on the matter here too though, although I'd recommend a second opinion here:
Kafka Question 1 - From my personal feeling of what such an application would require, I'd assume you'd want to be able to retrieve all data for a given product as efficient as possible. I'd wager it's important that all events are in the same partition to make this process as efficient as possible, is it wouldn't require any merging afterwards. With this in mind, I'd think using the ProductId will make most sense.
Kafka Question 2 - If you are anticipating only 5_000 products and 10_000 users, I'd guess it should be doable to have separate topics for these. Opinion incoming - It is here though were I personally feel that Kafka's intent to provide you direct power to decide on when to use topics over complicates from what you'd actually try to achieve, which business functionality. Giving the power to segregate streams feels more like an after thought from the perspective of application development. As soon as you'd require an enterprise grade/efficient message bus, that's when this option really shines I think, as then you can optimize for bulk.
Hoping all this helps you further #KDW!

Kafka Topics--should I have more or fewer of them?

We are new to Kafka, so I am looking for some high level guidance. We have data for a single entity (we can call it an "Order") that is essentially a number of different entities (we can call one a "Widget" and one a "Gizmo," but there are about 20 different entity types).
Obviously, there is benefit to thinking of Orders as a single topic because all the parts are related to one order. But design wise, does it make more sense for these to be separate topics (Orders, Widgets, Gizmos, etc.)?
There is no direct correlation between the Widgets and Gizmos--the benefit of keeping them together would be things like order of processing, etc. And suggestions or good resources to read would be very helpful. Thanks!
I would recommend initially recording the event as a single atomic message, and not splitting it up into several messages in several topics. It’s best to record events exactly as you receive them, in a form that is as raw as possible. You can always split up the compound event later, using a stream processor—but it’s much harder to reconstruct the original event if you split it up prematurely. Even better, you can give the initial event a unique ID (e.g. a UUID); that way later on when you split the original event into one event for each entity involved, you can carry that ID forward, making the provenance of each event traceable.

Snapshot taking and restore strategies

I've been reading about CQRS+EventSoucing patterns (which I wish to apply in a near future) and one point common to all decks and presentations I found is to take snapshots of your model state in order to restore it, but none of these share patterns/strategies of doing that.
I wonder if you could share your thoughts and experience in this matter particularly in terms of:
When to snapshot
How to model a snapshot store
Application/cache cold start
TL;DR: How have you implemented Snapshotting in your CQRS+EventSourcing application? Pros and Cons?
Rule #1: Don't.
Rule #2: Don't.
Snapshotting an event sourced model is a performance optimization. The first rule of performance optimization? Don't.
Specifically, snapshotting reduces the amount of time you lose in your repository trying to reload the history of your model from your event store.
If your repository can keep the model in memory, then you aren't going to be reloading it very often. So the win from snapshotting will be small. Therefore: don't.
If you can decompose your model into aggregates, which is to say that you can decompose the history of your model into a number of entities that have non-overlapping histories, then your one model long model history becomes many many short histories that each describe the changes to a single entity. Each entity history that you need to load will be pretty short, so the win from a snapshot will be small. Therefore: don't.
The kind of systems I'm working today require high performance but not 24x7 availability. So in a situation where I shut down my system for maintenace and restart it I'd have to load and reprocess all my event store as my fresh system doesn't know which aggregate ids to process the events. I need a better starting point for my systems to restart be more efficient.
You are worried about missing a write SLA when the repository memory caches are cold, and you have long model histories with lots of events to reload. Bolting on snapshotting might be a lot more reasonable than trying to refactor your model history into smaller streams. OK....
The snapshot store is a read model -- at any point in time, you should be able to blow away the model and rebuild it from the persisted history in the event store.
From the perspective of the repository, the snapshot store is a cache; if no snapshot is available, or if the store itself doesn't respond within the SLA, you want to fall back to reprocessing the entire event history, starting from the initial seed state.
The service provider interface is going to look something like
interface SnapshotClient {
SnapshotRecord getSnapshot(Identifier id)
}
SnapshotRecord is going to provide to the repository the information it needs to consume the snapshot. That's going to include at a minimum
a memento that allows the repository to rehydrate the snapshotted state
a description of the last event processed by the snapshot projector when building the snapshot.
The model will then re-hydrate the snapshotted state from the memento, load the history from the event store, scanning backwards (ie, starting from the most recent event) looking for the event documented in the SnapshotRecord, then apply the subsequent events in order.
The SnapshotRepository itself could be a key-value store (at most one record for any given id), but a relational database with blob support will work fine too
select *
from snapshots s
where id = ?
order by s.total_events desc
limit 1
The snapshot projector and the repository are tightly coupled -- they need to agree on what the state of the entity should be for all possible histories, they need to agree how to de/re-hydrate the memento, and they need to agree which id will be used to locate the snapshot.
The tight coupling also means that you don't need to worry particularly about the
schema for the memento; a byte array will be fine.
They don't, however, need to agree with previous incarnations of themselves. Snapshot Projector 2.0 discards/ignores any snapshots left behind by Snapshot Projector 1.0 -- the snapshot store is just a cache after all.
i'm designing an application that will probably generate millions event a day. what can we do if we need to rebuild a view 6 month later
One of the more compelling answers here is to model time explicitly. Do you have one entity that lives for six months, or do you have 180+ entities that each live for one day? Accounting is a good domain to reference here: at the end of the fiscal year, the books are closed, and the next year's books are opened with the carryover.
Yves Reynhout frequently talks about modeling time and scheduling; Evolving a Model may be a good starting point.
There are few instances you need to snapshot for sure. But there are a couple - a common example is an account in a ledger. You'll have thousands maybe millions of credit/debit events producing the final BALANCE state of the account - it would be insane not to snapshot that every so often.
My approach to snapshoting when I designed Aggregates.NET was its off by default and to enable your aggregates or entities must inherit from AggregateWithMemento or EntityWithMemento which in turn your entity must define a RestoreSnapshot, a TakeSnapshot and a ShouldTakeSnapshot
The decision whether to take a snapshot or not is left up to the entity itself. A common pattern is
Boolean ShouldTakeSnapshot() {
return this.Version % 50 == 0;
}
Which of course would take a snapshot every 50 events.
When reading the entity stream the first thing we do is check for a snapshot then read the rest of the entity's stream from the moment the snapshot was taken. IE: Don't ask for the entire stream just the part we have not snapshoted.
As for the store - you can use literally anything. VOU is right though a key-value store is best because you only need to 1. check if one exists 2. load the entire thing - which is ideal for kv
For system restarts - I'm not really following what your described problem is. There's no reason for your domain server to be stateful in the sense that its doing something different at different points in time. It should do just 1 thing - process the next command. In the process of handling a command it loads data from the event store, including a snapshot, runs the command against the entity which either produces a business exception or domain events which are recorded to the store.
I think you may be trying to optimize too much with this talk of caching and cold starts.