Trying to implement Event Sourcing and CQRS for the first time, but got stuck when it came to persisting the aggregates.
This is where I'm at now
I've setup "EventStore" an a stream, "foos"
Connected to it from node-eventstore-client
I subscribe to events with catchup
This is all working fine.
With the help of the eventAppeared event handler function I can build the aggregate, whenever events occur. This is great, but what do I do with it?
Let's say I build and aggregate that is a list of Foos
[
{
id: 'some aggregate uuidv5 made from barId and bazId',
barId: 'qwe',
bazId: 'rty',
isActive: true,
history: [
{
id: 'some event uuid',
data: {
isActive: true,
},
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
{
id: 'some event uuid',
data: {
barId: 'qwe',
bazId: 'rty',
},
timestamp: 123456789,
eventType: 'FooCreated'
}
]
}
]
To follow CQRS I will build the above aggregate within a Read Model, right? But how do I store this aggregate in a database?
I guess just a nosql database should be fine for this, but I definitely need a db since I will put a gRPC APi in front of this and other read models / aggreates.
But what do I actually go from when I have built the aggregate, to when to persist it in the db?
I once tried following this tutorial https://blog.insiderattack.net/implementing-event-sourcing-and-cqrs-pattern-with-mongodb-66991e7b72be which was super simple, since you'd use mongodb both as the event store and just create a view for the aggregate and update that one when new events are incoming. It had it's flaws and limitations (the aggregation pipeline) which is why I now turned to "EventStore" for the event store part.
But how to persist the aggregate, which is currently just built and stored in code/memory from events in "EventStore"...?
I feel this may be a silly question but do I have to loop over each item in the array and insert each item in the db table/collection or do you somehow have a way to dump the whole array/aggregate there at once?
What happens after? Do you create a materialized view per aggregate and query against that?
I'm open to picking the best db for this, whether that is postgres/other rdbms, mongodb, cassandra, redis, table storage etc.
Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?
So given that barId and bazId in combination can be used for grouping events, instead of a single stream I'd think more specialized streams such as foos-barId-bazId would be the way to go, to try and reduce the frequency of incoming new events to a point where recreating materialized views will make sense.
Is there a general rule of thumb saying not to recreate/update/refresh materialized views if the update frequency gets below a certain limit? Then the only other a lternative would be querying from a normal table/collection?
Edit:
In the end I'm trying to make a gRPC api that has just 2 rpcs - one for getting a single foo by id and one for getting all foos (with optional field for filtering by status - but that is not so important). The simplified proto would look something like this:
rpc GetFoo(FooRequest) returns (Foo)
rpc GetFoos(FoosRequest) returns (FooResponse)
message FooRequest {
string id = 1; // uuid
}
// If the optional status field is not specified, return all foos
message FoosRequest {
// If this field is specified only return the Foos that has isActive true or false
FooStatus status = 1;
enum FooStatus {
UNKNOWN = 0;
ACTIVE = 1;
INACTIVE = 2;
}
}
message FoosResponse {
repeated Foo foos;
}
message Foo {
string id = 1; // uuid
string bar_id = 2 // uuid
string baz_id = 3 // uuid
boolean is_active = 4;
repeated Event history = 5;
google.protobuf.Timestamp last_updated = 6;
}
message Event {
string id = 1; // uuid
google.protobuf.Any data = 2;
google.protobuf.Timestamp timestamp = 3;
string eventType = 4;
}
The incoming events would look something like this:
{
id: 'some event uuid',
barId: 'qwe',
bazId: 'rty',
timestamp: 123456789,
eventType: 'FooCreated'
}
{
id: 'some event uuid',
isActive: true,
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
As you can see there is no uuid to make it possible to GetFoo(uuid) in the gRPC API, which is why I'll generate a uuidv5 with the barId and bazId, which will combined, be a valid uuid. I'm making that in the projection / aggregate you see above.
Also the GetFoos rpc will either return all foos (if status field is left undefined), or alternatively it'll return the foo's that has isActive that matches the status field (if specified).
Yet I can't figure out how to continue from the catchup subscription handler.
I have the events stored in "EventStore" (https://eventstore.com/), using a subscription with catchup, I have built an aggregate/projection with an array of Foo's in the form that I want them, but to be able to get a single Foo by id from a gRPC API of mine, I guess I'll need to store this entire aggregate/projection in a database of some sort, so I can connect and fetch the data from the gRPC API? And every time a new event comes in I'll need to add that event to the database also or how is this working?
I think I've read every resource I can possibly find on the internet, but still I'm missing some key pieces of information to figure this out.
The gRPC is not so important. It could be REST I guess, but my big question is how to make the aggregated/projected data available to the API service (possible more API's will need it as well)? I guess I will need to store the aggregated/projected data with the generated uuid and history fields in a database to be able to fetch it by uuid from the API service, but what database and how is this storing process done, from the catchup event handler where I build the aggregate?
I know exactly how you feel! This is basically what happened to me when I first tried to do CQRS and ES.
I think you have a couple of gaps in your knowledge which I'm sure you will rapidly plug. You hydrate an aggregate from the event stream as you are doing. That IS your aggregate persisted. The read model is something different. Let me explain...
Your read model is the thing you use to run queries against and to provide data for display to a UI for example. Your aggregates are not (directly) involved in that. In fact they should be encapsulated. Meaning that you can't 'see' their state from the outside. i.e. no getter and setters with the exception of the aggregate ID which would have a getter.
This article gives you a helpful overview of how it all fits together: CQRS + Event Sourcing – Step by Step
The idea is that when an aggregate changes state it can only do so via an event it generates. You store that event in the event store. That event is also published so that read models can be updated.
Also looking at your aggregate it looks more like a typical read model object or DTO. An aggregate is interested in functionality, not properties. So you would expect to see void public functions for issuing commands to the aggregate. But not public properties like isActive or history.
I hope that makes sense.
EDIT:
Here are some more practical suggestions.
"To follow CQRS I will build the above aggregate within a Read Model, right? "
You do not build aggregates in the read model. They are separate things on separate sides of the CQRS side of the equation. Aggregates are on the command side. Queries are done against read models which are different from aggregates.
Aggregates have public void functions and no getter or setters (with the exception of the aggregate id). They are encapsulated. They generate events when their state changes as a result of a command being issued. These events are stored in an event store and are used to recover the state of an aggregate. In other words, that is how an aggregate is stored.
The events go on to be published so the event handlers and other processes can react to them and update the read model and or trigger new cascading commands.
"Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?"
Every couple of seconds is very likely to be fine. I'm more concerned at the persist and update using materialised views. I don't know what you mean by that but it doesn't sound like you have the right idea. Views should be very simple read models. No need to complex relations like you find in an RDMS. And is therefore highly optimised fast for reading.
There can be a lot of confusion on all the terminologies and jargon used in DDD and CQRS and ES. I think in this case, the confusion lies in what you think an aggregate is. You mention that you would like to persist your aggregate as a read model. As #Codescribler mentioned, at the sink end of your event stream, there isn't a concept of an aggregate. Concretely, in ES, commands are applied onto aggregates in your domain by loading previous events pertaining to that aggregate, rehydrating the aggregate by folding each previous event onto the aggregate and then applying the command, which generates more events to be persisted in the event store.
Down stream, a subscribing process receives all the events in order and builds a read model based on the events and data contained within. The confusion here is that this read model, at this end, is not an aggregate per se. It might very well look exactly like your aggregate at the domain end or it could be only creating a read model that doesn't use all the events and or the event data.
For example, you may choose to use every bit of information and build a read model that looks exactly like the aggregate hydrated up to the newest event(likely your source of confusion). You may instead have another process that builds a read model that only tallies a specific type of event. You might even subscribe to multiple streams and "join" them into a big read model.
As for how to store it, this is really up to you. It seems to me like you are taking the events and rebuilding your aggregate plus a history of events in a memory structure. This, of course, doesn't scale, which is why you want to store it at rest in a database. I wouldn't use the memory structure, since you would need to do a lot of state diffing when you flush to the database. You should be modify the database directly in response to each individual event. Ideally, you also transactionally store the stream count with said modification so you don't process the same event again in the case of a failure.
Hope this helps a bit.
Related
I am new to event sourcing, but as fas as I have understood when we have a command use case, we instantiate an aggregate in memory, apply events to it from the event store so as to be in the correct state, make the proper changes and then store those changes back to the event store. We also have a read model store that will eventually be updated by these changes.
In my case I have a CreateUserUseCase (which is a command use case) and I want to first check if the user already exists and if the username is already taken. For example something like this:
const userAlreadyExists = await this.userRepo.exists(email);
if (userAlreadyExists) {
return new EmailAlreadyExistsError(email);
}
const alreadyCreatedUserByUserName = await this.userRepo
.getUserByUserName(username);
if (alreadyCreatedUserByUserName) {
return new UsernameTakenError(username);
}
const user = new User(username, password, email);
await this.userRepo.save(user);
So, for the save method I would use the event store and append the uncommitted events to it. What about the exists and getUserByUserName methods though? On the one hand I want to make a specific query so I could use my read model store to get the data that I need, but on the other hand this makes a contradiction with CQRS. So what do we do in these cases? Do we, in some way, perform queries to the event store? And in what way do we do this?
Thank you in advance!
CQRS shouldn't be interpreted as "don't query the write model" (because the process of determining state from the write model for the purpose of command processing entails a query, this restriction is untenable). Instead, interpret it as "it's perfectly acceptable to have a different data model for a query than the one you use for handling intentions to update". This formulation implies that if the write model is a good fit for a given query, it's OK to execute the query against the write model.
Event sourcing in turn is arguably (especially in conjunction with certain usage styles) the ultimate in data models that optimize for write vs. read and accordingly the event-sourced model makes nearly all queries outside of a fairly small set so inefficient that some form of CQRS is needed.
What query facilities an event store includes are typically limited, but the one query anything that's a suitable event store will have (because it's needed for replaying the events) is a compound query that amounts to "give me the latest snapshot for that entity and either (if the snapshot exists) the first n events after that snapshot or (if no snapshot) the first n events for that entity". The result of that query is dispositive (modulo things like retention etc.) to the question of "has this entity published events"?
We have implemented drools engine in our platform in order to be able to evaluate rules from streams.
In our use case we have a change detection stream which contains the changes of multiple entities.
Rules need to be evaluated for each entity from the stream over a period of time and evolve it's state apart from others entities(Sessions). Those rules produces alerts based on the state of each entity. And for this reason entities should be into boundaries, so the state of one entity does not interfere on the others.
To achieve this, we create a session as a Spring Bean for each entity id and store it in a inMemory HashMap. So every time an entity arrives, we try to find it`s session on the inMemory Map by using it's Id. If we get a null return we create it.
It does`t seems the right way to accomplish it. Because it does not offer a disaster recover strategy neither offers a great memory management.
We could use some kind of inMemory database such as Redis or Memchached. But I don`t think it would be able to recover a stateful session precisely.
Does someone know how to achieve disaster recover and a good memory management with a embedded Drools with multi sessions in the right way? Does the platform offers some solution?
Thanks very much for your attention and support
The answer is not to try to persist and reuse sessions, but rather to persist an object that models the current state of the entity.
Your current workflow is this:
Entity arrives at your application (from change detection stream or elsewhere)
You do a lookup on a hashmap to get a Session which has the entity's state stored
You fire the rules, which updates the session (and possibly the entity)
You persist the session in-memory.
What your workflow should be is this:
(same) Entity arrives at your application
You do a look-up on an external data source for the entity's state -- for example from a database or data store
You fire the rules, passing in the entity state. Instead of updating the session, you update the state instance.
You persist the state to your external data source.
If you add appropriate write-through caches you can guarantee both performance and consistency. This will also allow you to scale your application sideways if you implement appropriate locking / transaction handling for your data source.
Here's a toy example.
Let's say we have an application modelling a Library where a user is allowed to check out books. A user is only allowed to check out a total of 3 books at a time.
The 'event' we receive models a book check-in or check-out event:
class BookBorrowEvent {
int userId;
int bookId;
EventType eventType; // EventType.CHECK_IN or EventType.CHECK_OUT
}
In an external data source we maintain a UserState record -- maybe as a distinct record in a traditional RDBMS or an aggregate; how we store it isn't really relevant to the example. But let's say our UserState record as returned from the data source looks something like this:
class UserState {
int userId;
int[] borrowedBookIds;
}
When we receive the event, we'll first retrieve the user state from the external data store (or an internally-managed write-through cache), then add the UserState to the rule inputs. We should be appropriately handling our sessions (disposing of them after use, using session pools as needed), of course.
public void handleBookBorrow(BookBorrowEvent event) {
UserState state = getUserStateFromStore(event.getUserId());
KieSession kieSession = ...;
kieSession.insert( event );
kieSession.insert( state );
kieSession.fireAllRules();
persistUserStateToStore(state);
}
Your rules would then do their work against the UserState instance, instead of storing values in local variables.
Some example rules:
rule "User borrows a book"
when
BookBorrowEvent( eventType == EventType.CHECK_OUT,
$bookId: bookId != null )
$state: UserState( $checkedOutBooks: borrowedBookIds not contains $bookId )
Integer( this < 3 ) from $checkedOutBooks.length
then
modify( $state ) { ... }
end
rule "User returns a book"
when
BookBorrowEvent( eventType == EventType.CHECK_IN,
$bookId: bookId != null )
$state: UserState( $checkedOutBooks: borrowedBookIds contains $bookId )
then
modify( $state ) { ... }
end
Obviously a toy example, but you could easily add additional rules for cases like user attempts to check out a duplicate copy of a book, user tries to return a book that they hadn't checked out, return an error if the user exceeds the 3 max book borrowing limit, add time-based logic for length of checkout allowed, etc.
Even if you were using stream-based processing so you can take advantage of the temporal operators, this workflow still works because you would be passing the state instance into the evaluation stream as you receive it. Of course in this case it would be more important to properly implement a write-through cache for performance reasons (unless your temporal operators are permissive enough to allow for some data source transaction latency). The only changes you need to make is to refocus your rules to target their data persistence to the state object instead of the session itself -- which isn't generally recommended anyway since sessions are designed to be disposed of.
I'm trying to wrap my head around how to maintain id references between two aggregates, eg. when an event happens on either side that affects the relationship, the other side is updated as well in an eventual consistent manner.
I have two aggregates, one for "Team" and one for "Event", in the context of a festival with the following code:
#Aggregate
public class Event {
#AggregateIdentifier
private EventId eventId;
private Set<TeamId> teams; // List of associated teams
... protected constructor, getters/setters and command handlers ...
}
#Aggregate
public class Team {
#AggregateIdentifier
private TeamId teamId;
private EventId eventId; // Owning event
... protected constructor, getters/setters and command handlers ...
}
A Team must always be associated to an event (through the eventId). An event contains a list of associated teams (through the team id set).
When a team is created (CreateTeamCommand) on the Team aggregate, I would like the TeamId set on the Event aggregate to be updated with the team id of the newly created team.
If the command "DeleteEventCommand" on the Event aggregate is executed, all teams associated to the event should also be deleted.
If a team is moved from one event to another event (MoveTeamToEventCommand) on the Team aggregate, the eventId on the Team aggregate should be updated but the TeamId should be removed from the old Event aggregate and be added to the new Event aggregate.
My current idea was to create a saga where I would run SagaLifecycle.associateWith for both the eventId on the Event aggregate and the teamId on the Team aggregate with a #StartSaga on the "CreateTeamCommand" (essentially the first time the relationship starts) and then have an event handler for every event that affects the relationship. My main issue with this solution is:
1: It would mean I would have a unique saga for each possible combination of team and event. Could this cause trouble performance wise if it was scaled to eg. 1mil events with each event having 50 teams? (This is unrealistic for this scenario but relevant for a general solution to maintain relationships between aggregates).
2: It would require I had custom commands and event handlers dedicated to handle the update of teams in team list of the Event aggregate as the resulting events should not be processed in the saga to avoid an infinite loop of updating references.
Thank you for reading this small story and I hope someone can either confirm that I'm on the right track or point me in the direction of a proper solution.
An event contains a list of associated teams (through the team id set).
If you mean "An event aggregate" here by "An event", I don't believe your event aggregate needs team ids. If you think it does, it would be great to understand your reasoning on this.
What I think you need is though your read side to know about this. Your read model for a single "Event" can listen on CreateTeamCommand and MoveTeamToEventCommand as well as all other "Event" related events, and build up the projection accordingly. Remember, don't design your aggregates with querying concerns in mind.
If the command "DeleteEventCommand" on the Event aggregate is executed, all teams associated to the event should also be deleted.
A few things here:
Again, your read side can listen on this event, and update the projections accordingly.
You can also start performing validation on relevant command handlers for the Team aggregate to check whether the Event exists or not before performing the operations. This won't have exact sync, but will cover for most cases (see "How can I verify that a customer ID really exists when I place an order?" section here).
If you really want to delete the associated Team aggregates off the back of a DeleteEventCommand event, you need to handle this inside a Saga as there is no way for you to be able to perform this in an atomic way w/o leaking the data storage system specifics into your domain model. So, you need certain retry and idempotency needs here, where a saga can give you. It's not exactly what you are suggesting here but related fact is that a single command can't act on a set of aggregates, see "How can I update a set of aggregates with a single command?" section here.
I'm trying to achieve data join between entities.
I've got 2 separated microservices which can communicate with each other using events (rabbitmq). And all the requests are currently joined within an api gateway.
Suppose my first service is UserService , and second service is ProductService.
Usually to get a list of products we do an GET request like /products , the same goes when we want to create a product , we do an POST request like /products.
The product schema looks something like this:
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: 'userId'
...
}
The user schema looks something like this:
{
username: 'UserUsername`,
email: 'UserEmail'
...
}
So , when creating a product or getting list of products we will not have some details about user like email, username...
What i'm trying to achieve is to get user details when creating or querying for a list of products along with user details like so:
[
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: {
username: 'UserUsername`,
email: 'UserEmail'
}
}
]
I could make an REST GET request to UserService , to get the user details for each product.
But my concern is that if UserService goes down the product will not have user details.
What are other ways to JOIN tables ? other than making REST API calls ?
I've read about DATA REPLICATION , but here's another concern how do we keep a copy of user details in ProductService when we create a new product with and POST request ?
Usually i do not want to keep a copy of user details to ProductService if he did not created a product. I could also emit events to each other service.
Approach 1- Data Replication
Data replication is not harmful as long as it makes your service independent and resilient. But too much data replication is not good either. Microservices doesn't fit well every case so we have to compromise on things as well.
Approach 2- Event sourcing and Materialized views
Generally if you have data consist of multiple services you should be considering event sourcing and Materialized views. These views are pre-complied disposable data tables that can be updated using published events from different data services . Say your "user" service publish the event , then you would update your view if another related event is published you can add/update materialized views and so on. These views can be saved in cache for fast retrieval and can be queried to get the data. This pattern adds little complexity but it's highly scale-able.
Event sourcing is basically a store to save all your events and replay the events to reach the particular state of system. Generally we create Materialized views from event store.
Say e.g. you have event store where you keep on saving all your published events. At the same time you are also updating your Materialized views. If you want to query the data then you will be getting it from your Materialized views. Since Materialized views are disposable that can always be generated from event store. Say Materialized views which was in cache got corrupted , you can completely regenerate the view from Event store by replaying the events. Say if i miss the cache hit i can still get the data from event store by replaying the events. You can find more on the following links.
Event Sourcing , Materialized view
Actually we are working with data replication to make each microservice more resilient (giving them the chance to still work even if another service is down).
This can be achieved in many ways, e.g. in your case by making the ProductService listening to the events send by the UserSevice when a user is created, deleted, etc.
Or the UserService could have a feed the ProductService is reading every n minutes or so marking the position last read on the feed. Etc.
There are many thing to consider when designing services and it really depends on your systems mission. E.g. you always have to evaluate the impact of coupling - if it is fine or not for a service not to be able to work when another service is down. Like, how important is a service and how is the impact on other services when this on is not able to work.
If you do not want to keep a copy of data not needed you could just read the data of the users that are related to a product. If a new product is created with a user that is not in your dataset you would then get it from the UserService. This would give you a stronger coupling then replicating everything but a weaker then replicating no data at all.
Again it really depends on what your systems is designed for and what it needs to achieve.
I am building a Meteor application and am currently creating the publications and coming up against what seems like a common design quandary around related vs embedded documents. My data model (simplified) has Bookings, each of which have a related Client and a related Service. In order to optimise the speed of retrieving a collection I am embedding the key fields of a Client and Service in the Booking, and also linking to the ID - my Booking model has the following structure:
export interface Booking extends CollectionObject {
client_name: string;
service_name: string;
client_id: string;
service_id: string;
bookingDate: Date;
duration: number;
price: number;
}
In this model, client_id and service_id are references to the linked documents and client_name / service_name are embedded as they are used when displaying a list of bookings.
This all seems fine to me however the missing part of the puzzle is keeping this embedded data up to date. If a user in a separate part of the system updates a service (which would be a reactive collection) then I need this to trigger an update of the service_name to any bookings with the corresponding service ID. Is there an event I should subscribe to for this or am I able to? Client side, I have a form which allows the user to add / edit a Service which simply uses the insert or update method on the MongoObservable collection - the OOP part of me feels like this needs to be overridden in the server code to also then update the related data or am I completely going about this the wrong way?
Is this all irrelevant and shoudl I actually just use https://atmospherejs.com/reywood/publish-composite and return collections of related documents (it just feels like it would harm performance in a production environment when returning several hundred bookings at once)
i use a lot of the "foreign key" concept as you're describing, and do de-normalize data across collection as you're doing with the service name. i do this explicitly to avoid extra lookups / publishes.
i use 2 strategies to keep things up to date. the first is done when the source data is saved, say in a Meteor method call. i'll update the de-normalized data on the spot, touching the other collection(s). i would do all this in a "high read, low write" scenario.
the other strategy is to use collection hooks to fire when the source collection is updated. i use this package: matb33:collection-hooks
conceptually, it's similar to the first, but the hook into knowing when to do it is different.
an example we're using in the current app i'm working on: we have a news feed with comments. news items and comments are in separate collections, and each record the comment collection has the id of the associated news item.
we keep a running comment count associated with the news item itself. whenever a comment is added or removed, we increment/decrement the count and update the news item right away.