CQRS and Event Sourcing coupled with Relational Database Design - cqrs

Let me start by saying, I do not have real world experience with CQRS and that is the basis for this question.
Background:
I am building a system where a new key requirement is allowing admins to "playback" user actions (admins want to be able to step through every action that has happened in a system to any particular point). The caveats to this are, the company already has reports that are generated off of their current SQL db that they will not change (at least not in parallel with this new requirement) so the storage of record will be SQL. I do not have access to SQL's Change Data Capture, so creating a bunch of history tables with triggers would be incredibly difficult to maintain so I'd like to avoid that if at all possible. Lastly, there are potentially (not currently) a lot of data entry points that go through a versioning lifecycle that will result in changes to the SQL db (adding/removing fields) so if I tried to implement change tracking in SQL, I'd have to maintain the tables that handled the older versions of the data (nightmare).
Potential Solution
I am thinking about using NoSQL (Azure DocumentDB) to handle data storage (writes) and then have command handlers handle updating the current SQL (Azure SQL) with the relevant data to be queried (reads). That way the audit trail is created and that idea of "playing back" can be handled while also not disturbing the current back end functionality that is provided.
This approach would handle the requirement and satisfy the caveats. I wouldnt use CQRS for the entire app, just for the pieces that I needed this "playback" functionality. I know that I would have to mitigate failure points along the Client -> Write to DocumentDB -> Respond to user with success/fail -> Write to SQL on Success write to DocumentDB path, but my novice CQRS eyes can't see a reason why this isn't a great way to handle this.
Any advice would be greatly appreciated.

This article explained CQRS pattern and provided an example of a CQRS implementation please refer to it.
I am thinking about using NoSQL (Azure DocumentDB) to handle data storage (writes) and then have command handlers handle updating the current SQL (Azure SQL) with the relevant data to be queried (reads).
here is my suggestion, when a user do write operations to update a record, we could always do insert operation before admin audit user’s operations. For example, if user want to update a record, we could insert updating entity with a property that indicates if current operation is audited by admins instead of directly update the record.
Original data in document
{
"version1_data": {
"data": {
"id": "1",
"name": "jack",
"age": 28
},
"isaudit": true
}
}
For updating age field, we could insert entity with updated information instead of updating original data directly.
{
"version1_data": {
"data": {
"id": "1",
"name": "jack",
"age": 28
},
"isaudit": true
},
"version2_data": {
"data": {
"id": "1",
"name": "jack",
"age": 29
},
"isaudit": false
}
}
and then admin could check the current the document to audit user’s operations and determine if updates could write to SQL database.

One potential way to think about this is creating a transaction object that has a unique id and represents the work that needs to be done. The transaction in this case would be write an object to document db or write an object to sql db. It could contain the in memory object to be written and the destination db (doc db, sql, etc.) connection parameters.
Once you define your transaction you would need to adjust your work flow for a proper CQRS. Instead of client writing to doc db directly and waiting on the result of this call, let the client create a transaction with a unique id - which could be something like Date Time tick counts or an incremental transaction id for instance, and then write this transaction to a message queue like azure queue or service bus. Once you write the transaction to the queue return success to user at that point. Create worker roles that would read the transaction messages from this queue and process them, write objects to doc db. That is not overwriting the same entity in doc db, but just writing the transaction with the unique incremental id to doc db for that particular entity. You could also use azure table storage for that afaik.
After successfully updating the doc db transaction, the same worker role could write this transaction to a different message queue which would be processed by its own set of worker roles which would update the entity in sql db. If anything goes wrong in the interim, keep an error table and update failures in that error table to query and retry later on.

Related

How to persist aggregate/read model from "EventStore" in a database?

Trying to implement Event Sourcing and CQRS for the first time, but got stuck when it came to persisting the aggregates.
This is where I'm at now
I've setup "EventStore" an a stream, "foos"
Connected to it from node-eventstore-client
I subscribe to events with catchup
This is all working fine.
With the help of the eventAppeared event handler function I can build the aggregate, whenever events occur. This is great, but what do I do with it?
Let's say I build and aggregate that is a list of Foos
[
{
id: 'some aggregate uuidv5 made from barId and bazId',
barId: 'qwe',
bazId: 'rty',
isActive: true,
history: [
{
id: 'some event uuid',
data: {
isActive: true,
},
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
{
id: 'some event uuid',
data: {
barId: 'qwe',
bazId: 'rty',
},
timestamp: 123456789,
eventType: 'FooCreated'
}
]
}
]
To follow CQRS I will build the above aggregate within a Read Model, right? But how do I store this aggregate in a database?
I guess just a nosql database should be fine for this, but I definitely need a db since I will put a gRPC APi in front of this and other read models / aggreates.
But what do I actually go from when I have built the aggregate, to when to persist it in the db?
I once tried following this tutorial https://blog.insiderattack.net/implementing-event-sourcing-and-cqrs-pattern-with-mongodb-66991e7b72be which was super simple, since you'd use mongodb both as the event store and just create a view for the aggregate and update that one when new events are incoming. It had it's flaws and limitations (the aggregation pipeline) which is why I now turned to "EventStore" for the event store part.
But how to persist the aggregate, which is currently just built and stored in code/memory from events in "EventStore"...?
I feel this may be a silly question but do I have to loop over each item in the array and insert each item in the db table/collection or do you somehow have a way to dump the whole array/aggregate there at once?
What happens after? Do you create a materialized view per aggregate and query against that?
I'm open to picking the best db for this, whether that is postgres/other rdbms, mongodb, cassandra, redis, table storage etc.
Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?
So given that barId and bazId in combination can be used for grouping events, instead of a single stream I'd think more specialized streams such as foos-barId-bazId would be the way to go, to try and reduce the frequency of incoming new events to a point where recreating materialized views will make sense.
Is there a general rule of thumb saying not to recreate/update/refresh materialized views if the update frequency gets below a certain limit? Then the only other a lternative would be querying from a normal table/collection?
Edit:
In the end I'm trying to make a gRPC api that has just 2 rpcs - one for getting a single foo by id and one for getting all foos (with optional field for filtering by status - but that is not so important). The simplified proto would look something like this:
rpc GetFoo(FooRequest) returns (Foo)
rpc GetFoos(FoosRequest) returns (FooResponse)
message FooRequest {
string id = 1; // uuid
}
// If the optional status field is not specified, return all foos
message FoosRequest {
// If this field is specified only return the Foos that has isActive true or false
FooStatus status = 1;
enum FooStatus {
UNKNOWN = 0;
ACTIVE = 1;
INACTIVE = 2;
}
}
message FoosResponse {
repeated Foo foos;
}
message Foo {
string id = 1; // uuid
string bar_id = 2 // uuid
string baz_id = 3 // uuid
boolean is_active = 4;
repeated Event history = 5;
google.protobuf.Timestamp last_updated = 6;
}
message Event {
string id = 1; // uuid
google.protobuf.Any data = 2;
google.protobuf.Timestamp timestamp = 3;
string eventType = 4;
}
The incoming events would look something like this:
{
id: 'some event uuid',
barId: 'qwe',
bazId: 'rty',
timestamp: 123456789,
eventType: 'FooCreated'
}
{
id: 'some event uuid',
isActive: true,
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
As you can see there is no uuid to make it possible to GetFoo(uuid) in the gRPC API, which is why I'll generate a uuidv5 with the barId and bazId, which will combined, be a valid uuid. I'm making that in the projection / aggregate you see above.
Also the GetFoos rpc will either return all foos (if status field is left undefined), or alternatively it'll return the foo's that has isActive that matches the status field (if specified).
Yet I can't figure out how to continue from the catchup subscription handler.
I have the events stored in "EventStore" (https://eventstore.com/), using a subscription with catchup, I have built an aggregate/projection with an array of Foo's in the form that I want them, but to be able to get a single Foo by id from a gRPC API of mine, I guess I'll need to store this entire aggregate/projection in a database of some sort, so I can connect and fetch the data from the gRPC API? And every time a new event comes in I'll need to add that event to the database also or how is this working?
I think I've read every resource I can possibly find on the internet, but still I'm missing some key pieces of information to figure this out.
The gRPC is not so important. It could be REST I guess, but my big question is how to make the aggregated/projected data available to the API service (possible more API's will need it as well)? I guess I will need to store the aggregated/projected data with the generated uuid and history fields in a database to be able to fetch it by uuid from the API service, but what database and how is this storing process done, from the catchup event handler where I build the aggregate?
I know exactly how you feel! This is basically what happened to me when I first tried to do CQRS and ES.
I think you have a couple of gaps in your knowledge which I'm sure you will rapidly plug. You hydrate an aggregate from the event stream as you are doing. That IS your aggregate persisted. The read model is something different. Let me explain...
Your read model is the thing you use to run queries against and to provide data for display to a UI for example. Your aggregates are not (directly) involved in that. In fact they should be encapsulated. Meaning that you can't 'see' their state from the outside. i.e. no getter and setters with the exception of the aggregate ID which would have a getter.
This article gives you a helpful overview of how it all fits together: CQRS + Event Sourcing – Step by Step
The idea is that when an aggregate changes state it can only do so via an event it generates. You store that event in the event store. That event is also published so that read models can be updated.
Also looking at your aggregate it looks more like a typical read model object or DTO. An aggregate is interested in functionality, not properties. So you would expect to see void public functions for issuing commands to the aggregate. But not public properties like isActive or history.
I hope that makes sense.
EDIT:
Here are some more practical suggestions.
"To follow CQRS I will build the above aggregate within a Read Model, right? "
You do not build aggregates in the read model. They are separate things on separate sides of the CQRS side of the equation. Aggregates are on the command side. Queries are done against read models which are different from aggregates.
Aggregates have public void functions and no getter or setters (with the exception of the aggregate id). They are encapsulated. They generate events when their state changes as a result of a command being issued. These events are stored in an event store and are used to recover the state of an aggregate. In other words, that is how an aggregate is stored.
The events go on to be published so the event handlers and other processes can react to them and update the read model and or trigger new cascading commands.
"Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?"
Every couple of seconds is very likely to be fine. I'm more concerned at the persist and update using materialised views. I don't know what you mean by that but it doesn't sound like you have the right idea. Views should be very simple read models. No need to complex relations like you find in an RDMS. And is therefore highly optimised fast for reading.
There can be a lot of confusion on all the terminologies and jargon used in DDD and CQRS and ES. I think in this case, the confusion lies in what you think an aggregate is. You mention that you would like to persist your aggregate as a read model. As #Codescribler mentioned, at the sink end of your event stream, there isn't a concept of an aggregate. Concretely, in ES, commands are applied onto aggregates in your domain by loading previous events pertaining to that aggregate, rehydrating the aggregate by folding each previous event onto the aggregate and then applying the command, which generates more events to be persisted in the event store.
Down stream, a subscribing process receives all the events in order and builds a read model based on the events and data contained within. The confusion here is that this read model, at this end, is not an aggregate per se. It might very well look exactly like your aggregate at the domain end or it could be only creating a read model that doesn't use all the events and or the event data.
For example, you may choose to use every bit of information and build a read model that looks exactly like the aggregate hydrated up to the newest event(likely your source of confusion). You may instead have another process that builds a read model that only tallies a specific type of event. You might even subscribe to multiple streams and "join" them into a big read model.
As for how to store it, this is really up to you. It seems to me like you are taking the events and rebuilding your aggregate plus a history of events in a memory structure. This, of course, doesn't scale, which is why you want to store it at rest in a database. I wouldn't use the memory structure, since you would need to do a lot of state diffing when you flush to the database. You should be modify the database directly in response to each individual event. Ideally, you also transactionally store the stream count with said modification so you don't process the same event again in the case of a failure.
Hope this helps a bit.

Why Spring Data doesn't support returning entity for modifying queries?

When implementing a system which creates tasks that need to be resolved by some workers, my idea would be to create a table which would have some task definition along with a status, e.g. for document review we'd have something like reviewId, documentId, reviewerId, reviewTime.
When documents are uploaded to the system we'd just store the documentId along with a generated reviewId and leave the reviewerId and reviewTime empty. When next reviewer comes along and starts the review we'd just set his id and current time to mark the job as "in progress" (I deliberately skip the case where the reviewer takes a long time, or dies during the review).
When implementing such a use case in e.g. PostgreSQL we could use the UPDATE review SET reviewerId = :reviewerId, reviewTime: reviewTime WHERE reviewId = (SELECT reviewId from review WHERE reviewId is null AND reviewTime is null FOR UPDATE SKIP LOCKED LIMIT 1) RETURNING reviewId, documentId, reviewerId, reviewTime (so basically update the first non-taken row, using SKIP LOCKED to skip any already in-processing rows).
But when moving from native solution to JDBC and beyond, I'm having troubles implementing this:
Spring Data JPA and Spring Data JDBC don't allow the #Modifying query to return anything else than void/boolean/int and force us to perform 2 queries in a single transaction - one for the first pending row, and second one with the update
one alternative would be to use a stored procedure but I really hate the idea of storing such logic so away from the code
other alternative would be to use a persistent queue and skip the database all along but this introduced additional infrastructure components that need to be maintained and learned. Any suggestions are welcome though.
Am I missing something? Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
Why Spring Data doesn't support returning entity for modifying queries?
Because it seems like a rather special thing to do and Spring Data JDBC tries to focus on the essential stuff.
Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
It is certainly possible to do this.
You can implement a custom method using an injected JdbcTemplate.

Microservices "JOIN" tables within different databases and data replication

I'm trying to achieve data join between entities.
I've got 2 separated microservices which can communicate with each other using events (rabbitmq). And all the requests are currently joined within an api gateway.
Suppose my first service is UserService , and second service is ProductService.
Usually to get a list of products we do an GET request like /products , the same goes when we want to create a product , we do an POST request like /products.
The product schema looks something like this:
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: 'userId'
...
}
The user schema looks something like this:
{
username: 'UserUsername`,
email: 'UserEmail'
...
}
So , when creating a product or getting list of products we will not have some details about user like email, username...
What i'm trying to achieve is to get user details when creating or querying for a list of products along with user details like so:
[
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: {
username: 'UserUsername`,
email: 'UserEmail'
}
}
]
I could make an REST GET request to UserService , to get the user details for each product.
But my concern is that if UserService goes down the product will not have user details.
What are other ways to JOIN tables ? other than making REST API calls ?
I've read about DATA REPLICATION , but here's another concern how do we keep a copy of user details in ProductService when we create a new product with and POST request ?
Usually i do not want to keep a copy of user details to ProductService if he did not created a product. I could also emit events to each other service.
Approach 1- Data Replication
Data replication is not harmful as long as it makes your service independent and resilient. But too much data replication is not good either. Microservices doesn't fit well every case so we have to compromise on things as well.
Approach 2- Event sourcing and Materialized views
Generally if you have data consist of multiple services you should be considering event sourcing and Materialized views. These views are pre-complied disposable data tables that can be updated using published events from different data services . Say your "user" service publish the event , then you would update your view if another related event is published you can add/update materialized views and so on. These views can be saved in cache for fast retrieval and can be queried to get the data. This pattern adds little complexity but it's highly scale-able.
Event sourcing is basically a store to save all your events and replay the events to reach the particular state of system. Generally we create Materialized views from event store.
Say e.g. you have event store where you keep on saving all your published events. At the same time you are also updating your Materialized views. If you want to query the data then you will be getting it from your Materialized views. Since Materialized views are disposable that can always be generated from event store. Say Materialized views which was in cache got corrupted , you can completely regenerate the view from Event store by replaying the events. Say if i miss the cache hit i can still get the data from event store by replaying the events. You can find more on the following links.
Event Sourcing , Materialized view
Actually we are working with data replication to make each microservice more resilient (giving them the chance to still work even if another service is down).
This can be achieved in many ways, e.g. in your case by making the ProductService listening to the events send by the UserSevice when a user is created, deleted, etc.
Or the UserService could have a feed the ProductService is reading every n minutes or so marking the position last read on the feed. Etc.
There are many thing to consider when designing services and it really depends on your systems mission. E.g. you always have to evaluate the impact of coupling - if it is fine or not for a service not to be able to work when another service is down. Like, how important is a service and how is the impact on other services when this on is not able to work.
If you do not want to keep a copy of data not needed you could just read the data of the users that are related to a product. If a new product is created with a user that is not in your dataset you would then get it from the UserService. This would give you a stronger coupling then replicating everything but a weaker then replicating no data at all.
Again it really depends on what your systems is designed for and what it needs to achieve.

In CQRS/Eventsourcing which is best approach for a parent to modify the state of all it's children ??

Usecase: Suppose I have the following aggregates
Root aggregate - CustomerRootAggregate (manages each CustomerAggregate)
Child aggregate of the Root aggregate - CustomerAggregate (there are 10 customers)
Question: How do I send DisableCustomer command to all the 10 CustomerAggregate to update their state to be disabled ?
customerState.enabled = false
Solutions: Since CQRS does not allow the write side to query the read side to get a list of CustomerAggregate IDs I thought of the following:
CustomerRootAggregate always store the IDs of all it's CustomerAggregate in the database as json. When a Command for DisableAllCustomers is received by CustomerRootAggregate it will fetch the CustomerIds json and send DisableCustomer command to all the children where each child will restore it's state before applying DisableCustomer command. But this means I will have to maintain CustomerIds json record's consistency.
The Client (Browser - UI) should always send the list of CustomerIds to apply DisableCustomer to. But this will be problematic for a database with thousands of customers.
In the REST API Layer check for the command DisableAllCustomers and fetch all the IDs from the read side and sends DisableAllCustomers(ids) with IDs populated to write side.
Which is a recommended approach or is a better approach ?
Root aggregate - CustomerRootAggregate (manages each CustomerAggregate)
Child aggregate of the Root aggregate - CustomerAggregate (there are 10 customers)
For starters, the language "child aggregate" is a bit confusing. Your model includes a "parent" entity that holds a direct reference to a "child" entity, then both of those entities must be part of the same aggregate.
However, you might have a Customer aggregate for each customer, and a CustomerSet aggregate that manages a collection of Id.
How do I send DisableCustomer command to all the 10 CustomerAggregate to update their state to be disabled ?
The usual answer is that you run a query to get the set of Customers to be disabled, and then you dispatch a disableCustomer command to each.
So both 3 and 2 are reasonable answers, with the caveat that you need to consider what your requirements are if some of the DisableCustomer commands fail.
2 in particular is seductive, because it clearly articulates that the client (human operator) is describing a task, which the application then translates into commands to by run by the domain model.
Trying to pack "thousands" of customer ids into the message may be a concern, but for several use cases you can find a way to shrink that down. For instance, if the task is "disable all", then client can send to the application instructions for how to recreate the "all" collection -- ie: "run this query against this specific version of the collection" describes the list of customers to be disabled unambiguously.
When a Command for DisableAllCustomers is received by CustomerRootAggregate it will fetch the CustomerIds json and send DisableCustomer command to all the children where each child will restore it's state before applying DisableCustomer command. But this means I will have to maintain CustomerIds json record's consistency.
This is close to a right idea, but not quite there. You dispatch a command to the collection aggregate. If it accepts the command, it produces an event that describes the customer ids to be disabled. This domain event is persisted as part of the event stream of the collection aggregate.
Subscribe to these events with an event handler that is responsible for creating a process manager. This process manager is another event sourced state machine. It looks sort of like an aggregate, but it responds to events. When an event is passed to it, it updates its own state, saves those events off in the current transaction, and then schedules commands to each Customer aggregate.
But it's a bunch of extra work to do it that way. Conventional wisdom suggests that you should usually begin by assuming that the process manager approach isn't necessary, and only introduce it if the business demands it. "Premature automation is the root of all evil" or something like that.

Persisting cached data after removal/eviction using Memcache API compatible store

This question is specifically pertaining to Couchbase, but I believe it would apply to anything with the memcached api.
Lets, say I am creating a client/server chat application, and on my server, I am storing chat session information for each user in a data bucket. Well after the chat session is over, I will remove the session object from the data bucket, but at the same time I also want to persist it to a permanent NoSQL datastore for reporting and analytics purposes. I also want session objects to be persisted upon cache eviction, when sessions timeout, etc.
Is there some sort of "best practice" (or even a function of Couchbase that I am missing) that enables me to do this efficiently and maintaining best possible performance of my in memory caching system?
Using Couchbase Server 2.0, you could setup two buckets (or two separate clusters if you want to separate physical resources). On the session cluster, you'd store JSON documents (the value in the key/value pair), perhaps like the following:
{
"sessionId" : "some-guid",
"users" : [ "user1", "user2" ],
"chatData" : [ "message1", "message2"],
"isActive" : true,
"timestamp" : [2012, 8, 6, 11, 57, 00]
}
You could then write a Map/Reduce view in the session database that gives you a list of all expired items (note the example below with the meta argument requires a recent build of Couchbase Server 2.0 - not the DP4.
function(doc, meta) {
if (doc.sessionId && ! doc.isActive) {
emit(meta.id, null);
}
}
Then, using whichever Couchbase client library you prefer, you could have a task to query the view, get the items and move them into the analytics cluster (or bucket). So in C# this would look something like:
var view = sessionClient.GetView("sessions", "all_inactive");
foreach(var item in view)
{
var doc = sessionClient.Get(item.ItemId);
analyticsClient.Store(StoreMode.Add, item.ItemId, doc);
sessionClient.Remove(item.ItemId);
}
If you instead, wanted to use an explicit timestamp or expiry, your view could index based on the timestamp:
function(doc) {
if (doc.sessionId && ! doc.isActive) {
emit(timestamp, null);
}
}
Your task could then query the view by including a startkey to return all documents that have not been touched in x days.
var view = sessionClient.GetView("sessions", "all_inactive").StartKey(new int[] { DateTime.Now.Year, DateTime.Now.Months, DateTime.Now.Days-1);
foreach(var item in view)
{
var doc = sessionClient.Get(item.ItemId);
analyticsClient.Store(StoreMode.Add, item.ItemId, doc);
sessionClient.Remove(item.ItemId);
}
Checkout http://www.couchbase.com/couchbase-server/next for more info on Couchbase Server 2.0 and if you need any clarification on this approach, just let me know on this thread.
-- John
CouchDB storage is (eventually) persistent and without built-in expiry mechanism, so whatever you store in it will remain stored until you remove it - it's not like in Memcached where you can set timeout for stored data.
So if you are storing session in CouchDB you will have to remove them on your own when they expire and since it's not an automated mechanism, but something you do on your own there is no reason for you not to save data wherever you want at the same time.
BTH I see no advantage of using Persistent NoSQL over SQL for session storage (and vice versa) - performance of both will be IO bound. Memory only key store or hybrid solution is a whole different story.
As for your problem: move data in you apps session expiry/session close mechanism and/or run a cron job that periodically checks session storage for expired sessions and move the data.