CQRS (event sourcing): Projections with multiple aggregates - cqrs

I have a question regarding projections involving multiple aggregates on a CQRS architecture.
For example sake, suppose I have two aggregates WorkItem and Developer and that the following events happen sequentially (but not immediately)
WorkItemCreated (workItemId)
WorkItemTitleChanged (workItemId, title)
DeveloperCreated (developerId)
DeveloperNameChanged (developerId, name)
WorkItemAssigned (workitemId, DeveloperId)
I wish to create a projection which is as "inner join" of developer-workitem:
| WorkItemId | DeveloperId | Title | DeveloperName | ... |
|------------|-------------|--------|---------------|-----|
| 1 | 1 | FixBug | John Doe | ... |
The way I am doing my projections is incrementally. Meaning I load the saved projections from the database and apply the remaining events as they come.
My problem is, the event responsible for creating a row on the projection table is WorkItemAssigned. However, that event does not carry required information from previous events (workitem title, developer name, etc.)
In order to have the required information by the time WorkItemAssigned, I have to load all events from the eventstore, keep states in-memory for all WorkItems and Developers so I have the required information by the time a WorkItemAssigned event arrives.
Sure, I could have a projection for Workitem, another for Developer and query them to retrieve their last states. But it seems like a lot of work, if I am to create projections for each aggregate separately, I might as well create a database view to inner-join them (In fact, that is what I am doing.)
I am not doing all this by hand, I am currently using a good framework called EventFlow, but it doesn´t direct me to answer this question.
This is a question on fundamentals of CQRS, and I fell I am missing something here.

I don't think you are missing anything. Projecting read models in an event-sourced system presents a different set of problems than querying from a relational model. The problems are not necessarily easier or harder to solve; they are just different.
The good news is that you have a lot of choices. Event Sourcing allows you to project data in any imaginable way, so you can decide on a solution that is most suitable for each individual projection. I guess the "bad" news (I would argue it's not bad news) is that the solution to the problem is not the same every time as it is with a relational system, which is to construct a query using JOINs.
You've already identified a few possible solutions:
Use a relational model as one of your read models
When a certain type of event comes in, re-query the streams that hold the data you need and use them to project on demand
You could also simply hold some data in an interim state (in memory, a document database, the file system, etc.) that allows you to look up the data and project it when needed. So keep lists of updated WorkItems and Developers where they can be read and used whenever a WorkItemAssigned event comes in.
I would say creating a relational database as an interim or permanent read model is a perfectly viable way of solving the problem, assuming you are not trying to achieve massive scalability.

Related

How many database or table needed for Event sourcing?

I'm trying to use event sourcing, ddd and cqrs.
I can't understand that I have to create two database (or table) ( 1-json 2-normalize database) or one database (just json)
And also if I have create two database (or table), I have to save data in databases (json and normalize) as atomic in one transaction or not?
Best regards
DDD
We're making an assumption here that you fully understand using DDD and the implications. Specifically related to Event Sourcing, it's a matter of defining Aggregate boundaries and the events that become their state.
CQRS
Again we're making an assumption that you fully understand the implications. CQRS merely allows you to write code in vertical slices (i.e. from UI to database) for handling "commands" separately from code that handles "queries". That's all. While it's true that you can then take this further, by storing data in a "read model" that might even be in a different database, let alone table, it's not a requirement of implementing CQRS.
As CQRS pertains to Event Sourcing - it's a good fit because the data model you tend to end up with in Event Sourcing is not conducive to complex queries. It's typically limited to "get the Aggregate by it's ID". Therefore having "projections" to store the data in other ways that are more appropriate for querying and loading into UIs is the typical approach.
Event Sourcing
If you implement a Domain Model in such a way that every command handled by an Aggregate (i.e. every use-case/task carried out by a user) generates one or more events, then Event Sourcing is the principle where you store those list of events in an append-only style against the Aggregate's ID, rather than storing a snapshot of the Aggregate after the command was successfully handled.
To load an aggregate from the event store, you load all it's previous events, and replay them in memory on the Aggregate object, again rather than loading a single row/document in as a snapshot/memento.
A document database is therefore an excellent choice for event stores, because a single document represents the event stream for a given Aggregate. However if you want to store your event streams in SQL, that's fine, but you might store it in two tables:
create table Aggregate (Id int not null...);
create table AggregateEvent(AggregateId int not null FK..., Version int not null, eventBody nvarchar(max));
The actual event body would typically be the event itself, serialised to a text format like JSON.
Projections and Read Stores
If you take the events generated by the handling of commands by aggregates, and write code that consumes them by writing to a separate data store (SQL, pre-calculated ViewModels, etc), then you can call that a "projection". It's "projecting" the data that's in one shape into another shape fit for a different purpose. The result is a "read store", which you can then query however you need to.
I can't understand that I have to create two database (or table) ( 1-json 2-normalize database) or one database (just json)
It's possible to get by with just an event store and nothing else.
"get by" isn't necessarily pleasant, however. Event stores are, as a rule, really good at "append new information", but not particularly good at "query". Thus, the usual answer is to deploy processes that copy information from your event store to something that has nicer query support.
I have to save data in databases (json and normalize) as atomic in one transaction or not?
It's a common pattern to update the event storage only, and then later invoke the process to copy the information from the event storage to your query support. Of course, that also means that your queries may end up showing old/out of date information (here is the answer to your question as-of five minutes ago).
If you store your query friendly data model with the event storage (tables in the same relational database, for instance), then you can arrange for at least some of your updates to the query friendly model to be synchronized with the events.
In other words, you get trade offs, not a single cookie cutter pattern that is used everywhere.

DDD, Event Sourcing, and the shape of the Aggregate state

I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

How to handle application death and other mid-operation faults with Mongo DB

Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.

Recreate a graph that change in time

I have an entity in my domain that represent a city electrical network. Actually my model is an entity with a List that contains breakers, transformers, lines.
The network change every time a breaker is opened/closed, user can change connections etc...
In all examples of CQRS the EventStore is queried with Version and aggregateId.
Do you think I have to implement events only for the "network" aggregate or also for every "Connectable" item?
In this case when I have to replay all events to get the "actual" status (based on a date) I can have near 10000-20000 events to process.
An Event modify one property or I need an Event that modify an object (containing all properties of the object)?
Theres always an exception to the rule but I think you need to have an event for every command handled in your domain. You can get around the problem of processing so many events by making use of Snapshots.
http://thinkbeforecoding.com/post/2010/02/25/Event-Sourcing-and-CQRS-Snapshots
I assume you mean currently your "connectable items" are part of the "network" aggregate and you are asking if they should be their own aggregate? That really depends on the nature of your system and problem and is more of a DDD issue than simple a CQRS one. However if the nature of your changes is typically to operate on the items independently of one another then then should probably be aggregate roots themselves. Regardless in order to answer that question we would need to know much more about the system you are modeling.
As for the challenge of replaying thousands of events, you certainly do not have to replay all your events for each command. Sure snapshotting is an option, but even better is caching the aggregate root objects in memory after they are first loaded to ensure that you do not have to source from events with each command (unless the system crashes, in which case you can rely on snapshots for quicker recovery though you may not need them with caching since you only pay the penalty of loading once).
Now if you are distributing this system across multiple hosts or threads there are some other issues to consider but I think that discussion is best left for another question or the forums.
Finally you asked (I think) can an event modify more than one property of the state of an object? Yes if that is what makes sense based on what that event represents. The idea of an event is simply that it represents a state change in the aggregate, however these events should also represent concepts that make sense to the business.
I hope that helps.