Let's say I have an object called document and it has bunch of children in form of images, audio, video etc. So a user of my application can create a document by typing some text, adding image, video, etc. From what I understand in DDD, document is an aggregate, while images, videos are always associated with a document as root. Based on this understanding, how would I design an app that enables a user a create/edit document? I could have a REST endpoint to upload document and all it's children in one request, but that's potentially long-running operation. Alternatively, I could design 2 rest endpoint, one to upload document's text body and call the other repeatedly to upload its children, which essentially means multiple transactions. Is the second approach still DDD? Am I violating transaction boundary by splitting document creation and update into multiple requests?
Consistency boundaries (I prefer that term over "transaction boundaries") are not a concept that specify the granularity of allowed changes. They tell you what can be changed atomically, and what cannot.
For example, if you design your documents to be separate aggregates than the images, then you should not change both the document and the and image in one user operation (even when that's technically possible). This means that aggregates cannot be too small, because that would be overly restrictive for a user. They should however also not be too big, because only one user can change an aggregate at a time, so larger aggregates tend to produce more conflicts.
You should try to design aggregates as small as possible, but still large enough to support your use cases. Thus, you'll have to figure that out yourself for your application with the rules above.
So both approaches that you mention are valid from a DDD point of view.
Related
I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".
I was recently recommended a talk by Jim Webber.
And there was a very interesting point in there.
Jim says that when you think that there is a 1-1 correspondence between rows in your database, domain objects and resources in REST service. This makes it hard when want to transact work across arability groups.
No he goes on to point that if you have say 3 users and want to update them, you do then sequentially and it is very poor because you have to track each of them and handle issues if 1 out of the 3 (or how many transactions you want occur).
He mentioned the way you should handle this is to make a resource, for all of the 3 users. Resources are cheap and infinite (you can make as many as you want) so use them. So create that resource and in a single operation put their status update.
This is an extremely interesting point to me as there have been times where I have wanted to perform an operation on multiple things that i considered to be singular.
So here is an example:
Say I have a list of users. Say 100. Users would be their own thing/resource. I want to pick x amount of users out of that list (say 10 randomly) and apply 50 points to them.
I want to apply these points to these users that have no unique connection in the domain, they are just a random group of users. a arbitrary group.
How would I create a rest endpoint/resource as Jim Webber is implying to handle this operation?
Now In my admittedly old frame of mind I would go about it making a specific resource like users/points/bulk/ (or something) and pass in a list of user id's and the points I would apply them. I would never have had the mindset of treating them as a resource, I would have just had an hacky command rest endpoint to perform it.
This point Jim has pointed out is really something I never considered and is such a change of mindset, that it would really make things cleaner.
Could someone explain this to mean and give an example to how it would look
Thanks
He mentioned the way you should handle this is to make a resource, for all of the 3 users. Resources are cheap and infinite (you can make as many as you want) so use them. So create that resource and in a single operation put their status update.
...
How would I create a rest endpoint/resource as Jim Webber is implying to handle this operation?
The basic rule of thumb here is: How would you do it on the Web? As REST is just a generalization of the interaction model the Web allowed to grow to its todays size, the same concpet that proven to be successful on the Web can (and should) be used in a REST architecture.
What is a group of resources actually?! If you think about most sport activities that are played in teams, such as football or the like, almost all players can be divided into certain groups. I.e. players of Team A and players of Team B or all defensive players or all attacking players. Each of the players is its own resource but each of the available groups is its own resource as well as we could give it a name also. We can further talk about the group instead of the individual player. Which allows us to instead of reference all of the players individually, to include all of them within a single, short statement. A statement such as "Team A beat the crap out of Team B" will most likey subsume that each of the players on Team A was playing better than their counterparts in the opponent team.
It is now only a matter of providing clients with the toolset to group resources together. In a typical HTML page you could i.e. have a table representation of all the active football players of this season across all teams with a checkbox to select certain players and some control element, such as a submit button, that allows you to create a group for the selected players. The backing HTML form contains not only the actual data set you could select sepcific players from and a submit button but also a target URI where the request has to be sent to as well as a request method to use. HTML by default uses application/x-www-form-urlencoded as representation format to send the data to the server, which knows depending on the invoked endpoint, the HTTP operation used and the media type received how to process the data accordingly.
As a new resource will be created as a consequence to the previous grouping request, the server will respond with a 201 Created response code and a Location HTTP header whose value is a URI pointing to the location the newly created grouping is accessible. A client may now get redirected to that URI automatically or it can use the returned URI to invoke further operations on that resource. As the domain-model does (and probably should) not need to match a resource or affordance model, each of the invidvidual player resources as well as the team-resource may use the same database entries to present the data to the client. On updating one resource (either an individual player or the team as a whole) other resources may get influenced by this operation as well.
If you take a look at the definition of PUT in the HTTP specification, you can read something like this:
A PUT request applied to the target resource can have side effects on other resources.
Due to this side-effect it is possible for an update performed via PUT to achive somthing similar to a partial-update:
Partial content updates are possible by targeting a separately identified resource with state that overlaps a portion of the larger resource, or by using a different method that has been specifically defined for partial updates (for example, the PATCH method defined in RFC5789).
I.e. if you update Player 1 of Team A via PUT it creates as a side effect a partial-update of the state of Team A as this just uses the same data the data-model provides for that particular player.
In order to achive the same functionality in a REST architecture, as mentioned before, the same concepts of providing a client with structured data it can select a subset from and perform operations on that subset, such as creating a new resource for these selected elements, should be used. In contrast to the Web where HTML is dominant, the supported media-types may varry drastically in a REST architecture. Here, content-type negotiation is a very important part as this allows the server to chose the most suitable representation format that is supported by the client. Instead of using proprietary representation formats, standardized formats should be used to increase the likelihood of clients not under your control to be able to interact with your system. While there is an ongoing effort on introducing media-types that support clients with client-feedback in the form of forms similar to the ones used in HTML, there is no de-facto standard form-representation, except for HTML, yet widely accepted. There are a couple of especially JSON-based approaches, such as hal-forms, halo+json, Ion or Hydra, in the working, though, as mentioned, nothing that is really used widely in production.
As your acutal intention is to update a bunch of resources atomically, you could use PATCH here as well, without the need of creating new resources, as PATCH is defined to perform all of the instructions atomically - either all succeed or none at all. In the spec, PATCH is defined similar to how patching is understood in software engineering, by having a sequence of instructions that should be applied to a resource to transform it to a desired output. application/json-patch+json is a representation format that is quite close to the actual definition whereas application/merge-patch+json has a totally different take on it by defining default rules to apply, depending whether the request contained a modified or nullified field value. As the latter representation-format is able to only work on a single resource, the first representation-format could be used for a batch update. By targeting the collection-resource directly, JSON Pointers can be used to address the respective fields of the sub-resources in that collection directly.
To avoid data-loss via PATCH operations, due to intermediary updates between fetching the most recent state, calculating the necessary steps to apply and sending the request to the API, an optimistic locking approach should be used that is achievable via conditional requests, such as ETag.
While patching provides you with the capability to apply the changes atomically, I feel that grouping resources together, if they naturally form a group, such as in the player - team example, feels more common and reuses the interaction model proposed by REST also better IMO.
My team and I we are refactoring a REST-API and I have come to a question.
For terms of brevity, let us assume that we have an SQL database with 4 tables: Teachers, Students, Courses and Classrooms.
Right now all the relations between the items are represented in the REST-API through referencing the URL of the related item. For example for a course we could have the following
{ "id":"Course1", "teacher": "http://server.com/teacher1", ... }
In addition, if ask a list of courses thought a call GET call to /courses, I get a list of references as shown below:
{
... //pagination details
"items": [
{"href": "http://server1.com/course1"},
{"href": "http://server1.com/course2"}...
]
}
All this is nice and clean but if I want a list of all the courses titles with the teachers' names and I have 2000 courses and 500 teachers I have to do the following:
Approximately 2500 queries just to read the data.
Implement the join between the teachers and courses
Optimize with caching etc, so that I will do it as fast as possible.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently.
Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
My question therefore is:
1. Is it wrong if we we nest the teacher information in the courses.
2. Should the listing of items e.g. GET /courses return a list of references or a list of items?
Edit: After some research I would say the model I have in mind corresponds mainly to the one shown in jsonapi.org. Is this a good approach?
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Your colleagues have lost the plot.
Here's your heuristic - how would you support this use case on a web site?
You would probably do it by defining a new web page, that produces the report you need. You'd run the query, you the result set to generate a bunch of HTML, and ta-da! The client has the information that they need in a standardized representation.
A REST-API is the same thing, with more emphasis on machine readability. Create a new document, with a schema so that your clients can understand the semantics of the document you return to them, tell the clients how to find the target uri for the document, and voila.
Creating new resources to handle new use cases is the normal approach to REST.
Yes, I totally think you should design something similar to jsonapi.org. As a rule of thumb, I would say "prefer a solution that requires less network calls". It's especially true if amount of network calls will be less by order of magnitude.
Of course it doesn't eliminate the need to limit the request/response size if it becomes unreasonable.
Real life solutions must have a proper balance. Clean API is nice as long as it works.
So in your case I would so something like:
GET /courses?include=teachers
Or
GET /courses?includeTeacher=true
Or
GET /courses?includeTeacher=brief|full
In the last one the response can have only the teacher's id for brief and full teacher details for full.
My problem is that this method creates a lot of network traffic with thousands of REST-API calls and that I have to re-implement the natural join that the database would do way more efficiently. Colleagues say that this is approach is the standard way of implementing a REST-API but then a relatively simple query becomes a big hassle.
Have you actually measured the overhead generated by each request? If not, how do you know that the overhead will be too intense? From an object-oriented programmers perspective it may sound bad to perform each call on their own, your design, however, lacks one important asset which helped the Web to grew to its current size: caching.
Caching can occur on multiple levels. You can do it on the API level or the client might do something or an intermediary server might do it. Fielding even mad it a constraint of REST! So, if you want to comply to the REST architecture philosophy you should also support caching of responses. Caching helps to reduce the number of requests having to be calculated or even processed by a single server. With the help of stateless communication you might even introduce a multitude of servers that all perform calculations for billions of requests that act as one cohesive system to the client. An intermediary cache may further help to reduce the number of requests that actually reach the server significantly.
A URI as a whole (including any path, matrix or query parameters) is actually a key for a cache. Upon receiving a GET request, i.e., an application checks whether its current cache already contains a stored response for that URI and returns the stored response on behalf of the server directly to the client if the stored data is "fresh enough". If the stored data already exceeded the freshness threshold it will throw away the stored data and route the request to the next hop in line (might be the actual server, might be a further intermediary).
Spotting resources that are ideal for caching might not be easy at times, though the majority of data doesn't change that quickly to completely neglect caching at all. Thus, it should be, at least, of general interest to introduce caching, especially the more traffic your API produces.
While certain media-types such as HAL JSON, jsonapi, ... allow you to embed content gathered from related resources into the response, embedding content has some potential drawbacks such as:
Utilization of the cache might be low due to mixing data that changes quickly with data that is more static
Server might calculate data the client wont need
One server calculates the whole response
If related resources are only linked to instead of directly embedded, a client for sure has to fire off a further request to obtain that data, though it actually is more likely to get (partly) served by a cache which, as mentioned a couple times now throughout the post, reduces the workload on the server. Besides that, a positive side effect could be that you gain more insights into what the clients are actually interested in (if an intermediary cache is run by you i.e.).
Is it wrong if we we nest the teacher information in the courses.
It is not wrong, but it might not be ideal as explained above
Should the listing of items e.g. GET /courses return a list of references or a list of items?
It depends. There is no right or wrong.
As REST is just a generalization of the interaction model used in the Web, basically the same concepts apply to REST as well. Depending on the size of the "item" it might be beneficial to return a short summary of the items content and add a link to the item. Similar things are done in the Web as well. For a list of students enrolled in a course this might be the name and its matriculation number and the link further details of that student could be asked for accompanied by a link-relation name that give the actual link some semantical context which a client can use to decide whether invoking such URI makes sense or not.
Such link-relation names are either standardized by IANA, common approaches such as Dublin Core or schema.org or custom extensions as defined in RFC 8288 (Web Linking). For the above mentioned list of students enrolled in a course you could i.e. make use of the about relation name to hint a client that further information on the current item can be found by following the link. If you want to enable pagination the usage of first, next, prev and last can and probably should be used as well and so forth.
This is actually what HATEOAS is all about. Linking data together and giving them meaningful relation names to span a kind of semantic net between resources. By simply embedding things into a response such semantic graphs might be harder to build and maintain.
In the end it basically boils down to implementation choice whether you want to embed or reference resources. I hope, I could shed some light on the usefulness of caching and the benefits it could yield, especially on large-scale systems, as well as on the benefit of providing link-relation names for URIs, that enhance the semantical context of relations used within your API.
I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.
Since Mongo doesn't have transactions that can be used to ensure that nothing is committed to the database unless its consistent (non corrupt) data, if my application dies between making a write to one document, and making a related write to another document, what techniques can I use to remove the corrupt data and/or recover in some way?
The greater idea behind NoSQL was to use a carefully modeled data structure for a specific problem, instead of hitting every problem with a hammer. That is also true for transactions, which should be referred to as 'short-lived transactions', because the typical RDBMS transaction hardly helps with 'real', long-lived transactions.
The kind of transaction supported by RDBMSs is often required only because the limited data model forces you to store the data across several tables, instead of using embedded arrays (think of the typical invoice / invoice items examples).
In MongoDB, try to use write-heavy, de-normalized data structures and keep data in a single document which improves read speed, data locality and ensures consistency. Such a data model is also easier to scale, because a single read only hits a single server, instead of having to collect data from multiple sources.
However, there are cases where the data must be read in a variety of contexts and de-normalization becomes unfeasible. In that case, you might want to take a look at Two-Phase Commits or choose a completely different concurrency approach, such as MVCC (in a sentence, that's what the likes of svn, git, etc. do). The latter, however, is hardly a drop-in replacement for RDBMs, but exposes a completely different kind of concurrency to a higher level of the application, if not the user.
Thinking about this myself, I want to identify some categories of affects:
Your operation has only one database save (saving data into one document)
Your operation has two database saves (updates, inserts, or deletions), A and B
They are independent
B is required for A to be valid
They are interdependent (A is required for B to be valid, and B is required for A to be valid)
Your operation has more than two database saves
I think this is a full list of the general possibilities. In case 1, you have no problem - one database save is atomic. In case 2.1, same thing, if they're independent, they might as well be two separate operations.
For case 2.2, if you do A first then B, at worst you will have some extra data (B data) that will take up space in your system, but otherwise be harmless. In case 2.3, you'll likely have some corrupt data in the event of a catastrophic failure. And case 3 is just a composition of case 2s.
Some examples for the different cases:
1.0. You change a car document's color to 'blue'
2.1. You change the car document's color to 'red' and the driver's hair color to 'red'
2.2. You create a new engine document and add its ID to the car document
2.3.a. You change your car's 'gasType' to 'diesel', which requires changing your engine to a 'diesel' type engine.
2.3.b. Another example of 2.3: You hitch car document A to another car document B, A getting the "towedBy" property set to B's ID, and B getting the "towing" property set to A's ID
3.0. I'll leave examples of this to your imagination
In many cases, its possible to turn a 2.3 scenario into a 2.2 scenario. In the 2.3.a example, the car document and engine are separate documents. Lets ignore the possibility of putting the engine inside the car document for this example. Its both invalid to have a diesel engine and non-diesel gas and to have a non-diesel engine and diesel gas. So they both have to change. But it may be valid to have no engine at all and have diesel gas. So you could add a step that makes the whole thing valid at all points. First, remove the engine, then replace the gas, then change the type of the engine, and lastly add the engine back onto the car.
If you will get corrupt data from a 2.3 scenario, you'll want a way to detect the corruption. In example 2.3.b, things might break if one document has the "towing" property, but the other document doesn't have a corresponding "towedBy" property. So this might be something to check after a catastrophic failure. Find all documents that have "towing" but the document with the id in that property doesn't have its "towedBy" set to the right ID. The choices there would be to delete the "towing" property or set the appropriate "towedBy" property. They both seem equally valid, but it might depend on your application.
In some situations, you might be able to find corrupt data like this, but you won't know what the data was before those things were set. In those cases, setting a default is probably better than nothing. Some types of corruption are better than others (particularly the kind that will cause errors in your application rather than simply incorrect display data).
If the above kind of code analysis or corruption repair becomes unfeasible, or if you want to avoid any data corruption at all, your last resort would be to take mnemosyn's suggestion and implement Two-Phase Commits, MVCC, or something similar that allows you to identify and roll back changes in an indeterminate state.