Keeping duplicated DynamoDB records synchronized - nosql

I am currently trying to model the data for our application. The data consists of identities and groups. One group can have multiple identities and one identity can be in multiple groups. (a typical many-to-many relationship).
So I have used the Adjacency List Design Pattern to structure my data as recommended by AWS:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-adjacency-graphs.html
I keep all the info about identities duplicated inside the groups and reading the data works just fine - a normal query for the details and a query against the index to get the relations of my objects.
How can I ensure that all duplicated records have the same value?
Every time the group changes, I am updating all the duplicated group records in the database.
I am okay with updating multiple records at once as changes will happen rarely but I want to avoid inconsistent data.
All the tutorials and guides always just talk about reading and accessing data not about updating the data.
I know that there is a TransactWriteItem-Reuquest but it is limited to 25 items maximum. So is there another way/pattern to guarantee that all my identity records are updated when e.g. the name changes.

You have to decide for yourself how consistent is consistent enough in your application.
The CAP theorem is alive and well and it says that to get availability and partition tolerance we have to sacrifice consistency.
Since updates happen infrequently, how does your application fail if it sees inconsistent records? If you can't use the transactional API because of the 25 item limit, maybe you could roll your own "lock-out" using an attribute you would set on items that must all be updated together:
first, you identify all items that need to be updated and set the "lock_out" attribute on them (this can be a timestamp indicating when the lock_out expires)
in your application, you can add business logic to treat items with the "lock_out" in a way that makes sense (maybe show them as being updated, or not show them at all etc.)
update the items
after the update is complete, clear the "lock-out" attribute

Related

DDD, Event Sourcing, and the shape of the Aggregate state

I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".

How to modelling domain model - aggregate root

I'm having some issues to correctly design the domain that I'm working on.
My straightforward use case is the following:
The user (~5000 users) can access to a list of ads (~5 millions)
He can choose to add/remove some of them as favorites.
He can decide to show/hide some of them.
I have a command which will mutate the aggregate state, to set Favorite to TRUE, let's say.
In terms of DDD, how should I design the aggregates?
How design the relationship between a user and his favorite's ads selection?
Considering the large numbers of ads, I cannot duplicate each ad inside a user aggregate root.
Can I design a Ads aggregateRoot containing a user "collection".
And finally, how to handle/perform the readmodels part?
Thanks in advance
Cheers
Two concepts may help you understand how to model this:
1. Aggregates are Transaction Boundaries.
An aggregate is a cluster of associated objects that are considered as a single unit. All parts of the aggregate are loaded and persisted together.
If you have an aggregate that encloses a 1000 entities, then you have to load all of them into memory. So it follows that you should preferably have small aggregates whenever possible.
2. Aggregates are Distinct Concepts.
An Aggregate represents a distinct concept in the domain. Behavior associated with more than one Aggregate (like Favoriting, in your case) is usually an aggregate by itself with its own set of attributes, domain objects, and behavior.
From your example, User is a clear aggregate.
An Ad has a distinct concept associated with it in the domain, so it is an aggregate too. There may be other entities that will be embedded within the Ad like valid_until, description, is_active, etc.
The concept of a favoriting an Ad links the User and the Ad aggregates. Your question seems to be centered around where this linkage should be preserved. Should it be in the User aggregate (a list of Ads), or should an Ad have a collection of User objects embedded within it?
While both are possibilities, IMHO, I think FavoriteAd is yet another aggregate, which holds references to both the User aggregate and the Ad aggregate. This way, you don't burden the concepts of User or the Ad with favoriting behavior.
Those aggregates will also not be required to load this additional data every time they are loaded into memory. For example, if you are loading an Ad object to edit its contents, you don't want the favorites collection to be loaded into memory by default.
These aggregate structures don't matter as far as read models are concerned. Aggregates only deal with the write side of the domain. You are free to rewire the data any way you want, in multiple forms, on the read side. You can have a subscriber just to listen to the Favorited event (raised after processing the Favorite command) and build a composite data structure containing data from both the User and the Ad aggregates.
I really like the answer given by Subhash Bhushan and I want to add another approach for you to consider.
If you look closely at your question you will see that you've made the assumption that an aggregate can 'see' everything that the user does when they are interacting with the UI. This doesn't need to be so.
Depending on the requirements of the domain you don't need to hold a list of any Ads in the aggregate to favourite them. Here's what I mean:
For this example, it doesn't matter where the the 'favourite' ad command sits. It could be on the user aggregate or a specific aggregate for handling the concept of Favouriting. The command just needs to hold the id of the User and the Ad they are favouriting.
You may need to handle what happens if a user or ad is deleted but that would just be a case of an event process manager listening to the appropriate events and issuing compensating commands.
This way you don't need to load up 5 million ads. That's a job for the read model and UI, not the domain.
Just a thought.

Swift: Implementing custom merge policy

I'm building an OSX app using Swift, with Coredata as my data layer. As part of this, I have table that lists a large number of files, with metadata associated with each. Each record can include a URI that points to one of three services it can be hosted on.
1. title created_at size uuid source_local source_remote source_cloud
I generate all the records using information pulled from the local source. These records all have a source_local string.
Later I import a number of records from the remote source. These records are all added and have a source_remote string.
A number of these records are hosted on both services, and have matching UUIDs. There is a unique constraint on the UUID field, and I want Swift to merge these two records' fields in some way when it has a constraint error.
I've tried:
NSMergeByPropertyStoreTrumpMergePolicy
and
NSMergeByPropertyObjectTrumpMergePolicy
But these policies result in one record completely trumping the other.
Currently I have to work around this limitation by checking if a record already exists with the UUID and updating the existing record with any missing fields in the new file.
However this feels non-optimal – is there a way to create a custom merge policy, in order to have Swift automatically handle conflicts in this way? At this stage I am not concerned with whether the Store or Memory record trumps the other, as long as I can correctly the merge the source_* fields.
Thanks
First of all, thanks to #tom-harrington for his nod to extend NSMergePolicy. Huge oversight on my part that I hadn't even considered going down that route.
While exploring how NSMergeByPropertyStoreTrumpMergePolicy/NSMergeByPropertyObjectTrumpMergePolicy are implemented, however, I realised that this issue stems from a misunderstanding on my part. These policies already handle conflicts at a property level. Rather than discarding the entirety of one of the object states on conflict, they compare each property and only apply the policy to those properties which have both changed/exist.
NSOverwriteMergePolicy and NSRollbackMergePolicy are policies that will result in one of either object A or B being completely discarded on conflict.

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

Can I use a single DynamoDB table to support these three use cases?

I'm building a DynamoDB table that holds notification messages. Messages are directed from a given user (from_user) to another user (to_user). They're quite simple:
{ "to_user": "e17818ae-104e-11e3-a1d7-080027880ca6", "from_user": "e204ea36-104e-11e3-9b0b-080027880ca6", "notification_id": "e232f73c-104e-11e3-9b30-080027880ca6", "message": "Bob recommended a good read.", "type": "recommended", "isbn": "1844134016" }
These are the Hash/Range keys defined on the table:
HashKey: to_user, RangeKey: notification_id
Case 1: Users regularly phone home to ask for any available notifications.
With these keys, it's easy to fetch the notifications awaiting a given user:
notifications.query(to_user="e17818ae-104e-11e3-a1d7-080027880ca6")
Case 2: Once a user has seen a message, they will explicitly acknowledge it and it will be deleted. This is similarly simple to accomplish with the given Hash/Range keys:
notifications.delete(to_user="e17818ae-104e-11e3-a1d7-080027880ca6", notification_id="e232f73c-104e-11e3-9b30-080027880ca6")
Case 3: It may sometimes be necessary to delete items in this table identified by keys other than the to_user and notification_id. For example, user Bob decides to un-recommnend a book and we would like to pull notifications with from_user=Bob, action=recommended and isbn=isbnval.
I know this can't be done with the Hash/Range keys I've chosen. Local secondary indexes also seem unhelpful here since I don't want to work within the table's chosen HashKey.
So am I stuck doing a full Scan? I can imagine creating a second table to map from_user+action+isbn back to items in the original table but that means I have to manage that additional complexity... and it seems like this hand-rolled index could get out of sync easily.
Any insights would be appreciated. I'm new to DynamoDB and trying to understand how typical data models map to it. Thanks.
Your analysis is correct. For case 3 and this schema, you must do a table scan .
There are a number of options which you've identified, but all of them will add a layer of complexity to your application.
Use a second table as you state. You are effectively creating your own global index and must manage that complexity yourself. This grows in complexity as you require more indices.
Perform a full table scan. Look at DynamoDB's scan segmenting for a method of distributing the scan across multiple worker nodes. Depending on your latency requirements(is it ok if the recommendations don't go away until the next scan?) you may be able to combine this and other future background tasks into a constant background process. This is also simpler than 1.
Both of these seem to be fairly common models.