"Life Beyond Transactions" Entity-Message-Activity Model in Practice? - nosql

Over vacation I read Pat Helland's "Life Beyond Transactions" (yes, vacation was that good :). To sum it up briefly, it advocates limiting the scope of transactions to a single entity and then using groups of "activities" that have the ability to update the entity or cancel a task anytime a change takes place that would make that task invalid.
(E.g. Shipping Order A requires some amount of Item 1. The Shipping Orders and Items are stored as entities and have their own activities. Shipping Order B ships with the last of Item 1 before A finishes. The activity for Item 1 cancels Shipping Order A.)
I had thought I was printing out the Dynamo paper, so forgive me if I conflate the two here. I've seen quite a few "NoSQL" projects influenced by Dynamo and BigTable, particularly in how they address entities by keys and partition data. I was wondering if this Entity-Message-Activity model has influenced any of them?
Or, to put it in more concrete terms, if I have an operation in HBase, Cassandra, Riak, etc. that spans multiple entities, do I need to implement an Activity all by myself (as more of a design pattern in the application), or is there some kind of existing framework? Or do they do something else completely that renders this entire question moot?
Thanks!

I can add my 2 cents here just from a Cassandra point of view (I haven't used the other NoSQL engines available). Cassandra is primarily designed to be a fast read-write structure. Twitter is a great use case for Cassandra (check the twitter clone Twissandra for this)
Assuming I have understood your question correctly: yes you will have to implement the activity yourself. To understand the modeling of Column/SuperColumnFamilies I would suggest reading this great article WTF is a SuperColumn?
Cheers!

Related

CQRS projections, joining data from different aggregates via probe commands

In CQRS when we need to create a custom-tailored projections for our read-models, we usually prefer a "denormalized" projections (assume we are talking about projecting onto a DB). It is not uncommon to have the information need by the application/UI come from different aggregates (possibly from different BCs).
Imagine we need a projected table to contain customer's information together with her full address and that Customer and Address are different aggregates in our system (possibly in different BCs). Meaning that, addresses are generated and maintained independently of customers. Or, in other words, when a new customer is created, there is no guarantee that there will be an AddressCreatedEvent subsequently produced by the system, this event may have already been processed prior to the creation of the customer. All we have at the time of CreateCustomerCommand is an UUID of an existing address.
We have several solutions here.
Enrich CreateCustomerCommand and the subsequent CustomerCreatedEvent to contain full address of the customer (looking up this information on the fly from the UI or the controller). This way the projection handler will just update the table directly upon receiving CustomerCreatedEvent.
Use the addrUuid provided in CustomerCreatedEvent to perform an ad-hoc query in the projection handler to get the missing part of the address information before updating the table.
These are commonly discussed solution to this problem. However, as noted by many others, there are problems with each approach. Enriching events can be difficult to justify as well described by Enrico Massone in this question, for example. Querying other views/projections (kind of JOINs) will work but introduces coupling (see the same link).
I would like describe another method here, which, as I believe, nicely addresses these concerns. I apologize beforehand for not giving a proper credit if this is a known technique. Sincerely, I have not seen it described elsewhere (at least not as explicitly).
"A picture speaks a thousand words", as they say:
The idea is that :
We keep CreateCustomerCommand and CustomerCreatedEvent simple with only addrUuid attribute (no enriching).
In API controller we send two commands to the command handler (aggregates): the first one, as usual, - CreateCustomerCommand to create customer and project customer information together with addrUuid to the table leaving other columns (full address, etc.) empty for time being. (Warning: See the update, we may have concurrency issue here and need to issue the probe command from a Saga.)
Right after this, and after we have obtained custUuid of the newly created customer, we issue a special ProbeAddrressCommand to Address aggregate triggering an AddressProbedEvent which will encapsulate the full state of the address together with the special attribute probeInitiatorUuid which is, of course our custUuid from the previous command.
The projection handler will then act upon AddressProbedEvent by simply filling in the missing pieces of the information in the table looking up the required row by matching the provided probeInitiatorUuid (i.e. custUuid) and addrUuid.
So we have two phases: create Customer and probe for the related Address. They are depicted in the diagram with (1) and (2) correspondingly.
Obviously, we can send as many such "probe" commands (in parallel) as needed by our projection: ProbeBillingCommand, ProbePreferencesCommand, etc. effectively populating or "filling in" the denormalized projection with missing data from each handled "probe" event.
The advantages of this method is that we keep the commands/events in the first phase simple (only UUIDs to other aggregates) all the while avoiding synchronous coupling (joining) of the projections. The whole approach has a nice EDA feeling about it.
My question is then: is this a known technique? Seems like I have not seen this... And what can go wrong with this approach?
I would be more then happy to update this question with any references to other sources which describe this method.
UPDATE 1:
There is one significant flaw with this approach that I can see already: command ProbeAddrressCommand cannot be issued before the projection handler had a chance to process CustomerCreatedEvent. But this is impossible to know from the API gateway (or controller).
The solution would probably involve a Saga, say CustomerAddressJoinProjectionSaga with will start upon receiving CustomerCreatedEvent and which will only then issue ProbeAddrressCommand. The Saga will end upon registering AddressProbedEvent. Or, if many other aggregates are involved in probing, when all such events have been received.
So here is the updated diagram.
UPDATE 2:
As noted by Levi Ramsey (see answer below) my example is rather convoluted with respect to the choice of aggregates. Indeed, Customer and Address are often conceptualized as belonging together (same Aggregate Root). So it is a better illustration of the problem to think of something like Student and Course instead, assuming for the sake of simplicity that there is a straightforward relation between the two: a student is taking a course. This way it is more obvious that Student and Course are independent aggregates (students and courses can be created and maintained at different times and different places in the system).
But the question still remains: how can we obtain a projection containing the full information about a student (full name, etc.) and the courses she is registered for (title, credits, the instructor's full name, prerequisites, etc.) all in the same table, if the UI requires it ?
A couple of thoughts:
I question why address needs to be a separate aggregate much less in a different bounded context, in view of the requirement that customers have an address. If in some other bounded context customer addresses are meaningful (e.g. you want to know "which addresses have more customers" etc.), then that context can subscribe to the events from the customer service.
As an alternative, if there's a particularly strong reason to model addresses separately from customers, why not have the read side prospectively listen for events from the address aggregate and store the latest address for a given address UUID in case there's a customer who ends up with that address. The reliability per unit effort of that approach is likely to be somewhat greater, I would expect.

Composite unique constraint on business fields with Axon

We leverage AxonIQ Framework in our system. We've faced a problem implementing composite uniq constraint based on aggregate business fields.
Consider following Aggregate:
#Aggregate
public class PersonnelCardAggregate {
#AggregateIdentifier
private UUID personnelCardId;
private String personnelNumber;
private Boolean archived;
}
We want to avoid personnelNumber duplicates in the scope of NOT-archived (archived == false) records. At the same time personnelNumber duplicates may exist in the scope of archived records.
Query Side check seems NOT to be an option. Taking into account Eventual Consistency nature of our system, more than one creation request with the same personnelNumber may exist at the same time and the Query Side may be behind.
What the solution would be?
What you're asking is an issue that can occur as soon as you start implementing an application along the CQRS paradigm and DDD modeling techniques.
The PersonnelCardAggregate in your scenario maintains the consistency boundary of a single "Personnel Card". You are however looking to expand this scope to achieve a uniqueness constraints among all Personnel Cards in your system.
I feel that this blog explains the problem of "Set Based Consistency Validation" you are encountering quite nicely.
I will not iterate his entire blog, but he sums it up as having four options to resolving the problem:
Introduce locking, transactions and database constraints for your Personnel Card
Use a hybrid locking field prior to issuing your command
Really on the eventually consistent Query Model
Re-examine the domain model
To be fair, option 1 wont do if your using the Event-Driven approach to updating your Command and Query Model.
Option 3 has been pushed back by you in the original question.
Option 4 is something I cannot deduce for you given that I am not a domain expert, but I am guessing that the PersonnelCardAggregate does not belong to a larger encapsulating Aggregate Root. Maybe the business constraint you've stated, thus the option to reuse personalNumbers, could be dropped or adjusted? Like I said though, I cannot state this as a factual answer for you, as I am not the domain expert.
That leaves option 2, which in my eyes would be the most pragmatic approach too.
I feel this would require a combination of a cache at your command dispatching side to deal with quick successions of commands to resolve the eventual consistency issue. To capture the occurs that an update still comes through accidentally, I'd introduce some form of Event Handler that (1) knows the entire set of "PersonnelCards" from a personalNumber/archived point of view and (2) can react on a faulty introduction by dispatching a compensating action.
You'd thus introduce some business logic on the event handling side of your application, which I'd strongly recommend to segregate from the application part which updates your query models (as the use cases are entirely different).
Concluding though, this is a difficult topic with several ways around it.
It's not so much an Axon specific problem by the way, but more an occurrence of modeling your application through DDD and CQRS.

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

Migrating to Bounded Context

I currently have a Web API project which currently has all the system processing in the same solution. I'm breaking this out into separate solutions so that they can be ran independently (e.g. an Azure WebJob) as I don't want to have to redploy the Web project if something in the backend has changed.
My issue with this is that even though I have separated the logic they are tied together by a single context so that if I make a change in one I will have to redeploy all as the migrations won't match up.
So that's why I've been looking at Bounded Context and DDD. I'm looking at how to break this up but having trouble understanding how relationships work.
A lot of the site is administrative (i.e. creating entities, no actual processing) so was going to split contexts around this e.g.:
A user adds and maintains currency conversion rates (this is two entities in
total).
A user adds and maintains details on how to process payments (note that is is not processing payments, it only holds information about paypal account details etc).
So I was splitting the context's up by this, does this sound reasonable to start with (there are a lot more like this such as tax bands, charge structures etc)?
If this is the way to go, how do I handle relationships between those two contexts? As an example:
A payment method requires a link to an 'active' currency conversion. I understand I can just have this as an Id, but I need to check it's state so need access to the model.
A currency conversion can only be set to 'Inactive' if there are no payment methods currently using it. Again this needs access to the other model.
So logically the models need access to each other, how would this be included in the context? Can I add navigation properties to a model in a different context? Or should I add it as a separate DbSet and possibly map using a view?
Thanks
So I was splitting the context's up by this, does this sound
reasonable to start with (there are a lot more like this such as tax
bands, charge structures etc)?
"So that they can be ran and deployed independently" may not be a sufficient heuristic to tell when you should split Bounded Contexts. This addresses one aspect of the solution space, but if you haven't looked well enough at the problem space, you'll suffer from a misalignment between BC's and subdomains that can cause a lot of friction. You might end up always deploying a cluster of seemingly unrelated "independently deployable units" together because you didn't realize they talk about the same thing.
Identifying subdomains is the product of distillating your business - separating the big functional areas and defining which parts are your core domain and which are ancillary activities. Each subdomain has its own specific semantics (Ubiquitous Language). In your case, as has been pointed out in the comments, Currency Conversion and Payment Methods might well be part of the same subdomain (Payment?). It does not automatically mean that they should also be in the same BC but it might be a good idea to keep subdomains aligned 1-to-1 with BC's, as additional BC's come at a cost.
Back to deployability, even if it can be one beneficial effect of Bounded Contexts, they are not always so easily translatable in terms of independent units of deployment. Context mapping patterns (Shared Kernel, Customer Supplier, etc.) and BC communication in general can lead to a model, and therefore a part of a codebase, being shared by multiple BC's. Code and API synchronization issues arise that can question a simplistic "deployable free electron" view.
Just because you're using the Bounded Context approach doesn't mean you have to use DDD's tactical patterns (Aggregate Root, invariants, etc.) inside each BC. Using them should be an educated decision to trade solution space complexity off for problem space manageability. If "Currency Conversion can only be set to inactive..." is the only rule pertaining to payment method and currency management in your business, it might not be worth the bother to give that Bounded Context a full-fledged rich domain model. CRUD could be better suited there.

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.