I have two aggregate roots: employees and company. Using CQRS I have 2 actions to create every model:
CreateCompany (/company) and NewEmploy (/employe) by POST. As well, 2 actions to retrieve both GetCompany (/company/{id}) and GetEmploy (/employe/{1}) by GET. I want to get the employees that belong to the company so I created the next endpoint /company/1?include=employees, but I don’t know if I have to do a join in my model in order to get employees related with the company or every time I add an new employ modify the read model to get directly without not join. Right now I’m using the same tables for write model and read model.
I don’t know if I have to do a join in my model in order to get employees related with the company or every time I add an new employ modify the read model to get directly without not join.
There are actually three choices.
1) When you add the new employee, also run the join query and use the results to update the read model; when you query the read model, just return the most recent copy.
2) When you add the new employee, stop. When you query the read model, run the join to update your data and then return this copy
3) When you add the new employee, stop. When you query the read model, just return the most recent copy. In the background, run a task that watches to see if a new employee was added - if so, run the join and update the read model.
Run in the background can mean a lot of different things - you can schedule a job each time an employee is added, you can run the job using a scheduler, you can give the admin controls to run the job on demand.
You'll probably end up making choices based on what kinds of SLA you need to meet (how "old" is the data in the read model allowed to be before people start complaining), dealing with how you are going to allow user to read their own writes, what other sorts of caching are in use in the system.
The important thing to understand is that "transform the write model to the read model" is an operation that you can run outside of the writes and reads.
Related
I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.
I have Customer read model that needs to be updated after NewOrderEvent.
One thing i want to understand, should i update my read model on every event. Or i need to replay all events and replace read model.
What im doing now is:
Saving NewOrderEvent
Getting or creating Customer read model
Invoking Customer.ApplyEvent(NewOrderEvent) that changes Customer state.
Saving Customer read model
Am i missing something?
Usually yes, you want to update the read model every time you have an event. But, it's just a simple CRUD operation, a db update. The replaying of events is done when you want to (re)generate a new read model, because you could have millions of events and could be a very long running operation.
Btw, the apply stuff should be reserved for command model only, in order to avoid confusion. You apply events to a domain aggregate root (entity), but you use an event as the source of data for read model updates.
Looks good to me. You may decide to replay the stream of events in order to recreate the read model only if you introducing something new to it.
Some people rebuild read models whenever the schema changes, but in many cases you can use migrations for that. Really depends on your application.
Following is a use case in a workflow system
Work order enters into a system. Work order will have a target which goes through different workflow states before completing a work order.
Say Work order for a target Vehicle came into a system - workflow for this work oder involves 2 tasks say
a)wash vehicle
b)inspect vehicle
Say wash vehicle workflow task changes vehicle attribute from "not washed" to "washed". And say "inspect vehicle" workflow task changes vehicle attribute "not inspected" to "inspection done"
If user is pulling work order data user will always see latest vehicle data (in this example assuming both workflow tasks are completed user will see value "washed" and "inspection done". However when user pulls ONLY workflow Task Wash Vehicle data -> user will see "washed" -Though second task was done, workflow Task 1 will only see that that it modified. Getting data for Workflow Task 2 will see both "washed" and "inspection done"
This involves milstoning (audit trail) of data; One approach is as shown below image - where when workflow task modifies data it'll update version number, modified_ts and maintain that version number in it's own data row (via a JOIN table as depicted below). Basically this is nothing but maintaining a reference to a history record for workflow task data so when pulling workflow task data it knows which history record to pull back. please ignore parent_id and other notes, noise in a below picture. it's not relevant for this question.
I am thinking event sourcing will also be another alternative design - however don't want to apply event sourcing(or any other similar solution) as a whole sale solution but only for this particular use case (affecting only 3 or so tables where audit trail matters). I am trying to evaluate if CQRS/Event sourcing is a right fit as a partial solution (again only limited to 3-4 tables which need to preserve history/audit trail data) or ES/CQRS will be an overkill? any other thoughts?
P.S. Though this isn't related to Scala - Scala is a platform we are using hence tagging it to see if there are language specific solutions that can help. tagging Akka for finding out if ES/CQRS via Akka persistence is an option or not. Postgresql is a db - And DB triggers is not a solution I am looking for.
A little background:
I have 2 entities (Product and Case). The product entity will hold all product records. A section in the Case will have the ability to choose products and auto-populate all related fields that are located in the product record for that specific product. For example, Product record has fields like hazardous, range, lot ect. The same field appear on the Case record. These fields should only be populated based on the product that was selected.
I was able to accomplish the above by creating a 1:N relationship and adding it to my Case form. I then created a workflow to populate the related fields (hazardous, range, lot ect). However, these fields only populate when the record is saved. Is there a way to make it update the fields once the product is chosen?
I want to refrain form using any type of JavaScript. If possible, I would like to strictly use workflows to accomplish this (if at all possible).
Real time information in your case can be only accomplished by using JavaScript. Maps works too but they have a special behavior.
Workflows that fire when the record is created only execute after all core operations are done (Native logic, Plug-in logic...) and you can't fire workflows if the record is not created.
So using workflows is a good idea even if you can't see the information
I'm trying understand CQRS to see if it can help out in an reporting environment.
Problem: An CQRS designed system is already in production, happily generating commands, events and updating the necessary query views. A new report is required. This report takes a number of parameters; Start Date, End Date, Product Type, and Product Category.
How do I generate the aggregate views for:
A query store that will initially be empty
And, can pass parameters with very different values
Do I try and solve this using a CQRS approach, or is there a better alternative?
Thanks
If it is not reasonable to precompute all your report data into flat view, then just don't do that. You may want to join a bunch of tables for your report. It's your decision what can be precomputed, and what is not worth it (cpu, storage considerations).
In your particular case (StartDate, EndDate,..) - i can't see what is the problem to generate a single ViewModel table for it, and just query directly against the parameters.
Figure out which events are required to gather all report data.
Query all those events, republish them to the endpoint that handles updating the new report table(s).
Wait until all events have been processed.
Put some indexes on the columns that will function as report query criteria.
Done!