Lagom persistent read side and model evolution - persistence

To learn lagom i created a simple application with some simple persistent entities and a persistent read side (as per the official documentation, using cassandra)
The official doc contains a section about model evolution, describing how to change the model. However, there is no mention of evolution when it comes to the read side.
Assuming i have an entity called Item, with an ID and a name, and the read side creates a table like CREATE TABLE IF NOT EXISTS items (id TEXT, name TEXT, PRIMARY KEY (id))
I now want to change the item to include a description. This is trivial for the persistent entity, but the read side has to be changed as well.
I can see several approaches to (maybe) achieve that:
use a model evolution tool like liquibase or play evolutions to change the read side tables.
somehow include update table statements in createTables that migrate the model
create additional tables containing the additional information, and keep the old tables without modifications
Which approach would be the most fitting? Is there something better?

Creating a new table and dropping the old table is an option too IMHO.
It is simple as modifying your "create table" command ("create table mytable_v2 ..." and "drop table mytable...") and changing the offset name and modifying your event handlers.
override def buildHandler( ): ReadSideProcessor.ReadSideHandler[MyEvent] = {
readSide.builder[MyEvent]("myOffset") // change it to "myOffset_v2"
...
}
This results in all events to be replayed and your read side table to be reconstructed from the scratch. This may not be an option if the current table is really huge as the recostruction may last very long time.
Regarding what #erip says I see perfectly normal adding a new column to your read side table. Suppose there are lots of records in this table with list of all entities and you want to retrieve a list of entities based on some criteria so you need some columns to be included in the where clause. Retrieving list of all entities and asking each of them if it complies with the criteria is not an option at all - it could be very unefficient as it needs more time, memory and network usage.

The point of a read-side is to materialize views from entity state changes from your service's event stream in your service. In this respect, you as the service controller can decide what is important for your subscribers to know about. This is handled by creating read-sides with an anti-corruption layer (or ACL).
Typically your subscribers will subscribe to API events which should experience no evolution. Your internal events (or impl events) will likely need to evolve; because of this, there should be a transformation from the impl to the API.
This is why it's very important to consider your domain very carefully before design: you really need to nail down what subscribers will need to know about. In the case of a description, it strikes me as unlikely that subscribers will need (or want!) to know about that.

Related

merge - upsert/delete in google cloud datastore

I am working on a POC (to move part of functionality from relational DB to cloud datastore). I have few questions:
I would need to refresh few "kind" every night as the data comes up
from a different data source (via flat files). I read about it, and
understood that there is not TRUNCATE kind of functionality in
datastore. I believe, only option is to retrieve the keys from the
"kind" in a loop and delete entity by entity. And use import functionality to load the new set of data. Is there any better
option?
Assume I have a kind called department, and a kind called
store. Now, I need a kind called dept-store. So for this parent
nodes are department and store. Is there a way to enforce this kind
of relationship? From the documentation I see that there can only be
one parent.
If i have a child entity in kind1 whose parent is
present in kind2, and they are linked together, is there a way to
query all the properties present in kind1 and kind2 together? From
relational DB perspective, it is like equi-join with "SELECT *". I
am looking for an equivalent functionality in datastore.
In order to answer your questions:
There is two ways to delete multiple entities. First, you can use Cloud Dataflow to delete entities in Bulk [1]. Second, once keys are retrieved you can make a batch delete operation by passing the keys to Datastore delete function, you have the usage example here [2]. In order to retrieve the keys you can run keys-only query [3].
In Datastore an entitiy can have only one parent but can have multiple children. But for your use case you may try to have a third kind, dept-store, and assign its properties as the keys of the entities from the department and the store kinds. This solution might need a good understanding of your neeeds for implementation, as Datastore by nature is Non-relational database.
You can lookup multiple entities providing the keys retrieved from kind1 and kind2 with batch operations [2].

DB Design for record tracking

Background: I am writing an app and designing out the database using efcore. There are particular tables that store user actions that I need to keep track of who created the record, last modified it, and deleted it (soft deletes).
I have a user table that has an int as a PK and each of the respective field (CreatedBy, LastModifiedBy, DeletedBy) hold an int that points to a user row.
I do have a full audit setup where the entirety of a row has it's old/new contents stored on save and that works fine. But this particular question is about the more immediate created/modified/deleted by tracking.
Help desk is generally who uses these fields on a daily basis to help users determine whats going on quickly but there are a lot of places in the app itself that will draw upon those fields eventually (moreso created/modified from the app perspective).
Question: I was going to create a pk/fk relationship between the tables and the user table. However, it got me thinking about if there is a better strategy then adding those 3 fields and relationships to every single table going forward. Maybe a single table that stores the table name with its pk and created/modified/deleted columns that have a relationship back to the user table so that only 1 table has those pk/fk relationships back to user. I just feel that there must be a better way/more efficient way to handle this. Is there a better way to handle this?
Don't do what you are thinking, stick with your original design - keep all auditing fields as they related to each table on the table itself. Adding a table that stores auditing fields from other tables just creates a design nightmare.
Now, a bigger question is how to track the audit transaction. A design I like is as follows:
CreateBy. Add default binding SUBSTRING(SUSER_NAME(),1,50)
CreateTs. Add default binding GETDATE()
UpdateBy
UpdateTs
Hard deletes are allowed (i.e., bad data). Soft deletes come in the form of an additional column called ActiveInd (BIT) where that transaction would be stored as an Update. This means that Updates and soft Deletes are recorded into the UpdateBy/UpdateTs columns.
That should get you what you need if you intend on tracking activity from a web application. If you have a back-end system that is loading and manipulating data that I would include a LoadInfo table that tracks all of the jobs and then you can add both a LoadSequenceKey and ParentSequenceKey (add a self referencing foreign key here) and then you can create a foreign key on all tables that are modified by jobs that store the sequence key as either a CreateSequenceKey or UpdateSequenceKey.

New entity ID in domain event

I'm building an application with a domain model using CQRS and domain events concepts (but no event sourcing, just plain old SQL). There was no problem with events of SomethingChanged kind. Then I got stuck in implementing SomethingCreated events.
When I create some entity which is mapped to a table with identity primary key then I don't know the Id until the entity is persisted. Entity is persistence ignorant so when publishing an event from inside the entity, Id is just not known - it's magically set after calling context.SaveChanges() only. So how/where/when can I put the Id in the event data?
I was thinking of:
Including the reference to the entity in the event. That would work inside the domain but not necesarily in a distributed environment with multiple autonomous system communicating by events/messages.
Overriding SaveChanges() to somehow update events enqueued for publishing. But events are meant to be immutable, so this seems very dirty.
Getting rid of identity fields and using GUIDs generated in the entity constructor. This might be the easiest but could hit performance and make other things harder, like debugging or querying (where id = 'B85E62C3-DC56-40C0-852A-49F759AC68FB', no MIN, MAX etc.). That's what I see in many sample applications.
Hybrid approach - leave alone the identity and use it mainly for foreign keys and faster joins but use GUID as the unique identifier by which i pull the entities from the repository in the application.
Personally I like GUIDs for unique identifiers, especially in multi-user, distributed environments where numeric ids cause problems. As such, I never use database generated identity columns/properties and this problem goes away.
Short of that, since you are following CQRS, you undoubtedly have a CreateSomethingCommand and corresponding CreateSomethingCommandHandler that actually carries out the steps required to create the new instance and persist the new object using the repository (via context.SaveChanges). I will raise the SomethingCreated event here rather than in the domain object itself.
For one, this solves your problem because the command handler can wait for the database operation to complete, pull out the identity value, update the object then pass the identity in the event. But, more importantly, it also addresses the tricky question of exactly when is the object 'created'?
Raising a domain event in the constructor is bad practice as constructors should be lean and simply perform initialization. Plus, in your model, the object isn't really created until it has an ID assigned. This means there are additional initialization steps required after the constructor has executed. If you have more than one step, do you enforce the order of execution (another anti-pattern) or put a check in each to recognize when they are all done (ooh, smelly)? Hopefully you can see how this can quickly spiral out of hand.
So, my recommendation is to raise the event from the command handler. (NOTE: Even if you switch to GUID identifiers, I'd follow this approach because you should never raise events from constructors.)

I don't need/want a key!

I have some views that I want to use EF 4.1 to query. These are specific optimized views that will not have keys to speak of; there will be no deletions, updates, just good ol'e select.
But EF wants a key set on the model. Is there a way to tell EF to move on, there's nothing to worry about?
More Details
The main purpose of this is to query against a set of views that have been optimized by size, query parameters and joins. The underlying tables have their PKs, FKs and so on. It's indexed, statiscized (that a word?) and optimized.
I'd like to have a class like (this is a much smaller and simpler version of what I have...):
public MyObject //this is a view
{
Name{get;set}
Age{get;set;}
TotalPimples{get;set;}
}
and a repository, built off of EF 4.1 CF where I can just
public List<MyObject> GetPimply(int numberOfPimples)
{
return db.MyObjects.Where(d=> d.TotalPimples > numberOfPimples).ToList();
}
I could expose a key, but whats the real purpose of dislaying a 2 or 3 column natural key? That will never be used?
Current Solution
Seeming as their will be no EF CF solution, I have added a complex key to the model and I am exposing it in the model. While this goes "with the grain" on what one expects a "well designed" db model to look like, in this case, IMHO, it added nothing but more logic to the model builder, more bytes over the wire, and extra properties on a class. These will never be used.
There is no way. EF demands unique identification of the record - entity key. That doesn't mean that you must expose any additional column. You can mark all your current properties (or any subset) as a key - that is exactly how EDMX does it when you add database view to the model - it goes through columns and marks all non-nullable and non-computed columns as primary key.
You must be aware of one problem - EF internally uses identity map and entity key is unique identification in this map (each entity key can be associated only with single entity instance). It means that if you are not able to choose unique identification of the record and you load multiple records with the same identification (your defined key) they will all be represented by a single entity instance. Not sure if this can cause you any issues if you don't plan to modify these records.
EF is looking for a unique way to identify records. I am not sure if you can force it to go counter to its nature of desiring something unique about objects.
But, this is an answer to the "show me how to solve my problem the way I want to solve it" question and not actually tackling your core business requirement.
If this is a "I don't want to show the user the key", then don't bind it when you bind the data to your form (web or windows). If this is a "I need to share these items, but don't want to give them the keys" issue, then map or surrogate the objects into an external domain model. Adds a bit of weight to the solution, but allows you to still do the heavy lifting with a drag and drop surface (EF).
The question is what is the business requirement that is pushing you to create a bunch of objects without a unique identifier (key).
One way to do this would be not to use views at all.
Just add the tables to your EF model and let EF create the SQL that you are currently writing by hand.

Self Tracking Entities versus timestamp column in database

In an optimistic concurrency scenario fo a web-app, I am considering to give each table the timestamp column (sqlserver), comparable to a guid. Linq to entities will then generate sql update queries like WHERE id = #p0 AND timestamp = #p1 when one decorates the timestamp column with a certain attribute in Entity Framework. When the number of updated records returned is 0 we have detected a concurrency exception.
In a lot of posts I am reading about Self Tracking Entities which may be an alternative or better solution. But I didn't see any advantage over the "simple" timestamp method described above. Apart from the scenario where the database is immutable and doesn't offer the timestamp column.
Which solution is better and why?
EDIT
Yury Tarabanko correctly states that STE is another concept.
However zeeshanhirani's answer demonstrates that concurrency check is one main motive to track changes.
Lets rephrase the question: why would anybody use the STE concept for concurrency check where the 'timestamp column' method looks so much easier.
You are mixing two concepts here. STE is not about concurrency at all.
Self tracking entities just know how to do their change tracking regardles of how those changes were made. So you always know what is the current state of entities object graph. And you don't need to invoke additional change detecting.
What is STE.
EDIT:
"concurrency check is one main motive to track changes"
AFAIK STEs and POCOs share the same approach to concurrency check which simply results in additional where condition(s) in update statement sent to DB. Equivalent to this:
UPDATE [schema].[table]
SET [prop1] = value1, ...
WHERE [key] = key_value
AND [concurrency_prop_1] = concurrency_prop1_old_value
AND [concurrency_prop_2] = concurrency_prop2_old_value
So 'the main motive' to track changes is, well, to track changes in N-tier app.
Self Tracking Entity actually works with the concept you described. STE basically tracks changes to the object when the context is not around. However when it sends its changes back to the server using WCF service, it sends the current values of the property, the new state of the entity and also the original values of the the primary key column, independent association value and original value for any columns that are marked as Concurrency=Fixed in entity data model.