If I were using an RDBMS (e.g. SQL Server) to store event sourcing data, what might the schema look like?
I've seen a few variations talked about in an abstract sense, but nothing concrete.
For example, say one has a "Product" entity, and changes to that product could come in the form of: Price, Cost and Description. I'm confused about whether I'd:
Have a "ProductEvent" table, that has all the fields for a product, where each change means a new record in that table, plus "who, what, where, why, when and how" (WWWWWH) as appropriate. When cost, price or description are changed, a whole new row as added to represent the Product.
Store product Cost, Price and Description in separate tables joined to the Product table with a foreign key relationship. When changes to those properties occur, write new rows with WWWWWH as appropriate.
Store WWWWWH, plus a serialised object representing the event, in a "ProductEvent" table, meaning the event itself must be loaded, de-serialised and re-played in my application code in order to re-build the application state for a given Product.
Particularly I worry about option 2 above. Taken to the extreme, the product table would be almost one-table-per-property, where to load the Application State for a given product would require loading all events for that product from each product event table. This table-explosion smells wrong to me.
I'm sure "it depends", and while there's no single "correct answer", I'm trying to get a feel for what is acceptable, and what is totally not acceptable. I'm also aware that NoSQL can help here, where events could be stored against an aggregate root, meaning only a single request to the database to get the events to rebuild the object from, but we're not using a NoSQL db at the moment so I'm feeling around for alternatives.
The event store should not need to know about the specific fields or properties of events. Otherwise every modification of your model would result in having to migrate your database (just as in good old-fashioned state-based persistence). Therefore I wouldn't recommend option 1 and 2 at all.
Below is the schema as used in Ncqrs. As you can see, the table "Events" stores the related data as a CLOB (i.e. JSON or XML). This corresponds to your option 3 (Only that there is no "ProductEvents" table because you only need one generic "Events" table. In Ncqrs the mapping to your Aggregate Roots happens through the "EventSources" table, where each EventSource corresponds to an actual Aggregate Root.)
Table Events:
Id [uniqueidentifier] NOT NULL,
TimeStamp [datetime] NOT NULL,
Name [varchar](max) NOT NULL,
Version [varchar](max) NOT NULL,
EventSourceId [uniqueidentifier] NOT NULL,
Sequence [bigint],
Data [nvarchar](max) NOT NULL
Table EventSources:
Id [uniqueidentifier] NOT NULL,
Type [nvarchar](255) NOT NULL,
Version [int] NOT NULL
The SQL persistence mechanism of Jonathan Oliver's Event Store implementation consists basically of one table called "Commits" with a BLOB field "Payload". This is pretty much the same as in Ncqrs, only that it serializes the event's properties in binary format (which, for instance, adds encryption support).
Greg Young recommends a similar approach, as extensively documented on Greg's website.
The schema of his prototypical "Events" table reads:
Table Events
AggregateId [Guid],
Data [Blob],
SequenceNumber [Long],
Version [Int]
The GitHub project CQRS.NET has a few concrete examples of how you could do EventStores in a few different technologies. At time of writing there is an implementation in SQL using Linq2SQL and a SQL schema to go with it, there's one for MongoDB, one for DocumentDB (CosmosDB if you're in Azure) and one using EventStore (as mentioned above). There's more in Azure like Table Storage and Blob storage which is very similar to flat file storage.
I guess the main point here is that they all conform to the same principal/contract. They all store information in a single place/container/table, they use metadata to identify one event from another and 'just' store the whole event as it was - in some cases serialised, in supporting technologies, as it was. So depending on if you pick a document database, relational database or even flat file, there's several different ways to all reach the same intent of an event store (it's useful if you change you mind at any point and find you need to migrate or support more than one storage technology).
As a developer on the project I can share some insights on some of the choices we made.
Firstly we found (even with unique UUIDs/GUIDs instead of integers) for many reasons sequential IDs occur for strategic reasons, thus just having an ID wasn't unique enough for a key, so we merged our main ID key column with the data/object type to create what should be a truly (in the sense of your application) unique key. I know some people say you don't need to store it, but that will depend on if you are greenfield or having to co-exist with existing systems.
We stuck with a single container/table/collection for maintainability reasons, but we did play around with a separate table per entity/object. We found in practise that meant either the application needed "CREATE" permissions (which generally speaking is not a good idea... generally, there's always exceptions/exclusions) or each time a new entity/object came into existence or was deployed, new storage containers/tables/collections needed to be made. We found this was painfully slow for local development and problematic for production deployments. You may not, but that was our real-world experience.
Another things to remember is that asking action X to happen may result in many different events occurring, thus knowing all the events generated by a command/event/what ever is useful. They may also be across different object types e.g. pushing "buy" in a shopping cart may trigger account and warehousing events to fire. A consuming application may want to know all of this, so we added a CorrelationId. This meant a consumer could ask for all events raised as a result of their request. You'll see that in the schema.
Specifically with SQL, we found that performance really became a bottleneck if indexes and partitions weren't adequately used. Remember events will needs to be streamed in reverse order if you are using snapshots. We tried a few different indexes and found that in practise, some additional indexes were needed for debugging in-production real-world applications. Again you'll see that in the schema.
Other in-production metadata was useful during production based investigations, timestamps gave us insight into the order in which events were persisted vs raised. That gave us some assistance on a particularly heavily event driven system that raised vast quantities of events, giving us information about the performance of things like networks and the systems distribution across the network.
Well you might wanna give a look at Datomic.
Datomic is a database of flexible, time-based facts, supporting queries and joins, with elastic scalability, and ACID transactions.
I wrote a detailed answer here
You can watch a talk from Stuart Halloway explaining the design of Datomic here
Since Datomic stores facts in time, you can use it for event sourcing use cases, and so much more.
I think solution (1 & 2) can become a problem very quickly as your domain model evolves. New fields are created, some change meaning, and some can become no longer used. Eventually your table will have dozens of nullable fields, and loading the events will be mess.
Also, remember that the event store should be used only for writes, you only query it to load the events, not the properties of the aggregate. They are separate things (that is the essence of CQRS).
Solution 3 what people usually do, there are many ways to acomplish that.
As example, EventFlow CQRS when used with SQL Server creates a table with this schema:
CREATE TABLE [dbo].[EventFlow](
[GlobalSequenceNumber] [bigint] IDENTITY(1,1) NOT NULL,
[BatchId] [uniqueidentifier] NOT NULL,
[AggregateId] [nvarchar](255) NOT NULL,
[AggregateName] [nvarchar](255) NOT NULL,
[Data] [nvarchar](max) NOT NULL,
[Metadata] [nvarchar](max) NOT NULL,
[AggregateSequenceNumber] [int] NOT NULL,
CONSTRAINT [PK_EventFlow] PRIMARY KEY CLUSTERED
(
[GlobalSequenceNumber] ASC
)
where:
GlobalSequenceNumber: Simple global identification, may be used for ordering or identifying the missing events when you create your projection (readmodel).
BatchId: An identification of the group of events that where inserted atomically (TBH, have no idea why this would be usefull)
AggregateId: Identification of the aggregate
Data: Serialized event
Metadata: Other usefull information from event (e.g. event type used for deserialize, timestamp, originator id from command, etc.)
AggregateSequenceNumber: Sequence number within the same aggregate (this is usefull if you cannot have writes happening out of order, so you use this field to for optimistic concurrency)
However, if you are creating from scratch I would recomend following the YAGNI principle, and creating with the minimal required fields for your use case.
Possible hint is design followed by "Slowly Changing Dimension" (type=2) should help you to cover:
order of events occurring (via surrogate key)
durability of each state (valid from - valid to)
Left fold function should be also okay to implement, but you need to think of future query complexity.
I reckon this would be a late answer but I would like to point out that using RDBMS as event sourcing storage is totally possible if your throughput requirement is not high. I would just show you examples of an event-sourcing ledger I build to illustrate.
https://github.com/andrewkkchan/client-ledger-service
The above is an event sourcing ledger web service.
https://github.com/andrewkkchan/client-ledger-core-db
And the above I use RDBMS to compute states so you can enjoy all the advantages coming with a RDBMS like transaction support.
https://github.com/andrewkkchan/client-ledger-core-memory
And I have another consumer to be processing in memory to handle bursts.
One would argue the actual event store above still lives in Kafka-- as RDBMS is slow for inserting especially when the inserting is always appending.
I hope the code help give you an illustration apart from the very good theoretical answers already provided for this question.
Related
I'm trying to use event sourcing, ddd and cqrs.
I can't understand that I have to create two database (or table) ( 1-json 2-normalize database) or one database (just json)
And also if I have create two database (or table), I have to save data in databases (json and normalize) as atomic in one transaction or not?
Best regards
DDD
We're making an assumption here that you fully understand using DDD and the implications. Specifically related to Event Sourcing, it's a matter of defining Aggregate boundaries and the events that become their state.
CQRS
Again we're making an assumption that you fully understand the implications. CQRS merely allows you to write code in vertical slices (i.e. from UI to database) for handling "commands" separately from code that handles "queries". That's all. While it's true that you can then take this further, by storing data in a "read model" that might even be in a different database, let alone table, it's not a requirement of implementing CQRS.
As CQRS pertains to Event Sourcing - it's a good fit because the data model you tend to end up with in Event Sourcing is not conducive to complex queries. It's typically limited to "get the Aggregate by it's ID". Therefore having "projections" to store the data in other ways that are more appropriate for querying and loading into UIs is the typical approach.
Event Sourcing
If you implement a Domain Model in such a way that every command handled by an Aggregate (i.e. every use-case/task carried out by a user) generates one or more events, then Event Sourcing is the principle where you store those list of events in an append-only style against the Aggregate's ID, rather than storing a snapshot of the Aggregate after the command was successfully handled.
To load an aggregate from the event store, you load all it's previous events, and replay them in memory on the Aggregate object, again rather than loading a single row/document in as a snapshot/memento.
A document database is therefore an excellent choice for event stores, because a single document represents the event stream for a given Aggregate. However if you want to store your event streams in SQL, that's fine, but you might store it in two tables:
create table Aggregate (Id int not null...);
create table AggregateEvent(AggregateId int not null FK..., Version int not null, eventBody nvarchar(max));
The actual event body would typically be the event itself, serialised to a text format like JSON.
Projections and Read Stores
If you take the events generated by the handling of commands by aggregates, and write code that consumes them by writing to a separate data store (SQL, pre-calculated ViewModels, etc), then you can call that a "projection". It's "projecting" the data that's in one shape into another shape fit for a different purpose. The result is a "read store", which you can then query however you need to.
I can't understand that I have to create two database (or table) ( 1-json 2-normalize database) or one database (just json)
It's possible to get by with just an event store and nothing else.
"get by" isn't necessarily pleasant, however. Event stores are, as a rule, really good at "append new information", but not particularly good at "query". Thus, the usual answer is to deploy processes that copy information from your event store to something that has nicer query support.
I have to save data in databases (json and normalize) as atomic in one transaction or not?
It's a common pattern to update the event storage only, and then later invoke the process to copy the information from the event storage to your query support. Of course, that also means that your queries may end up showing old/out of date information (here is the answer to your question as-of five minutes ago).
If you store your query friendly data model with the event storage (tables in the same relational database, for instance), then you can arrange for at least some of your updates to the query friendly model to be synchronized with the events.
In other words, you get trade offs, not a single cookie cutter pattern that is used everywhere.
Is there a way to have my items automatically ordered by creation order in DynamoDB?
I've tried using an ISO timestamp in my sort-keys but I soon noticed that items created in the same second have no guaranteed order.
Another major issue is that my sort-key is composite, for example :
Sort_Key : someRandomUuidHere|Created:someTimeStampHere
I need to generate UUIds to try to guarantee it is unique, but adding the timestamp at the end of the uuid doesn't seem to order it by the timestamp, but by the uuid instead. And if I add the timestamp to the start I can't use things like begins_with so it breaks my access patterns
The only way I could think of was maintaining a "last-key" object and always ask it the last item index before, but that would require an extra request and some ACID logic.
Or maybe just order it on the client side, there's always that
Sort keys are sorted left to right...so yeah, the UUID would be the only component that affects the order.
Why not place the timestamp first?
Alternatively, consider just Time-Based UUID. Dynamo doesn't offer one natively, alothough some NoSQL DB's, such as Cassandra, do.
Here's an article that discusses creating one with .NET
Creating a Time UUID (GUID) in .NET
Which includes a link the following source
https://github.com/fluentcassandra/fluentcassandra/blob/master/src/GuidGenerator.cs
I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.
Some of the Users in my database will also be Practitioners.
This could be represented by either:
an is_practitioner flag in the User table
a separate Practitioner table with a user_id column
It isn't clear to me which approach is better.
Advantages of flag:
fewer tables
only one id per user (hence no possibility of confusion, and also no confusion in which id to use in other tables)
flexibility (I don't have to decide whether fields are Practitioner-only or not)
possible speed advantage for finding User-level information for a practitioner (e.g. e-mail address)
Advantages of new table:
no nulls in the User table
clearer as to what information pertains to practitioners only
speed advantage for finding practitioners
In my case specifically, at the moment, practitioner-related information is generally one-to-many (such as the locations they can work in, or the shifts they can work, etc). I would not be at all surprised if it turns I need to store simple attributes for practitioners (i.e., one-to-one).
Questions
Are there any other considerations?
Is either approach superior?
You might want to consider the fact that, someone who is a practitioner today, is something else tomorrow. (And, by that I don't mean, not being a practitioner). Say, a consultant, an author or whatever are the variants in your subject domain, and you might want to keep track of his latest status in the Users table. So it might make sense to have a ProfType field, (Type of Professional practice) or equivalent. This way, you have all the advantages of having a flag, you could keep it as a string field and leave it as a blank string, or fill it with other Prof.Type codes as your requirements grow.
You mention, having a new table, has the advantage for finding practitioners. No, you are better off with a WHERE clause on the users table for that.
Your last paragraph(one-to-many), however, may tilt the whole choice in favour of a separate table. You might also want to consider, likely number of records, likely growth, criticality of complicated queries etc.
I tried to draw two scenarios, with some notes inside the image. It's really only a draft just to help you to "see" the various entities. May be you already done something like it: in this case do not consider my answer please. As Whirl stated in his last paragraph, you should consider other things too.
Personally I would go for a separate table - as long as you can already identify some extra data that make sense only for a Practitioner (e.g.: full professional title, University, Hospital or any other Entity the Practitioner is associated with).
So in case in the future you discover more data that make sense only for the Practitioner and/or identify another distinct "subtype" of User (e.g. Intern) you can just add fields to the Practitioner subtable, or a new Table for the Intern.
It might be advantageous to use a User Type field as suggested by #Whirl Mind above.
I think that this is just one example of having to identify different type of Objects in your DB, and for that I refer to one of my previous answers here: Designing SQL database to represent OO class hierarchy
I have a database using PostgreSQL, which holds data on students, applications and job offers.
Is there some kind of constraint that will mean a student can only accept one job offer. So by selecting 'yes' on 'job accepted' attribute, they can no longer do this for any other jobs they may receive?
It is not exactly a "constraint". It is just a column. In the Student table have a column called AcceptedJobOffer. That solves the direct problem. In addition, you want the following:
AcceptedJobOfferId int references JobOffers(JobOfferid)
And, then create a unique index on Applications for StudentId, JobOfferId and include:
foreign key (StudentId, AcceptedJobOfferId) references Applications(StudentId, JobOfferId)
This ensures that the job offer is a valid job and that it references an application (assuming that an application is a requirement -- 100% of the time -- for acceptance).
I imagine you've some kind of job applications table, which has a field called is_accepted in it or something to that order. You can add an exclude constraint on it. Example here.
An alternative is to add an accepted_job_id column (ideally a foreign key) to the students table, as already suggested by Gordon.
Side note: If this is going to be dealing with real data, rather than theoretical data in a database course, you probably do not want to enforce the constraint at all. Sometimes, people want or need multiple jobs, so limiting the system in such a way that they cannot apply to more than one job introduces an artificial limitation which may come back and bite you down the road.