Keeping a snapshot of the most current version of each aggregate in an event store

Keeping a snapshot of the most current version of each aggregate in an event store - cqrs

We're currently using an SQL-backed Event Store (the typical 2-table implementation) and some people in the team are afraid that even though we're using the Event Store only for writes, things may get a bit slower, so a suggestion was put in place to instead of adding snapshots here and there, to actually maintain a fully-consistent (with the event streams) snapshot of each aggregate in its most recent state (in JSON format). All the querying on the system will end up being done on the read-side, with a typical SQL database that is updated in an eventual consistency fashion from the ES (write) side.
Having such a system in place would allow us to enjoy the benefits of having an Event Store while simultaneously removing any possible performance issues altogether. We are currently not making use of any "time-travelling" feature, although sooner of later that will end up being the case.
Is this a good approach? There's something in it leaving my uncomfortable. For instance, if we need some sort of time-travelling feature, not having snapshots here and there in each aggregate's event-stream will prove a performance disaster. Of course we could have both a most-current-snapshot per aggregate instance and also snapshots throughout the event-streams.
In case we decide to go down this route, should we make the snapshot update for a given aggregate transactional to the events updates on that same aggregate, or should we just update the events and in an eventually-consistent manner update the snapshot?
What are the downsides of this approach? Has anyone tried something of the kind?

You should probably run your own benchmarks before adding unnecessary complexity to your system. We have noticed some performance problems when thousands of events need to be queried and applied to rebuild an aggregate from the event stream, where JSON to object deserialization was the biggest performance bottleneck. If each of your aggregates has only few events (say, < 100) you probably won't notice any significant differences in practice.
Most event stores record snapshots every n events/commits, say every 50-100 events, and on assembly query the latest snapshots and apply the missing events since the last snapshot. If you also keep all old snapshots in your snapshot database, the time traveling feature will be as fast as a usual query, and you'll only need slightly more persistence space, which is cheap nowadays.
The snapshots should always be written out of the original transaction (and can be generated in another thread), since it's non-crucial if the last snapshot is missing, but you want to don't want your business transaction to fail due to errors in the snapshot write transaction.
Depending on your usual system uptime and data size, it might make sense to held snapshots in memory or a distributed cache/graid or in another database (non-SQL).

Related

Event Sourcing - How to query inside a command?

We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?

If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.

Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist

Do Firebase/Firestore Transactions create internal queues?

I'm wondering if transactions (https://firebase.google.com/docs/firestore/manage-data/transactions) are viable tools to use in something like a ticketing system where users maybe be attempting to read/write to the same collection/document and whoever made the request first will be handled first and second will be handled second etc.
If not what would be a good structure for such a need with firestore?

Transactions just guarantee atomic consistent update among the documents involved in the transaction. It doesn't guarantee the order in which those transactions complete, as the transaction handler might get retried in the face of contention.
Since you tagged this question with google-cloud-functions (but didn't mention it in your question), it sounds like you might be considering writing a database trigger to handle incoming writes. Cloud Functions triggers also do not guarantee any ordering when under load.
Ordering of any kind at the scale on which Firestore and other Google Cloud products operate is a really difficult problem to solve (please read that link to get a sense of that). There is not a simple database structure that will impose an order where changes are made. I suggest you think carefully about your need for ordering, and come up with a different solution.
The best indication of order you can get is probably by adding a server timestamp to individual documents, but you will still have to figure out how to process them. The easiest thing might be to have a backend periodically query the collection, ordered by that timestamp, and process things in that order, in batch.

Commit to a log like Kafka + database with ACID properties?

I'm planning in test how make this kind of architecture to work:
http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.
I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.
But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.
So:
1- Is possible to keep consistency between a RDBMS and a log storage OR
2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?

Easy(1) answers for provided questions:
Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.
Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.
"Conceptually correct" answer:
If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.
Would be happy to clarify my points if needed.
(1) Easy vs simple: mandatory link

How to ensure external projections are in sync when using CQRS and EventSourcing?

I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.

One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.

Too much data duplication in mongodb?

I'm new to this whole NOSQL stuff and have recently been intrigued with mongoDB. I'm creating a new website from scratch and decided to go with MONGODB/NORM (for C#) as my only database. I've been reading up a lot about how to properly design your document model database and I think for the most part I have my design worked out pretty well. I'm about 6 months into my new site and I'm starting to see issues with data duplication/sync that I need to deal with over and over again. From what I read, this is expected in the document model, and for performance it makes sense. I.E. you stick embedded objects into your document so it's fast to read - no joins; but of course you can't always embed, so mongodb has this concept of a DbReference which is basically analogous to a foreign key in relational DBs.
So here's an example: I have Users and Events; both get their own document, Users attend events, Events have users attendees. I decided to embed a list of Events with limited data into the User objects. I embedded a list of Users also into the Event objects as their "attendees". The problem here is now I have to keep the Users in sync with the list of Users that is also embedded in the Event object. As I read it, this seems to be the preferred approach, and the NOSQL way to do things. Retrieval is fast, but the fall-back is when I update the main User document, I need to also go into the Event objects, possibly find all references to that user and update that as well.
So the question I have is, is this a pretty common problem people need to deal with? How much does this problem have to happen before you start saying "maybe the NOSQL strategy doesn't fit what I'm trying to do here"? When does the performance advantage of not having to do joins turn into a disadvantage because you're having a hard time keeping data in sync in embedded objects and doing multiple reads to the DB to do so?

Well that is the trade off with document stores. You can store in a normalized fashion like any standard RDMS, and you should strive for normalization as much as possible. It's only where its a performance hit that you should break normalization and flatten your data structures. The trade off is read efficiency vs update cost.
Mongo has really efficient indexes which can make normalizing easier like a traditional RDMS (most document stores do not give you this for free which is why Mongo is more of a hybrid instead of a pure document store). Using this, you can make a relation collection between users and events. It's analogous to a surrogate table in a tabular data store. Index the event and user fields and it should be pretty quick and will help you normalize your data better.
I like to plot the efficiency of flatting a structure vs keeping it normalized when it comes to the time it takes me to update a records data vs reading out what I need in a query. You can do it in terms of big O notation but you don't have to be that fancy. Just put some numbers down on paper based on a few use cases with different models for the data and get a good gut feeling about how much works is required.
Basically what I do is first try to predict the probability of how many updates a record will have vs. how often it's read. Then I try to predict what the cost of an update is vs. a read when it's both normalized or flattened (or maybe partially combination of the two I can conceive... lots of optimization options). I can then judge the savings of keeping it flat vs. the cost of building up the data from normalized sources. Once I plotted all the variables, if the savings of keeping it flat saves me a bunch, then I will keep it flat.
A few tips:
If you require fast lookups to be quick and atomic (perfectly up to date) you may want a favor a solution where you favor flattening over normalization and taking the hit on the update.
If you require update to be quick, and access immediately then favor normalization.
If you require fast lookups but don't require perfectly up to date data, consider building out your normalized data in batch jobs (using map/reduce possibly).
If your queries need to be fast, and updates are rare, and do not necessarily require your update to be accessible immediately or require transaction level locking that it went through 100% of the time (to guarantee your update was written to disk), you can consider writing your updates to a queue processing them in the background. (In this model, you will probably have to deal with conflict resolution and reconciliation later).
Profile different models. Build out a data query abstraction layer (like an ORM in a way) in your code so you can refactor your data store structure later.
There are lot of other ideas that you can employ. There a lot of great blogs on line that go into it like highscalabilty.org and make sure you understand CAP theorem.
Also consider a caching layer, like Redis or memcache. I will put one of those products in front my data layer. When I query mongo (which is storing everything normalized), I use the data to construct a flattened representation and store it in the cache. When I update the data, I will invalidate any data in the cache that references what I'm updating. (Although you have to take the time it takes to invalidate data and tracking data in the cache that is getting updated into consideration of your scaling factors). Someone once said "The two hardest things in Computer Science are naming things and cache invalidation."

Try adding an IList of type UserEvent property to your User object. You didn't specify much about how your domain model is designed. Check the NoRM group http://groups.google.com/group/norm-mongodb/topics
for examples.