How to ensure external projections are in sync when using CQRS and EventSourcing? - cqrs

I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.

One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.

Related

Event Sourcing - How to query inside a command?

We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist

Do Firebase/Firestore Transactions create internal queues?

I'm wondering if transactions (https://firebase.google.com/docs/firestore/manage-data/transactions) are viable tools to use in something like a ticketing system where users maybe be attempting to read/write to the same collection/document and whoever made the request first will be handled first and second will be handled second etc.
If not what would be a good structure for such a need with firestore?
Transactions just guarantee atomic consistent update among the documents involved in the transaction. It doesn't guarantee the order in which those transactions complete, as the transaction handler might get retried in the face of contention.
Since you tagged this question with google-cloud-functions (but didn't mention it in your question), it sounds like you might be considering writing a database trigger to handle incoming writes. Cloud Functions triggers also do not guarantee any ordering when under load.
Ordering of any kind at the scale on which Firestore and other Google Cloud products operate is a really difficult problem to solve (please read that link to get a sense of that). There is not a simple database structure that will impose an order where changes are made. I suggest you think carefully about your need for ordering, and come up with a different solution.
The best indication of order you can get is probably by adding a server timestamp to individual documents, but you will still have to figure out how to process them. The easiest thing might be to have a backend periodically query the collection, ordered by that timestamp, and process things in that order, in batch.

Aggregate design with EventSourcing and large number of events

I'd like to start adventure with EventSourcing. As a playground I have a system that gathers data from set of Sensors organized in Arrays. Each Sensor have a single value like temperature. What I need from this system is
to get the current value of Sensors readings
last month of Sensor value history
when Sensor value changes, I have to calculate Array "status" and
store it (also for a month)
Array "status" can be corrected manually by the user
Number of Arrays and Sensors is growing. For each Array I have many readings per second.
Now I wanted to have the Array as an Aggregate with Sensors as it's entities. In this case each Sensor reading update would upgrade Array Aggregate version. That gives > 10M of changes for a month. In this design I can't cut off not old events. I can't think about time required for restoring ReadModels after a year of data.
I think I could store current state as CRUD table and remove Sensor current data from Array. Keep just definition. Then I can use service that will handle the Sensor data stream, check Array "status" and keep Array "status" as separate Aggregate. Service would emit "Sensor data update" event. This event would trigger ReadModel keeping historical data handling 1 month constraint. I will not pollute event store with Sensor readings events. In case of Array "status" I will be able to remove whole past "status" Aggregates from the event store. Arrays would keep only Sensor definitions, so EventStore would be relatively small.
I loose complete history. I can't restore my 1 month signal history ReadModel. I would have to pay additional attention not to break it.
The goal is to learn how to scale EventSourcing / CQRS system. How to handle large EventStore and rebuild damaged or inflate new ReadModels within hours not days.
Does this idea fits into ES / CQRS?
(EDIT: is it OK to update RM with event stream not from an Aggregates?)
How to handle issues with growing event store and fixing broken ReadModels?
Thanks!
Does this idea fits into ES / CQRS?
One of the things that you need to be really careful about, is understanding which information is under the control of your domain model, and which belongs to something outside.
If your sensors are physical devices in the real world, broadcasting readings, then your domain model is not the authority. That sensor data is probably going to be read, validated (ie: no corruption to the messages in transit) and stored. In other words, the sensor measurements are events (past), not commands (imperative). Throw them into a convenient data store.
With that in mind, you need to look carefully at whether your arrays are domain entities (reading in sensor data, and making interesting decisions) or projections (a reorganization of the streams of sensor measurements).
It may be useful to review When to avoid CQRS, by Udi Dahan. One of the things he talks about there is that, when done right, aggregates look like processes.
In short, make sure that you are applying the right tools to your problem.
That said, yes -- if you have enough events that folding them into a projection isn't easy, then it is hard. You have to look at how much budget you have to solve the problem, and start digging into more I/O efficient representations of your events, more memory efficient representations of your events, batching, etc. Trying to find different ways to partition the work among different cores.
LMAX did a pretty good job documenting the lessons they learned in processing high volume message streams; search for information about their architecture.
Aggregates with lots of events
Aggregate is a term for the write side (C in CQRS). Aggregate receives a command, and using its state emits events into event store. Aggregate state is built using events from the event store. So if there are lot of events for the given aggregate, it takes time to build the state.
In order to speed up building a state for an aggregate, CQRS/ES frameworks are using snapshots - this is a serialized aggregate state that is stored for particular aggregate version, so you are building the state not from the beginning of time, but from the latest snapshot. You can store snapshots for, say, every 100 events. And don't forget to rebuild them if your projection function changed.
Frameworks such as reSolve are doing this for you transparently.
Your scenario
In your particular case it seems to me that your business logic is trivial, meaning you don't need an aggregate state to calculate anything or to make a decision - there is no business logic, you essentially just store events as they being generated by sensor. So in your custom framework you can just avoid building an aggregate state at write side - just store events as sensor data coming in.
At the read side you would use event stream as usual - upon receiving an event you can store it into Read Model database with necessary categorization or time slots.
If you don't need old data in the ReadModel - you may just skip old events during rebuilding - it should be very fast.
If you don't want to store old event in the event store - you can delete them, but this would not be real event sourcing anymore.

Commit to a log like Kafka + database with ACID properties?

I'm planning in test how make this kind of architecture to work:
http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.
I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.
But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.
So:
1- Is possible to keep consistency between a RDBMS and a log storage OR
2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?
Easy(1) answers for provided questions:
Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.
Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.
"Conceptually correct" answer:
If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.
Would be happy to clarify my points if needed.
(1) Easy vs simple: mandatory link

Keeping a snapshot of the most current version of each aggregate in an event store

We're currently using an SQL-backed Event Store (the typical 2-table implementation) and some people in the team are afraid that even though we're using the Event Store only for writes, things may get a bit slower, so a suggestion was put in place to instead of adding snapshots here and there, to actually maintain a fully-consistent (with the event streams) snapshot of each aggregate in its most recent state (in JSON format). All the querying on the system will end up being done on the read-side, with a typical SQL database that is updated in an eventual consistency fashion from the ES (write) side.
Having such a system in place would allow us to enjoy the benefits of having an Event Store while simultaneously removing any possible performance issues altogether. We are currently not making use of any "time-travelling" feature, although sooner of later that will end up being the case.
Is this a good approach? There's something in it leaving my uncomfortable. For instance, if we need some sort of time-travelling feature, not having snapshots here and there in each aggregate's event-stream will prove a performance disaster. Of course we could have both a most-current-snapshot per aggregate instance and also snapshots throughout the event-streams.
In case we decide to go down this route, should we make the snapshot update for a given aggregate transactional to the events updates on that same aggregate, or should we just update the events and in an eventually-consistent manner update the snapshot?
What are the downsides of this approach? Has anyone tried something of the kind?
You should probably run your own benchmarks before adding unnecessary complexity to your system. We have noticed some performance problems when thousands of events need to be queried and applied to rebuild an aggregate from the event stream, where JSON to object deserialization was the biggest performance bottleneck. If each of your aggregates has only few events (say, < 100) you probably won't notice any significant differences in practice.
Most event stores record snapshots every n events/commits, say every 50-100 events, and on assembly query the latest snapshots and apply the missing events since the last snapshot. If you also keep all old snapshots in your snapshot database, the time traveling feature will be as fast as a usual query, and you'll only need slightly more persistence space, which is cheap nowadays.
The snapshots should always be written out of the original transaction (and can be generated in another thread), since it's non-crucial if the last snapshot is missing, but you want to don't want your business transaction to fail due to errors in the snapshot write transaction.
Depending on your usual system uptime and data size, it might make sense to held snapshots in memory or a distributed cache/graid or in another database (non-SQL).