Aggregate design with EventSourcing and large number of events - cqrs

I'd like to start adventure with EventSourcing. As a playground I have a system that gathers data from set of Sensors organized in Arrays. Each Sensor have a single value like temperature. What I need from this system is
to get the current value of Sensors readings
last month of Sensor value history
when Sensor value changes, I have to calculate Array "status" and
store it (also for a month)
Array "status" can be corrected manually by the user
Number of Arrays and Sensors is growing. For each Array I have many readings per second.
Now I wanted to have the Array as an Aggregate with Sensors as it's entities. In this case each Sensor reading update would upgrade Array Aggregate version. That gives > 10M of changes for a month. In this design I can't cut off not old events. I can't think about time required for restoring ReadModels after a year of data.
I think I could store current state as CRUD table and remove Sensor current data from Array. Keep just definition. Then I can use service that will handle the Sensor data stream, check Array "status" and keep Array "status" as separate Aggregate. Service would emit "Sensor data update" event. This event would trigger ReadModel keeping historical data handling 1 month constraint. I will not pollute event store with Sensor readings events. In case of Array "status" I will be able to remove whole past "status" Aggregates from the event store. Arrays would keep only Sensor definitions, so EventStore would be relatively small.
I loose complete history. I can't restore my 1 month signal history ReadModel. I would have to pay additional attention not to break it.
The goal is to learn how to scale EventSourcing / CQRS system. How to handle large EventStore and rebuild damaged or inflate new ReadModels within hours not days.
Does this idea fits into ES / CQRS?
(EDIT: is it OK to update RM with event stream not from an Aggregates?)
How to handle issues with growing event store and fixing broken ReadModels?
Thanks!

Does this idea fits into ES / CQRS?
One of the things that you need to be really careful about, is understanding which information is under the control of your domain model, and which belongs to something outside.
If your sensors are physical devices in the real world, broadcasting readings, then your domain model is not the authority. That sensor data is probably going to be read, validated (ie: no corruption to the messages in transit) and stored. In other words, the sensor measurements are events (past), not commands (imperative). Throw them into a convenient data store.
With that in mind, you need to look carefully at whether your arrays are domain entities (reading in sensor data, and making interesting decisions) or projections (a reorganization of the streams of sensor measurements).
It may be useful to review When to avoid CQRS, by Udi Dahan. One of the things he talks about there is that, when done right, aggregates look like processes.
In short, make sure that you are applying the right tools to your problem.
That said, yes -- if you have enough events that folding them into a projection isn't easy, then it is hard. You have to look at how much budget you have to solve the problem, and start digging into more I/O efficient representations of your events, more memory efficient representations of your events, batching, etc. Trying to find different ways to partition the work among different cores.
LMAX did a pretty good job documenting the lessons they learned in processing high volume message streams; search for information about their architecture.

Aggregates with lots of events
Aggregate is a term for the write side (C in CQRS). Aggregate receives a command, and using its state emits events into event store. Aggregate state is built using events from the event store. So if there are lot of events for the given aggregate, it takes time to build the state.
In order to speed up building a state for an aggregate, CQRS/ES frameworks are using snapshots - this is a serialized aggregate state that is stored for particular aggregate version, so you are building the state not from the beginning of time, but from the latest snapshot. You can store snapshots for, say, every 100 events. And don't forget to rebuild them if your projection function changed.
Frameworks such as reSolve are doing this for you transparently.
Your scenario
In your particular case it seems to me that your business logic is trivial, meaning you don't need an aggregate state to calculate anything or to make a decision - there is no business logic, you essentially just store events as they being generated by sensor. So in your custom framework you can just avoid building an aggregate state at write side - just store events as sensor data coming in.
At the read side you would use event stream as usual - upon receiving an event you can store it into Read Model database with necessary categorization or time slots.
If you don't need old data in the ReadModel - you may just skip old events during rebuilding - it should be very fast.
If you don't want to store old event in the event store - you can delete them, but this would not be real event sourcing anymore.

Related

Event Sourcing - How to query inside a command?

We would like to be able to read state inside a command use case.
We could get the state from event store for the specific aggregate, but what about querying aggregates by field(not id) or performing more complicated queries, that are not fitted for the event store?
The approach we were thinking was to use our read model for those cases as well and not only for query use cases.
This might be inconsistent, so a solution could be to have the latest version of the aggregate stored in both write/read models, in order to be able to tell if the state is correct or stale.
Does this make sense and if yes, if we need to get state by Id should we use event store or the read model?
If you want the absolute latest state of an event-sourced aggregate, you're going to have to read the latest snapshot (assuming that you are snapshotting) and then replay events since that snapshot from the event store. You can be aggressive about snapshotting (conceivably even saving a snapshot after every command), but you're giving away some write performance to make the read faster.
Updating the read model directly is conceivably possible, though that level of coupling is something that should be considered very carefully. Note also that you will very likely need some sort of two-phase commit to ensure that the read model is only updated when the write model is updated and vice versa. I strongly suggest considering why you're using CQRS/ES in this project, because you are quite possibly undermining that reason by doing this sort of thing.
In general, if you need a query for processing a particular command, it's likely that query will generally be the same, i.e. you don't need free-form query support. In that case, you can often have a read model that's tuned for exactly that query and which only cares about events which could affect that query: often a fairly small subset of the events. The finer-grained the read model, the easier it is to keep in sync (if it ignores 99% of events, for instance, it can't really fall that far behind).
Needing to make complex queries as part of command processing could also be a sign that your aggregate boundaries aren't right and could do with a re-examination.
Does this make sense
Maybe. Let's start with
This might be inconsistent
Yup, they might be. So what?
We typically respond to a query by sending an unlocked copy of the answer. In other words, it's possible that the actual information in the write model will change after this response is dispatched but before the response arrives at its destination. The client will be looking at a copy of the answer taken from the past.
So we might reasonably ask how much better it is to get information no more than one minute old compared to information no more than five minutes old. If the difference in value is pennies, then you should probably deploy the five minute version. If the difference is millions of dollars, then you're in a good position to negotiate a real budget to solve the problem.
For processing a command in our own write model, that kind of inconsistency isn't usually acceptable or wise. But neither of the two common answers require keeping the read and write models synchronized. The most common answer is to just work with the write model alone. The less common answer is to grab a snapshot out of a cache, and then apply any additional events to it to bring it up to date. The latter approach is "just" a performance optimization (first rule: don't.)
The variation that trips everyone up is trying to process a command somewhere else, enforcing a consistency rule on our data here. Once again, you need a really clear picture of how valuable the consistency is to the business. If it's really important, that may be a signal that the information in question shouldn't be split into two different piles - you may be working with the wrong underlying data model.
Possibly useful references
Pat Helland Data on the Outside Versus Data on the Inside
Udi Dahan Race Conditions Don't Exist

Understanding CQRS and EventSourcing

I read several blogs and watched video about usefulness of CQRS and ES. I am left with implementation confusion.
CQRS: when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table. Do we required to use cron job to sync data to read only table from write table or any other available options ?
Event Sourcing: Do we store only all Immutable sequential operation as record for each update happened upon once created in one storage. Or do we also store mutable record I mean the same record is updated in another storage
And Please explain RDBMS, NoSQL and Messaging to be used and where they fit into it
when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table.
You design an asynchronous process that understands how to transform the data from its "write" representation to its "read" representation, and you design a scheduler to decide when that asynchronous process runs.
Part of the point is that it's just plumbing, and you can choose whatever plumbing you want that satisfies your operational needs.
Event Sourcing
On the happy path, each "event stream" is a append only sequence of immutable events. In the case where you are enforcing a domain invariant over the contents of the stream, you'll normally have a "first writer wins" conflict policy.
But "the" stream is the authoritative copy of the events. There may also be non-authoritative copies (for instance, events published to a message bus). They are typically all immutable.
In some domains, where you have to worry about privacy and "the right to be forgotten", you may need affordances that allow you to remove information from a previously stored event. Depending on your design choices, you may need mutable events there.
RDBMS
For many sorts of queries, especially those which span multiple event streams, being able to describe the desired results in terms of relations makes the programming task much easier. So a common design is to have asynchronous process that read information from the event streams and update the RDBMS. The usual derived benefit is that you get low latency queries (but the data returned by those queries may be stale).
RDBMS can also be used as the core of the design of the event store / message store itself. Events are common written as blob data, with interesting metadata exposed as additional columns. The message store used by eventide-project is based on postgresql.
NoSQL
Again, can potentially be used as your cache of readable views, or as your message store, depending on your needs. Event Store would be an example of a NoSQL message store.
Messaging
Messaging is a pattern for temporal decoupling; the ability to store/retrieve messages in a stable central area affords the ability to shut down a message producer without blocking the message consumer, and vice versa. Message stores also afford some abstraction - the producer of a message doesn't necessarily know all of the consumers, and the consumer doesn't necessarily know all of the producers.
My Question is about Event Sourcing. Do we required only immutable sequence events to be stored and where to be stored ?
In event sourcing, the authoritative representation of the state is the sequence of events - your durable copy of that event sequence is the book of truth.
As for where they go? Well, that is going to depend on your architecture and storage choices. You could manage files on disk yourself, you could write them in to your own RDBMS; you could use an RDBMS designed by somebody else, you could use a NoSQL document store, you could use a dedicated message store.
There could be multiple stores -- for instance, in a micro service architecture, the service that accepts orders might be different from the service that tracks order fulfillment, and they could each be writing events into different storage appliances.

Keeping a snapshot of the most current version of each aggregate in an event store

We're currently using an SQL-backed Event Store (the typical 2-table implementation) and some people in the team are afraid that even though we're using the Event Store only for writes, things may get a bit slower, so a suggestion was put in place to instead of adding snapshots here and there, to actually maintain a fully-consistent (with the event streams) snapshot of each aggregate in its most recent state (in JSON format). All the querying on the system will end up being done on the read-side, with a typical SQL database that is updated in an eventual consistency fashion from the ES (write) side.
Having such a system in place would allow us to enjoy the benefits of having an Event Store while simultaneously removing any possible performance issues altogether. We are currently not making use of any "time-travelling" feature, although sooner of later that will end up being the case.
Is this a good approach? There's something in it leaving my uncomfortable. For instance, if we need some sort of time-travelling feature, not having snapshots here and there in each aggregate's event-stream will prove a performance disaster. Of course we could have both a most-current-snapshot per aggregate instance and also snapshots throughout the event-streams.
In case we decide to go down this route, should we make the snapshot update for a given aggregate transactional to the events updates on that same aggregate, or should we just update the events and in an eventually-consistent manner update the snapshot?
What are the downsides of this approach? Has anyone tried something of the kind?
You should probably run your own benchmarks before adding unnecessary complexity to your system. We have noticed some performance problems when thousands of events need to be queried and applied to rebuild an aggregate from the event stream, where JSON to object deserialization was the biggest performance bottleneck. If each of your aggregates has only few events (say, < 100) you probably won't notice any significant differences in practice.
Most event stores record snapshots every n events/commits, say every 50-100 events, and on assembly query the latest snapshots and apply the missing events since the last snapshot. If you also keep all old snapshots in your snapshot database, the time traveling feature will be as fast as a usual query, and you'll only need slightly more persistence space, which is cheap nowadays.
The snapshots should always be written out of the original transaction (and can be generated in another thread), since it's non-crucial if the last snapshot is missing, but you want to don't want your business transaction to fail due to errors in the snapshot write transaction.
Depending on your usual system uptime and data size, it might make sense to held snapshots in memory or a distributed cache/graid or in another database (non-SQL).

How to ensure external projections are in sync when using CQRS and EventSourcing?

I'm starting a new application and I want to use cqrs and eventsourcing. I got the idea of replaying events to recreate aggregates and snapshotting to speedup if needed, using in memory models, caching, etc.
My question is regarding large read models I don't want to hold in memory. Suppose I have an application where I sell products, and I want to listen to a stream of events like "ProductRegistered" "ProductSold" and build a table in a relational database that will be used for reporting or integration with another system. Suppose there are lots of records and this table may take from a few seconds to minutes to truncate/rebuild, and the application exports dozens of these projections for multiple purposes.
How does one handle the consistency of the projections in this scenario?
With in-memory data, it's quite simple and fast to replay the events. But I feel that external projections that are kept in disk will be much slower to rebuild.
Should I always start my application with a TRUNCATE TABLE + rebuild for every external projection? This seems impractical to me over time, but I may be worried about a problem I didn't have yet.
Since the table is itself like a snapshot, I could keep a "control table" to tell which event was the last one I handled for that projection, so I can replay only what's needed. But I'm worried about inconsistencies if the application or database crashes. It seems that checking the consistency of the table and rebuilding would be the same, which points to the solution 1 again.
How would you handle that in a way that is maintainable over time? Are there better solutions?
Thank you very much.
One way to handle this is the concept of checkpointing. Essentially either your event stream or your whole system has a version number (checkpoint) that increments with each event.
For each projection, you store the last committed checkpoint that was applied. At startup, you pull events greater than the last checkpoint number that was applied to the projection, and continue building your projection from there. If you need to rebuild your projection, you delete the data AND the checkpoint and rerun the whole stream (or set of streams).
Caution: the last applied checkpoint and the projection's read models need to be persisted in a single transaction to ensure they do not get out of sync.

Creating a snapshot in a distributed architecture

I'm thinking about the problem in question title: if I have to query for an aggregate in a distributed architecture where the distributed event store can eventually be waiting for last events to be distributed.. How can I know if the aggregate i'm reading via read model is not being replaced by the updated one in another server of the network?
I have an http server that receive events to save on the store. Store not exists actually but I want implement it soon.
Events regards huge aggregate that serialized in json format takes 4MB
Another sub-question is what storage do you recommend for the snapshot?
EDIT
I don't understand if the question is not written well or if I have selected wrong tags...
The ability to know when the "last" event in the distributed store is processed depends on two things:
Can you define "last"?
Does the distributed storage engine expose it to you?
The CAP theorem is a good reference to the sort of problems you are going to have with both of those in a distributed data store; in general, unless you give up availability you are not going to be able to have the properties needed to get what you want.
On the other hand, if you can define last in a meaningful way, you can still have what you want. For example: do your events expire after a while? If, for example, they expire after 12 hours, you know that you can always meaningfully define last as "the moment in time 12 hours ago", because any unprocessed event older than that is obsolete...
To answer your sub-question, I strongly recommend a storage engine that you do not write yourself, because distributed data storage is an awesomely hard problems that many very smart people, working for companies doing nothing but solving problems in this space, are doing for you.
Leverage their work instead.