Understanding CQRS and EventSourcing - cqrs

I read several blogs and watched video about usefulness of CQRS and ES. I am left with implementation confusion.
CQRS: when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table. Do we required to use cron job to sync data to read only table from write table or any other available options ?
Event Sourcing: Do we store only all Immutable sequential operation as record for each update happened upon once created in one storage. Or do we also store mutable record I mean the same record is updated in another storage
And Please explain RDBMS, NoSQL and Messaging to be used and where they fit into it

when use separate table, one for "Write, Update and delete" and other for Read operation. So then how the data sync from write table to read table.
You design an asynchronous process that understands how to transform the data from its "write" representation to its "read" representation, and you design a scheduler to decide when that asynchronous process runs.
Part of the point is that it's just plumbing, and you can choose whatever plumbing you want that satisfies your operational needs.
Event Sourcing
On the happy path, each "event stream" is a append only sequence of immutable events. In the case where you are enforcing a domain invariant over the contents of the stream, you'll normally have a "first writer wins" conflict policy.
But "the" stream is the authoritative copy of the events. There may also be non-authoritative copies (for instance, events published to a message bus). They are typically all immutable.
In some domains, where you have to worry about privacy and "the right to be forgotten", you may need affordances that allow you to remove information from a previously stored event. Depending on your design choices, you may need mutable events there.
RDBMS
For many sorts of queries, especially those which span multiple event streams, being able to describe the desired results in terms of relations makes the programming task much easier. So a common design is to have asynchronous process that read information from the event streams and update the RDBMS. The usual derived benefit is that you get low latency queries (but the data returned by those queries may be stale).
RDBMS can also be used as the core of the design of the event store / message store itself. Events are common written as blob data, with interesting metadata exposed as additional columns. The message store used by eventide-project is based on postgresql.
NoSQL
Again, can potentially be used as your cache of readable views, or as your message store, depending on your needs. Event Store would be an example of a NoSQL message store.
Messaging
Messaging is a pattern for temporal decoupling; the ability to store/retrieve messages in a stable central area affords the ability to shut down a message producer without blocking the message consumer, and vice versa. Message stores also afford some abstraction - the producer of a message doesn't necessarily know all of the consumers, and the consumer doesn't necessarily know all of the producers.
My Question is about Event Sourcing. Do we required only immutable sequence events to be stored and where to be stored ?
In event sourcing, the authoritative representation of the state is the sequence of events - your durable copy of that event sequence is the book of truth.
As for where they go? Well, that is going to depend on your architecture and storage choices. You could manage files on disk yourself, you could write them in to your own RDBMS; you could use an RDBMS designed by somebody else, you could use a NoSQL document store, you could use a dedicated message store.
There could be multiple stores -- for instance, in a micro service architecture, the service that accepts orders might be different from the service that tracks order fulfillment, and they could each be writing events into different storage appliances.

Related

Query vs Transaction

In this picture, we can see saga is the one that implements transactions and cqrs implements queries. As far as I know, a transaction is a set of queries that follow ACID properties.
So can I consider CQRS as an advanced version of saga which increases the speed of reads?
Note that this diagram is not meant to explain what Sagas and CQRS are. In fact, looking at it this way it is quite confusing. What this diagram is telling you is what patterns you can use to read and write data that spans multime microservices. It is saying that in order to write data (somehow transactionally) across multiple microservices you can use Sagas and in order to read data which belongs to multiple microservices you can use CQRS. But that doesn't mean that Sagas and CQRS have anything in common. They are two different patterns to solve completely different problems (reads and writes). To make an analogy, it's like saying that to make pizzas (Write) you can use an oven and to view the pizzas menu (Read) you can use a tablet.
On the specific patterns:
Sagas: you can see them as a process manager or state machine. Note that they do not implement transactions in the RDBMS sense. Basically, they allow you to create a process that will take care of telling each microservice to do a write operation and if one of the operations fails, it'll take care of telling the other microservices to rollback (or compensate) the action that they did. So, these "transactions" won't be atomic, because while the process is running some microservices will have already modified the data and others won't. And it is not garanteed that whatever has succeed can sucessfully be rolled back or compensated.
CQRS (Command Query Responsibility Segregation): suggests the separation of Commands (writes) and Queries (Reads). The reason for that, it is what I was saying before, that the reads and writes are two very different operations. Therefore, by separating them, you can implement them with the patterns that better fit each scenario. The reason why CQRS is shown in your diagram as a solution for reading data that comes from multiple microservices is because one way of implementing queries is to listen to Domain Events coming from multiple microservices and storing the information in a single database so that when it's time to query the data, you can find it all in a single place. An alternative to this would be Data Composition. Which would mean that when the query arrives, you would submit queries to multiple microservices at that moment and compose the response with the composition of the responses.
So can I consider CQRS as an advanced version of saga which increases the speed of reads?
Personally I would not mix the concepts of CQRS and Sagas. I think this can really confuse you. Consider both patterns as two completely different things and try to understand them both independently.

CQRS Write database

In our company we are developing a microservice based system and we apply CQRS pattern. In CQRS we separate Commands and Queries, because of that we have to develop 2 microservices. Currently I was assigned to enhance CQRS pattern to save events in a separate database (event sourcing). I understand that having a separate event database is very important but do we really need a separate Write Database? What is the actual use of the Write database?
If you have an event database, it is your Write database. It is the system-of-record and contains the transactionally-consistent state of the application.
If you have a separate Read database, it can be built off of the event log in either a strongly-consistent or eventually-consistent manner.
I understand that having a separate event database is very important but do we really need a separate Write Database? What is the actual use of the Write database?
The purpose of the write database is to stand as your book of record. The write database is the persisted representation that you use to recover on restart. It's the synchronization point for all writes.
It's "current Truth" as your system understands it.
In a sense, it is the "real" data, where the read models are just older/cached representations of what the real data used to look like.
It may help to think in terms of an RDBMS. When traffic is small, we can serve all of the incoming requests from a single database. As traffic increases, we want to start offloading some of that traffic. Since we want the persisted data to be in a consistent state, we can't offload the writes -- not if we want to be resolving conflicts at the point of the write. But we can shed reads onto other instances, provided that we are wiling to admit some finite interval of time between when a write happens, and when the written data is available on all systems.
So we send all writes to the leader, who is responsible for organizing everything into the write ahead log; changes to the log can then be replicated to the other instances, which in turn build out local copies of the data structures used to support low latency queries.
If you look very carefully, you might notice that your "event database" shares a lot in common with the "write ahead log".
No, you don't necessarily need a separate write database. The core of CQRS segregation is at the model (code) level. Going all the way to the DB might be beneficial or detrimental to your project, depending on the context.
As with many orthogonal architectural decisions surrounding the use of CQRS (Event Sourcing, Command Bus, etc.), the pros and cons should be considered carefully prior to adoption. Below some amount of concurrent access, separating read and write DBs might not be worth the effort.

CQRS, Event Sourcing and Scaling

It's clear that system based on these patterns is easily scalable. But I would like to ask you, how exactly? I have few questions regarding scalability:
How to scale aggregates? If I will create multiple instances of aggregate A, how to sync them? If one of the instances process the command and create an event, this event should be propagated to every instance of that agregate?
Shouldn't be there some business logic present which instance of the agregate to request? So if I am issuing multiple commands which applies to aggregate A (ORDERS) and applies to one specific order, it make sense to deliver it to the same instance. Or?
In this article: https://initiate.andela.com/event-sourcing-and-cqrs-a-look-at-kafka-e0c1b90d17d8,
they are using Kafka with a partitioning. So the user management service - aggregate is scaled but is subscribed only to specific partition of the topic, which contains all events of a particular user.
Thanks!
How to scale aggregates?
choose aggregates carefully, make sure your commands spread reasonably among many aggregates. You don't want to have an aggregate that likely to receive high number of command from concurrent users.
Serialize commands sent to aggregate instance. This can be done with aggregate repository and command bus/queue. But for me, the simplest way is to make optimistic locking with aggregate versioning as described in this post by Michiel Rook
which instance of the agregate to request?
In our reSolve framework we are creating instance of aggregate on every command and don't keep it between requests. This works surprisingly fast - it is faster to fetch 100 events and reduce them to aggregate state, than to find a right aggregate instance in a cluster.
This approach is scalable, lets you go serverless - one lambda invocation per command and no shared state in between. Those rare cases when aggregate has too many events are solved by snapshots.
How to scale aggregates?
The Aggregate instances are represented by their stream of events. Every Aggregate instance has its own stream of events. Events from one Aggregate instance are NOT used by other Aggregate instances. For example, if Order Aggregate with ID=1 creates an OrderWasCreated event with ID=1001, that Event will NEVER be used to rehydrate other Order Aggregate instances (with ID=2,3,4...).
That being said, you scale the Aggregates horizontally by creating shards on the Event store based on the Aggregate ID.
If I will create multiple instances of aggregate A, how to sync them? If one of the instances process the command and create an event, this event should be propagated to every instance of that agregate?
You don't. Each Aggregate instance is completely separated from other instances.
In order to be able to scale horizontally the processing of commands, it is recommended to load each time an Aggregate instance from the Event store, by replaying all its previously generated events. There is one optimization that you can do to boost performance: Aggregate snapshots, but it is recommended to do it only if it's really needed. This answer could help.
Shouldn't be there some business logic present which instance of the agregate to request? So if I am issuing multiple commands which applies to aggregate A (ORDERS) and applies to one specific order, it make sense to deliver it to the same instance. Or?
You assume that the Aggregate instances are running continuously on some servers' RAM. You could do that but such an architecture is very complex. For example, what happens when one of the servers goes down and it must be replaced by other? It's hard to determine what instances where living there and to restart them. Instead, you could have many stateless servers that could handle commands for any of the aggregate instances. When a command arrives, you identity the Aggregate ID, you load it from the Event store by replaying all its previous events and then it can execute the command. After the command is executed and the new events are persisted to the Event store, you can discard the Aggregate instance. The next command that arrives for the same Aggregate instance could be handled by any other stateless server. So, scalability is dictated only by the scalability of the Event store itself.
How to scale aggregates?
Each piece of information in the system has a single logical authority. Multiple authorities for a single piece of data gets you contention. You scale the writes by creating smaller non overlapping boundaries -- each authority has a smaller area of responsibility
To borrow from your example, an example of smaller responsibilities would
be to shift from one aggregate for all ORDERS to one aggregate for _each_
ORDER.
It's analogous to the difference between having a key value store with
all ORDERS stored in a document under one key, vs each ORDER being stored
using its own key.
Reads are safe, you can scale them out with multiple copies. Those copies are only eventually consistent, however. This means that if you ask "what is the bid price of FCOJ now?" you may get different answers from each copy. Alternatively, if you ask "what was the bid price of FCOJ at 10:09:02?" then each copy will either give you a single answer or say "I don't know yet".
But if the granularity is already one command per aggregate, what is not very often possible in my opinion, and you have really many concurrent accesses, how to solve it? How to spread the load and stay without the conflict as much as possible?
Rough sketch - each aggregate it stored via a key that can be computed from the contents of the command message. Update to the aggregate is achieved by a compare-and-swap operation using that key.
Acquire a message
Compute the storage key
Load a versioned representation from storage
Compute a new versioned representation
Store.compare and swap the new representation for the old
To provide additional traffic throughput, you add more stateless compute.
To provide storage throughput, you distribute the keys across more storage appliances.
A routing layer can be used to group messages together - the routers uses the same storage key calculation as before, but uses that to choose where in the compute farm to forward the message. The compute can then check each batch of messages it receives for duplicate keys, and process those messages together (trading some extra compute to reduce the number of compare and swaps).
Sane message protocols are important; see Marc de Graauw's Nobody Needs Reliable Messaging.

Commit to a log like Kafka + database with ACID properties?

I'm planning in test how make this kind of architecture to work:
http://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/
Where all the data is stored as facts in a log, but the validations when posted a change must be against a table. For example, If I send a "Create Invoice with Customer 1" I will need to validate if the customer exist and other stuff, then when the validation pass commit to the log and put the current change to the table, so the table have the most up-to-date information yet I have all the history of the changes.
I could put the logs into the database in a table (I use PostgreSql). However I'm concerned about the scalability of doing that, also, I wish to suscribe to the event stream from multiple clients and PG neither other RDBMS I know let me to do this without polling.
But if I use Kafka I worry about the ACID between both storages, so Kafka could get wrong data that PG rollback or something similar.
So:
1- Is possible to keep consistency between a RDBMS and a log storage OR
2- Is possible to suscribe in real time and tune PG (or other RDBMS) for fast event storage?
Easy(1) answers for provided questions:
Setting up your transaction isolation level properly may be enough to achieve consistency and not worry about DB rollbacks. You still can occasionally create inconsistency, unless you set isolation level to 'serializable'. Even then, you're guaranteed to be consistent, but still could have undesirable behaviors. For example, client creates a customer and puts an invoice in a rapid succession using an async API, and invoice event hits your backed system first. In this case invoice event would be invalidated and a client will need to retry hoping that customer was created by that time. Easy to avoid if you control clients and mandate them to use sync API.
Whether it is possible to store events in a relational DB depends on your anticipated dataset size, hardware and access patterns. I'm a big time Postgres fan and there is a lot you can do to make event lookups blazingly fast. My rule of thumb -- if your operating table size is below 2300-300GB and you have a decent server, Postgres is a way to go. With event sourcing there are typically no joins and a common access pattern is to get all events by id (optionally restricted by time stamp). Postgres excels at this kind of queries, provided you index smartly. However, event subscribers will need to pull this data, so may not be good if you have thousands of subscribers, which is rarely the case in practice.
"Conceptually correct" answer:
If you still want to pursue streaming approach and fundamentally resolve race conditions then you have to provide event ordering guarantees across all events in the system. For example, you need to be able to order 'add customer 1' event and 'create invoice for customer 1' event so that you can guarantee consistency at any time. This is a really hard problem to solve in general for a distributed system (see e.g. vector clocks). You can mitigate it with some clever tricks that would work for your particular case, e.g. in the example above you can partition your events by 'customerId' early as they hit backend, then you can have a guarantee that all event related to the same customer will be processed (roughly) in order they were created.
Would be happy to clarify my points if needed.
(1) Easy vs simple: mandatory link

Data Synchronization in a Distributed system

We have an REST-based application built on the Restlet framework which supports CRUD operations. It uses a local-file to store the data.
Now the requirement is to deploy this application on multiple VMs and any update operation in one VM needs to be propagated other application instances running on other VMs.
Our idea to solve this was to send multiple POST msgs (to all other applications) when a update operation happens in a given VM.
The assumption here is that each application has a list/URLs of all other applications.
Is there a better way to solve this?
Consistency is a deep topic, and a hard thing to get right. The trouble comes when two nearly-simultaneous changes occur to the same data: conflicting updates can arrive in one order on one server, and in another order on another. This is a problem, since the two servers no longer agree on what the data is, and it isn't clear who is "right".
The short-story: get your favorite RDBMS (for example, mysql is popular) and have your app servers connect to in what is called the three-tier model. Be sure to perform complex updates in transactions, which will provide an acceptable consistency model.
The long-story: The three-tier model serves well for small-to-medium scale web sites/services. You will eventually find that the single database becomes the bottleneck. For services whose read traffic is substantially larger than write traffic, a common optimization is to create a single-master, many-slave database replication arrangement, where all writes go to the single master (required for consistency with non-distributed transactions), but the more-common reads could go to any of the read slaves.
For services with evenly-mixed read/write traffic, you may be better served by dropped some of the conveniences (and accompanying restrictions) that formal SQL provides and instead use of one of the various "nosql" data stores that have recently emerged. Their relative merits and fitness for various problems is a deep topic in itself.
I can see 7 major options for now. You should find out more details and decide whether the facilities / trade-offs are appropriate for your purpose
Perform the CRUD operation on a common RDBMS. Simplest and most consistent
Perform the CRUD operations on a common RDBMS which runs as fast in-memory RDBMS. eg TimesTen from Oracle etc
Perform the CRUD on a distributed cache or your own home cooked distributed hash table which can guarantee synchronization eg Hazelcast/ehcache and others
Use a fast common state server like REDIS/memcached and perform your updates
in a synchronized manner on it and write out the successfull operations to a DB in a lazy manner if required.
Distribute your REST servers such that the CRUD operations on a single entity are only performed by a single master. Once this is done, the details about the changes can be communicated to everyone else using a reliable message bus or a distributed database (eg postgres) that runs underneath and syncs all of your updates fairly fast.
Target eventual consistency and use a distributed data store like Cassandra which lets you target the consistency you require
Use distributed consensus algorithms like Paxos or RAFT or an implementation of the same(recommended) like zookeeper or etcd respectively and take ownership of the item you want to change from each REST server before you perform the CRUD operation - might be a bit slow though and same stuff is what Cassandra might give you.