We have the first version of an application based on a microservice architecture. We used REST for external and internal communication.
Now we want to switch to AP from CP (CAP theorem)* and use a message bus for communication between microservices.
There is a lot of information about how to create an event bus based on Kafka, RabbitMQ, etc.
But I can't find any best practices for a combination of REST and messaging.
For example, you create a car service and you need to add different car components. It would make more sense, for this purpose, to use REST with POST requests. On the other hand, a service for booking a car would be a good task for an event-based approach.
Do you have a similar approach when you have a different dictionary and business logic capabilities? How do you combine them? Just support both approaches separately? Or unify them in one approach?
* for the first version, we agreed to choose consistency and partition tolerance. But now availability becomes more important for us.
Bottom line up front: You're looking for Command Query Responsibility Segregation; which defines an architectural pattern for breaking up responsibilities from querying for data to asking for a process to be run. The short answer is you do not want to mix the two in either a query or a process in a blocking fashion. The rest of this answer will go into detail as to why, and the three different ways you can do what you're trying to do.
This answer is a short form of the experience I have with Microservices. My bona fides: I've created Microservices topologies from scratch (and nearly zero knowledge) and as they say hit every branch on the way down.
One of the benefits of starting from zero-knowledge is that the first topology I created used a mixture of intra-service synchronous and blocking (HTTP) communication (to retrieve data needed for an operation from the service that held it), and message queues + asynchronous events to run operations (for Commands).
I'll define both terms:
Commands: Telling a service to do something. For instance, "Run ETL Batch job". You expect there to be an output from this; but it is necessarily a process that you're not going to be able to reliably wait on. A command has side-effects. Something will change because of this action (If nothing happens and nothing changes, then you haven't done anything).
Query: Asking a service for data that it holds. This data may have been there because of a Command given, but asking for data should not have side effects. No Command operations should need to be run because of a Query received.
Anyway, back to the topology.
Level 1: Mixed HTTP and Events
For this first topology, we mixed Synchronous Queries with Asynchronous Events being emitted. This was... problematic.
Message Buses are by their nature observable. One setting in RabbitMQ, or an Event Source, and you can observe all events in the system. This has some good side-effects, in that when something happens in the process you can typically figure out what events led to that state (if you follow an event-driven paradigm + state machines).
HTTP Calls are not observable without inspecting network traffic or logging those requests (which itself has problems, so we're going to start with "not feasible" in normal operations). Therefore if you mix a message based process and HTTP calls, you're going to have holes where you can't tell what's going on. You'll have spots where due to a network error your HTTP call didn't return data, and your services didn't continue the process because of that. You'll also need to hook up Retry/Circuit Breaker patterns for your HTTP calls to ensure they at least try a few times, but then you have to differentiate between "Not up because it's down", and "Not up because it's momentarily busy".
In short, mixing the two methods for a Command Driven process is not very resilient.
Level 2: Events define RPC/Internal Request/Response for data; Queries are External
In step two of this maturity model, you separate out Commands and Queries. Commands should use an event driven system, and queries should happen through HTTP. If you need the results of a query for a Command, then you issue a message and use a Request/Response pattern over your message bus.
This has benefits and problems too.
Benefits-wise your entire Command is now observable, even as it hops through multiple services. You can also replay processes in the system by rerunning events, which can be useful in tracking down problems.
Problems-wise now some of your events look a lot like queries; and you're now recreating the beautiful HTTP and REST semantics available in HTTP for messages; and that's not terribly fun or useful. As an example, a 404 tells you there's no data in REST. For a message based event, you have to recreate those semantics (There's a good Youtube conference talk on the subject I can't find but a team tried to do just that with great pain).
However, your events are now asynchronous and non-blocking, and every service can be refactored to a state-machine that will respond to a given event. Some caveats are those events should contain all the data needed for the operation (which leads to messages growing over the course of a process).
Your queries can still use HTTP for external communication; but for internal command/processes, you'd use the message bus.
I don't recommend this approach either (though it's a step up from the first approach). I don't recommend it because of the impurity your events start to take on, and in a microservices system having contracts be the same throughout the system is important.
Level 3: Producers of Data emit data as events. Consumers Record data for their use.
The third step in the maturity model (and we were on our way to that paradigm when I departed from the project) is for services that produce data to issue events when that data is produced. That data is then jotted down by services listening for those events, and those services will use that (could be?) stale data to conduct their operations. External customers still use HTTP; but internally you emit events when new data is produced, and each service that cares about that data will store it to use when it needs to. This is the crux of Michael Bryzek's talk Designing Microservices Architecture the Right way. Michael Bryzek is the CTO of Flow.io, a white-label e-commerce company.
If you want a deeper answer along with other issues at play, I'll point you to my blog post on the subject.
Related
I am using Kafka for Event Sourcing and I am interested in implementing sagas using Kafka.
Any best practices on how to do this? The Commander pattern mentioned here seems close to the architecture I am trying to build but sagas are not mentioned anywhere in the presentation.
This talk from this year's DDD eXchange is the best resource I came across wrt Process Manager/Saga pattern in event-driven/CQRS systems:
https://skillsmatter.com/skillscasts/9853-long-running-processes-in-ddd
(requires registering for a free account to view)
The demo shown there lives on github: https://github.com/flowing/flowing-retail
I've given it a spin and I quite like it. I do recommend watching the video first to set the stage.
Although the approach shown is message-bus agnostic, the demo uses Kafka for the Process Manager to send commands to and listen to events from other bounded contexts. It does not use Kafka Streams but I don't see why it couldn't be plugged into a Kafka Streams topology and become part of the broader architecture like the one depicted in the Commander presentation you referenced.
I hope to investigate this further for our own needs, so please feel free to start a thread on the Kafka users mailing list, that's a good place to collaborate on such patterns.
Hope that helps :-)
I would like to add something here about sagas and Kafka.
In general
In general Kafka is a tad different than a normal queue. It's especially good in scaling. And this actually can cause some complications.
One of the means to accomplish scaling, Kafka uses partitioning of the data stream. Data is placed in partitions, which can be consumed at its own rate, independent of the other partitions of the same topic. Here is some info on it: how-choose-number-topics-partitions-kafka-cluster. I'll come back on why this is important.
The most common ways to ensure the order within Kafka are:
Use 1 partition for the topic
Use a partition message key to "assign" the message to a topic
In both scenarios your chronologically dependent messages need to stream through the same topic.
Also, as #pranjal thakur points out, make sure the delivery method is set to "exactly once", which has a performance impact but ensures you will not receive the messages multiple times.
The caveat
Now, here's the caveat: When changing the amount of partitions the message distribution over the partitions (when using a key) will be changed as well.
In normal conditions this can be handled easily. But if you have a high traffic situation, the migration toward a different number of partitions can result in a moment in time in which a saga-"flow" is handled over multiple partitions and the order is not guaranteed at that point.
It's up to you whether this will be an issue in your scenario.
Here are some questions you can ask to determine if this applies to your system:
What will happen if you somehow need to migrate/copy data to a new system, using Kafka?(high traffic scenario)
Can you send your data to 1 topic?
What will happen after a temporary outage of your saga service? (low availability scenario/high traffic scenario)
What will happen when you need to replay a bunch of messages?(high traffic scenario)
What will happen if we need to increase the partitions?(high traffic scenario/outage & recovery scenario)
The alternative
If you're thinking of setting up a saga, based on steps, like a state machine, I would challenge you to rethink your design a bit.
I'll give an example:
Lets consider a booking-a-hotel-room process:
Simplified, it might consist of the following steps:
Handle room reserved (incoming event)
Handle room payed (incoming event)
Send acknowledgement of the booking (after payed and some processing)
Now, if your saga is not able to handle the payment if the reservation hasn't come in yet, then you are relying on the order of events.
In this case you should ask yourself: when will this break?
If you conclude you want to avoid the chronological dependency; consider a system without a saga, or a saga which does not depend on the order of events - i.e.: accepting all messages, even when it's not their turn yet in the process.
Some examples:
aggregators
Modeled as business process: parallel gateways (parallel process flows)
Do note in such a setup it is even more crucial that every action has got an implemented compensating action (rollback action).
I know this is often hard to accomplish; but, if you start small, you might start to like it :-)
I am starting to learn Scala and functional programming. I was reading the book !Programming scala: Tackle Multi-Core Complexity on the Java Virtual Machine". Upon the first chapter I've seen the word Event-Driven concurrency and Actor model. Before I continue reading this book I want to have an idea about Event-Driven concurrency or Actor Model.
What is Event-Driven concurrency, and how is it related to Actor Model?
An Event Driven programming model involves registering code to be run when a given event fires. An example is, instead of calling a method that returns some data from a database:
val user = db.getUser(1)
println(user.name)
You could instead register a callback to be run when the data is ready:
db.getUser(1, u => println(u.name))
In the first example, no concurrency was happening; The current thread would block until db.getUser(1) returned data from the database. In the second example db.getUser would return immediately and carry on executing the next code in the program. In parallel to this, the callback u => println(u.name) will be executed at some point in the future.
Some people prefer the second approach as it doesn't mean memory hungry Threads are needlessly sat around waiting for slow I/O to return.
The Actor Model is an example of how Event-Driven concepts can be used to help the programmer easily write concurrent programs.
From a super high level, Actors are objects that define a series of Event Driven message handlers that get fired when the Actor receives messages. In Akka, each instance of an Actor is single Threaded, however when many of these Actors are put together they create a system with concurrency.
For example, Actor A could send messages to Actor B and C in parallel. Actor B and C could fire messages back to Actor A. Actor A would have message handlers to receive these messages and behave as desired.
To learn more about the Actor model I would recommend reading the Akka documentation. It is really well written: http://doc.akka.io/docs/akka/2.1.4/
There is also lot's of good documentation around the web about Event Driven Concurrency that us much more detailed than what I've written here. http://berb.github.io/diploma-thesis/original/055_events.html
Theon's answer provides a good modern overview. I'd like to add some historical perspective.
Tony Hoare and Robert Milner both developed mathematical algebra for analysing concurrent systems (Communicating Sequential Processes, CSP, and Communicating Concurrent Systems, CCS). Both of these look like heavy mathematics to most of us but the practical application is relatively straightforward. CSP led directly to the Occam programming language amongst others, with Go being the newest example. CCS led to Pi calculus and the mobility of communicating channel ends, a feature that is part of Go and was added to Occam in the last decade or so.
CSP models concurrency purely by considering automomous entities ('processes', v.lightweight things like green threads) interacting simply by event exchange. The medium for passing events is along channels. Processes may have to deal with several inputs or outputs and they do this by selecting the event that is ready first. The events usually carry data from the sender to the receiver.
A principle feature of the CSP model is that a pair of processes engage in communication only when both are ready - in practical terms this leads to what is usually called 'synchronous' communication. However, the actual implementations (Go, Occam, Akka) allow channels to be buffered (the normal state in Akka) so that the lock-step exchange of events is often actually decoupled instead.
So in summary, an event-driven CSP-based system is really a data-flow network of processes connected by channels.
Besides the CSP interpretation of event-driven, there have been others. An important example is the 'event-wheel' approach, once popular for modelling concurrent systems whilst actually having a single processing thread. Such systems handle events by putting them into a processing queue and dealing with them due course, usually via a callback. Java Swing's event processing engine is a good example. There were others, e.g. for time-based simulation engines. One might think of the Javascript / NodeJS model as fitting into this category as well.
So in summary, an event-wheel was a way to express concurrency but without parallelism.
The irony of this is that the two approaches I've described above are both described as event driven but what they mean by event driven is different in each case. In one case, hardware-like entities are wired together; in the other, almost all actions are executed by callbacks. The CSP approach claims to be scalable because it's fully composable; it's naturally adept at parallel execution also. If there are any reasons to favour one over the other, these are probably it.
To understand the answer to this you have to look at event concurrency from the OS layer up. First you start with threads which are the smallest section of code that can be run by the OS and eventually deal with I/O, timing and other kinds of events.
The OS groups threads into a process in which they share the same memory, protection and security permissions. Above that layer you have user programs which typically make I/O requests that are handled by user libraries.
The I/O libraries handle these requests in one of two ways. Unix-like systems use a "reactor" model in which the library registers I/O handlers for all the different types of I/O and events in the system. These handlers are activated when I/O is ready on a specific device. Windows-like systems use an I/O completion model in which I/O requests are made and a callback is triggered when the request is complete.
Both of these models require a significant amount of overhead to manage overall program state if you were to use them directly. However some programming tasks (web apps / services) lend themselves to a seemingly more direct implementation if you use an event model directly, but you still need to manage all of that program state. In order to track program logic across dispatches of several related events you have to manually track state and pass it around to the callbacks. This tracking structure is usually called a state context or baton. As you might imagine passing batons around all over the place to numerous seemingly unrelated handlers makes for some extremely hard to read and spaghetti-like code. It's also a pain to write and debug -- especially when you're trying to handle the synchronization of various concurrent paths of execution. You start getting into Futures and then the code becomes really difficult to read.
One well-known event processing library is call libuv. It's a portable event loop that integrates Unix's reactor model with Windows' completion model into a single model usually called a "proactor". Its the event handler that drives NodeJS.
Which brings us to communicating sequential processes.
https://en.wikipedia.org/wiki/Communicating_sequential_processes
Rather than writing asynchronous I/O dispatch and synchronization code using one or more concurrency models (and their often competing conventions), we flip the problem on its head. We use a "coroutine" which looks like normal sequential code.
A simple example is a coroutine that receives a single byte over an event channel from another coroutine that sends a single byte. This effectively synchronizes I/O producer and consumer because the writer/sender has to wait for a reader/receiver and vice-versa. While either process is waiting they explicitly yield execution to other processes. When a coroutine yields, its scoped program state is saved on a stack frame thus saving you from the confusion of managing multi-layered baton state in an event loop.
Using applications built on these event channels we can construct arbitrary, reusable, concurrent logic and the algorithms no longer look like spaghetti code. In pure CSP systems if you write to a channel and there is no reader, you will be blocked. The channel endpoints are known via handles internally to the program.
Actor systems are different in a couple of ways. First, the endpoints are the actor threads and they are named and known external to the mainline program. The second difference is that sends and receives on these channels are buffered. In other words if you send a message to an actor and there isn't one listening or its busy you aren't blocked until one reads from their input channel. Other differences exist like one actor can publish to two different actors concurrently.
As you might guess Actor systems can easily be built from CSP systems. There are other details like waiting for specific event patterns and selecting from them, but that's the basics.
I hope that clarifies things a bit.
Other constructs can be built from these ideas. Various programming systems (Go, Erlang, etc) include CSP implementations within them. Operating systems like Inferno and Node9 use CSPs and Channels as the basis of their distributed computing model.
Go: https://en.wikipedia.org/wiki/Go_(programming_language)
Erlang: https://en.wikipedia.org/wiki/Erlang_(programming_language)
Inferno: https://en.wikipedia.org/wiki/Inferno_(operating_system)
Node9: https://github.com/jvburnes/node9
I have just downloaded joliver eventstore and looking to wire up a service bus with Windows Service Bus 1.0 for an application separated across more than one Bounded Context process.
If a bounded context has been offline whilst events in other bounded contexts have been created (or may even be a new context that has been deployed), I can see the following sequence of events.
For an example ContextA, ContextB and ContextC, all connected using Service Bus 1.0 and each context with their own event store, they all share the same bus messaging backplane.
ContextC goes offline.
When ContextC comes back-up, other bounded contexts need to be notified of the events that need to be resent to the context that has just come back online. These events are replayed from each of the event stores.
My questions are:
The above scenario would apply to any event sourcing libraries, so is there any infrastructure code on top of this I can use, or do I have to roll my own?
With Windows Service Bus 1.0, how do I marry sequence numbers in my event store to sequence numbers on the Service Bus?
What is the best practice to detect and handle events that have already been received in a safe manner (protecting against message handlers failing)?
The above scenario would apply to any event sourcing libraries, so is there any infrastructure code on top of this I can use, or do I have to roll my own?
The notion of a Projection mechanism tied to the events is certainly common. Unfortunately, there are many many ways of handling how that might be done, depending on your stack, performance requirements and scale and many other factors.
As a result I'm not aware of a commoditized facility of this nature.
The GetEventStore store has an integrated Projection facility which looks extremely powerful and takes the need to build all this off the table. Before its existence, I'd have argued that one shouldnt even consider looking past the the SRPness of the JOES.
You havent said much about your actual stack other than mentioning Azure.
With Windows Service Bus, how do I marry sequence numbers in my event store to sequence numbers on the Service Bus?
You can use stream id + the commit sequence number the MessageId (and use that to ensure duplicates are removed by the bus). You will probably also include properties in the Message metadata.
What is the best practice to detect and handle events that have already been received in a safe manner (protecting against message handlers failing)?
If you're on Azure and considering ServiceBus then the Topics can be used to ensure at least once delivery (and you'll use the sessioning facility). Go watch the two hour deep dive ClemensV Subscribe video plus a few other episodes or you'll spent the same amount of time making mistakes)
To keep broadcast traffic down, if ContextC requests replays from ContextA and ContextB, is there any way for these replay messages to be sent only to ContextC? Or should I not worry about this?
Mu. You started off asking whether this stuff was a good idea but now seem to have baked in an assumption that it's the way to go.
Firstly, this infrastructure is a massive wheel to reinvent. Have you considered simply setting up a topic per BC and having anyone that needs to listen listen?
A key thing here is that you need to bear in mind the fact that just because you can think of cases where BCs need to consume each others events, that this central magic bus that's everywhere will deliver everything everywhere.
EDIT: Answers to your edited versions of questions 2+
With Windows Service Bus 1.0, how do I marry sequence numbers in my event store to sequence numbers on the Service Bus?
Your event store doesnt have a sequence number. It has a commit sequence number per aggregate. You'd typically use a sessioned topic and subscription. Then you need to choose whether you want a global ordering (use a single session id) or per aggregate ordering (use the stream id as the session id).
Once events are on a topic, they have a MessageSequenceNumber and the subscription (when sessioned) delivers (actually the subscriber recieves them) them in sequence.
What is the best practice to detect and handle events that have already been received in a safe manner (protecting against message handlers failing)?
This is built into the Service Bus (or any queueing mechanism). You don't mark the Message completed until it has been successfully processed. Any failure leads to Abandonment (which puts it back on the queue for reprocessing).
The subscriber taking a break, becoming disconnected or work backing up is naturally dealt with by the Topic.
I have been experimenting with JOliver's Event Store 3.0 as a potential component in a project and have been trying to measure the throughput of events through the Event Store.
I started using a simple harness which essentially iterated through a for loop creating a new stream and committing a very simple event comprising of a GUID id and a string property to a MSSQL2K8 R2 DB. The dispatcher was essentially a no-op.
This approach managed to achieve ~3K operations/second running on an 8 way HP G6 DL380 with the DB on a separate 32 way G7 DL580. The test machines were not resource bound, blocking looks to be the limit in my case.
Has anyone got any experience of measuring the throughput of the Event Store and what sort of figures have been achieved? I was hoping to get at least 1 order of magnitude more throughput in order to make it a viable option.
I would agree that blocking IO is going to be the biggest bottleneck. One of the issues that I can see with the benchmark is that you're operating against a single stream. How many aggregate roots do you have in your domain with 3K+ events per second? The primary design of the EventStore is for multithreaded operations against multiple aggregates which reduces contention and locks for read-world applications.
Also, what serialization mechanism are you using? JSON.NET? I don't have a Protocol Buffers implementation (yet), but every benchmark shows that PB is significantly faster in terms of performance. It would be interesting to run a profiler against your application to see where the biggest bottlenecks are.
Another thing I noticed was that you're introducing a network hop into the equation which increases latency (and blocking time) against any single stream. If you were writing to a local SQL instance which uses solid state drives, I could see the numbers being much higher as compared to a remote SQL instance running magnetic drives and which have the data and log files on the same platter.
Lastly, did your benchmark application use System.Transactions or did it default to no transactions? (The EventStore is safe without use of System.Transactions or any kind of SQL transaction.)
Now, with all of that being said, I have no doubt that there are areas in the EventStore that could be dramatically optimized with a little bit of attention. As a matter of fact, I'm kicking around a few backward-compatible schema revisions for the 3.1 release to reduce the number writes performed within SQL Server (and RDBMS engines in general) during a single commit operation.
One of the biggest design questions I faced when starting on the 2.x rewrite that serves as the foundation for 3.x is the idea of async, non-blocking IO. We all know that node.js and other non-blocking web servers beat threaded web servers by an order of magnitude. However, the potential for complexity introduced on the caller is increased and is something that must be strongly considered because it is a fundamental shift in the way most programs and libraries operate. If and when we do move to an evented, non-blocking model, it would be more in a 4.x time frame.
Bottom line: publish your benchmarks so that we can see where the bottlenecks are.
Excellent question Matt (+1), and I see Mr Oliver himself replied as the answer (+1)!
I wanted to throw in a slightly different approach that I myself am playing with to help with the 3,000 commits-per-second bottleneck you are seeing.
The CQRS Pattern, that most people who use JOliver's EventStore seem to be attempting to follow, allows for a number of "scale out" sub-patterns. The first one people usually queue off is the Event commits themselves, which you are seeing a bottleneck in. "Queue off" meaning offloaded from the actual commits and inserting them into some write-optimized, non-blocking I/O process, or "queue".
My loose interpretation is:
Command broadcast -> Command Handlers -> Event broadcast -> Event Handlers -> Event Store
There are actually two scale-out points here in these patterns: the Command Handlers and Event Handlers. As noted above, most start with scaling out the Event Handler portions, or the Commits in your case to the EventStore library, because this is usually the biggest bottleneck due to the need to persist it somewhere (e.g. Microsoft SQL Server database).
I myself am using a few different providers to test for the best performance to "queue up" these commits. CouchDB and .NET's AppFabric Cache (which has a great GetAndLock() feature). [OT]I really like AppFabric's durable-cache features that lets you create redundant cache servers that backup your regions across multiple machines - therefore, your cache stays alive as long as there is at least 1 server up and running.[/OT]
So, imagine your Event Handlers do not write the commits to the EventStore directly. Instead, you have a handler insert them into a "queue" system, such as Windows Azure Queue, CouchDB, Memcache, AppFabric Cache, etc. The point is to pick a system with little to no blocks to queue up the events, but something that is durable with redundancy built-in (Memcache being my least favorite for redundancy options). You must have that redundancy, in the case that if a server drops, you still have the event queued up.
To finally commit from this "Queued Event", there are several options. I like Windows Azure's Queue pattern for this, because of the many "workers" you can have constantly looking for work in the queue. But it doesn't have to be Windows Azure - I've mimicked Azure's Queue pattern in local code using a "Queue" and "Worker Roles" running in background threads. It scales really nicely.
Say you have 10 workers constantly looking into this "queue" for any User Updated events (I usually write a single worker role per Event type, makes scaling out easier as you get to monitor the stats of each type). Two events get inserted into the queue, the first two workers instantly pick up a message each, and insert them (Commit them) directly into your EventStore at the same time - multithreading, as Jonathan mentioned in his answer. Your bottleneck with that pattern would be whatever database/eventstore backing you select. Say your EventStore is using MSSQL and the bottleneck is still 3,000 RPS. That is fine, because the system is built to 'catch up' when those RPS drops down to, say 50 RPS after a 20,000 burst. This is the natural pattern CQRS allows for: "Eventual Consistency."
I said there was other scale-out patterns native to the CQRS patterns. Another, as I mentioned above, is the Command Handlers (or Command Events). This is one I have done as well, especially if you have a very rich domain domain as one of my clients does (dozens of processor-intensive validation checks on every Command). In that case, I'll actually queue off the Commands themselves, to be processed in the background by some worker roles. This gives you a nice scale out pattern as well, because now your entire backend, including the EvetnStore commits of the Events, can be threaded.
Obviously, the downside to that is that you loose some real-time validation checks. I solve that by usually segmenting validation into two categories when structuring my domain. One is Ajax or real-time "lightweight" validations in the domain (kind of like a Pre-Command check). And the others are hard-failure validation checks, that are only done in the domain but not available for realtime checking. You would then need to code-for-failure in Domain model. Meaning, always code for a way out if something fails, usually in the form of a notification email back to the user that something went wrong. Because the user is no longer blocked by this queued Command, they need to be notified if the command fails.
And your validation checks that need to go to the 'backend' is going to your Query or "read-only" database, riiiight? Don't go into the EventStore to check for, say, a unique Email address. You'd be doing your validation against your highly-available read-only datastore for the Queries of your front end. Heck, have a single CouchDB document be dedicated to only a list of all email addresses in the system as your Query portion of CQRS.
CQRS is just suggestions... If you really need realtime checking of a heavy validation method, then you can build a Query (read-only) store around that, and speed up the validation - on the PreCommand stage, before it gets inserted into the queue. Lots of flexibility. And I would even argue that validating things like empty Usernames and empty Emails is not even a domain concern, but a UI responsiblity (off-loading the need to do real-time validation in the domain). I've architected a few projects where I had very rich UI validation on my MVC/MVVM ViewModels. Of course my Domain had very strict validation, to ensure it is valid before processing. But moving the mediocre input-validation checks, or what I call "light-weight" validation, up into the ViewModel layers gives that near-instant feedback to the end-user, without reaching into my domain. (There are tricks to keep that in sync with your domain as well).
So in summary, possibly look into queuing off those Events before they are committed. This fits nicely with EventStore's multi-threading features as Jonathan mentions in his answer.
We built a small boilerplate for massive concurrency using Erlang/Elixir, https://github.com/work-capital/elixir-cqrs-eventsourcing using Eventstore. We still have to optimize db connections, pooling, etc... but the idea of having one process per aggregate with multiple db connections is aligned with your needs.
I am working on my bc thesis project which should be a Minecraft server written in scala and Akka. The server should be easily deployable in the cloud or onto a cluster (not sure whether i use proper terminology...it should run on multiple nodes). I am, however, newbie in akka and i have been wondering how to implement such a thing. The problem i'm trying to figure out right now, is how to share state among actors on different nodes. My first idea was to have an Camel actor that would read tcp stream from minecraft clients and then send it to load balancer which would select a node that would process the request and then send some response to the client via tcp. Lets say i have an AuthenticationService implementing actor that checks whether the credentials provided by user are valid. Every node would have such actor(or perhaps more of them) and all the actors should have exactly same database (or state) of users all the time. My question is, what is the best approach to keep this state? I have came up with some solutions i could think of, but i haven't done anything like this so please point out the faults:
Solution #1: Keep state in a database. This would probably work very well for this authentication example where state is only represented by something like list of username and passwords but it probably wouldn't work in cases where state contains objects that can't be easily broken into integers and strings.
Solution #2: Every time there would be a request to a certain actor that would change it's state, the actor will, after processing the request, broadcast information about the change to all other actors of the same type whom would change their state according to the info send by the original actor. This seems very inefficient and rather clumsy.
Solution #3: Having a certain node serve as sort of a state node, in which there would be actors that represent the state of the entire server. Any other actor, except the actors in such node would have no state and would ask actors in the "state node" everytime they would need some data. This seems also inefficient and kinda fault-nonproof.
So there you have it. Only solution i actually like is the first one, but like i said, it probably works in only very limited subset of problems (when state can be broken into redis structures). Any response from more experienced gurus would be very appriciated.
Regards, Tomas Herman
Solution #1 could possibly be slow. Also, it is a bottleneck and a single point of failure (meaning the application stops working if the node with the database fails). Solution #3 has similar problems.
Solution #2 is less trivial than it seems. First, it is a single point of failure. Second, there are no atomicity or other ordering guarantees (such as regularity) for reads or writes, unless you do a total order broadcast (which is more expensive than a regular broadcast). In fact, most distributed register algorithms will do broadcasts under-the-hood, so, while inefficient, it may be necessary.
From what you've described, you need atomicity for your distributed register. What do I mean by atomicity? Atomicity means that any read or write in a sequence of concurrent reads and writes appears as if it occurs in single point in time.
Informally, in the Solution #2 with a single actor holding a register, this guarantees that if 2 subsequent writes W1 and then W2 to the register occur (meaning 2 broadcasts), then no other actor reading the values from the register will read them in the order different than first W1 and then W2 (it's actually more involved than that). If you go through a couple of examples of subsequent broadcasts where messages arrive to destination at different points in time, you will see that such an ordering property isn't guaranteed at all.
If ordering guarantees or atomicity aren't an issue, some sort of a gossip-based algorithm might do the trick to slowly propagate changes to all the nodes. This probably wouldn't be very helpful in your example.
If you want fully fault-tolerant and atomic, I recommend you to read this book on reliable distributed programming by Rachid Guerraoui and Luís Rodrigues, or the parts related to distributed register abstractions. These algorithms are built on top of a message passing communication layer and maintain a distributed register supporting read and write operations. You can use such an algorithm to store distributed state information. However, they aren't applicable to thousands of nodes or large clusters because they do not scale, typically having complexity polynomial in the number of nodes.
On the other hand, you may not need to have the state of the distributed register replicated across all of the nodes - replicating it across a subset of your nodes (instead of just one node) and accessing those to read or write from it, providing a certain level of fault-tolerance (only if the entire subset of nodes fails, will the register information be lost). You can possibly adapt the algorithms in the book to serve this purpose.