Designing event-based architecture for the customer service - apache-kafka

Being a developer with solid experience, i am only entering the world of microservices and event-driven architecture. Things like loose coupling, independent scalability and proper implementation of asynchronous business processes is something that i feel should get simplified as compared with traditional monolith approach. So giving it a try, making a simple PoC for myself.
I am considering making a simple application where user can register, login and change the customer details. However, i want to react on certain events asynchronously:
customer logs in - we send them an email, if the IP address used is new to the system.
customer changes their name, we send them an email notifying of the change.
The idea is to make a separate application that reacts on "CustomerLoggedIn", "CustomerChangeName" events.
Here i can think of three approaches, how to implement this simple functionality, with each of them having some drawbacks. So, when a customer submits their name change:
Store change name Changed name is stored in the DB + an event is sent to Kafkas when the DB transaction is completed. One of the big problems that arise here is that if a customer had 2 tabs open and almost simultaneously submits a change from initial name "Bob" to "Alice" in one tab and from "Bob" to "Jim" in another one, on a database level one of the updates overwrites the other, which is ok, however we cannot guarantee the order of the events to be the same. We can use some checks to ensure that DB update is only done when "the last version" has been seen, thus preventing the second update at all, so only one event will be emitted. But in general case, this pattern will not allow us to preserve the same order of events in the DB as in Kafka, unless we do DB change + Kafka event sending in one distributed transaction, which is anti-pattern afaik.
Change the name in the DB, and use Debezium or similar DB CDC to capture the event and stream it. Here we get a single event source, so ordering problem is solved, however what bothers me is that i lose the ability to enrich the events with business information. Another related drawback is that CDC will stream all the updates in the "customer" table regardless of the business meaning of the event. So, in this case, i will probably need to build a Kafka Streams application to convert the DB CDC events to business events and decouple the DB structure from event structure. The potential benefit of this approach is that i will be able to capture "direct" DB changes in the same manner as those originated in the application.
Emit event from the application, without storing it in the DB. One of the subscribers might to the DB persistence, another will do email sending, etc. The biggest problem i see here is - what do i return to the client? I cannot say "Ok, your name is changed", it's more like "Ok, you request has been recorded and will be processed". In case if the customer quickly hits refresh - he expects to see his new name, as we don't want to explain to the customers what's eventual consistency, do we? Also the order of processing the same event by "email sender" and "db updater" is not guaranteed, so i can send an email before the change is persisted.
I am looking for advices regarding any of these three approaches (and maybe some others i am missing), maybe the usecases when one can be preferrable over others?

It sounds to me like you want event sourcing. In event sourcing, all you need to store is the event: the current state of a customer is derived from replaying the events (either from the beginning of time, or since a snapshot: the snapshot is just an optional optimization). Some other process (there are a few ways to go about this) can then project the events to Kafka for consumption by interested parties. Since every event has a sequence number, you can use the sequence number to prevent concurrent modification (alternatively, the more actor modely event-sourcing implementations can use techniques like cluster sharding in Akka to achieve the same ends).
Doing this, you can have a "write-side" which processes the updates in a strongly consistent manner and can respond to queries which only involve a single customer having seen every update to that point (the consistency boundary basically makes customer in this case an aggregate in domain-driven-design terms). "Read-sides" consuming events are eventually consistent: the latencies are typically fairly short: in this case your services sending emails are read-sides (as would be a hypothetical panel showing names of all customers), but the customer's view of their own data could be served by the write-side.
(The separation into read-sides and write-side (the pluralization is significant) is Command Query Responsibility Segregation, which sometimes gets interpreted as "reads can only be served by a read-side". This is not totally accurate: for one thing a write-side's model needs to be read in order for the write-side to perform its task of validating commands and synchronizing updates, so nearly any CQRS-using project violates that interpretation. CQRS should instead be interpreted as "serve reads from the model that makes the most sense and avoid overcomplicating a model (including that model in the write-side) to support a new read".)

I think I qualify to answer this, having extensively used debezium for simplifying the architecture.
I would prefer Option 2:
Every transaction always results in an event emitted in correct order
Option 1/3 has a corner case, what if transaction succeeds, but application fails to emit the event?
To your point:
Another related drawback is that CDC will stream all the updates in
the "customer" table regardless of the business meaning of the event.
So, in this case, i will probably need to build a Kafka Streams
application to convert the DB CDC events to business events and
decouple the DB structure from event structure.
I really dont think that is a roadblock. The benefit you get is potentially other usecases may crop up where another consumer to this topic may want to read other columns of the table.
Option 1 and 3 are only going to tie this to your core application logic, and that is not doing any favor from simplifying PoV. With option 2, with zero code changes to core application APIs, a developer can independently work on the events, with no need to understand that core logic.

Related

Axon or Kafka to support CQRS/ES

Consider the simple use case in which I want to store product ratings as events in an event store.
I could use two different approaches:
Using Axon: A Rating aggregate is responsible for handling the CreateRatingCommand and sending the RatingCreatedEvent. Sending the event would case the Rating to be stored in the event store. Other event handlers have the possibility to replay the event stream when connecting to the Axon server instance and doing whatever needed with the ratings. In this case, the event handler will be used as a stream processor.
Using Kafka: A KafkaProducer will be used to store a Rating POJO (after proper serialization) in a Kafka topic. Setting the topic's retention time to indefinite would cause no events to get lost in time. Kafka Streams would in this case be used to do the actual rating processing logic.
Some architectural questions appear to me for both approaches:
When using Axon:
Is there any added value to use Axon (or similar solutions) if there is no real state to be maintained or altered within the aggregate? The aggregate just serves as a "dumb" placeholder for the data, but does not provide any state changing logic.
How does Axon handle multiple event handlers of the same event type? Will they all handle the same event (same aggregate id) in parallel, or is the same event only handled once by one of the handlers?
Are events stored in the Axon event store kept until the end of time?
When using Kafka:
Kafka stores events/messages with the same key in the same partition. How does one select the best value for a key in the use case of user-product ratings? UserId, ProductId or a separate topic for both and publish each event in both topics.
Would it be wise to use a separate topic for each user and each product resulting in a massive amount of topics on the cluster? (Approximately <5k products and >10k users).
I don't know if SO is the preferred forum for this kind of questions... I was just wondering what you (would) recommend in this particular use case as the best practise. Looking forward to your feedback and feel free to point out other points of thought I missed in the previous questions.
EDIT#12/11/2020 : I just found a related discussion containing useful information related to my question.
As Jan Galinski already puts it, this hasn't got a fool proof answer to it really. This is worth a broader discussion on for example indeed AxonIQ's Discuss forum. Regardless, there are some questions in here I can definitely give an answer to, so let's get to it:
Axon Question 1 - Axon Framework is as you've noticed used a lot for DDD centric applications. Nothing however forces you to base yourself on that notion at all. You can strip the framework from Event Sourcing specifics, as well as modelling specifics entirely and purely go for the messaging idea of distinct commands, events and queries. It has been a conscious decision to segregate Axon Framework version 3 into these sub-part when version 4 (current) was released actually. Next to that, I think there is great value in not just basing yourself on event messages. Using distinct commands and queries only further decouples your components, making for a far richer and easier to extend application landscape.
Axon Question 2 - This depends on where the #EventHandler annotated methods are located actually. If they're in the same class only one will be invoked. If they're positioned into distinct classes, then both will receive the same event. Furthermore if they're segregated between distinct classes, it is important to note Axon uses an Event Processor as the technical solution to invoking your event handlers. If distinct classes are grouped under the same Event Processor, you can impose a certain ordering which handler is invoked first. Next to this if the event handling should occur in parallel, you will have to configure a so called TrackingEventProcessor (the default in Axon Framework), as it allows configuration of several threads to handle events concurrently. Well, to conclude this section, everything you're asking in question two is an option, neither a necessity. Just a matter of configuration really. Might be worth checking up on this documentation page of Axon Framework on the matter.
Axon Question 3 - As Axon Server serves the purpose of an Event Store, there is no retention period at all. So yes, they're by default kept until the end of time. There is nothing stopping your from dropping the events though, if you feel there's no value in storing the events to for example base all your models on (as you'd do when using Event Sourcing).
It's the Kafka question I'm personally less familiar with (figures as a contributor to Axon Framework I guess). I can give you my two cents on the matter here too though, although I'd recommend a second opinion here:
Kafka Question 1 - From my personal feeling of what such an application would require, I'd assume you'd want to be able to retrieve all data for a given product as efficient as possible. I'd wager it's important that all events are in the same partition to make this process as efficient as possible, is it wouldn't require any merging afterwards. With this in mind, I'd think using the ProductId will make most sense.
Kafka Question 2 - If you are anticipating only 5_000 products and 10_000 users, I'd guess it should be doable to have separate topics for these. Opinion incoming - It is here though were I personally feel that Kafka's intent to provide you direct power to decide on when to use topics over complicates from what you'd actually try to achieve, which business functionality. Giving the power to segregate streams feels more like an after thought from the perspective of application development. As soon as you'd require an enterprise grade/efficient message bus, that's when this option really shines I think, as then you can optimize for bulk.
Hoping all this helps you further #KDW!

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

How to store sagas’ data?

From what I read aggregates must only contain properties which are used to protect their invariants.
I also read sagas can be aggregates which makes sense to me.
Now I modeled a registration process using a saga: on RegistrationStarted event it sends a ReserveEmail command which will trigger an EmailReserved or EmailReservationFailed given if the email is free or not. A listener will then either send a validation link or a message telling an account already exists.
I would like to use data from the RegistrationStarted event in this listener (say the IP and user-agent). How should I do it?
Storing these data in the saga? But they’re not used to protect invariants.
Pushing them through ReserveEmail command and the resulting event? Sounds tedious.
Project the saga to the read model? What about eventual consistency?
Another way?
Rinat Abdullin wrote a good overview of sagas / process managers.
The usual answer is that the saga has copies of the events that it cares about, and uses the information in those events to compute the command messages to send.
List[Command] processManager(List[Event] events)
Pushing them through ReserveEmail command and the resulting event?
Yes, that's the usual approach; we get a list [RegistrationStarted], and we use that to calculate the result [ReserveEmail]. Later on, we'll get [RegistrationStarted, EmailReserved], and we can use that to compute the next set of commands (if any).
Sounds tedious.
The data has to travel between the two capabilities somehow. So you are either copying the data from one message to another, or you are copying a correlation identifier from one message to another and then allowing the consumer to decide how to use the correlation identifier to fetch a copy of the data.
Storing these data in the saga? But they’re not used to protect invariants.
You are typically going to be storing events in the sagas (to keep track of what has happened). That gives you a copy of the data provided in the event. You don't have an invariant to protect because you are just caching a copy of a decision made somewhere else. You won't usually have the process manager running queries to collect additional data.
What about eventual consistency?
By their nature, sagas are always going to be "eventually consistent"; the "state" of an instance of a saga is just cached copies of data controlled elsewhere. The data is probably nanoseconds old by the time the saga sees it, there's no point in pretending that the data is "now".
If I understand correctly I could model my saga as a Registration aggregate storing all the events whose correlation identifier is its own identifier?
Udi Dahan, writing about CQRS:
Here’s the strongest indication I can give you to know that you’re doing CQRS correctly: Your aggregate roots are sagas.

Lagom | Return Values from read side processor

We are using Lagom for developing our set of microservices. The trick here is that although we are using event sourcing and persisting events into cassandra but we have to store the data in one of the graph DB as well since it will be the one that will be serving most of the queries because of the use case.
As per the Lagom's documentation, all the insertion into Graph database(or any other database) has to be done in ReadSideProcecssor after the command handler persist the events into cassandra as followed by philosophy of CQRS.
Now here is the problem which we are facing. We believe that the ReadSideProcecssor is a listener which gets triggered after the events are generated and persisted. What we want is we could return the response back from the ReadSideProcecssor to the ServiceImpl. Example when a user is added to the system, the unique id generated by the graph has to be returned as one of the response headers. How that can be achieved in Lagom since the response is constructed from setCommandHandler and not the ReadSideProcessor.
Also, we need to make sure that if due to any error at graph side, the API should notify the client that the request has failed but again exceptions occuring in ReadSideProcessor are not propagated to either PersistentEntity or ServiceImpl class. How can that be achieved as well?
Any helps are much appreciated.
The read side processor is not a listener that is attached to the command - it is actually completely disconnected from the persistent entity, it may be running on a different node, at a different time, perhaps even years in the future if you add a new read side processor that first comes up to speed with all the old events in history. If the read side processor were connected synchronously to the command, then it would not be CQRS, there would not be segregation between the command and the query side.
Read side processors essentially poll the database for new events, processing them as they detect them. You can add a new read side processor at any time, and it will get all events from all of history, not just the new ones that are added, this is one of the great things about event sourcing, you don't need to anticipate all your query needs from the start, you can add them as the query need comes.
To further explain why you don't want a connection between the two - what happens if the event persist succeeds, but the update on the graph db fails? Perhaps the graph db is crashed. Does the command have to retry? Does the event have to be deleted? What happens if the node doing the update itself crashes before it has an opportunity to fix the problem? Now your read side is in an inconsistent state from your entities. Connecting them leads to inconsistency in many failure scenarios - for example, like when you update your address with a utility company, and but your bills still go to the old address, and you contact them, and they say "yes, your new address is updated in our system", but they still go to the old address - that's the sort of terrible user experience that you are signing your users up for if you try to connect your read side and write side together. Disconnecting allows Lagom to ensure consistency between the events you have emitted on the write side, and the consumption of them on the read side.
So to address your specific concerns: ID generation should be done on the write side, or, if a subsequent ID is generated on the read side, it should also provide a way of mapping the IDs on the write side to the read side ID. And as for handling errors on the read side - all validation should be done on the write side - the write side should ensure that it never emits an event that is invalid.
Now if the read side processor encounters something that is invalid, then it has two options. One option is it could fail. In many cases, this is a good option, since if something is invalid or inconsistent, then it's likely that either you have a bug or some form of corruption. What you don't want to do is continue processing as if everything is happy, since that might make the data corruption or inconsistency even worse. Instead the read side processor stops, your monitoring should then detect the error, and you can go in and work out either what the bug is or fix the corruption. Of course, there are downsides to doing this, your read side will start lagging behind the write side while it's unable to process new events. But that's also an advantage of CQRS - the write side is able to continue working, continue enforcing consistency, etc, the failure is just isolated to the read side, and only in updating the read side. Instead of your whole system going down and refusing to accept new requests due to this bug, it's isolated to just where the problem is.
The other option that the read side has is it can store the error somewhere - eg, store the event in a dead letter table, or raise some sort of trouble ticket, and then continue processing. This way, you can go and fix the event after the fact. This ensures greater availability, but does come at the risk that if that event that it failed to process was important to the processing of subsequent events, you've potentially just got yourself into a bigger mess.
Now this does introduce specific constraints on what you can and can't do, but I can't really anticipate those without specific knowledge of your use case to know how to address them. A common constraint is set validation - for example, how do you ensure that email addresses are unique to a single user in your system? Greg Young (the CQRS guy) wrote this blog post about those types of problems:
http://codebetter.com/gregyoung/2010/08/12/eventual-consistency-and-set-validation/

Using aggregates and Domain events with nosql storage

I'm wandering on DDD and NoSql field actually. I have a doubt now: i need to produce events from the aggregate and i would like to use a NoSql storage. But how can i be sure that events are saved on the storage AND the changes on the aggregate root not having transactions?
Does it makes sense? Is there a way to do this without being forced to use event sourcing or a transactional db?
Actually i was lookin at implementing a 2 phase commit algorithm but it seems pretty heavy from a performance point of view...
Am i approaching the problem the wrong way?
Stuffed with questions...
Thanks for every suggestion
Enrico
PS
I'm a newbie on stackoverflow so any suggestion/critic/... is more than welcome
Enrico
Edit 1
Well i would need events to notify aggregates that something happened and i they should react to the change. The problem arise when such events are important for the business logic. As far as i understood, after a night of thinking, i can't use a nosql storage to do such things. Let me explain (thinking with loud voice :P):
With ES (1st scenery): I save the "diff" of the data. Then i produce an event associated with it. 2 operations.
With ES (2nd scenery): I save the "diff" of the data. A process, watch the ES and produce the event. But i'm tied to having only one watcher process to ensure the correct ordering of events.
With ES (3d scenery): Idempotent events. The events can be inferred by the state and every reapplication of the event can cause a change on the consumer only once, can have multiple "dequeue" processes, duplicates can't possibly happen. 1 operation, but it introduce heavy limitations on the consumers.
In general: I save the aggregate's data. Then i produce an event associated with it. 2 operations.
Now the question becomes wider imho, is it possible to work with domain events and nosql when such domain events are fundamental part of the business process?
I think that could be a better option to go relational... even if i would need to add quite a lot of machines to get the same performances.
Edit 2
For the sake of completness, searching for "domain events nosql idempotent" on google: http://svendvanderveken.wordpress.com/2011/08/26/transactional-event-based-nosql-storage/
If you need Event Sourcing, you should store events only.
This should be the sequence:
the aggregate root recieves a command
it fires proper events
events are stored
Each aggregate's re-hydratation should be done only by executing events over them. You can create aggregates' snapshots if you measure performance problems on their initialization, but this doesn't require two-phase commits, since you can build snapshots asynchronously via batch.
Note however that you need CQRS and/or Event Sourcing only if your application is heavily concurrent and you need to cope with partition tolerance and compensating actions.
edit
Event Sourcing is alternative to the persistence of object state. You either store the events or the state of the object model. You can save snapshot, but they're just performance tools: your application must be able to work without them. You can consider such snapshots as a caching technique. As an alternative you can persist object state (the classical model), but in that case you don't need to store events.
In my own DDD application, I use observable entities to decouple (via direct events' subscription from the repository) aggregates and their persistence. For example your repository can subscribe each domain events, and execute the actions required by the application (persist to the store, dispatch to a queue and so on...). But as a persistence technique, Event Sourcing is alternative to classical persistence of the observable object state. In most scenarios you don't need both.
edit 2
A final note: if you choose ES, one of the events subscriber can build a relational read-model too.