Synchronising events between microservices using Kafka and MongoDb connector - mongodb

I'm experimenting with microservices architecture. I have UserService and ShoppingService.
In UserService I'm using MongoDb. When I'm creating new user in UserService I want to sync basic user info to ShoppingService. In UserService I'm using something like event sourcing. When I'm creating new User, I first create the UserCreatedEvent and then I apply the event onto domain User object. So in the end I get the domain User object that has current state and list of events containing one UserCreatedEvent.
I wonder if I should persist the Events collection as a nested property of User document or in separate UserEvents collection. I was planning to use Kafka Connect to synchronize the events from UserService to ShoppingService.
If I decide to persist the events inside the User document then I don't need transaction that I would use to save event to separate UserEvents collection but I can't setup the Kafka connector to track changes in the nested property only.
If I decide to persist events in separate UserEvents collection I need to wrap in transaction changes to User and UserEvents. But saving events to separate collection makes setting up Kafka connector very easy because I track only inserts and I don't need to track updates of nested UserEvents array in User document.
I think I will go with the second option for sake of simplicity but maybe I've missed something. Is it good idea to implement it like this?

I would generally advise the second approach. Note that you can also eliminate the need for a transaction by observing that User is just a snapshot based on the UserEvents up to some point in the stream and thus doesn't have to be immediately updated.
With this, your read operation for User can be: select a user from User (the latest snapshot), which includes a version/sequence number saying that it's as-of some event; then select the events with later sequence numbers and apply those events to the user. If there's some querier which wants a faster response and can tolerate getting something stale, a different endpoint (or an option in the query) can bypass the event replay.
You can then have some asynchronous process which subscribes to the stream of user events and updates User based on those events.

Related

version of aggregate event sourcing

According to event sourcing. When a command is called, all events of a domain have to be stored. Per event, system must increase the version of an aggregate. My eventstore is something like this:
(AggregateId, AggregateVersion, Sequence, Data, EventName, CreatedDate)
(AggregateId, AggregateVersion) is key
In some cases it does not make sense to increase the version of an aggregate. For example,
a command register an user and raises RegisteredUser, WelcomeEmailEvent, GiftCardEvent.
how can I handle this problem?
how can I handle this problem?
Avoid confusing your representation-of-information-changes events from your publishing-for-use-elsewhere events.
"Event sourcing", as commonly understood in the domain-drive-design and cqrs space, is a kind of data model. We're talking specifically about the messages an aggregate sends to its future self that describe its own changes over time.
It's "just" another way of storing the state of the aggregate, same as we would do if we were storing information in a relational database, or a document store, etc.
Messages that we are going to send to other components and then forget about don't need to have events in the event stream.
In some cases, there can be confusion when we haven't recognized that there are multiple different processes at work.
A requirement like "when a new user is registered, we should send them a welcome email" is not necessarily part of the registration process; it might instead be an independent process that is triggered by the appearance of a RegisteredUser event. The information that you need to save for the SendEmail process would be "somewhere else" - outside of the Users event history.
Event changes the state of an aggregate, and therefore changes its version. If state is not changed, then there should be no event for this aggregate.
In your example, I would ask myself - if WelcomeEmailEvent does not change the state of the User aggregate, then whose state it chages? Perhaps some other aggregate - some EmailNotification service that cares about successful or filed email attempt. In this case I would make it event of those aggregate which state it changes. And it will affect version of that aggregate.

Maintain reference between aggregates

I'm trying to wrap my head around how to maintain id references between two aggregates, eg. when an event happens on either side that affects the relationship, the other side is updated as well in an eventual consistent manner.
I have two aggregates, one for "Team" and one for "Event", in the context of a festival with the following code:
#Aggregate
public class Event {
#AggregateIdentifier
private EventId eventId;
private Set<TeamId> teams; // List of associated teams
... protected constructor, getters/setters and command handlers ...
}
#Aggregate
public class Team {
#AggregateIdentifier
private TeamId teamId;
private EventId eventId; // Owning event
... protected constructor, getters/setters and command handlers ...
}
A Team must always be associated to an event (through the eventId). An event contains a list of associated teams (through the team id set).
When a team is created (CreateTeamCommand) on the Team aggregate, I would like the TeamId set on the Event aggregate to be updated with the team id of the newly created team.
If the command "DeleteEventCommand" on the Event aggregate is executed, all teams associated to the event should also be deleted.
If a team is moved from one event to another event (MoveTeamToEventCommand) on the Team aggregate, the eventId on the Team aggregate should be updated but the TeamId should be removed from the old Event aggregate and be added to the new Event aggregate.
My current idea was to create a saga where I would run SagaLifecycle.associateWith for both the eventId on the Event aggregate and the teamId on the Team aggregate with a #StartSaga on the "CreateTeamCommand" (essentially the first time the relationship starts) and then have an event handler for every event that affects the relationship. My main issue with this solution is:
1: It would mean I would have a unique saga for each possible combination of team and event. Could this cause trouble performance wise if it was scaled to eg. 1mil events with each event having 50 teams? (This is unrealistic for this scenario but relevant for a general solution to maintain relationships between aggregates).
2: It would require I had custom commands and event handlers dedicated to handle the update of teams in team list of the Event aggregate as the resulting events should not be processed in the saga to avoid an infinite loop of updating references.
Thank you for reading this small story and I hope someone can either confirm that I'm on the right track or point me in the direction of a proper solution.
An event contains a list of associated teams (through the team id set).
If you mean "An event aggregate" here by "An event", I don't believe your event aggregate needs team ids. If you think it does, it would be great to understand your reasoning on this.
What I think you need is though your read side to know about this. Your read model for a single "Event" can listen on CreateTeamCommand and MoveTeamToEventCommand as well as all other "Event" related events, and build up the projection accordingly. Remember, don't design your aggregates with querying concerns in mind.
If the command "DeleteEventCommand" on the Event aggregate is executed, all teams associated to the event should also be deleted.
A few things here:
Again, your read side can listen on this event, and update the projections accordingly.
You can also start performing validation on relevant command handlers for the Team aggregate to check whether the Event exists or not before performing the operations. This won't have exact sync, but will cover for most cases (see "How can I verify that a customer ID really exists when I place an order?" section here).
If you really want to delete the associated Team aggregates off the back of a DeleteEventCommand event, you need to handle this inside a Saga as there is no way for you to be able to perform this in an atomic way w/o leaking the data storage system specifics into your domain model. So, you need certain retry and idempotency needs here, where a saga can give you. It's not exactly what you are suggesting here but related fact is that a single command can't act on a set of aggregates, see "How can I update a set of aggregates with a single command?" section here.

Microservices "JOIN" tables within different databases and data replication

I'm trying to achieve data join between entities.
I've got 2 separated microservices which can communicate with each other using events (rabbitmq). And all the requests are currently joined within an api gateway.
Suppose my first service is UserService , and second service is ProductService.
Usually to get a list of products we do an GET request like /products , the same goes when we want to create a product , we do an POST request like /products.
The product schema looks something like this:
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: 'userId'
...
}
The user schema looks something like this:
{
username: 'UserUsername`,
email: 'UserEmail'
...
}
So , when creating a product or getting list of products we will not have some details about user like email, username...
What i'm trying to achieve is to get user details when creating or querying for a list of products along with user details like so:
[
{
title: 'ProductTitle`,
description: 'ProductDescriptio',
user: {
username: 'UserUsername`,
email: 'UserEmail'
}
}
]
I could make an REST GET request to UserService , to get the user details for each product.
But my concern is that if UserService goes down the product will not have user details.
What are other ways to JOIN tables ? other than making REST API calls ?
I've read about DATA REPLICATION , but here's another concern how do we keep a copy of user details in ProductService when we create a new product with and POST request ?
Usually i do not want to keep a copy of user details to ProductService if he did not created a product. I could also emit events to each other service.
Approach 1- Data Replication
Data replication is not harmful as long as it makes your service independent and resilient. But too much data replication is not good either. Microservices doesn't fit well every case so we have to compromise on things as well.
Approach 2- Event sourcing and Materialized views
Generally if you have data consist of multiple services you should be considering event sourcing and Materialized views. These views are pre-complied disposable data tables that can be updated using published events from different data services . Say your "user" service publish the event , then you would update your view if another related event is published you can add/update materialized views and so on. These views can be saved in cache for fast retrieval and can be queried to get the data. This pattern adds little complexity but it's highly scale-able.
Event sourcing is basically a store to save all your events and replay the events to reach the particular state of system. Generally we create Materialized views from event store.
Say e.g. you have event store where you keep on saving all your published events. At the same time you are also updating your Materialized views. If you want to query the data then you will be getting it from your Materialized views. Since Materialized views are disposable that can always be generated from event store. Say Materialized views which was in cache got corrupted , you can completely regenerate the view from Event store by replaying the events. Say if i miss the cache hit i can still get the data from event store by replaying the events. You can find more on the following links.
Event Sourcing , Materialized view
Actually we are working with data replication to make each microservice more resilient (giving them the chance to still work even if another service is down).
This can be achieved in many ways, e.g. in your case by making the ProductService listening to the events send by the UserSevice when a user is created, deleted, etc.
Or the UserService could have a feed the ProductService is reading every n minutes or so marking the position last read on the feed. Etc.
There are many thing to consider when designing services and it really depends on your systems mission. E.g. you always have to evaluate the impact of coupling - if it is fine or not for a service not to be able to work when another service is down. Like, how important is a service and how is the impact on other services when this on is not able to work.
If you do not want to keep a copy of data not needed you could just read the data of the users that are related to a product. If a new product is created with a user that is not in your dataset you would then get it from the UserService. This would give you a stronger coupling then replicating everything but a weaker then replicating no data at all.
Again it really depends on what your systems is designed for and what it needs to achieve.

Event Sourcing and dealing with data dependencies

Given a REST API with the following operations resulting in events posted to Kafka:
AddCategory
UpdateCategory
RemoveCategory
AddItem (refers to a category by some identifier)
UpdateItem
RemoveItem
And an environment where multiple users may use the REST API at the same time, and the consumers must all get the same events. The consumers may be offline for an extended period of time (more than a day). New consumers may be added, and others removed.
The problems:
Event ordering (only workaround single topic/partition?)
AddItem before AddCategory, invalid category reference.
UpdateItem before AddCategory, used to be a valid reference, now invalid.
RemoveCategory before AddItem, category reference invalid.
....infinite list of other concurrency issues.
Event Store snapshots for fast resync of restarted consumers
Should there be a compacted log topic for both categories and items, each entity keyed by its identifier?
Can the whole compacted log topic be somehow identified as an offset?
Should there only be one one entry in the compacted log topic, and the data of it contain a serialized blob of all categories and items given an offset (would require single topic/partition).
How to deal with the handover from replaying the rendered entities event store to the "live stream" of commands/events? Encode offset in each item in the compacted log view, and pass that to replay from the live event log?
Are there other systems that fit this problem better?
I will give you a partial answer based on my experience in Event sourcing.
Event ordering (only workaround single topic/partition?)
AddItem before AddCategory, invalid category reference.
UpdateItem before AddCategory, used to be a valid reference, now invalid.
RemoveCategory before AddItem, category reference invalid.
....infinite list of other concurrency issues.
All scalable Event stores that I know of guaranty events ordering inside a partition only. In DDD terms, the Event store ensure that the Aggregate is rehydrated correctly by replaying the events in the order they were generated. The Apache-kafka topic seems to be a good choice for that. While this is sufficient for the Write side of an application, it is harder for the Read side to use it. Harder but not impossible.
Given that the events are already validated by the Write side (because they represent facts that already happened) we can be sure that any inconsistency that appears in the system is due to the wrong ordering of events. Also, given that the Read side is eventually consistent with the Write side, the missing events will eventually reach our Read models.
So, first thing, in your case AddItem before AddCategory, invalid category reference, should be in fact ItemAdded before CategoryAdded (terms are in the past).
Second, when ItemAdded arrives, you try to load the Category by ID and if it fails (because of the delayed CategoryAdded event) then you can create a NotYetAvailableCategory having the ID equal to the referenced ID in the ItemAdded event and a title of "Not Yet Available Please Wait a few miliseconds". Then, when the CategoryAdded event arrives, you just update all the Items that reference that category ID. So, the main idea is that you create temporary entities that will be finalized when their events eventually arrive.
In the case of CategoryRemoved before ItemAdded, category reference invalid, when the ItemAdded event arrives, you could check that the category was deleted (by havind a ListOfCategoriesThatWereDeleted read model) and then take the appropriate actions in your Item entity - what depends on you business.

Service Fabric Actors - save state to database

I'm working on a sample Service Fabric project, where I have to maintain a shopping list. For this I have a ShoppingList actor, which is identifiable by a specific id. It stores the current list content in its state using StateManager. All works fine.
However, in parallel I'd like to maintain the shopping list content in a sql database. In particular:
store all add/remove item request for future analysis (ML)
on first actor initialization load list content from db (e.g. after cluster has been re-created)
What is the best approach to achieve that? Create a custom StateProvider (how? can't find examples)?
Or maybe have another service/actor for handling all db operations (possibly using queues and reminders)?
All examples seem to completely rely on default StateManager, with no data persistence to external storage, so I'm not sure what's the best practice.
The best way will be to have a separate entity responsible for storing data to DB. And actor will just send an event (not implying SF events) with some data about performed operation, and another entity will catch it and perform the rest of the work.
But of course you can implement this thing in actor itself, but it will bring two possible issues:
Actor will be not able to process other requests if there will be some issues with DB or connectivity between actor and DB or if there will be high loading of DB itself and it will process requests slowly. The actor would have to wait till transferring to DB successfully completes.
Possible overloading of DB with many single connections from many actors instead of one or several connection from another entity and batch insertion.
So, your final solution will depend on workload of your system. But definitely you will need a reliable queue to safely store data in DB if value of such data is too high to afford a loss.
Also, I think you could use default state manager to store logs and information about transactions before it will be transferred to DB and remove from service's state after transaction completes. There is no need to have permanent storage of such data in services.
And another things to take into consideration — reading from DB. Probably, if you have relationship database and will update with new records only one table + if there will be huge amount of actors that will query such data on activation, you will have performance degradation as this table will be locked for reading or writing if you will not configure it to behave differently. So, probably, you will need caching system to read data for actors activation — depends on your workload.
And about implementing your custom State Manager: take a look at this example. Basically, all you need to do is to implement IReliableStateManagerReplica interface and pass it to StatefullService constructor.