Details of duality of Streams and Tables in Kafka Streams - apache-kafka

The confluent document here states
And Kafka exploits this duality in many ways: for example, to make your applications elastic, to support fault-tolerant stateful processing, or to run Kafka Streams Interactive Queries against your application’s latest processing results.
Wonder if there are more details for how is the duality of streams/tables used in these scenarios. Looking for some simple explanation rather than some long design docs

A stream can be considered as log and table can be a snapshot of logs at a given instant of time.
A stream is a flow of data, new data can keep on coming and we process it as it comes, store the processed results in a table for querying.
A table's data, changes over time. At any given instant of time, we get a snapshot of that data at that instant. A table, therefore, can be used for performing queries and retreiving results on demand, which is not the case with 'just' streams
For example,
User comments on a video can be a stream of events, new comments keep on coming and they simply get displayed on the UI. Nothing to query here (typically).
But there also some other use cases, like..
Cricket updates: For every new ball, we get no. of runs for that ball, now we need to add them to the score. We certainly, need to store the previous score and update it with every new ball. We also need to query the score at any given instant of time (on demand). For performing queries or updating the score, we can use a table.
In the Kafka context, event is a log message and every message is immutable.
Consider an example, of an user information getting updated.
{user_id: 101, name: X}
{user_id: 101, name: Y}
The name of the user_id=101 is updated from X to Y. When you perform the update on a database directly and do a query, you see only name: Y, you may not have the previous name of the user with you, because it is overridden with the new value.
In Kafka, we have two messages, 'X' and 'Y'.
At times, this may be useful and even critical. A hacker could have changed all the user information, and the legit user has no way of proving his identity to re-claim his account. But if there is previous info about his account which he can tell as a proof, he can re-claim it.
So for those who use Kafka, there could be use-cases to store data as a table (or) a map and then retrieve it using queries.

Related

Designing event-based architecture for the customer service

Being a developer with solid experience, i am only entering the world of microservices and event-driven architecture. Things like loose coupling, independent scalability and proper implementation of asynchronous business processes is something that i feel should get simplified as compared with traditional monolith approach. So giving it a try, making a simple PoC for myself.
I am considering making a simple application where user can register, login and change the customer details. However, i want to react on certain events asynchronously:
customer logs in - we send them an email, if the IP address used is new to the system.
customer changes their name, we send them an email notifying of the change.
The idea is to make a separate application that reacts on "CustomerLoggedIn", "CustomerChangeName" events.
Here i can think of three approaches, how to implement this simple functionality, with each of them having some drawbacks. So, when a customer submits their name change:
Store change name Changed name is stored in the DB + an event is sent to Kafkas when the DB transaction is completed. One of the big problems that arise here is that if a customer had 2 tabs open and almost simultaneously submits a change from initial name "Bob" to "Alice" in one tab and from "Bob" to "Jim" in another one, on a database level one of the updates overwrites the other, which is ok, however we cannot guarantee the order of the events to be the same. We can use some checks to ensure that DB update is only done when "the last version" has been seen, thus preventing the second update at all, so only one event will be emitted. But in general case, this pattern will not allow us to preserve the same order of events in the DB as in Kafka, unless we do DB change + Kafka event sending in one distributed transaction, which is anti-pattern afaik.
Change the name in the DB, and use Debezium or similar DB CDC to capture the event and stream it. Here we get a single event source, so ordering problem is solved, however what bothers me is that i lose the ability to enrich the events with business information. Another related drawback is that CDC will stream all the updates in the "customer" table regardless of the business meaning of the event. So, in this case, i will probably need to build a Kafka Streams application to convert the DB CDC events to business events and decouple the DB structure from event structure. The potential benefit of this approach is that i will be able to capture "direct" DB changes in the same manner as those originated in the application.
Emit event from the application, without storing it in the DB. One of the subscribers might to the DB persistence, another will do email sending, etc. The biggest problem i see here is - what do i return to the client? I cannot say "Ok, your name is changed", it's more like "Ok, you request has been recorded and will be processed". In case if the customer quickly hits refresh - he expects to see his new name, as we don't want to explain to the customers what's eventual consistency, do we? Also the order of processing the same event by "email sender" and "db updater" is not guaranteed, so i can send an email before the change is persisted.
I am looking for advices regarding any of these three approaches (and maybe some others i am missing), maybe the usecases when one can be preferrable over others?
It sounds to me like you want event sourcing. In event sourcing, all you need to store is the event: the current state of a customer is derived from replaying the events (either from the beginning of time, or since a snapshot: the snapshot is just an optional optimization). Some other process (there are a few ways to go about this) can then project the events to Kafka for consumption by interested parties. Since every event has a sequence number, you can use the sequence number to prevent concurrent modification (alternatively, the more actor modely event-sourcing implementations can use techniques like cluster sharding in Akka to achieve the same ends).
Doing this, you can have a "write-side" which processes the updates in a strongly consistent manner and can respond to queries which only involve a single customer having seen every update to that point (the consistency boundary basically makes customer in this case an aggregate in domain-driven-design terms). "Read-sides" consuming events are eventually consistent: the latencies are typically fairly short: in this case your services sending emails are read-sides (as would be a hypothetical panel showing names of all customers), but the customer's view of their own data could be served by the write-side.
(The separation into read-sides and write-side (the pluralization is significant) is Command Query Responsibility Segregation, which sometimes gets interpreted as "reads can only be served by a read-side". This is not totally accurate: for one thing a write-side's model needs to be read in order for the write-side to perform its task of validating commands and synchronizing updates, so nearly any CQRS-using project violates that interpretation. CQRS should instead be interpreted as "serve reads from the model that makes the most sense and avoid overcomplicating a model (including that model in the write-side) to support a new read".)
I think I qualify to answer this, having extensively used debezium for simplifying the architecture.
I would prefer Option 2:
Every transaction always results in an event emitted in correct order
Option 1/3 has a corner case, what if transaction succeeds, but application fails to emit the event?
To your point:
Another related drawback is that CDC will stream all the updates in
the "customer" table regardless of the business meaning of the event.
So, in this case, i will probably need to build a Kafka Streams
application to convert the DB CDC events to business events and
decouple the DB structure from event structure.
I really dont think that is a roadblock. The benefit you get is potentially other usecases may crop up where another consumer to this topic may want to read other columns of the table.
Option 1 and 3 are only going to tie this to your core application logic, and that is not doing any favor from simplifying PoV. With option 2, with zero code changes to core application APIs, a developer can independently work on the events, with no need to understand that core logic.

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

How to manage a user's game state using akka

I am trying to figure out how to manage a users game state using akka.
The game state will be persisted to mysql and this cannot change because we have other services that require this.
Anything that happens in a game is considered an "event".
Then you I have "Levels" which someone can achieve. A level is achieved when you complete all the "events" associated with it.
So you have:
Level
- event1 e.g. reach a point in the game
- event2 e.g. pickup a sword
- event3 e.g. defeat a monster
So in a game there are many levels, and 100's of events that are linked to levels.
So all "events" are sent via HTTP to my backend, and I save the event in the database.
I then have to load the users game profile in memory, and then re-calculate the Level's achieved since there was a new event that happened.
Note: This calculation cannot be done at the database level because it is a little more complicated that I am writing here.
The problem I see is that if I use akka, I can't have multiple actors processing the events for the same user, because the data can become stale.
Just to be clear, so when a new event arrives, I have to load the game profile in memory, loop through the levels and see if any of them have been achieved, if they have, update the database
e.g. update levels set achieved=true where level_id = 123 and user_id=234
e.g. actor1 loads the profile (all the levels and events for this user) and then processes the new event that just arrived in the inbox.
at the same time, actor2 loads the profile (same as actor1), and then processes the new event. When it persists the changes to mysql, the data will be out of sych.
If I was using threads, I would have to lock during the game profile calculation and persisting to the db.
How can I do this using Akka and be able to handle things in parallel, or is this scenerio not allow for it?
Let's think how you would manage it without actors. So, in nutshell, you have the following problem scenario:
two (or more) update requests arrive at the same time, both are
going to modify the same data
both requests read some stable data
state, then update it each in its own manner and persist to the DB
the modifications from the request which checked in first are lost, more precisely - overridden by the later request.
This is a classical problem. There are at least two classical solutions of it:
Optimistic locking
Pessimistic locking: it's usually achieved by applying Serializable isolation level for transactions.
It worth reading this answer with a nice comparison of both worlds.
As you're using Akka, you most probably want to prefer better concurrency and occasional failures, which are easy to recover. It goes on par with Akka motto let it crash.
So, you need to make the next steps:
Add version column to your table(s). It can be numeric or string (with hash). Numeric is the simplest one.
When you insert new record - initialize versions.
When you update the record - check version value has not changed. So, here's your update strategy:
Read record and its version.
Update record in memory.
Execute update query with criteria where rec_id=$id and version=$version.
If updated records count is 1 - you're good. If 0 - throw OptimisticLockException or smth like this.
Finally, it's time for Akka to do its job: come up with appropriate supervision strategy (I'd pick something like try again in 1 second). In actor's preRestart method return the update message back to the actor's mailbox (see Restart Hooks chapter in Akka docs).
With this strategy, even if two requests try to update the same record at a time, one of them will fail and will be immediately processed again.

EventStore: learning how to use

I'm trying to learn EventStore, I like the concept but when I try to apply in practice I'm getting stuck in same point.
Let's see the code:
foreach (var k in stream.CommittedEvents)
{
//handling events
}
Two question about that:
When an app start ups after some maintenance, how do we bookmark in a
safe way what events start to read? Is there a pattern to use?
as soon the events are all consumed, the cycle ends... what about the message arriving run time? I would expect the call blocking until some new message arrive ( of course need to be handled in a thread ) or having something like BeginRead EndRead.
Do I have to bind an ESB to handle run time event or does the EventSore provides some facility to do this?
I try to better explain with an example
Suppose the aggregate is a financial portfolio, and the application is an application showing that portfolio to a trader. Suppose the trader connect to the web app and he looks at his own portfolio. The current state will be the whole history, so I have to read potentially a lot of records to reproduce the status. I guess this could be done by a so called snapshot, but who's responsible for creating it? When one should choose to create an aggregate? How can one guess a snapshot for an aggregate exists ?
For the runtime part: as soon the user look at the reconstructed portfolio state, the real time part begin to run. The user can place an order and a new position can be created by succesfully execute that order in the market. How is the portfolio updated by the infrastructure? I would expect, but maybe I'm completely wrong, having the same event stream being the source of that new event new long position, otherwise I have two path handling the state of the same aggregate. I would like to know if this is how the strategy is supposed to work, even if I feel a little tricky having the two state agents, that can possibly overlap.
Just to clarify how I fear the overlapping:
I know events has to be idempotent, so I know it must not be a
problem anyway,
But let's consider the following:
I subscribe an event bus before streaming the event to update the state of the portfolio. some "open position event" appears on the bus: I must handle them, but maybe the portfolio is not in the correct state to handle it since is not yet actualized. Even if I'm able to handle such events I will find them again when I read the stream.
More insidious: I open the stream and I read all events and I create a state. Then I subscribe to the bus: some message on the bus happen in the middle between the end of the steram reading and the beggining of the subscription: those events are missing and the aggregate is not in the correct state.
Please be patient all, my English is poor and the argument is tricky, hope I managed to share my doubt :)
The current state will be the whole history, so I have to read
potentially a lot of records to reproduce the status. I guess this
could be done by a so called snapshot, but who's responsible for
creating it?
In CQRS and event sourcing, queries are served by projections which are generated from events emitted by aggregates. You don't use the aggregate instance as reconstituted from the event store to display information.
The term snapshot refers specifically to an optimization of the event store which allows rebuilding the aggregate without replaying all of the events.
Projections are essentially event handlers which maintain a denormalized view of aggregates. Events emitted from aggregates are published, possibly out of band, and the projection subscribes to and handles those events. A projection can combine multiple aggregates if a requirement exists to display summary information, for instance. In case of a trading application, each view will typically contain data from various aggregates. Projections are designed in a consumer-driven way - application requirements determine the different views of the underlying data that are needed.
With this type of workflow you have to embrace eventual consistency throughout your application. For instance, if an end user is viewing their portfolio and initiating new trades, the UI has to subscribe to updates to reflect updated projections in an asynchronous manner.
Take a look at here for an overview of CQRS and event sourcing.