How to store/aggregate correlated cdc events with flink? - apache-kafka

I have a kafka queue where multiple cdc events are coming from a database.
Suppose the following three tables implementing a student-course n:n association:
STUDENT
COURSE
STUDENT_COURSE
I can have the following "business" events:
A new student enrolls to a course: in this case I would receive 3 events on my kafka queue, said events can come in any order but I'd like to emit a "business" event like this one: {type:"enroll", "student": {"name": "Jhon", "age": "..}, "course": {"name":"physics", "teacher":"Foo", ...}
A student changes their course: in this case I would only receive 1 event on my kafka queue (on STUDENT_COURSE) and I'd launch a "business" event like this one: {"type": "change", "student": {"name": "Jhon", "age": "..}, "newcourse": {"name":"maths", "teacher":"Foo", ...}
Updates on STUDENT information (say email, phone,...) or COURSE information (time, teacher,...) 1 event on either table
My issue is that I don't know how to store and correlate said CDC to make a business event together, in fact I'd need to do something like this:
Receive the event and store it in an "uncertain" state, wait for a reasonable time, say 10 sec
If an event on another table is received then I'm in case 1
Otherwise I'm in 2/3
Is there a way to obtain this behavior in flink?

Looks like you could start with a streaming SQL join on the 3 dynamic tables derived from these CDC streams, which will produce an update stream along the lines of what you're looking for.
Some of the examples in https://github.com/ververica/flink-sql-cookbook should provide inspiration for getting started. https://github.com/ververica/flink-sql-CDC and https://github.com/ververica/flink-cdc-connectors are also good resources.

Related

Understanding use case for max.in.flight.request property in Kafka

I'm building a Spring Boot consumer-producers project with Kafka as middleman between two microservices. The theme of the project is a basketball game. Here is a small state machine diagram, in which events are displayed. There will be many more different events, this is just a snippet.
Start event:
{
"id" : 5,
"actualStartTime" : "someStartTime"
}
Point event:
{
"game": 5,
"type": "POINT",
"payload": {
"playerId": 44,
"value": 3
}
}
Assist event:
{
"game": 4,
"type": "ASSIST",
"payload": {
"playerId": 278,
"value": 1
}
}
Jump event:
{
"game": 2,
"type": "JUMP",
"payload": {
"playerId": 55,
"value": 1
}
}
End event:
{
"id" : 5,
"endTime" : "someStartTime"
}
Main thing to note here is that if there was an Assist event it must be followed with Point event.
Since I'm new to Kafka, I'll keep things simple and have one broker with one topic and one partition. For my use case I need to maintain ordering of each of these events as they actually happen live on the court (I have a json file with 7000 lines and bunch of these and other events).
So, let's say that from the Admin UI someone is sending these events (for instance via WebSockets) to the producers app. Producer app will be doing some simple validation or whatever it needs to do. Now, we can also image that we have two instances of producer app, one is at ip:8080 (prd1) and other one at ip:8081 (prd2).
In reality sequence of these three events happend: Assist -> Point -> Jump. The operator on the court send those three events in that order.
Assist event was sent on prd1 and Point was sent on prd2. Let's now imagine that there was a network glitch in communication between prd1 and Kafka cluster. Since we are using Kafka latest Kafka at the time of this writing, we already have enabled.idempotence=true and Assist event will not be sent twice.
During retry of Assist event on prd1 (towards Kafka), Point event on prd2 passed successfully. Then Assist event passed and after it Jump event (at any producer) also ended up in Kafka.
Now in queue we have: Point -> Assist -> Jump. This is not allowed.
My question is whether these types of problems should be handle by application's business logic (for example Spring State Machine) or this ordering can be handled by Kafka?
In case of latter, is property max.in.flight.request=1 responsible for ordering? Are there any other properties which might preserve ordering?
On the side note, is it a good tactic to use single partition for single match and multiple consumers for any of the partitions? Most probably I would be streaming different types of matches (basketball, soccer, golf, across different leagues and nations) and most of them will require some sort of ordering.
This maybe can be done with KStreams but I'm still on Kafka's steep learning curve.
Update 1 (after Jessica Vasey's comments):
Hi, thanks for very through comments. Unfortunately I didn't quite get all the pieces of the puzzle. What confuses me the most is some terminology you use and order of things happening. Not saying it's not correct, just I didn't understand.
I'll have two microservices, so two Producers. I got be be able to understand Kafka in microservices world, since I'm Java Spring developer and its all about microservices and multiple instances.
So let's say that on prd1 few dto events came along [Start -> Point -> Assist] and they are sent as a ProducerRequest (https://kafka.apache.org/documentation/#recordbatch), they are placed in RECORDS field. On the prd2 we got [Point -> Jump] also as a ProducerRequest. They are, in my understanding, two independent in-flight requests (out of 5 possible?)? Their ordering is based on a timestamp?
So when joining to the cluster, Kafka assigns id to producer let's say '0' for prd1 and '1' for prd2 (I guess it also depends on topic-partition they have been assigned). I don't understand whether each RecordBatch has its monotonically increasing sequence number id or each Kafka message within RecordBatch has its own monotonically increasing sequence number or both? Also the part 'time to recover' is bugging me. Like, if I got OutofOrderSequenceException, does it mean that [Point -> Jump] batch (with possibly other in-flight requsets and other batches in producer's buffer) will sit on Kafka until either delivery.timeout.ms expirees or when it finally successfully [Start -> Point -> Assist] is sent?
Sorry for confusing you further, it's some complex logic you have! Hopefully, I can clarify some points for you. I assumed you had one producer, but after re-reading your post I see you have two producers.
You cannot guarantee the order of messages across both producers. You can only guarantee the order for each individual producer. This post explains this quite nicely Kafka ordering with multiple producers on same topic and parititon
On this question:
They are, in my understanding, two independent in-flight requests (out
of 5 possible?)? Their ordering is based on a timestamp?
Yes, each producer will have max.in.flight.requests.per.connection set to 5.
You could provide a timestamp in your producer, which could help with your situation. However, I won't go into too much detail on that right now and will first answer your questions.
I don't understand whether
each RecordBatch has its monotonically increasing sequence number id
or each Kafka message within RecordBatch has its own monotonically
increasing sequence number or both? Also the part 'time to recover' is
bugging me. Like, if I got OutofOrderSequenceException, does it mean
that [Point -> Jump] batch (with possibly other in-flight requsets and
other batches in producer's buffer) will sit on Kafka until either
delivery.timeout.ms expirees or when it finally successfully [Start ->
Point -> Assist] is sent?
Each message is assigned a monotonically increasing sequence number. This LinkedIn post explains is better than I ever could!
Yes, other batches will sit on the producer until either the previous batch is acknowledged (which could be less than 2 mins) OR delivery.timeout.ms expires.
Even if max.in.flight.requests.per.connection > 1, setting enable.idempotence=true should preserve the message order as this assigns the messages a sequence number. When a batch fails, all subsequent batches to the same partition fail with OutofOrderSequenceException.
Number of partitions should be determined by your target throughput. If you wanted to send basketball matches to one partition and golf to another, you can use keys to determine which message should be sent where.

Microservices with Kafka - how can we know when a service has successfully processed a message

We currently have a topic that is being consumed by two services as outlined in the architecture below. One is NLP service and the other one is CV service. They are separated because they belong to different teams.
Let's say the original message is like this:
{
"id": 1234,
"text": "I love pizza",
"photo": "https://photo.service/photo001"
}
The NLP service will process the message and produce a new message to topic 1 as below:
{
"id": 1234,
"text": "I love pizza",
"nlp": "pizza",
"photo": "https://photo.service/photo001"
}
And the CV (Computer Vision) will process it and produce the below message to topic 2:
{
"id": 1234,
"text": "I love pizza",
"photo": "https://photo.service/photo001",
"cv": ["pizza", "restaurant", "cup", "spoon", "folk"]
}
Lastly, there's a final service that need both pieces of information from the two services above. However, the amount of time taken by NLP service and CV service is different. Now, as the final service, how do I grab both messages from topic 1 and topic 2 for this particular message with id 1234?
You can use Kafka Streams or ksqlDB to run a join query. Otherwise, you'd use an external database for the same.
E.g. You'd create a table for whichever events finish "first", then you join the second incoming stream on the ID keys of that table. Without a persistent table, you can join two streams, but this assumes there is a time window in which both events will exist.
Alternatively, don't split the incoming stream.
A -> NLP -> CV -> final service

How do I implement Event Sourcing using Kafka?

I would like to implement the event-sourcing pattern using kafka as an event store.
I want to keep it as simple as possible.
The idea:
My app contains a list of customers. Customers an be created and deleted. Very simple.
When a request to create a customer comes in, I am creating the event CUSTOMER_CREATED including the customer data and storing this in a kafka topic using a KafkaProducer. The same when a customer is deleted with the event CUSTOMER_DELETED.
Now when i want to list all customers, i have to replay all events that happened so far and then get the current state meaning a list of all customers.
I would create a temporary customer list, and then processing all the events one by one (create customer, create customer, delete customer, create customer etc). (Consuming these events with a KafkaConsumer). In the end I return the temporary list.
I want to keep it as simple as possible and it's just about giving me an understanding on how event-sourcing works in practice. Is this event-sourcing? And also: how do I create snapshots when implementing it this way?
when i want to list all customers, i have to replay all events that happened so far
You actually don't, or at least not after your app starts fresh and is actively collecting / tombstoning the data. I encourage you to lookup the "Stream Table Duality", which basically states that your table is the current state of the world in your system, and a snapshot in time of all the streamed events thus far, which would be ((customers added + customers modified) - customers deleted).
The way you implement this in Kafka would be to use a compacted Kafka topic for your customers, which can be read into a Kafka Streams KTable, and persisted in memory or spill to disk (backed by RocksDB). The message key would be some UUID for the customer, or some other identifiable record that cannot change (e.g. not name, email, phone, etc. as all this can change)
With that, you can implement Interactive Queries on it to scan or lookup a certain customer's details.
Theoretically you can do Event Sourcing with Kafka as you mentioned, replaying all Events in the application start but as you mentioned, if you have 100 000 Events to reach a State it is not practical.
As it is mentioned in the previous answers, you can use Kafka Streaming KTable for sense of Event Sourcing but while KTable is hosted in Key/Value database RockDB, querying the data will be quite limited (You can ask what is the State of the Customer Id: 123456789 but you can't ask give me all Customers with State CUSTOMER_DELETED).
To achieve that flexibility, we need help from another pattern Command Query Responsibility Segregation (CQRS), personally I advice you to use Kafka reliable, extremely performant Broker and give the responsibility for Event Sourcing dedicated framework like Akka (which Kafka synergies naturally) with Apache Cassandra persistence and Akka Finite State Machine for the Command part and Akka Projection for the Query part.
If you want to see a sample how all these technology stacks plays together, I have a blog for it. I hope it can help you.

Event Sourcing - Apache Kafka + Kafka Streams - How to assure atomicity / transactionality

I'm evaluating Event Sourcing with Apache Kafka Streams to see how viable it is for complex scenarios. As with relational databases I have come across some cases were atomicity/transactionality is essential:
Shopping app with two services:
OrderService: has a Kafka Streams store with the orders (OrdersStore)
ProductService: has a Kafka Streams store (ProductStockStore) with the products and their stock.
Flow:
OrderService publishes an OrderCreated event (with productId, orderId, userId info)
ProductService gets the OrderCreated event and queries its KafkaStreams Store (ProductStockStore) to check if there is stock for the product. If there is stock it publishes an OrderUpdated event (also with productId, orderId, userId info)
The point is that this event would be listened by ProductService Kafka Stream, which would process it to decrease the stock, so far so good.
But, imagine this:
Customer 1 places an order, order1 (there is a stock of 1 for the product)
Customer 2 places concurrently another order, order2, for the same product (stock is still 1)
ProductService processes order1 and sends a message OrderUpdated to decrease the stock. This message is put in the topic after the one from order2 -> OrderCreated
ProductService processes order2-OrderCreated and sends a message OrderUpdated to decrease the stock again. This is incorrect since it will introduce an inconsistency (stock should be 0 now).
The obvious problem is that our materialized view (the store) should be updated directly when we process the first OrderUpdated event. However the only way (I know) of updating the Kafka Stream Store is publishing another event (OrderUpdated) to be processed by the Kafka Stream. This way we can't perform this update transactionally.
I would appreciate ideas to deal with scenarios like this.
UPDATE: I'll try to clarify the problematic bit of the problem:
ProductService has a Kafka Streams Store, ProductStock with this stock (productId=1, quantity=1)
OrderService publishes two OrderPlaced events on the orders topic:
Event1 (key=product1, productId=product1, quantity=1, eventType="OrderPlaced")
Event2 (key=product1, productId=product1, quantity=1, eventType="OrderPlaced")
ProductService has a consumer on the orders topic. For simplicity let's suppose a single partition to assure messages consumption in order. This consumer executes the following logic:
if("OrderPlaced".equals(event.get("eventType"))){
Order order = new Order();
order.setId((String)event.get("orderId"));
order.setProductId((Integer)(event.get("productId")));
order.setUid(event.get("uid").toString());
// QUERY PRODUCTSTOCK TO CHECK AVAILABILITY
Integer productStock = getProductStock(order.getProductId());
if(productStock > 0) {
Map<String, Object> event = new HashMap<>();
event.put("name", "ProductReserved");
event.put("orderId", order.getId());
event.put("productId", order.getProductId());
// WRITES A PRODUCT RESERVED EVENT TO orders topic
orderProcessor.output().send(MessageBuilder.withPayload(event).build(), 500);
}else{
//XXX CANCEL ORDER
}
}
ProductService also has a Kafka Streams processor that is responsible to update the stock:
KStream<Integer, JsonNode> stream = kStreamBuilder.stream(integerSerde, jsonSerde, "orders");
stream.xxx().yyy(() -> {...}, "ProductsStock");
Event1 would be processed first and since there is still 1 available product it would generate the ProductReserved event.
Now, it's Event2's turn. If it is consumed by ProductService consumer BEFORE the ProductService Kafka Streams Processor processes the ProductReseved event generated by Event1, the consumer would still see that the ProductStore stock for product1 is 1, generating a ProductReserved event for Event2, then producing an inconsistency in the system.
This answer is a little late for your original question, but let me answer anyway for completeness.
There are a number of ways to solve this problem, but I would encourage addressing this is an event driven way. This would mean you (a) validate there is enough stock to process the order and (b) reserve the stock as a single, all within a single KStreams operation. The trick is to rekey by productId, that way you know orders for the same product will be executed sequentially on the same thread (so you can't get into the situation where Order1 & Order2 reserve stock of the same product twice).
There is a post that talks discusses how to do this: https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/
Maybe more usefully there is some sample code also showing how it can be done:
https://github.com/confluentinc/kafka-streams-examples/blob/1cbcaddd85457b39ee6e9050164dc619b08e9e7d/src/main/java/io/confluent/examples/streams/microservices/InventoryService.java#L76
Note how in this KStreams code the first line rekeys to productId, then a Transformer is used to (a) validate there is sufficient stock to process the order and (b) reserve the stock required by updating the state store. This is done atomically, using Kafka's Transactions feature.
This same problem is typical in assuring consistency in any distributed system. Instead of going for strong consistency, typically the process manager/saga pattern is used. This is somewhat similar to the 2-phase commit in distributed transactions but implemented explicitly in application code. It goes like this:
The Order Service asks the Product Service to reserve N items. The Product Service either accepts the command and reduces stock or rejects the command if it doesn't have enough items available. Upon positive reply to the command the Order Service can now emit OrderCreated event (although I'd call it OrderPlaced, as "placed" sounds mode idiomatic to the domain and "created" is more generic, but that's a detail). The Product Service either listens for OrderPlaced events or an explicit ConfirmResevation command is sent to it. Alternatively, if something else happened (e.g. failed to clear funds), an appropriate event can be emitted or CancelReservation command sent explicitly to the ProductService. To cater for exceptional circumstances, the ProductService may also have a scheduler (in KafkaStreams punctuation can come in handy for this) to cancel reservations that weren't confirmed or aborted within a timeout period.
The technicalities of the orchestration of the two services and handling the error conditions and compensating actions (cancelling reservation in this case) can be handled in the services directly, or in an explicit Process Manager component to segregate this responsibility. Personally I'd go for an explicit Process Manager that could be implemented using Kafka Streams Processor API.

Handling out of order events in CQRS read side

I've read this nice post from Jonathan Oliver about handling out of order events.
http://blog.jonathanoliver.com/cqrs-out-of-sequence-messages-and-read-models/
The solution that we use is to dequeue a message and to place it in a “holding table” until all messages with a previous sequence are
received. When all previous messages have been received we take all
messages out of the holding table and run them in sequence through the
appropriate handlers. Once all handlers have been executed
successfully, we remove the messages from the holding table and commit
the updates to the read models.
This works for us because the domain publishes events and marks them
with the appropriate sequence number. Without this, the solution
below would be much more difficult—if not impossible.
This solution is using a relational database as a persistence storage
mechanism, but we’re not using any of the relational aspects of the
storage engine. At the same time, there’s a caveat in all of this.
If message 2, 3, and 4 arrive but message 1 never does, we don’t apply
any of them. The scenario should only happen if there’s an error
processing message 1 or if message 1 somehow gets lost. Fortunately,
it’s easy enough to correct any errors in our message handlers and
re-run the messages. Or, in the case of a lost message, to re-build
the read models from the event store directly.
Got a few questions particularly about how he says we can always ask the event store for missing events.
Does the write side of CQRS have to expose a service for the read
side to "demand" replaying of events? For example if event 1 was not
received but but 2, 4, 3 have can we ask the eventstore through a
service to republish events back starting from 1?
Is this service the responsibility of the write side of CQRS?
How do we re-build the read model using this?
If you have a sequence number, then you can detect a situation where current event is out of order, e.g. currentEventNumber != lastReceivedEventNumber + 1
Once you've detected that, you just throw an exception. If your subscriber has a mechanism for 'retries' it will try to process this event again in a second or so. There is a pretty good chance that during this time earlier events will be processed and sequence will be correct. This is a solution if out-of-order events are happening rarely.
If you are facing with this situation regularly, you need to implement global locking mechanism, which will allow certain events be processed sequentially.
For example, we were using sp_getapplock in MSSQL to achieve global "critical section" behaviour in certain situations. Apache ZooKeeper offers a framework to deal with even more complicated scenarios when multiple parts of the distributed application require something more than a just simple lock.
Timestamp based solution:
The incoming messages are:
{
id: 1,
timestamp: T2,
name: Samuel
}
{
id: 1,
timestamp: T1,
name: Sam,
age: 26
}
{
id: 1,
timestamp: T3,
name: Marlon Samuels,
contact: 123
}
And what we expect to see irrespective of the ORDER in the database is:
{
id: 1,
timestamp: T3,
name: Marlon Samuels,
age: 26,
contact: 123
}
For every incoming message, do the following:
Get the persisted record and evaluate the timestamp.
Whichever's timestamp is greater that's the target.
Now let's go through the messages:
T2 arrives first: Store it in the database as it's the first one.
T1 arrives next: Persistent one (T2) & incoming (T1), so T2 is the target.
T3 arrives: Persistent one (T2) & incoming (T1), so T3 is target.
The following deepMerge(src, target) should be able to give us the resultant:
public static JsonObject deepMerge(JsonObject source, JsonObject target) {
for (String key: source.keySet()) {
JsonElement srcValue = source.get(key);
if (!target.has(key)) { // add only when target doesn't have it already
target.add(key, srcValue);
} else {
// handle recursively according to the requirement
}
}
return target;
}
Let me know in the comment if you need full version of deepMerge()
Another alternative would be to feed the service that your reading events from (S1) in such a way that that it can only produce in-order events to your service (S2).
For example if you have loads of events for many different sessions coming in, have an ordering service (O1) at the front end responsible for order. It ensures only one event for each session gets passed to (S1) and only when (S1) and (S2) have both processed it successfully does (O1) allow a new event for that session to pass to (S1). Throw in a bit of queuing too for performance.