Kafka event Producer on RDBMS data & reading it at consumer in same order of producer in case of multiple topics - apache-kafka

I have two business entities in RDBMS: Associate & AssociateServingStore. I planned to have two topics currently writing ADD/UPDATE/DELETE into AssociateTopic & AssociateServingStoreTopic, and these two topics are consumed by several downstream systems which would use for their own business needs.
Whenever an Associate/AssociateServingStore is added from UI, currently I have Associate & AssociateServingStore writing into two separate topics, and I have a single consumer at my end to read both topics, the problem is order of messages that can be read from two separate topics.. as this follows a workflow I cannot read AssociateServingStore without reading Associate first.. how do I read them in order ? (with partition key I can read data in order for same topic within partition) but here I have two separate topics and want to read in an order, first read Associate & then AssociateServingSotre and How to design it in such a way that I can read Associate before AssociateServingStore.
If I thinking as a consumer myself, I am planning to read first 50 rows of Associate and then 50 rows from AssocateServingStore and process the messages, but the problem is if I get a row in AssociateServingStore from the 50 records that are consumed which is not in already read/processed from first 50 Associate events, I will get issues on my end saying parent record not found while child insert.
How to design consumer in these cases of RDBMS business events where we have multiple topics but read them in order so that I will not fall in a situation where I might read particular child topic message before reading parent topic message and get issues during insert/update like parent record not found. Is there a way we can stage the data in a staging table and process them accordingly with timestamp ? I couldn't think of design which would guarantee the read order and process them accordingly
Any suggestions ?

This seems like a streaming join use-case, supported by some stream-processing frameworks/libraries.
For instance, with Kafka Streams or ksqlDB you can treat these topics as either tables or streams, and apply joins between tables, streams, or stream to table joins.
These joins handle all the considerations related to streams that do not happen on traditional databases, like how long to wait when time on one of the streams is more recent than the other one[1][2].
This presentation[3] goes into the details of how joins work on both Kafka Streams and ksqlDB.
[1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-695%3A+Further+Improve+Kafka+Streams+Timestamp+Synchronization
[3] https://www.confluent.io/events/kafka-summit-europe-2021/temporal-joins-in-kafka-streams-and-ksqldb/

Related

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Event Driven Architectures - Topics / Stream Design

This could be some kind of best practice question. Someone please who has worked on this clarify with examples. So that all of us could benefit!
For event-driven architectures with Kafka / Redis, when we create topics/streams for events, what are all the best practices to be followed.
Lets consider online order processing workflow.
I read some blogs saying that create topics/streams like order-created-events, order-deleted-events etc. But my question is how the order of the messages is guaranteed when we split this into multiple topics.
For ex:
order-created-events could have thousands of events and being slowly processed by a consumer. order-deleted-events could have only few records in the queue assuming only 5-10% would cancel the order.
Now, lets assume, an user first places an order. then he immediately cancels. This will make the order-deleted-event to process first as the topic/stream do not have much messages before some consumer processes order-created-event for the same order. It will cause some data inconsistency.
Hopefully my question is clear. So, how to come up with topics/streams design?
Kafka ensures sequencing for a particular partition only.
So, to take use of kafka partitioning and load balancing using partitions, multiple partitions for a single topic( like order) should be created.
Now, Use a partition class to generate a key for every message and that key should correspond to same partition only.
So , irrespective of Order A getting created , updated or deleted , they should always belong to same partition.
To properly achieve sequencing , this should be the basis of deciding topics , instead of 2 different topics for different activities.

Merge multiple events from RDBMS in Kafka

We are capturing Change data capture from different tables from a RDBMS database. Each individual change is treated as an event. All the events are published into a single Kafka topic. Every event (message) is having the table name as header. We need to cater certain Use cases, where we need to merge multiple events and populate the output.
Entire thing is happening in real time.
We are using Apache Kafka.
Not sure what you mean exactly by merging events, but this seems to be in Kafka streams domain.
You can design each of your events using streams and ktables, for which you'll apply a Kafka streams topology ( joining streams of events and applying some business logic for instance)
But do you need more technical suggestions?
Yannick

ordering across partitions in Kafka

I am writing a kafka producer and needs help in creating partitions.
I have a group and a user table. Group contains different users and at a time a user can be a part of only one group.
There can be two types of events which I will receive as input and based on that I will add them to Kafka.
The events related to users.
The events related to groups.
Whenever an event related to a group happens, all the users in that group must be updated in bulk at consumer end.
Whenever an event related to a user happens, it must be executed as such at the consumer end.
Also, I want to maintain ordering on basis of time.
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
I am trying to figure out the possibilities I can try here.
Also, I want to maintain ordering on basis of time.
Means that topics, no matter how many, cannot have more than one partition, as you could have received messages out-of-order.
Obviously, unless you implement something like sequence ids in your messages (and can share that sequence across possibly multiple producers).
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
It sounds like a very simple messaging design, where you have a single queue (that's actually backed by a single topic with a single partition) that's consumed by multiple users. Actually any pub-sub messaging technology would be sufficient here (e.g. RabbitMQ's fanout exchanges).
The messages on the queue contain the information whether they are group updates or user updates - the consumers then filter the input depending on what they are interested in.
To discuss an alternative: single queue for group updates, and another for user updates - I understand that it would not be enough due to order demands - it's possible to get a group update independently of user update, breaking the ordering.
From the kafka documentation :
https://kafka.apache.org/documentation/#intro_consumers
Kafka only provides a total order over records within a partition, not
between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over records
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
so the best you can do is to have single partition-single topic.

How can I consume a data sequentially(in order of their time-stamp) from a multi-partitioned Kafka topic

I know that Kafka will not be able to guarantee ordering of data when a topic has multiple partitions. But my problem is:- I need to have multiple partitions to an event topic(user activities generating events) since I want multiple consumer groups to consume the data from the topic.
But there are times when I need to bootstrap the entire data,i.e, read the complete data right from the beginning to the end and rebuild my graph of events from the historical messages in Kafka and then I lose the ordering which is creating problem.
One approach might be to process it in a Map-Reduce paradigm where I map the data based on time and order it and consume it.
Is there anybody who has faced similar situation / problem and who would like to help me out with the right approach / solution.
Thanks in advance.
As per kafka documentation global ordering throughout partitions not guaranteed so you can create N number of partitions with N number of consumers. Create partitions based on type of data i.e. all type of data of category A should go in one partition as the order of messages maintained within partition you can consume those messages in separate consumer and process data.
I gone through some blogs which saying buffer those messages and apply sorting logic on those messages, but this is not seems to be a good practice as one of partition may be slow message message is late in some cases and you need to sort your messages as and when every new message arrives.