Building a unified object from CDC events via kafka - apache-kafka

I am trying to build my understanding with using CDC , Kafka streams.
I want to build a system called 'Order Store' that represents a unified data model of all orders booked in multiple order booking systems. Assume I have order booking systems that create orders in their own format in their own tables. These systems have CDC setup and push row changes as events to their Kafka topics. One Kafka topic per table in the order booking system. From here how do I get to creating a complete unified order in the order store.
Where will I be able to get details for all of the order information from the source order booking system because from CDC I only get specific row changes in my Kafka topic.
When I am using streams to join Kafka topics - let's say I have Kafka topics 'Order Event' topic and 'Order Details' topic. Order details was changed and it created an event in Order Details Kafka topic. If I try to join it with order in order topic I might not find the order info as Kafka stores only last x days worth of data. In this case what is done to build a order object that needs order and order details?

Related

How to guarantee message ordering over multiple topics in kafka?

I am creating a system in which I use kafka as an event store. The problem I am having is not being able to guarantee the message ordering of all the events.
Let's say I have a User entity and a Order entity. Right now I have the topics configured as follows:
user-deleted
user-created
order-deleted
order-created
When consuming these topics from the start (when a new consumer group registers) first the user-deleted topic gets consumed then the user-created etc. The problem with this is that the events over multiple topics do not get consumed chronologically, only within the topic.
Let's say 2 users get created and after this one gets deleted. The result would be one remaing user.
Events:
user-created
user-created
user-deleted
My system would consume these like:
user-deleted
user-created
user-created
Which means the result is 2 remaining users which is wrong.
I do set the partition key (with the user id) but this seems only to guarantee order within a topic. How does this problem normally get tackled?
I have seen people using topic per entity. Resulting in 2 topics for this example (user and order) but this can still cause issues with related enities.
What you've designed is "request/response topics", and you cannot order between multiple topics this way.
Instead, design "entity topics" or "event topics". This way, ordering will be guaranteed, and you only need one topic per entity. For example,
Topic users
For a key=userId, you can structure events this way.
Creates
userId, {userId: userId, name:X, ...}
Updates
userId, {userId: userId, name:Y, ...}
Deletes
userId, null
Use a compacted topic for an event-store such that all deletes will be tombstoned and dropped from any materialized view.
You could go a step further and create a wrapper record.
userId, {action:CREATE, data:{name:X, ...}} // full-record
userId, {action:UPDATE, data:{name:Y}} // partial record
userId, {action:DELETE} // no data needed
This topic acts as your "event entity topic", but then, you need a stream processor to parse and process these events consistently into the above format, such as null-ing any action:DELETE, and writing to a compacted topic (perhaps automatically using Kafka Streams KTable)
Kafka is not able to maintain ordering across multiple topics. It's not capable either to maintain ordering inside one topic that has several partitions. The only ordering guarantee we have is within each partition of one topic.
What this means is that if the order of user-created and user-deleted as known by a kafkfa producer must be the same as the order of those events as perceived by a kafka consumer (which is understandable as you explain), then those events must be sent to same kafka partition of the same topic.
Usually, you don't actually need the whole order to be exactly the same for the producer and producer (i.e. you don't need total ordering), but you need it to be the same at least for each entity id, i.e. for each user id the user-created and user-deleted event must be in the same order for the producer and the consumer, but it's often acceptable to have events mixed up across users (i.e. you need _partial ordering`).
In practice this means you must use the same topic for all those events, which means this topic will contain events with different schemas.
One strategy for achieving that is to use union types, i.e. you declare in your event schema that the type can either be a user-created or a user-deleted. Both Avro and Protobuf offer this feature.
Another strategy, if you're using Confluent Schema registry, is to allow a topic to be associated with several types in the registry, using the RecordNameStrategy schema resolution strategy. The blog post Putting Several Event Types in the Same Topic – Revisited is probably a good source of information for that.

how can I aggregate kafka topics?

i need to carry market data from source to target. i'd like to put each symbol , ie. BTCUSD in its own topic and have the target app subscribe to as many topics as it wants and receive data of multiple symbols in correct time based order.
i am currently putting all data into a single topic , and have the target filter out the data it's not interested in.
can i achieve what i want with kafka alone , or with an additional project , or can you name another message broker for the job.
thanks.

Kafka event Producer on RDBMS data & reading it at consumer in same order of producer in case of multiple topics

I have two business entities in RDBMS: Associate & AssociateServingStore. I planned to have two topics currently writing ADD/UPDATE/DELETE into AssociateTopic & AssociateServingStoreTopic, and these two topics are consumed by several downstream systems which would use for their own business needs.
Whenever an Associate/AssociateServingStore is added from UI, currently I have Associate & AssociateServingStore writing into two separate topics, and I have a single consumer at my end to read both topics, the problem is order of messages that can be read from two separate topics.. as this follows a workflow I cannot read AssociateServingStore without reading Associate first.. how do I read them in order ? (with partition key I can read data in order for same topic within partition) but here I have two separate topics and want to read in an order, first read Associate & then AssociateServingSotre and How to design it in such a way that I can read Associate before AssociateServingStore.
If I thinking as a consumer myself, I am planning to read first 50 rows of Associate and then 50 rows from AssocateServingStore and process the messages, but the problem is if I get a row in AssociateServingStore from the 50 records that are consumed which is not in already read/processed from first 50 Associate events, I will get issues on my end saying parent record not found while child insert.
How to design consumer in these cases of RDBMS business events where we have multiple topics but read them in order so that I will not fall in a situation where I might read particular child topic message before reading parent topic message and get issues during insert/update like parent record not found. Is there a way we can stage the data in a staging table and process them accordingly with timestamp ? I couldn't think of design which would guarantee the read order and process them accordingly
Any suggestions ?
This seems like a streaming join use-case, supported by some stream-processing frameworks/libraries.
For instance, with Kafka Streams or ksqlDB you can treat these topics as either tables or streams, and apply joins between tables, streams, or stream to table joins.
These joins handle all the considerations related to streams that do not happen on traditional databases, like how long to wait when time on one of the streams is more recent than the other one[1][2].
This presentation[3] goes into the details of how joins work on both Kafka Streams and ksqlDB.
[1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-695%3A+Further+Improve+Kafka+Streams+Timestamp+Synchronization
[3] https://www.confluent.io/events/kafka-summit-europe-2021/temporal-joins-in-kafka-streams-and-ksqldb/

Kafka used as Delivery Mechanism in News Feed

Can I create topics called update_i for different kinds of updates and partition them using user_id in a Kafka MQ ? I've been through this post by confluent.io: https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ . Also, I know that I cannot create a topic with dynamic number of partitions. These two facts (the post and static number of Kafka partitions). What's the delivery mechanism alternative ?
Can I create topics called update_i for different kinds of updates and partition them using user_id in a Kafka MQ ?
If I understand you correctly, the answer is Yes.
What you would need to do in a nutshell:
Topic configuration: Determine the required number of partitions for your topic(s). Usually, the number of partitions is determined based on (1) anticipated scale/volume of the incoming data, i.e. the Write-side of scaling, and/or (2) the required parallelism when consuming the messages for processing, i.e. the Read-side of scaling. See https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ for details.
Writing messages to these Kafka topics (aka the side of the "Kafka producer"): In Kafka, messages are key-value pairs. In your case, you would set the message key to be the user_id. Then, when using Kafka's default "partitioner", messages for the same message key (here: user_id) would automatically be sent to the same partition -- which is what you want to achieve.
As a possible solution I would suggest to create a number of partitions, and then setup producers to select partition using the following rule
user_id mod <number_of_partitions>
That will allow you to keep order of messages for particular user_id.
Then, If you need to have a consumer that processes only messages for particular user_id, you can write a (low-level) consumer that will read a particular partition and process only messages that are sent for a particular customer and ignore all other messages.

kafka as event store in event sourced system

This question is similar to Using Kafka as a (CQRS) Eventstore. Good idea?, but more implementation specific.
How to use kafka as event store, when I have thousands of event "sources" (aggregate roots in DDD)? As I've read in linked question and some other places, I'll have problems with topic per source. If I split events to topics by type, it will be much easier to consume and store, but I need access to event stream of particular source. How to do event sourcing with kafka?
Post all of your event sources to a single topic with a data type (thrift?) that includes some unique identifier for each event source. Then create consumers for each event type that you are interested in and identify each with a unique consumer group name. This way each unique source consumer will have its own offset value in zookeeper. Everybody reads the whole topic but only outputs (or deals with) info from a single source (or group of sources).