I am creating a system in which I use kafka as an event store. The problem I am having is not being able to guarantee the message ordering of all the events.
Let's say I have a User entity and a Order entity. Right now I have the topics configured as follows:
user-deleted
user-created
order-deleted
order-created
When consuming these topics from the start (when a new consumer group registers) first the user-deleted topic gets consumed then the user-created etc. The problem with this is that the events over multiple topics do not get consumed chronologically, only within the topic.
Let's say 2 users get created and after this one gets deleted. The result would be one remaing user.
Events:
user-created
user-created
user-deleted
My system would consume these like:
user-deleted
user-created
user-created
Which means the result is 2 remaining users which is wrong.
I do set the partition key (with the user id) but this seems only to guarantee order within a topic. How does this problem normally get tackled?
I have seen people using topic per entity. Resulting in 2 topics for this example (user and order) but this can still cause issues with related enities.
What you've designed is "request/response topics", and you cannot order between multiple topics this way.
Instead, design "entity topics" or "event topics". This way, ordering will be guaranteed, and you only need one topic per entity. For example,
Topic users
For a key=userId, you can structure events this way.
Creates
userId, {userId: userId, name:X, ...}
Updates
userId, {userId: userId, name:Y, ...}
Deletes
userId, null
Use a compacted topic for an event-store such that all deletes will be tombstoned and dropped from any materialized view.
You could go a step further and create a wrapper record.
userId, {action:CREATE, data:{name:X, ...}} // full-record
userId, {action:UPDATE, data:{name:Y}} // partial record
userId, {action:DELETE} // no data needed
This topic acts as your "event entity topic", but then, you need a stream processor to parse and process these events consistently into the above format, such as null-ing any action:DELETE, and writing to a compacted topic (perhaps automatically using Kafka Streams KTable)
Kafka is not able to maintain ordering across multiple topics. It's not capable either to maintain ordering inside one topic that has several partitions. The only ordering guarantee we have is within each partition of one topic.
What this means is that if the order of user-created and user-deleted as known by a kafkfa producer must be the same as the order of those events as perceived by a kafka consumer (which is understandable as you explain), then those events must be sent to same kafka partition of the same topic.
Usually, you don't actually need the whole order to be exactly the same for the producer and producer (i.e. you don't need total ordering), but you need it to be the same at least for each entity id, i.e. for each user id the user-created and user-deleted event must be in the same order for the producer and the consumer, but it's often acceptable to have events mixed up across users (i.e. you need _partial ordering`).
In practice this means you must use the same topic for all those events, which means this topic will contain events with different schemas.
One strategy for achieving that is to use union types, i.e. you declare in your event schema that the type can either be a user-created or a user-deleted. Both Avro and Protobuf offer this feature.
Another strategy, if you're using Confluent Schema registry, is to allow a topic to be associated with several types in the registry, using the RecordNameStrategy schema resolution strategy. The blog post Putting Several Event Types in the Same Topic – Revisited is probably a good source of information for that.
Related
I need to make Kafka consumers process all the messages with the same ID in each partition at once. For example, consider one topic containing all orders with different types and there are multiple consumer instances subscribing to this topic. How can I run consumers to process all the messages in each partition with the same Id? Because when the orders are produced with that Id, although Kafka guarantees that all same IDs go to the same partition, but each partition may contain different orders. I need to process all the similar orders in each partition at once(not one by one) and once in a while(not as soon as a new message arrives).
As the comments say, you'll need to manually batch your data into "bins per ID", then process those on your own. For example, write each record to a database, group by ID, then iterate/process each batch.
As far as Kafka is concerned, you're required to look at each event "one by one", but this does not require you to "handle them" in that order, unless you care about sequential processing, at least once processing, and in-order offset commits.
There's also no way to get "all unique ids" in any partition without consuming the whole partition end-to-end. You could use Kafka Streams aggregate function to help with this, and punctuate to periodically handle all gathered IDs up to a certain point, as one other solution.
I need to make Kafka consumers process all the messages with the same ID in each partition at once. For example, consider one topic containing all orders with different types and there are multiple consumer instances subscribing to this topic. How can I run consumers to process all the messages in each partition with the same Id? Because when the orders are produced with that Id, although Kafka guarantees that all same IDs go to the same partition, but each partition may contain different orders. I need to process all the similar orders in each partition at once(not one by one) and once in a while(not as soon as a new message arrives).
As the comments say, you'll need to manually batch your data into "bins per ID", then process those on your own. For example, write each record to a database, group by ID, then iterate/process each batch.
As far as Kafka is concerned, you're required to look at each event "one by one", but this does not require you to "handle them" in that order, unless you care about sequential processing, at least once processing, and in-order offset commits.
There's also no way to get "all unique ids" in any partition without consuming the whole partition end-to-end. You could use Kafka Streams aggregate function to help with this, and punctuate to periodically handle all gathered IDs up to a certain point, as one other solution.
I have two business entities in RDBMS: Associate & AssociateServingStore. I planned to have two topics currently writing ADD/UPDATE/DELETE into AssociateTopic & AssociateServingStoreTopic, and these two topics are consumed by several downstream systems which would use for their own business needs.
Whenever an Associate/AssociateServingStore is added from UI, currently I have Associate & AssociateServingStore writing into two separate topics, and I have a single consumer at my end to read both topics, the problem is order of messages that can be read from two separate topics.. as this follows a workflow I cannot read AssociateServingStore without reading Associate first.. how do I read them in order ? (with partition key I can read data in order for same topic within partition) but here I have two separate topics and want to read in an order, first read Associate & then AssociateServingSotre and How to design it in such a way that I can read Associate before AssociateServingStore.
If I thinking as a consumer myself, I am planning to read first 50 rows of Associate and then 50 rows from AssocateServingStore and process the messages, but the problem is if I get a row in AssociateServingStore from the 50 records that are consumed which is not in already read/processed from first 50 Associate events, I will get issues on my end saying parent record not found while child insert.
How to design consumer in these cases of RDBMS business events where we have multiple topics but read them in order so that I will not fall in a situation where I might read particular child topic message before reading parent topic message and get issues during insert/update like parent record not found. Is there a way we can stage the data in a staging table and process them accordingly with timestamp ? I couldn't think of design which would guarantee the read order and process them accordingly
Any suggestions ?
This seems like a streaming join use-case, supported by some stream-processing frameworks/libraries.
For instance, with Kafka Streams or ksqlDB you can treat these topics as either tables or streams, and apply joins between tables, streams, or stream to table joins.
These joins handle all the considerations related to streams that do not happen on traditional databases, like how long to wait when time on one of the streams is more recent than the other one[1][2].
This presentation[3] goes into the details of how joins work on both Kafka Streams and ksqlDB.
[1] https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization
[2] https://cwiki.apache.org/confluence/display/KAFKA/KIP-695%3A+Further+Improve+Kafka+Streams+Timestamp+Synchronization
[3] https://www.confluent.io/events/kafka-summit-europe-2021/temporal-joins-in-kafka-streams-and-ksqldb/
I am writing a kafka producer and needs help in creating partitions.
I have a group and a user table. Group contains different users and at a time a user can be a part of only one group.
There can be two types of events which I will receive as input and based on that I will add them to Kafka.
The events related to users.
The events related to groups.
Whenever an event related to a group happens, all the users in that group must be updated in bulk at consumer end.
Whenever an event related to a user happens, it must be executed as such at the consumer end.
Also, I want to maintain ordering on basis of time.
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
I am trying to figure out the possibilities I can try here.
Also, I want to maintain ordering on basis of time.
Means that topics, no matter how many, cannot have more than one partition, as you could have received messages out-of-order.
Obviously, unless you implement something like sequence ids in your messages (and can share that sequence across possibly multiple producers).
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
It sounds like a very simple messaging design, where you have a single queue (that's actually backed by a single topic with a single partition) that's consumed by multiple users. Actually any pub-sub messaging technology would be sufficient here (e.g. RabbitMQ's fanout exchanges).
The messages on the queue contain the information whether they are group updates or user updates - the consumers then filter the input depending on what they are interested in.
To discuss an alternative: single queue for group updates, and another for user updates - I understand that it would not be enough due to order demands - it's possible to get a group update independently of user update, breaking the ordering.
From the kafka documentation :
https://kafka.apache.org/documentation/#intro_consumers
Kafka only provides a total order over records within a partition, not
between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over records
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
so the best you can do is to have single partition-single topic.
I have multiple messages (more specifically log messages) in a certain topic which have the same id for a block of messages (these id's keep changing but remain same for a certain block of messages) and I need to find a way to group all the messages with that id or share the data contained in those messages with the same id between all the consumers in a consumer group.
So is there any way I could share data among various consumers in a consumer group?
This sounds like a sessionization use case to me. Kafka doesn't provide any means of grouping or nesting messages together so you'd have to do that yourself by keeping state in the consumer while processing and wrap the group of messages with some kind of header. Then you could push this to a new topic of wrapped message groups.
A better approach would probably be to make use of an external database or other system with more flexible means of selecting or organizing data based on fields. You can have a look at this blogpost for an example using Spark streaming + HBase.
There are two ways you can do that.
When you publish the message itself, create a message with partition key, so all the messages with same id goes to single partition. then in the consumer side it will always consumed by single consumer.[https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example]
If you use Spark-streaming in consumer side, you could use sliding window concept to group all the same id messages.[http://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations]