consumers process all the related messages with a specific key - apache-kafka

I need to make Kafka consumers process all the messages with the same ID in each partition at once. For example, consider one topic containing all orders with different types and there are multiple consumer instances subscribing to this topic. How can I run consumers to process all the messages in each partition with the same Id? Because when the orders are produced with that Id, although Kafka guarantees that all same IDs go to the same partition, but each partition may contain different orders. I need to process all the similar orders in each partition at once(not one by one) and once in a while(not as soon as a new message arrives).

As the comments say, you'll need to manually batch your data into "bins per ID", then process those on your own. For example, write each record to a database, group by ID, then iterate/process each batch.
As far as Kafka is concerned, you're required to look at each event "one by one", but this does not require you to "handle them" in that order, unless you care about sequential processing, at least once processing, and in-order offset commits.
There's also no way to get "all unique ids" in any partition without consuming the whole partition end-to-end. You could use Kafka Streams aggregate function to help with this, and punctuate to periodically handle all gathered IDs up to a certain point, as one other solution.

Related

grouping messages and processing bunch of messages at once with kafka [duplicate]

I need to make Kafka consumers process all the messages with the same ID in each partition at once. For example, consider one topic containing all orders with different types and there are multiple consumer instances subscribing to this topic. How can I run consumers to process all the messages in each partition with the same Id? Because when the orders are produced with that Id, although Kafka guarantees that all same IDs go to the same partition, but each partition may contain different orders. I need to process all the similar orders in each partition at once(not one by one) and once in a while(not as soon as a new message arrives).
As the comments say, you'll need to manually batch your data into "bins per ID", then process those on your own. For example, write each record to a database, group by ID, then iterate/process each batch.
As far as Kafka is concerned, you're required to look at each event "one by one", but this does not require you to "handle them" in that order, unless you care about sequential processing, at least once processing, and in-order offset commits.
There's also no way to get "all unique ids" in any partition without consuming the whole partition end-to-end. You could use Kafka Streams aggregate function to help with this, and punctuate to periodically handle all gathered IDs up to a certain point, as one other solution.

How to guarantee message ordering over multiple topics in kafka?

I am creating a system in which I use kafka as an event store. The problem I am having is not being able to guarantee the message ordering of all the events.
Let's say I have a User entity and a Order entity. Right now I have the topics configured as follows:
user-deleted
user-created
order-deleted
order-created
When consuming these topics from the start (when a new consumer group registers) first the user-deleted topic gets consumed then the user-created etc. The problem with this is that the events over multiple topics do not get consumed chronologically, only within the topic.
Let's say 2 users get created and after this one gets deleted. The result would be one remaing user.
Events:
user-created
user-created
user-deleted
My system would consume these like:
user-deleted
user-created
user-created
Which means the result is 2 remaining users which is wrong.
I do set the partition key (with the user id) but this seems only to guarantee order within a topic. How does this problem normally get tackled?
I have seen people using topic per entity. Resulting in 2 topics for this example (user and order) but this can still cause issues with related enities.
What you've designed is "request/response topics", and you cannot order between multiple topics this way.
Instead, design "entity topics" or "event topics". This way, ordering will be guaranteed, and you only need one topic per entity. For example,
Topic users
For a key=userId, you can structure events this way.
Creates
userId, {userId: userId, name:X, ...}
Updates
userId, {userId: userId, name:Y, ...}
Deletes
userId, null
Use a compacted topic for an event-store such that all deletes will be tombstoned and dropped from any materialized view.
You could go a step further and create a wrapper record.
userId, {action:CREATE, data:{name:X, ...}} // full-record
userId, {action:UPDATE, data:{name:Y}} // partial record
userId, {action:DELETE} // no data needed
This topic acts as your "event entity topic", but then, you need a stream processor to parse and process these events consistently into the above format, such as null-ing any action:DELETE, and writing to a compacted topic (perhaps automatically using Kafka Streams KTable)
Kafka is not able to maintain ordering across multiple topics. It's not capable either to maintain ordering inside one topic that has several partitions. The only ordering guarantee we have is within each partition of one topic.
What this means is that if the order of user-created and user-deleted as known by a kafkfa producer must be the same as the order of those events as perceived by a kafka consumer (which is understandable as you explain), then those events must be sent to same kafka partition of the same topic.
Usually, you don't actually need the whole order to be exactly the same for the producer and producer (i.e. you don't need total ordering), but you need it to be the same at least for each entity id, i.e. for each user id the user-created and user-deleted event must be in the same order for the producer and the consumer, but it's often acceptable to have events mixed up across users (i.e. you need _partial ordering`).
In practice this means you must use the same topic for all those events, which means this topic will contain events with different schemas.
One strategy for achieving that is to use union types, i.e. you declare in your event schema that the type can either be a user-created or a user-deleted. Both Avro and Protobuf offer this feature.
Another strategy, if you're using Confluent Schema registry, is to allow a topic to be associated with several types in the registry, using the RecordNameStrategy schema resolution strategy. The blog post Putting Several Event Types in the Same Topic – Revisited is probably a good source of information for that.

ordering across partitions in Kafka

I am writing a kafka producer and needs help in creating partitions.
I have a group and a user table. Group contains different users and at a time a user can be a part of only one group.
There can be two types of events which I will receive as input and based on that I will add them to Kafka.
The events related to users.
The events related to groups.
Whenever an event related to a group happens, all the users in that group must be updated in bulk at consumer end.
Whenever an event related to a user happens, it must be executed as such at the consumer end.
Also, I want to maintain ordering on basis of time.
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
I am trying to figure out the possibilities I can try here.
Also, I want to maintain ordering on basis of time.
Means that topics, no matter how many, cannot have more than one partition, as you could have received messages out-of-order.
Obviously, unless you implement something like sequence ids in your messages (and can share that sequence across possibly multiple producers).
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
It sounds like a very simple messaging design, where you have a single queue (that's actually backed by a single topic with a single partition) that's consumed by multiple users. Actually any pub-sub messaging technology would be sufficient here (e.g. RabbitMQ's fanout exchanges).
The messages on the queue contain the information whether they are group updates or user updates - the consumers then filter the input depending on what they are interested in.
To discuss an alternative: single queue for group updates, and another for user updates - I understand that it would not be enough due to order demands - it's possible to get a group update independently of user update, breaking the ordering.
From the kafka documentation :
https://kafka.apache.org/documentation/#intro_consumers
Kafka only provides a total order over records within a partition, not
between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over records
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
so the best you can do is to have single partition-single topic.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.

Data Modeling with Kafka? Topics and Partitions

One of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".
I've read and watched some introductory materials. In particular, take, for example, Kafka: a Distributed Messaging System for Log Processing, which writes:
"a Topic is the container with which messages are associated"
"the smallest unit of parallelism is the partition of a topic. This implies that all messages that ... belong to a particular partition of a topic will be consumed by a consumer in a consumer group."
Knowing this, what would be a good example that illustrates how to use topics and partitions? When should something be a topic? When should something be a partition?
As an example, let's say my (Clojure) data looks like:
{:user-id 101 :viewed "/page1.html" :at #inst "2013-04-12T23:20:50.22Z"}
{:user-id 102 :viewed "/page2.html" :at #inst "2013-04-12T23:20:55.50Z"}
Should the topic be based on user-id? viewed? at? What about the partition?
How do I decide?
When structuring your data for Kafka it really depends on how it´s meant to be consumed.
In my mind, a topic is a grouping of messages of a similar type that will be consumed by the same type of consumer so in the example above, I would just have a single topic and if you´ll decide to push some other kind of data through Kafka, you can add a new topic for that later.
Topics are registered in ZooKeeper which means that you might run into issues if trying to add too many of them, e.g. the case where you have a million users and have decided to create a topic per user.
Partitions on the other hand is a way to parallelize the consumption of the messages. The total number of partitions in a broker cluster need to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature. Consumers in a consumer group will split the burden of processing the topic between themselves according to the partitioning so that one consumer will only be concerned with messages in the partition itself is "assigned to".
Partitioning can either be explicitly set using a partition key on the producer side or if not provided, a random partition will be selected for every message.
Once you know how to partition your event stream, the topic name will be easy, so let's answer that question first.
#Ludd is correct - the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
For example:
If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing
If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Samza will be able to keep a count of a given page's views just by looking at the events in a single partition
Generally, we are trying to avoid having to rely on global state (such as keeping counts in a remote database like DynamoDB or Cassandra), and instead be able to work using partition-local state. This is because local state is a fundamental primitive in stream processing.
If you need both of the above use-cases, then a common pattern with Kafka is to first partition by say :user-id, and then to re-partition by :viewed ready for the next phase of processing.
On topic names - an obvious one here would be events or user-events. To be more specific you could go with with events-by-user-id and/or events-by-viewed.
This is not exactly related to the question, but in case you already have decided upon the logical segregation of records based on topics, and want to optimize the topic/partition count in Kafka, this blog post might come handy.
Key takeaways in a nutshell:
In general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve. Let the max throughout achievable on a single partition for production be p and consumption be c. Let’s say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions.
Currently, in Kafka, each broker opens a file handle of both the index and the data file of every log segment. So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system. E.g. in our production system, we once saw an error saying too many files are open, while we had around 3600 topic partitions.
When a broker is shut down uncleanly (e.g., kill -9), the observed unavailability could be proportional to the number of partitions.
The end-to-end latency in Kafka is defined by the time from when a message is published by the producer to when the message is read by the consumer. As a rule of thumb, if you care about latency, it’s probably a good idea to limit the number of partitions per broker to 100 x b x r, where b is the number of brokers in a Kafka cluster and r is the replication factor.
I think topic name is a conclusion of a kind of messages, and producer publish message to the topic and consumer subscribe message through subscribe topic.
A topic could have many partitions. partition is good for parallelism. partition is also the unit of replication,so in Kafka, leader and follower is also said at the level of partition. Actually a partition is an ordered queue which the order is the message arrived order. And the topic is composed by one or more queue in a simple word. This is useful for us to model our structure.
Kafka is developed by LinkedIn for log aggregation and delivery. this scene is very good as a example.
The user's events on your web or app can be logged by your Web sever and then sent to Kafka broker through the producer. In producer, you could specific the partition method, for example : event type (different event is saved in different partition) or event time (partition a day into different period according your app logic) or user type or just no logic and balance all logs into many partitions.
About your case in question, you can create one topic called "page-view-event", and create N partitions through hash keys to distribute the logs into all partitions evenly. Or you could choose a partition logic to make log distributing by your spirit.