Kafka to store the message on single partition for a user? - apache-kafka

I have a ecommerce like system which produces user events of different kind .
I need to store them in kafka for asynch data analysis. I want events for specific users goes to one queue partition so that consumers gets all messages
on one partition . This won't be dedicated queue for a user. Which means single partition can store the data for multiple customer. Not sure how
I can achieve it in kafka ?

To send messages of specific users to the same partition, you can use the key= parameter of producer's send method. You can set this parameter to a byte encoded value which must be unique.
For example, in Python:
producer.send("topic", json.dumps(msg).encode()), key=str(user_id).encode())
This will ensure that messages concerning a given user will be pushed into the same topic's partition.

#zebra8844 answer is correct. The same key will always go to the same partition unless you increase the number of partitions in the future then this will not be the case. So just keep this in mind for future.

Related

How to sync data for a particular user, when reading from kafka?

I have a streaming serving using kafka, where I receive data from multiple users and I want to process the data where each users data must be processed in sync manner where as different User's data data can be processed on async manner? Is there any standard pattern available for such scenarios or situations ?
You can achieve so, by using userId as the key while publishing the message to kafka.
Keys are used to ensure the messages published to kafka with a particular key are ordered by pushing them into a single partition.
And as each consumer is assigned one partition (in best case, i.e. there can't be any such case where one partition is shared among consumers), thus consumer would be consuming the data from partition in sequence it is pushed.

mark end of logical section at kafka when multiple partitions are used

I want to share a problem and a solution I used, as I think it may be beneficial for others, if people have any other solutions please share.
I have a table with 1,000,000 rows, which I want to send to kafka, and spread the data between 20 partitions.
I want to notify the consumer when producer reached end of data, I don't want to have direct connection between producer and consumer.
I know kafka is designed as logical endless stream of data, but I still need to mark the end of the specific table.
There was a suggestion to count the number of items per logical section, and send this data (to a metadata topic), so the consumer will be able to count items, and know when the logical section ended.
There are several disadvantages for this approach:
As data is spread between partitions, I can tell there are total x items at my logical section, however if there are multiple consumers (one per partition), they'll need to share a counter of consumed messages per logical section. I want to avoid this complexity. Also when consumer is stopped and resumed, it will need to know how many items were already consumed and keep context.
Regular producer session guarantees at-least-once delivery, which means I may have duplicated messages. Counting the messages will need to take this into account (and avoid counting duplicated messages).
There is also the case where I don't know in advance the number of items per logical session, (I'm also kind of consumer, consuming stream of event and signaled when data ended), so at this case, the producer will also need to have a counter, keep it when stopped and resumed etc. Having several producers will need to share the counter etc. So it adds a lot of complexity to the process.
Solution 1:
I actually want the last message at each partition indicate it is the last message.
I can do some work in advance, create some random message keys, send messages partitioned by key, and test to which partition each message is directed. As partitioning by keys is deterministic (for given number of partitions), I want to prepare a map of keys and the target partition. For example key: ‘xyz’ is directed to partition #0, key ‘hjk’ is directed to partition #1 etc, and finally have the reversed map, so for partition 0, use key ‘xyz’, for partition 1, use key ‘hjk’ etc.
Now I can send the entire table (except of the last 20 rows) with partition strategy random, so the data is spread between partitions, for almost entire table.
When I come to the last 20 rows, I’ll send them using partition key and I’ll set for each message partition key which will hash the message to a different partition. This way, each one of the 20 partitions will get one of the last 20 messages. For each one of the last 20 messages, I'll set a relevant header which will state it is the last one.
Solution 2:
Similar to solution 1, but send the entire table spread to random partitions. Now send 20 metadata messages, which I’ll direct to the 20 partitions using the partition by key strategy (by setting appropriate keys).
Solution 3:
Have additional control topic. After the table was sent entirely to the data topic, send a message to the control topic saying table is completed. The consumer will need to test the control topic from time to time, when it gets the 'end of data' message, it will know that if it reached the end of the partition, it actually reached the end of the data for that partition. This solution is less flexible and less recommended, but I wrote it as well.
Another one solution is to use open source analog of S3 (e.g. minio.io). Producers can uplod data, send message with link to object storage. Consumers will remove data frome object storage after collecting.

Kafka very large number of topics?

I am considering Kafka to stream updates from the back-end to the front-end applications.
- Data streams are specific to a user requests, so each request will generate a stream in the back-end.
- Each user will have multiple concurrent requests. One to many relationship btw user and streams
I first thought I would setup a topic "per user request" but learnt that hundreds of thousands of topics is bad for multiple reasons.
Reading online, I came across posts that suggest one topic partitioned on userid. How is that any better than multiple topics?
If partitioning on userid is the way to go, the consumer will receive updates for different requests (from that user) and that will cause issues. I need to be able to not process a stream until I choose to, and if each request had it own topic this will work out great.
Thoughts?
I don't think Kafka will be a good option for your use case. As your use case is somewhat "synchronous" and "dynamic" in nature. A user request is submitted and the client wait for the stream of response events, the client should also know when the response for a particular user request ends. Multiple user requests may end up in the same Kafka partition as we cannot afford to have an exclusive partition for each user when number of users is high.
I guess Redis may be a better use case for this use case. Every request can have an unique id, and response events are added to a Redis list with some reasonable expiry time. The Redis list is given the same key name as the request id.
Redis list will look like (key is request id):
request id --> response even1, response event2,...... , response end evnt
The process which is relaying the event to the client will delete the list after it successfully sends all the response event to the client and the "last response event marker" is encountered. If the relaying process dies before it can delete the response, Redis will take care of deleting the list after the list's expiry time.
Although it is possible (I guess) to have a Kafka cluster of several thousends topics, I'm not sure it is the way to go in your particular case.
Usually you design your Kafka app around streams of data: like click-streams, page-views etc. Then, if you want some kind of "sticky" processors - you need partition key. In your case, if you select user id as a key, Kafka will store all events from an user to the same partition.
Kafka consumer, on the other side, read messages from 1 to all partitions of a topic. That means, if say, you have a topic with 10 partitions, you can start your Kafka consumer in a consumer group so every consumer has a distinct partitions assigned.
It means, for the user id example, all users will be processed by the exactly one consumer depending on the key. For example, userid A goes to partition 1, but userid B goes to partition 10.
Again, you can use message key in order to map your data stream to Kafka partitions. All events with the same key will be stored to the same partition and will be consumed/processed by the same consumer instance.

apache- kafka with 100 millions of topics

I'm trying to replace rabbit mq with apache-kafka and while planning, I bumped in to several conceptual planning problem.
First we are using rabbit mq for per user queue policy meaning each user uses one queue. This suits our need because each user represent some job to be done with that particular user, and if that user causes a problem, the queue will never have a problem with other users because queues are seperated ( Problem meaning messages in the queue will be dispatch to the users using http request. If user refuses to receive a message (server down perhaps?) it will go back in retry queue, which will result in no loses of message (Unless queue goes down))
Now kafka is fault tolerant and failure safe because it write to a disk.
And its exactly why I am trying to implement kafka to our structure.
but there are problem to my plannings.
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
Second, If I decide to go for topics based on operation and partition by random hash of users id, if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
So as conclusion, 1~5 millions users. We do not want to have one user blocking large number of other users being processed. Having topic per user will solve this issue, it seems like there might be an issue with zookeeper if such large number gets in (Is this true? )
what would be the best solution for structuring? Considering scalability?
First, I was thinking to create as many topic as per user meaning each user would have each topic (What problem will this cause? My max estimate is that I will have around 1~5 million topics)
I would advise against modeling like this.
Google around for "kafka topic limits", and you will find the relevant considerations for this subject. I think you will find you won't want to make millions of topics.
Second, If I decide to go for topics based on operation and partition by random hash of users id
Yes, have a single topic for these messages and then route those messages based on the relevant field, like user_id or conversation_id. This field can be present as a field on the message and serves as the ProducerRecord key that is used to determine which partition in the topic this message is destined for. I would not include the operation in the topic name, but in the message itself.
if there was a problem with one user not consuming message currently, will the all user in the partition have to wait ? What would be the best way to structure this situation?
This depends on how the users are consuming messages. You could set up a timeout, after which the message is routed to some "failed" topic. Or send messages to users in a UDP-style, without acks. There are many ways to model this, and it's tough to offer advice without knowing how your consumers are forwarding messages to your clients.
Also, if you are using Kafka Streams, make note of the StreamPartitioner interface. This interface appears in KStream and KTable methods that materialize messages to a topic and may be useful in a chat applications where you have clients idling on a specific TCP connection.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.