Split 1 topic/partitions into multiple topics - apache-kafka

I am only starting to learn about Kafka Topic/Partitions, So I have a case where I have 1 topic and a possible 10,000 partitions possibly more.
I'm assuming having 10,000 partitions is a very large number and this is discouraged.
So what I am thinking is to split the 1 topic into logical topic buckets and thus having the 10,000 partitions spread among these topics.
So instead of :
1 topic + 10,000+ partitions
I will have:
10 topics + 1,000 partitions each
Is this a viable approach?

Related

consume kafka topics with different partition numbers

Hi i have a kafka consumer (using spring kafka dependency) that listens to multiple topics. Lets say i have 3 topics which are topicA, topicB and topicC. In my application i consume all three topics in one consumer like below.
#KafkaListener(topics = "topicA,topicB,topicC", groupId = "myGroup", concurrency="3")
My topics have partitions and those number of partitions are deferent from each. Lets say my topicA has 3 partitions. topicB have 6 partitions and topicC has 9 partitions. How should i determine a number for "concurrency" option in #KafkaListener. (I'm confused since topicB and topicC contain 6 and 9 partitions respectively. So should i change the concurrency to 6 or 9 ? or should i change it to 18 which is total number of partitions from 3 topics)
I know that on the consumer side, Kafka always gives a single partition’s data to one consumer thread and the degree of parallelism in the consumer (within a consumer group) is bounded by the number of partitions being consumed.
My main goal is to consume parallelly by using concurrency option in #kafkalistener
If you set the concurrency to 18, with the default partition assignor, if the concurrency is greater than the number of partitions, you will have idle consumers. The partitions from different topics have no bearing on how the partitions are distributed.
You can use a custom partition assignor (in the consumer configuration) to distribute the partitions differently.
See https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy
Also see the discussion about RoundRobinAssignor here https://docs.spring.io/spring-kafka/docs/current/reference/html/#using-ConcurrentMessageListenerContainer
Or, simply add 3 separate #KafkaListener annotations to the method, one for each topic, with different concurrencies.

Number of consumers in kafka comsumer-group

If a producer has 3 topics and 4 partitions each topic, should the consumer group contains 4 or 12 consumers?
I want to achieve ideal consumption.
There should be one consumer each partition for ideal consumption. So, for your case, 12 consumers should be ideal.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
If you have 3 topics with 4 partition each.
For best optimisation you should have 4 consumers per consumer group.
Reason : If you have more than 4 consumers ,your extra consumers would be left ideal, because 4 consumers will be assigned 4 partitions with 1 consumer assigned 1 partition. So in short more than 4 consumers is not required per consumer group.
If you have less consumers say 2 consumers for 4 topics , each consumer will consume messages from 2 partitions each which will overload it.
There is no limit in number of consumer groups which subscribe to a topic.

How repartition spark with one (or several) very big partitions?

I have a DataFrame with 10 paritions but 90% of the data belongs to 1 or 2 partitions. If I invoke dataFrame.coalesce(10) this splits each partition into 10 parts while this is not neccessary for 8 partitions. Is there a way to split only partitions with data into more parts then others?

How does offset work when I have multiple topics on one partition in Kafka?

I am trying to develop a better understanding of how Kafka works. To keep things simple, currently I am running Kafka on one Zookeeper with 3 brokers and one partition with duplication factor of 3. I learned that, in general, it's better to have number of partitions ~= number of consumers.
Question 1: Do topics share offsets in the same partition?
I have multiple topics (e.g. dogs, cats, dinosaurs) on one partition (e.g. partition 0). Now my producers have produced a message to each of the topics. "msg: bark" to dogs, "msg: meow" to cats and "msg: rawr" to dinosaurs. I noticed that if I specify dogs[0][0], I get back bark and if I do the same on cats and dinosaurs, I do get back each message respectively. This is an awesome feature but it contradicts with my understanding. I thought offset is specific to a partition. If I have pushed three messages into a partition sequentially. Shouldn't the messages be indexed with 0, 1, and 2? Now it seems me that offset is specific to a topic.
This is how I imagined it
['bark', 'meow', 'rawr']
In reality, it looks like this
['bark']
['meow']
['rawr']
But that can't be it. There must be something keeping track of offset and the actual physical location of where the message is in the log file.
Question 2: How do you manage your messages if you were to have multiple partitions for one topic?
In question 1, I have multiple topics in one partition, now let's say I have multiple partitions for one topic. For example, I have 4 partitions for the dogs topic and I have 100 messages to push to my Kafka cluster. Do I distribute the messages evenly across partitions like 25 goes in partition 1, 25 goes in partition 2 and so on...?
If a consumer wants to consume all those 100 messages at once, he/she needs to hit all four partitions. How is this different from hitting 1 partition with 100 messages? Does network bandwidth impose a bottleneck?
Thank you in advance
For your question 1: It is impossible to have multiple topics on one partition. Partition is part of topic conceptually. You can have 3 topics and each of them has only one partition. So you have 3 partitions in total. That explains the behavior that you observed.
For your question 2: AT the producer side, if a valid partition number is specified that partition will be used when sending the record. If no partition is specified but a key is present, a partition will be chosen using a hash of the key. If neither key nor partition is present a partition will be assigned in a round-robin fashion. Now the number of partitions decides the max parallelism. There is a concept called consumer group, which can have multiple consumers in the same group consuming the same topic. In the example you gave, if your topic has only one partition, the max parallelism is one and only one consumer in the consumer group will receive messages (100 of them). But if you have 4 partitions, you can have up to 4 consumers, one for each partition and each receives 25 messages.

Calculate Kafka topic partitions

I have an n node Kafka cluster with just 2 topics
I have replicated the topics across all n nodes
I 'think' I have just a single consumer in the form of a mirrormaker consuming all topics although I intend to increase that from 1 to n mirror makers
How many partitions should my topics use ? 1 then later n?
The number of nodes in your cluster (what you call "n") should be independent of the number of partitions in a topic (let's call that "p"). The max number of consumers in a consumer group (Mirror Maker or any other single group) will be p, but the number you actually need will be entirely driven by your throughput performance and message ordering requirements.