Any advantages splitting up Kafka Topics - apache-kafka

I am working on a application/Kafka Cluster which will be producing/consuming messages (around 100k a second) to a Topic. The message format is identical so my initial thoughts were to have a single topic for all messages.
However is there any benefits to Kafka to split the messages into multiple Topics? There is a logical separation which could be applied which could split the topic into multiple (10ish) topics.
Apart from the Producer/Consumer side of things. Does Kafka itself have any preferences around performance, redundancy, stability, management etc by having 1 large topic versus multiple smaller topics?

Topic partitions are the usual means of parallelizing Kafka, however you could opt to split it into multiple topics as well if you wanted. But I would first look into the partition aspect of things. Here is a good Confluent article on how to pick the right number of partitions. Especially note that if you are partitioning on keys then adding partitions after the fact can result in split data, so think through it properly up front as best as you can.

Parallelism in kafka depends on the number of partitions in a topic.There will be an increase in throughput of data as long as the number of partitions is optimal(unnecessarily large number of partitions will create overhead).By increasing the number of consumer you can streams message from partitions simultaneously

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

Kafka - Standalone sever - How to decide partitions?

I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?
Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.
There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.

Kafka - Best practices in case of slow processing consumer. How to achieve more parallelism?

I'm aware that the maximum number of active consumers in a consumer group is the number of partitions of a topic.
What's the best practice in case of slow processing consumers? How to achieve more parallelism?
An example: A topic with 6 partitions and thousands of messages per second produced from Producers. So I have at most 6 consumers in the group. Consider that processing those messages is complex and the consumers are much slower than the producers. The result is that the consumers are always behind the last offset and the lag is increasing.
In a traditional MQ system, we simply add more and more consumers to stay up to date.
How to achieve this with Kafka, since the total of the consumers in a group is at most the number of partitions? Should I:
Configure the topic to have more partitions allowing more consumers per group?
Route the message from the consumer to a traditional MQ Queue (but lose the ordering)?
What's the best practice for this situation?
In Kafka, partitions are the unit of parallelism.
Without knowing our exact use case and requirements it's hard to come up with precise recommendations but there are a few options.
First you should really consider having more partitions. 6 partitions is relatively small, you could easily have 60, 120 or even more partitions (and the corresponding number of consumers). Suddenly the amount of work each consumers has to do is significantly reduced.
Also if your requirements allow, you can also consume at a fast rate and spread the processing of records across many workers. In solutions like this it's harder to maintain ordering but if you don't need it then you can consider it.
I'm not sure how routing messages through a MQ Queue would really help in this scenario. If you are still reading slower than writing the amount of data in the queue will grow till you have no disk space left.
Kafka is better designed to serve as buffer between your producers and consumers so just ensure you have retention limits on your topics that allow some flexibility on the consumer side without losing data.

Decide when to create new topic or increase partition count

Say, I am having a kafka topic with 10 partitions. When a data rate is increased, I can increase the partitions to speed up my processing logic.
But my doubt is that, whether increasing the partitions is good or can I go for topic split up (That is, Based on my application logic, some data will go for topic 1 and some data to topic2. So by doing this, I can split the data rate to two topics)
Whether choosing new topic rather than increasing partitions or increasing partitions rather than creating new topic will have any performance impact on kafka cluster?
Which one will be the best solution?
It depends!
It is usually recommended to slightly over-partition topics that are likely to increase in throughput so you don't have to add partitions when this happens.
The main reason is that if you're using keyed messages adding partitions will change the key-partition mappings. So after having added partitions, messages with a key won't go to the same partition than before. If you need ordering per key this can be problematic.
Adding partitions is usually easier as consumers and producers won't need updates. You will just be able to add consumers to scale. You also keep all events together and have to worry about a single topic. Depending on the size of your cluster, with only 10 partitions you probably still have a lot of leeway to add partitions. From Kafka's point of view, 10 partitions is pretty small and you can easily have 50 or even more.
On the other hand, when creating new topics, clients will need to be updated to use them. Nvertheless, that could be a solution if over time you start receiving more types of events and want to reorder them across several topics.

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.