By what standards is it better to split Kafka partition? - apache-kafka

I got a few questions while I was preparing the Kafka Service.
First question.
What is the recommended criterion for partitioning busy services?
I understand it is a good idea to decide the number of partitions based on the memory of producer and the the consumers.
Are there any criteria to determine the number of partitions?
Your account of the experience will also be of great help.
Second question.
Sometimes, only one broker happens to be busy during Kafka service.
How do I fix this?
Is there any way to prevent it?
Question three :
Is there any way I can know about server dirty shutdown?

In general, the more the partitions in your Kafka cluster the higher the throughput. However, note that there is an (negative) impact of having too many partitions in total on things like availability and latency. This article from Confluent can shed some light regarding your first question.
Coming to your second question, a topic is made up of at least one partition. A Kafka broker contains multiple partitions of various topics. Some of these partitions are leaders and some of them are replicas from partitions on other brokers. Therefore, a broker might have some active partitions (leaders) and some inactive (replicas). I guess that in your case, only a single broker contains leader partitions so you need to check your replication and partitioning strategies.
Regarding your last question, you need to consider a Kafka cluster monitoring tool such as Confluent's Control Centre, or Landoop's Kafka LENSES.

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

Does kafka support millions of partitions?

Will we have any problem if we have millions of partitions for one topic?
Due to our business requirement, we are thinking if we can make a partition for every user in kafka.
We have millions of users.
Any insight would be appreciated!
Yes, I think you will end up having problems if you have millions of partitions for several reasons:
(Most importantly!!) Customers come and go, so you will have the requirement to constantly change the number of partitions or have plenty of unused partitions (because you can not reduce the number of partitions within a topic).
More Partitions Requires More Open File Handles: More Partitions means more directories and segment files on disk.
More Partitions May Increase Unavailability: Planned failures move Leaders off of a Broker one at a time, with minimal downtime per partition. In a hard failure all the leaders are immediately unavailable.
More Partitions May Increase End-to-end Latency: For the message to be seen by a Consumer it must be committed. The Broker replicates data from the leader with a single thread, resulting in overhead per Partition.
More Partitions May Require More Memory In the Client
More details are provided in the blog from Confluent on How to choose the number of topics/partitions in a Kafka cluster?.
In addition, according to Confluent's training material for Kafka developers it is recommended:
"The current limits (2-4K Partitions/Broker, 100s K Partitions per cluster) are maximums. Most environments are well below these values (typically in the 1000-1500 range or less per Broker)."
This blog explains that "Apache Kafka Supports 200K Partitions Per Cluster".
This might change with the replacement of Zookeeper KIP-500 but, again, looking at the first bullet point above this will still be a unhealthy software design.

How to change the number of brokers for a topic in a kafka cluster?

I have a problem with some Kafka topics and couldn't find an answer to it yet.
While adding more partitions to __confluent.support.metrics shouldn't be a problem (I know how to do that), I wonder if it is possible to tell it to use brokers which obviously can not be seen by this topic?
Also I'd love to understand why these topics only inherit some brokers instead of all available 5 brokers in their cluster.
I'd love to fix these topics. But I fear that if I tell it to add (or use) partitions on brokers the topic can't "see", that it might not work or even destroy the topic, which would be rather bad.
How can I instruct these topics, that there are 5 available brokers? Can I do it with one of the Kafka tools?
How could that have happened in the first place?
Why does the __consumer_offsets topic only "see" 4 brokers instead of 5 like all other topics in this cluster do?
FYI: I didn't setup any of this, but I have to cleanup/revamp the running clusters and am stuck now, I never came across this sort of problem before
The reason this has happened is because you have only one partition and one replica for the __confluent.support.metrics topic. In a 5-node cluster, this means you will only be using 20% of the available brokers in the cluster, which corresponds with the image you've posted. A topic with replication-factor 1 and 1 partition will only ever hold data on one broker.
On the other hand, it is unusual that your __consumer_offsets topic would be using only 4 out of 5 brokers. My guess would be that your 5th broker was not online at the time of creation of __consumer_offsets (this is created when you consume from any topic for the first time) and thus no partitions were created on this broker.
However, this is probably nothing to worry about, as the spread of partitions across the cluster is generally handled by Kafka itself rather than being a user problem. There is no concept of a topic "seeing" a broker per se; rather, the brokers hold the data for the topics, and the topics will know which brokers they reside on. A topic doesn't generally need to concern itself with other brokers.
Both the consumer offsets and Confluent metrics topics have line items in the server properties file that determines what configurations those topics will be created with.
To improve the health of those topics, you can attempt to increase the replication factor, which will spread your topic over more brokers and provide fault tolerance. Also see Kafka Tools Wiki

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.

Increase number of topic in Kafka leads zookeeper fail

We plan to use kafka as a message broker for IoT use case, where each device is considered as unique topic. when I simulated 10 message per second to 10 thousand topics zookeeper is getting bottle neck,all Kafka monitoring tools fails to read the throughput values and number of topics from JMX port because of that. will tuning zookeeper will solve the issues. where In IoT use case there will be millions of device polling data to millions of topics. I want to make sure the approach is perfect to go. Please suggest.
There is another solution available now - try out Apache Pulsar.
It looks promising about the number of topics and generally quite similar to Kafka and it is now compatible with Kafka.
https://pulsar.apache.org/
Your right in saying that even though theoretically you could have millions of topics in Kakfa, the number of topics in Kafka is bound by Zookeeper, You cannot have million of topics. Creating topics/partitions per device is not scalable solution.
Is there a reason why you need millions of topics ?