I have a Kafka cluster with 3 brokers (v2.3.0) and each broker has 2 disks attached. I added a new topic (heavyweight) and was surprised that even if the topic has 15 partitions, those weren't distributed evenly on the disks. Thus I got one disk that's almost empty and the other almost filled up. Is there any way to have Kafka evenly distribute data on its disks?
Related
I've a Kafka topic with one partition. I'm trying to send messages to broker. The source is of 1.5 TB in size. My broker has two directories to store the Kafka partitions
/dev/sdc1 1.1T 567G 460G 56% /data_disk_0
/dev/sdd1 1.1T 1.1T 0 100% /data_disk_1
Each one with 1.1 TB size. As my topic has only one partition, Kafka is storing all the messages to /dev/sdd1. Eventually the disk fills up completely because the source size is greater than the target disk size. Can I span my topic partition to store half data in disk0 and the other half in disk1 without changing the number of partitions?
Please advice
I couldn't find any configuration related changes that I can add to Kafka
This isn't possible at the kafka configuration level. You'd have to use RAID or logical volume groups to pool the disks together as one volume
In the Kafka documentation, it mentions
You can either RAID these drives together into a single volume or format and mount each drive as its own directory
If your data is so heavily skewed to one disk, meaning certain partitions, you should be checking how your producers are partitioning the data, start to persist such a large topic somewhere, or turn on compaction / retention periods for these topics
I have read from here and a bit not sure about the partition log.
First they say:
For each topic, the Kafka cluster maintains a partitioned log that
looks like this:
Then they show a picture:
Also they say
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Do I understand correctly that :
On a cluster, it can have only one partition log of a topic? In other words, two partition of the same topic cannot be in the same cluster?
A Cluster can have multiple partition log from different topics?
The picture about a topic should be more like this?
A topic consist of 1 or many partitions. You specify the number of partitions when creating the topic, and partitions can also be added after creation.
Kafka will spread the partitions on as many brokers as it can in the cluster. If you only have a single broker then they will be all on this broker.
Many partitions from the same topic can live on the same broker. This happens all the time as most clusters only have a dozen brokers and it's not uncommon to have 50 partitions, hence several partitions from the same topic will live on the same broker.
What the docs say is that a partition is a unit that cannot be split. It's either on a broker or not. Whereas a topic is just a collections of partitions that have the same name and configuration.
To answer your question:
For a Kafka cluster of b brokers and a topic with p partitions, each broker will roughly hold p/b partitions as primary copy. They might also hold the replica partitions, but that depends on your replication factor. So, e.g. if you have a 3-node cluster, and a topic test with 6 partitions, each node will have 2 partitions.
Yes, it surely can. Extending the previous point, if you have two topics test1, and test2, each with 6 partitions, then each broker will hold 4 partitions in total (2 for each topics).
I guess in the diagram you have mislabeled brokers as cluster.
Does number of partitions have an impact on producer throughput in Kafka?
( I understand that number of partitions is the upper bound for degree of parallelism on consumer side, but does it affect the producer performance ? )
I used the producer performance tool in Kafka to test this on a Kafka cluster setup on AWS. I observed that for 3 , 6 and 20 partitions the aggregated throughput in the cluster was approximately similar ( around 200 MB/s ). I would appreciate if you could help me clarify this issue.
Thank you.
an answer in two parts:
From the Kafka consumer perspective. Yes, partitions give improved throughput for Kafka consumers. But, I found that you really want to minimise the number of Kafka consumers (and therefore partitions) if you want good scalability. Here's a link to a blog I wrote last year on a Kafka IoT application (see section 2.3)
From the Kafka producer perspective, throughput drops with more partitions. Last week I ran some benchmarks with Kafka producers and different numbers of partitions and found that the throughput drops off significantly with more partitions. To "size" a Kafka cluster correctly, the only solution is to increase the Kafka cluster size (nodes and/or cores) until you get the target capacity with the required number of partitions. I needed 2M write/s and 200 partitions (for concurrency on the consumer side). For a 6 node (4 cores per node) cluster I could do 2.1M write/s with 6 partitions, but only 1.2M write/s with 200 partitions. On a 6 node cluster with 8 core nodes I could get 4.6M write/s with 6 partitions, and slightly more than my target throughput of 2.4M write/s with 200 partitions. I haven't blogged about these results yet but here's a link to the current blog series (Anomalia Machina).
Note: Throughput can also be increased by (a) reducing the replication factor or (b) by only writing to a subset of partitions (!) but then you probably don't need all the partitions.
We are in the process of designing a Kafka Cluster (at least 3 nodes) that will process events from an array of web servers. Since the logs are largely identical, we are planning to create a single Topic only (say - webevents)
We expect a lot of traffic from the servers. Since there is a single topic, there will be a single leader broker. In such a case how will the cluster balance the high traffic? All write requests will always be routed to the leader broker at all times and other nodes might be underutilized.
Does a external hardware balancer help solve this problem? Alternately, can a Kafka configuration help distribute write requests evenly on a 1-topic cluster?
Thanks,
Sharod
Short answer: a topic may have multiple partitions and each partition, not topic, has a leader. Leaders are evenly distributed among brokers. So, if you have multiple partitions in your topic you will have multiple leaders and your writes will be evenly distributed among brokers.
You will have a single topic with lot of partitions, you can replicate partitions for high availability/durability of your data.
Each broker will hold an evenly distributed number of partitions and each of these partitions can be either a leader or a replica for a topic. Kafka producers (Kafka clients running in your web servers in your case) write to a single leader, this provides a means of load balancing production so that each write can be serviced by a separate broker and machine.
Producers do the load balancing selecting the target partition for each message. It can be done based on the message key, so all messages with same key go to the same partition, or on a round-robin fashion if you don't set a message key.
Take a look at this nice post. I took the diagram from there.
We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.