Kafka cluster performance dropped after adding more Kafka brokers - apache-kafka

does anybody knows of a possible reason of slowing down messages processing when more Kafka brokers are added to the cluster?
The situation is the following:
1 setup: In a Kafka cluster of 3 brokers I produce some messages to 50 topics (replication factor=2, 1 partition, ack=1), each has a consumer assigned. I measure the avg time to process 1 message (from producing to consuming).
2 setup: I add 2 more Kafka brokers to the cluster - they are created by the same standard tool, so have the same characteristics like cpu/ram, and the same Kafka configs. I create 50 new topics (replication factor=2, 1 partition, ack=1) - just to save my time and not doing replicas reassignment. So the replicas are spread over the 5 brokers. I produce some messages only to the new 50 topics and measure the avg processing time - it became slower in almost 1/3.
So I didn't change any settings of producer, consumers or brokers (except for listing 2 new brokers in the config of Kafka and zookeeper), and can't explain the performance drop. Please point me to any config option/log file/useful article that would help to explain this, and thank you so much in advance.

In a Kafka cluster of 3 brokers I produce some messages to 50 topics
In the first setup, you have 50 topics with 3 brokers.
I add 2 more Kafka brokers to the cluster. I create 50 new topics
In the second setup, you have 100 topics with 5 brokers.
Even supposing scaling should be linear, 100 topics should contain 6 brokers but not 5
So the replicas are spread over the 5 brokers
Here, how the replicas are spread also matters. A broker may be serving 10 partitions as leader, another broker may be serving 7 and so on. This being the case, a particular broker may have more load compared to other brokers. This could be the cause for slow down.
Also, when you have replication.factor=2, what matters here is whether acks=all or acks=1 or acks=0. If you have put acks=all, then all the replicas must acknowledge the write to the producer which could slow it down.
Next is the locality and configuration of the new brokers, under what machine configurations they are running, their CPU config, RAM, processor load, network between the old brokers, new brokers and clients are also worth considering.
Moreover, if your application is consuming a lot of topics, it necessarily would have to make requests to a lot of brokers since the topic partitions are spread among different brokers. Utilizing one broker to the fullest (CPU, memory etc) vs utilizing multiple brokers can be benchmarked.

Related

Kafka - How to recover if a partition is lost?

I have 4 Kafka nodes in a cluster, one topic split to 40 partitions and replica count 2. Kafka version is 2.3.1.
How can I recover from the situation when two Kafka nodes die at the same time, it is not possible to start them again and Kafka logs are lost?
I'm sure that I lose some data because some partitions are lost (some partitions have replicas only on the died nodes).
I tried to add two new Kafka nodes and reassign partitions to all 4 available Kafka nodes. Finally, lost partitions are not reassigned to the two new Kafka nodes. Clients cannot publish data that go to lost partitions.
Kafka recovers by himself the losing partitions only if those partitions still have at least one alive replica that was previously in sync. Otherwise unclean.leader.election must be enabled on the brokers to move the leader to an out of sync replica
Since partitions had only 2 replica and you lost 2 nodes, you might lose some partitions.
You can replace 2 replica by 4 replica to more reliability
The two added nodes should have the same id as the previous ones to be able to pull replica.

Impact on having large number of consumers on producers / brokers

I wanted to understand, having large numbers of consumers on producer latency and brokers. I have around 7K independent consumers. Each consumer is consuming all partitions of a topic. I have manually assigned partitions of a topic to a consumer, not using consumer groups. Each consumer is consuming msg only one topic. Message size is small less than a KB. Produce throughput is also low. But when produce spike come then produce latency (here ack=1). Brokers resource is also very low. I want to understand the impact on having large number of consumers on Kafka.
Cluster Details:
Broker#: 13 (1 Broker : 14 cores & 36GB memory )
Kafka cluster version: 2.0.0
Kafka Java client version: 2.0.0
Number topics: ~15.
Number of consumers: 7K (all independent and manually assigned all partitions of a topic to a consumers. One consumer is consuming all partitions from a topic only)

Kafka Producer, Consumer, Broker in same host?

Are there any downsides to running the same producer and consumer code for all nodes in the cluster? If there are 8 nodes in the cluster (8 consumer, 8 kafka broker, and 8 producers), would 8 producers be running at the same time in the cluster then? Is there a way to modify cluster so that only one producer runs at a time?
Kafka cluster is nothing but Kafka brokers running under a distributed consensus. Kafka cluster is agnostic about number of producers and consumers running around it. Producers and consumers are clients of the Kafka cluster. Producers will stream data to Kafka and consumers consume the data from Kafka. Within Kafka cluster data will be distributed within topics. Topics are sharded using partitions. If multiple consumers belong to the same consumer group consumers can work in a self healing fashion.
Is there a way to modify cluster so that only one producer runs at a
time?
If you intend to run a single producer at certain point of time, you don't need to make any change within cluster.
Are there any downsides to running the same producer and consumer code for all nodes in the cluster?
The primary downsides here would be scalability and memory usage.
Producers and Consumers are not required to run on Brokers. Producers should be deployed where data is being generated (or running as separate hosts, like Kafka Connect workers).
Consumers should be scaled out independently based on the throughput and ordering guarantees that you need in your downstream systems.
There is nothing that says 8 brokers requires 8 producers and 8 consumers; partitions are what matters more
If you have N partitions in a topic, you can only scale to N active consumers anyway, and infinitely many producers
8 brokers can hold lots of partitions for any given topic
Running a single producer is an implementation of your own code. The broker cannot force it.

How to do data rebalance on kafka if data is stored persistently

I'm new to kafka and preparing use it for production.
What strategies can be used for rebalancing data storage if brokers for a topic's current partitions are running out of disk space, if more brokers can be added to the cluster?
By a simple example, say a topic has 3 partitions at beginning (1 replica to simplify problem), and 3 brokers each stores 1 partition of the topic, and each of these partition takes up 1TB disk space.
How can I add 3 more new broker servers and alter topic's partition amount to 6, and end up with a data rebalance result of each of the 6 partitions takes up 500GB disk space on its broker?
I think this problem is critical for storing large amount of data forever in kafka cluster.
Thanks.
kafka-reassign-partitions & kafka-preferred-replica-election are the built in commands to handle such relocation tasks, as Kafka does not perform it automatically on cluster expansion.
There are vendored alternatives, such as from Confluent and DataDog.
How can I add 3 more new broker servers
See Docs - Expanding your cluster
alter topic's partition amount to 6
Use kafka-topics --alter and increase partitions (note: this does not relocate existing data to new partitions, or in other words "re-key" the topic)
Also, keep in mind that once you create topics, replicas and ISRs will get defined. Where possible, try to choose a replication factor of 3 for resiliency and durability. Having a replication factor of 2 in a 3-node cluster is not helpful in certain sticky situations, where if one (of the 3) brokers goes down, then none of the available or online brokers will join the replica set (to satisfy the replication factor) and move into the ISR.
In a situation like this, you will end up with an ISR that is incomplete and worse, end up with a single point of failure.
Note that broker being down if different from expanding or contracting the Kafka cluster.

Kafka consumers unable to keep pace on some brokers, not others

I have a topic with 6 partitions spread over 3 brokers (ie 2 partitions per broker).
I have consumers on 6 separate worker nodes (using Storm).
The partitions are all accepting 20MB/s of messages.
2 partitions are able to output 20MB/s to the consumers on 2 nodes but the other 2 are only managing ~15 MB/s.
File cache is working properly and there are no direct disk reads on any broker.
The offset tracking for the partitions is done by the consumer (ie manualPartitionAssignment, nothing committed to Kafka nor Zookeeper).
What could be causing the apparent internal latency on 2 of the brokers for 4 of the partitions? The load profile, GC etc seems similar across all 3 brokers' JVMs. I am monitoring all manner of metrics for the fetch consumer operation etc through the JMX Mbeans but can't figure this out. Any pointers?