We have too many partitions per topic and we have a long way until we will have the time to attend the issue.
In the mean time we are trying to mitigate by decreasing the replication factor.
Will it affect the count per broker?
Decreasing replica count will reduce the number of file-handles on the brokers, but it will not affect partition count of the topics.
Partitions cannot be reduced.
Related
I understand that ideally tasks.max = # of partitions for max throughput. But how many tasks per cpu core is ideal?
I am trying to improve Kafka producer throughput, we have CSV reports which are getting process and publish to Kafka topic. using default Kafka settings we are getting on avg 300-500 kbps Kafka throughput. to improve the throughput I have tried testing some combinations with linger.ms and batch.size but it's not helping.
tried with
"linger.ms= 30000","batch.size= 1000000","buffer.memory=16777216"
"linger.ms= 40000","batch.size= 1500000","buffer.memory=16777216"
even tried with lesser linger.ms and batch.size
linger.ms = 200, batch.size=65000
but still, throughput is around 150-200 kbps
but throughput is just decreasing to 100-150kbps.
Kafka topic has 12 partitions.
ack is all, and compression is snappy
any suggestions are welcome.
There is a comprehensive white paper from Confluent which explains how to increase throughput and which configurations to look at.
Basically, you have already done the right steps by increasing batch.size and tuning linger.ms. Depending on your requirements of potential data loss you may also reduce the retries. As an important factor for increasing throughput, you should use a compression.type in your producer while at the same time set the compression.type=producer at broker-level.
Remember that Kafka scales with the partitions and this can only happen if you have enough brokers in your cluster. Having many partitions, all located on the same broker will not increase throughput.
To summarize, the white paper mentiones the following producer configurations to increase throughput:
batch.size: increase to 100000 - 200000 (default 16384)
linger.ms: increase to 10 - 100 (default 0)
compression.type=lz4 (default none)
acks=1 (default 1)
retries=0 (default 0)
buffer.memory: increase if there are a lot of partitions (default 33554432)
Keep in mind, that in the end, each cluster behaves differently. In addition, each use case has different structure of messages (volume, frequency, byte size, ...). Therefore, it is important to get an understanding of the mentioned producer configurations and test their sensitivity on your actual cluster.
although we do not have any perfomance issues yet, and the nodes are pretty much idle, is it advisable to increase the number of kafka brokers (and zookeepers) from 3 to 5 immediately to improve cluster high availability? The intention is then of course to increase the replication factor from 3 to 5 as a default config for critical topics.
If high level of data replication is essential for your business, it is advisable to increase the count of brokers. To attain this, on top of extra nodes, you are creating a technical debt on network load also. Obviously if you increase the number of brokers in cluster, you are decreasing the risk related to loosing high availability.
Depending of your needs. If you do not have to ensure a very high availability(example a bank), the increase of replication factor in your cluster will reducer the overall performance because when you write a message on a topic/partition, that message will be replicated in 5 nodes instead of 3. You can increase the number of nodes for high availability and distribute less partitions on every node, but without a increase of replication factor.
Is there anyway I can speed up the rate at which the replicas will fetch data from leader?
I am using bin/kafka-producer-perf-test.sh to test the throughput of my producer. And I have set a client quota of 50 MBps. Now without any replicas I am getting throughput ~ 50MBps but when replication factor is set to 3, it reduces to ~30 MBps.
There is no other traffic in the network so I am not sure why things are slowing down. Is there some parameter like replica.socket.receive.buffer.bytes, replica.fetch.min.bytes that needs to be tuned to achieve high throughput? How can I speed up my replicas?
increasing value of num.replica.fetchers should help, its Number of threads used to replicate messages from leaders. Increasing this value can increase the degree of I/O parallelism in the follower broker. default value 1
I have 1 master and 3 slaves(4 cores each)
By Default the min partition size in my spark cluster is 32MB and my file size is 41 Gb.
So i am trying to reduce the number of partitions by changing the minsize to 64Mb
sc.hadoopConfiguration.setLong("mapreduce.input.fileinputformat.split.minsize", 64*1024*1024)
val data =sc.textFile("/home/ubuntu/BigDataSamples/Posts.xml",800)
data.partitions.size = 657
So what are the advantages of increasing the partition size and reducing the number of partitions.
Because when my partitions are around 1314 it took around 2-3min appx and even after reducing the partition count, it is still taking same amount of time.
The more partitions the more overhead, but to some extend it helps with performance as you can run all of them in parallel.
So, on one hand it makes sense to keep number of partitions equal to number of cores. On the other it might happen specific partition size lead to specific amount of garbage in the JVM, which might overhead the limit. In this case you'd like to increase number of partitions to reduce memory footprint of each of them.
It might also depend on the workflow. Consider groupByKey vs reduceByKey. In the latter case you can compute a lot locally and send just a little to remote node. Shuffles happen to be written to disk before being sent to remote, thus having more partitions might reduce performance.
It is also true that there is some overhead come along with each partition.
In case you'd like to share cluster with several people, then you might consider approach to take somewhat less number of partitions to process everything, so that all of the users would have some processing time.
Smth like this.