I have configured a two node six partition Kafka cluster with a replication factor of 2 on AWS. Each Kafka node runs on a m4.2xlarge EC2 instance backed by an EBS.
I understand that rate of data flow from Kafka producer to Kafka broker is limited by the network bandwidth of producer.
Say network bandwidth between Kafka producer and broker is 1Gbps ( approx. 125 MB/s) and bandwidth between Kafka broker and storage ( between EC2 instance and EBS volume ) is 1 Gbps.
I used the org.apache.kafka.tools.ProducerPerformance tool for profiling the performance.
I observed that a single producer can write at around 90 MB/s to the broker when a message size is 100 bytes.( hence network is not saturated)
I also observed that disk write rate to EBS volume is around 120 MB/s.
Is this 90 MB/s due to some network bottleneck or is it a limitation of Kafka ? (forgetting batch size and compression etc. for simplicity )
Could this be due to the bandwidth limitation between broker and ebs volume?
I also observed that when two producers ( from two separate machines ) produce data, throughput of one producer dropped to around 60 MB/s.
What could be the reason for this? Why doesn't that value reach 90 MB/s ? Could this be due to the network bottleneck between broker and ebs volume?
What confuses me is that in both cases (single producer and two producers ) disk write rate to ebs stays around 120 MB/s ( closer to its upper limit ).
Thank you
I ran into the same issue as per my understanding, in first case one producer is sending data to two brokers (there is nothing else in the network) so you got 90 MB/s and each broker at 45MB/s (approx), but in the second case two producers are sending data to two brokers so from the producer perspective it is able to send data at 60 MB/s but from the broker perspective it is receiving data at 60MB/s. so you are actually able to push more data through kafka.
There are a couple things to consider:
There are separate disk and network limits that apply to both the instance and the volume.
You have to account for replication. If you have RF=2, the amount of write traffic taken by a single node is 2*(PRODUCER_TRAFFIC)/(PARTITION_COUNT) assuming even distribution of writes across partitions.
Related
I made a Kafka Cluster on my local machine and I was testing creating producers with different Throughput to see what happens to the latency.
I used the kafka-test-perf benchmark to these tests
https://docs.cloudera.com/runtime/7.2.10/kafka-managing/topics/kafka-manage-cli-perf-test.html
Different throughput on producer
When I set the troughput to 200.000 there is only 22k records/sec. This means that my Kafka Cluster in my local machine can not handle this type of throughput?
I tested different throughputs to try to understand what happens here.
org.apache.kafka.clients.producer.BufferExhaustedException: Failed to allocate memory within the configured max blocking time 5 ms.
This says that exception is thrown when producer is unable to allocate memory to the record within the configured max blocking time 5 ms.
This is what it says when I was trying to add Kafka s3-sink connectors. There are 11 topics in two kafka brokers and there were consumers present already consuming from these topics. I was spinning out a 2 node Kafka connect cluster with 11 connectors trying to consume from these topics. But there was a huge spike in errors when I started these s3-sink connectors. Once I stopped these connectors, the errors dropped and seemed to be fine. Then I started the consumers again with less number of tasks and this time the errors spiked up when there was a sudden surge in the traffic and back to normal when the traffic was back to normal. There was a max retry of 5 and it messages failed to write even after 5 attempts.
From whatever I had read, it might be due to producer batch size or producer rate being higher than the consumer rate. And I guess each consumer will be occupying upto 64 mb when there is bursty traffic. Could that be the reason? Should I try and increase the blocking time?
Producer Config:
lingerTime: 0
maxBlockTime: 5
bufferMemory: 1024000
batchSize: 102400
ack: "1"
maxRequestSize: 102400
retries: 1
maxInFlightRequestsPerConn: 1000
It was actually due to the increase in the IOPS of the EC2 instances that Kafka brokers couldn't handle. Increasing the number of bytes fetched per poll and decreasing the frequency of polls fixed it.
We have setup a zookeeper quorum (3 nodes) and 3 kafka brokers. The producers can't able to send record to kafka --- data loss. During investigation, we (can still) SSH to that broker and observed that the broker disk is full. We deleted topic logs to clear some disk space and the broker function as expected again.
Given that we can still SSH to that broker, (we can't see the logs right now) but I assume that zookeeper can hear the heartbeat of that broker and didn't consider it down? What is the best practice to handle such events?
The best practice is to avoid this from happening!
You need to monitor the disk usage of your brokers and have alerts in advance in case available disk space runs low.
You need to put retention limits on your topics to ensure data is deleted regularly.
You can also use Topic Policies (see create.topic.policy.class.name) to control how much retention time/size is allowed when creating/updating topics to ensure topics can't fill your disk.
The recovery steps you did are ok but you really don't want to fill the disks to keep your cluster availability high.
What are the appropriate values for the following Kafka broker properties for a production system?
log.flush.interval.messages
log.flush.interval.ms
I am seeing too many small IOs write requests in my st1 HDD drive which I would like to optimize. Will changing these properties help? What are the tradeoffs?
Also, How are these properties configured in any typical production system?
We have 5 Kafka brokers (r5.xlarge) with attached st1 HDD Drive. Our usual data input rate is around 2 Mbps during peak time and 700-800 Kbps during usual time.
I have 3 server with 10GB connection between them and run a Kafka cluster on 2 servers and generate some test in third server...
when I run a single java producer (in third server that is not in Kafka cluster) sending 1 million messages take 3 seconds, but when I run another java producer (with different topic) both of producers take 6 seconds for sending messages.
I sure network connection is not bottleneck (it is 10GB)
so why this problem happened and how can I solve this (I want both producers take 3 seconds) ?
Sounds like you are getting a consistent 333,333 messages/sec performance out of a two node kafka cluster, with zookeeper running on the same 2 machines as your 2 kafka brokers. You don’t say what size these messages are or what kind of disks you are using, or how much memory, or if you are publishing with acks=all, or what programming language you are using (I assume java) but that actually sounds like good consistent results that are probably disk IO bound on the brokers or cpu bound on your single client machine.