Log Flush Policy for Kafka brokers in production systems - apache-kafka

What are the appropriate values for the following Kafka broker properties for a production system?
log.flush.interval.messages
log.flush.interval.ms
I am seeing too many small IOs write requests in my st1 HDD drive which I would like to optimize. Will changing these properties help? What are the tradeoffs?
Also, How are these properties configured in any typical production system?
We have 5 Kafka brokers (r5.xlarge) with attached st1 HDD Drive. Our usual data input rate is around 2 Mbps during peak time and 700-800 Kbps during usual time.

Related

Apache Kafka Throughput and Latency

I made a Kafka Cluster on my local machine and I was testing creating producers with different Throughput to see what happens to the latency.
I used the kafka-test-perf benchmark to these tests
https://docs.cloudera.com/runtime/7.2.10/kafka-managing/topics/kafka-manage-cli-perf-test.html
Different throughput on producer
When I set the troughput to 200.000 there is only 22k records/sec. This means that my Kafka Cluster in my local machine can not handle this type of throughput?
I tested different throughputs to try to understand what happens here.

SSD or HDD for Kafka Brokers? ( Using SSD for Kafka )

Kafka is fast because it uses sequential writing techniques on HDD.
If I use SSD for Kafka Brokers, do I get faster performance?
As far as I know, SSD works differently than HDD. And I think with SSD I wouldn't get sequential writes privileges. I'm worried that using SSD wouldn't be good for Kafka brokers.
My questions :
Is SSD better than HDD for Kafka Brokers?
Does the "sequential write technique" also apply to SSD?
SSD are best for Zookeeper servers, not the brokers.
If I use SSD for Kafka Brokers, do I get faster performance?
Honestly, that is up for you to benchmark for your use-cases
However, Kafka does sequential scans/writes, not random flash access of data (what SSDs are meant for), therefore spinning disks are preferred , regardless of claimed speeds of SSD
https://docs.confluent.io/current/kafka/deployment.html#disks
Also, disk pools (JBOD) and partition schemas are important, and ZFS seems to get good gains over XFS or ext4
I have run Kafka in production for 8 years with 1mio messages per second.
Spinning disks will only work as long as you can avoid lagging consumers. If you have too many, disk access will look like random I/O and spinning disk-based Kafka clusters will fail (tested for you).
Do not put Kafka on consumer drives, we tried, they die hard after about 1 year. Enterprise NVMe is awesome, if you can afford it. Currently experimenting with a 22 disk ssd raid0. 10gbit+ nics is a must.

kafka Performance Reduced when adding more consumer or producer

I have 3 server with 10GB connection between them and run a Kafka cluster on 2 servers and generate some test in third server...
when I run a single java producer (in third server that is not in Kafka cluster) sending 1 million messages take 3 seconds, but when I run another java producer (with different topic) both of producers take 6 seconds for sending messages.
I sure network connection is not bottleneck (it is 10GB)
so why this problem happened and how can I solve this (I want both producers take 3 seconds) ?
Sounds like you are getting a consistent 333,333 messages/sec performance out of a two node kafka cluster, with zookeeper running on the same 2 machines as your 2 kafka brokers. You don’t say what size these messages are or what kind of disks you are using, or how much memory, or if you are publishing with acks=all, or what programming language you are using (I assume java) but that actually sounds like good consistent results that are probably disk IO bound on the brokers or cpu bound on your single client machine.

Kafka Producer and Broker Throughput Limitations

I have configured a two node six partition Kafka cluster with a replication factor of 2 on AWS. Each Kafka node runs on a m4.2xlarge EC2 instance backed by an EBS.
I understand that rate of data flow from Kafka producer to Kafka broker is limited by the network bandwidth of producer.
Say network bandwidth between Kafka producer and broker is 1Gbps ( approx. 125 MB/s) and bandwidth between Kafka broker and storage ( between EC2 instance and EBS volume ) is 1 Gbps.
I used the org.apache.kafka.tools.ProducerPerformance tool for profiling the performance.
I observed that a single producer can write at around 90 MB/s to the broker when a message size is 100 bytes.( hence network is not saturated)
I also observed that disk write rate to EBS volume is around 120 MB/s.
Is this 90 MB/s due to some network bottleneck or is it a limitation of Kafka ? (forgetting batch size and compression etc. for simplicity )
Could this be due to the bandwidth limitation between broker and ebs volume?
I also observed that when two producers ( from two separate machines ) produce data, throughput of one producer dropped to around 60 MB/s.
What could be the reason for this? Why doesn't that value reach 90 MB/s ? Could this be due to the network bottleneck between broker and ebs volume?
What confuses me is that in both cases (single producer and two producers ) disk write rate to ebs stays around 120 MB/s ( closer to its upper limit ).
Thank you
I ran into the same issue as per my understanding, in first case one producer is sending data to two brokers (there is nothing else in the network) so you got 90 MB/s and each broker at 45MB/s (approx), but in the second case two producers are sending data to two brokers so from the producer perspective it is able to send data at 60 MB/s but from the broker perspective it is receiving data at 60MB/s. so you are actually able to push more data through kafka.
There are a couple things to consider:
There are separate disk and network limits that apply to both the instance and the volume.
You have to account for replication. If you have RF=2, the amount of write traffic taken by a single node is 2*(PRODUCER_TRAFFIC)/(PARTITION_COUNT) assuming even distribution of writes across partitions.

Kafka recommended system configuration

I'm expecting our influx in to Kafka to raise to around 2 TB/day over a period of time. I'm planning to setup a Kafka cluster with 2 brokers (each running on separate system). What is the recommended hardware configuration for handling 2 TB/day ?
To use as a base you could look here: https://docs.confluent.io/4.1.1/installation/system-requirements.html#hardware
You need to know the amount of messages you get per second/hour because this will determine the size of your cluster. For HD, it's not necessary to get SSD because the system will use RAM to store the data first. Still you could need quite speed hard disk to ensure that the flushing process of the queue will not slow your system.
I would also recommend to use 3 kafka broker and 3 or 4 zookeeper server too.