Tuning kafka performance to get 1 Million messages/second - apache-kafka

I'm using 3 VM servers, each one has 16 core/ 56 GB Ram /1 TB, to setup a kafka cluster. I work with Kafka 0.10.0 version. I installed a broker on two of them. I have created a topic with 2 partitions, 1 partition/broker and without replication.
My goal is to attend 1 000 000 messages / second.
I made a test with kafka-producer-perf-test.sh script and i get between 150 000 msg/s and 204 000 msg/s.
My configuration is:
-batch size: 8k (8192)
-message size: 300 byte (0.3 KB)
-thread num: 1
The producer configuration:
-request.required.acks=1
-queue.buffering.max.ms=0 #linger.ms=0
-compression.codec=none
-queue.buffering.max.messages=100000
-send.buffer.bytes=100000000
Any help will be appreciated to get 1 000 000 msg / s
Thank you

You're running an old version of Apache Kafka. The most recent release (0.11) had improvements including around performance.
You might find this useful too: https://www.confluent.io/blog/optimizing-apache-kafka-deployment/

Related

Lag to read message from kafka topic by storm spout

While ingesting message on kafka topic, storm spout is not picking up immediately. There is lag of more than 1 hrs.
There is one spout and 3 bolt in Topology.
Spout- ddl
Bolt- kafkabolt, deletebolt, deletemapperbolt
Storm Config:
ddl.spout.executors: 3
topology.spout.executors: 10
topology.acker.executors: 3
topology.bolt.executors.kafkabolt: 2
topology.bolt.executors.deletebolt: 3
topology.bolt.tasks.deletebolt: 3
topology.max.spout.pending: 1
topology.bolt.executors.deletemapperbolt: 3
topology.bolt.tasks.deletemapperbolt: 3
topology.message.timeout.secs: 300
topology.max.task.parallelism: 100
topology.workers: 1
topology.debug: false
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.receiver.buffer.size: 64
topology.transfer.buffer.size: 64

Why kafka the size of last log segment reduce when producer stop after a while?

I am trying to benchmark for kafka producer,
and i found the size of the last log segment reduce when producer stop after a while.
see the example 00000000000005746692.log below:
2020-05-23 08:40:35 /bin/du /data/kafka/json_test-0 -a
...
4 /data/kafka/json_test-0/00000000000004793445.snapshot
1048704 /data/kafka/json_test-0/00000000000004793445.log
68 /data/kafka/json_test-0/00000000000004793445.index
104 /data/kafka/json_test-0/00000000000004793445.timeindex
4 /data/kafka/json_test-0/00000000000005746692.snapshot
258176 /data/kafka/json_test-0/00000000000005746692.log
10240 /data/kafka/json_test-0/00000000000005746692.index
10240 /data/kafka/json_test-0/00000000000005746692.timeindex
6571068 /data/kafka/json_test-0
2020-05-23 08:40:38 /bin/du /data/kafka/json_test-0 -a
...
4 /data/kafka/json_test-0/00000000000004793445.snapshot
1048464 /data/kafka/json_test-0/00000000000004793445.log
68 /data/kafka/json_test-0/00000000000004793445.index
104 /data/kafka/json_test-0/00000000000004793445.timeindex
4 /data/kafka/json_test-0/00000000000005746692.snapshot
222224 /data/kafka/json_test-0/00000000000005746692.log
10240 /data/kafka/json_test-0/00000000000005746692.index
10240 /data/kafka/json_test-0/00000000000005746692.timeindex
6534876 /data/kafka/json_test-0
The size of /data/kafka/json_test-0/00000000000005746692.log from 258176 reduce to 222224.
Why kafka the size of last log segment reduce when producer stop after a while?
Edit:
kafka version: kafka_2.12-2.0.1
producer's compression.type: snappy
I doubt this log.preallocate configuration KIP-20, but config/server.properties not define this properties, it default to false log.preallocate
Data will be compressed by the producer, written in compressed format on the server and decompressed by the consumer.Compression

Kafka producer quota and timeout exceptions

I am trying to come up with a configuration that would enforce producer quota setup based on an average byte rate of producer.
I did a test with a 3 node cluster. The topic however was created with 1 partition and 1 replication factor so that the producer_byte_rate can be measured only for 1 broker (the leader broker).
I set the producer_byte_rate to 20480 on client id test_producer_quota.
I used kafka-producer-perf-test to test out the throughput and throttle.
kafka-producer-perf-test --producer-props bootstrap.servers=SSL://kafka-broker1:6667 \
client.id=test_producer_quota \
--topic quota_test \
--producer.config /myfolder/client.properties \
--record.size 2048 --num-records 4000 --throughput -1
I expected the producer client to learn about the throttle and eventually smooth out the requests sent to the broker. Instead I noticed there is alternate throghput of 98 rec/sec and 21 recs/sec for a period of more than 30 seconds. During this time average latency slowly kept increseing and finally when it hits 120000 ms, I start to see Timeout exception as below
org.apache.kafka.common.errors.TimeoutException : Expiring 7 records for quota_test-0: 120000 ms has passed since batch creation.
What is possibly causing this issue?
The producer is hitting timeout when latency reaches 120 seconds (default value of delivery.timeout.ms )
Why isnt the producer not learning about the throttle and quota and slowing down or backing off
What other producer configuration could help alleviate this timeout issue ?
(2048 * 4000) / 20480 = 400 (sec)
This means that, if your producer is trying to send the 4000 records full speed ( which is the case because you set throughput to -1), then it might batch them and put them in the queue.. in maybe one or two seconds (depending on your CPU).
Then, thanks to your quota settings (20480), you can be sure that the broker won't 'complete' the processing of those 4000 records before at least 399 or 398 seconds.
The broker does not return an error when a client exceeds its quota, but instead attempts to slow the client down. The broker computes the amount of delay needed to bring a client under its quota and delays the response for that amount of time.
Your request.timeout.ms being set to 120 seconds, you then have this timeoutException.

Spark Application - High "Executor Computing Time"

I have a Spark application that is now running for 46 hours. While majority of its jobs complete within 25 seconds, specific jobs take hours. Some details are provided below:
Task Time Shuffle Read Shuffle Write
7.5 h 2.2 MB / 257402 2.9 MB / 128601
There are other similar task times off-course having values of 11.3 h, 10.6 h, 9.4 h etc. each of them spending bulk of the activity time on "rdd at DataFrameFunctions.scala:42.". Details for the stage reveals that the time spent by executor on "Executor Computing time". This executor runs at DataNode 1, where the CPU utilization is very normal about 13%. Other boxes (4 more worker nodes) have very nominal CPU utilization.
When the Shuffle Read is within 5000 records, this is extremely fast and completes with 25 seconds, as stated previously. Nothing is appended to the logs (spark/hadoop/hbase), neither anything is noticed at /tmp or /var/tmp location which will indicate some disk related activity is in progress.
I am clueless about what is going wrong. Have been struggling with this for quite some time now. The versions of software used are as follows:
Hadoop : 2.7.2
Zookeeper : 3.4.9
Kafka : 2.11-0.10.1.1
Spark : 2.1.0
HBase : 1.2.6
Phoenix : 4.10.0
Some configurations on the spark default file.
spark.eventLog.enabled true
spark.eventLog.dir hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.history.fs.logDirectory hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.yarn.jars hdfs://SDCHDPMAST1:8111/user/appuser/spark/share/lib/*.jar
spark.driver.maxResultSize 5G
spark.deploy.zookeeper.url SDCZKPSRV01
spark.executor.memory 12G
spark.driver.memory 10G
spark.executor.heartbeatInterval 60s
spark.network.timeout 300s
Is there any way I can reduce the time spent on "Executor Computing time"?
The job performing on the specific dataset is skewed. Because of the skewness, jobs are taking more than expected.

What does ProducerPerformance Tool in Kafka give?

What does running following Kafka tool actually give ?
./bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --throughput=10000--topic=TOPIC--num-records=50000000 --record-size=200 --producer-props bootstrap.servers=SERVERS buffer.memory=67108864 batch.size=64000
When running with a single producer I get 90MB/s. When I use 3 separate producers on separate nodes I get only around 60 MB/s per producer. ( My Kafka cluster consists of 2 nodes, and topic has 6 partitions )
What does 90 MB/s mean? Is it the maximum rate at which a producer can produce?
Does partition count affect this value?
Why it drops to 60 MB/s when there are 3 producers ( still no network saturation on broker front)?
Thank you