Performance testing in Kafka - apache-kafka

Can someone please explain on how is performance tested in Kafka using,
bin/kafka-consumer-perf-test.sh --topic benchmark-3-3-none \
--zookeeper kafka-zk-1:2181,kafka-zk-2:2181,kafka-zk-3:2181 \
--messages 15000000 \
--threads 1
and
bin/kafka-producer-perf-test.sh --topic benchmark-1-1-none \
--num-records 15000000 \
--record-size 100 \
--throughput 15000000 \
--producer-props \
acks=1 \
bootstrap.servers=kafka-kf-1:9092,kafka-kf-2:9092,kafka-kf-3:9092 \
buffer.memory=67108864 \
compression.type=none \
batch.size=8196
I am not clear on what are the paramters and what is the output that should be obtained. How will I check if I send 1000 messages to Kafka topics ,its performance and acknowledgement.

When we run this we get the following,
Producer
| start.time | end.time | compression | message.size | batch.size | total.data.sent.in.MB | MB.sec | total.data.sent.in.nMsg | nMsg.sec |
| 2016-02-03 21:38:28:094 | 2016-02-03 21:38:28:449 | 0 | 100 | 200 | 0.01 | 0.0269 | 100 | 281.6901 |
Where,
• total.data.sent.in.MB shows total data send to cluster in MB.
• MB.sec indicates how much data transferred in MB per sec(Throughput on size).
• total.data.sent.in.nMsg will show the count of total message which were sent during this test.
• And last nMsg.sec shows how many messages sent in a sec(Throughput on count of messages
Consumer
| start.time | end.time | fetch.size | data.consumed.in.MB | MB.sec | data.consumed.in.nMs | nMsg.sec |
| 2016-02-04 11:29:41:806 | 2016-02-04 11:29:46:854 | 1048576 | 0.0954 | 1.9869 | 1001 | 20854.1667
where,
• start.time, end.time will show when was test started and completed.
• fetch.size** shows the amount of data to fetch in a single request.
• data.consumed.in.MB**** shows the size of all messages consumed.
• ***MB.sec* indicates how much data transferred in MB per sec(Throughput on size).
• data.consumed.in.nMsg will show the count of total message which were consumed during this test.
• And last nMsg.sec shows how many messages consumed in a sec(Throughput on count of messages).

I would rather suggest going for a specialized performance testing tool like Apache JMeter and Pepper-Box - Kafka Load Generator in order to load test your Kafka installation.
This way you will be able to conduct the load having full control of threads, ramp-up time, message size and content, etc. You will also be able to generate HTML Reporting Dashboard having tables and charts with interesting metrics.
See Apache Kafka - How to Load Test with JMeter article for more details if needed.

If anyone runs into this question please note that kafka-producer-perf-test.sh should produce a different output as of Kafka v2.12-3.3.2.
For example, to send 1000 messages to a Kafka topic use command line parameter --num-records 1000 (and --topic <topic_name> of course). Generated output should resemble the following and include number of messages sent in steps, speed in terms of messages sent per second and MB per seconds, average latencies (I chose to send 1M messages):
323221 records sent, 64644.2 records/sec (63.13 MB/sec), 7.5 ms avg latency, 398.0 ms max latency.
381338 records sent, 76267.6 records/sec (74.48 MB/sec), 1.2 ms avg latency, 27.0 ms max latency.
1000000 records sent, 70244.450688 records/sec (68.60 MB/sec), 15.21 ms avg latency, 475.00 ms max latency, 1 ms 50th, 96 ms 95th, 353 ms 99th, 457 ms 99.9th.

Related

Understanding relation between partitions and brokers in kafka?

I am new to kafka and just doing some tweaks in my local machine following the docs,
Let's say I have 3 topics T1, T2 and T3.
T1 has 2 partitions,
T2 has 3 partitions,
T3 has 5 partitions
and,
I have two brokers B1 and B2.
Will kafka manage assigning brokers to topics/partions automatically ? If yes, how?
Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.
Partition 0:
+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+
Partition 1:
+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+
Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.
Example of 2 topics (each having 3 and 2 partitions respectively):
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 2 |
| Partition 1 |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 2 |
| |
| |
| Topic 2 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).
Topics, should have a replication-factor > 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2 as shown below:
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| |
| |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 1 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.
Note about Leaders:
At any time, only one broker can be a leader of a partition and only that leader can receive and serve data for that partition. The remaining brokers will just synchronize the data (in-sync replicas). Also note that when the replication-factor is set to 1, the leader cannot be moved elsewhere when a broker fails. In general, when all replicas of a partition fail or go offline, the leader will automatically be set to -1.
Note about retention period
If you are planning to use Kafka as a storage you also need to be aware of the configurable retention period for every topic. If you don't take care of this setting, you might lose your data. According to the docs:
The Kafka cluster durably persists all published records—whether or
not they have been consumed—using a configurable retention period. For
example, if the retention policy is set to two days, then for the two
days after a record is published, it is available for consumption,
after which it will be discarded to free up space.
According to the Kafka documentation on Replica Management the assignment of partitions is happening on a round-robin fashion:
We attempt to balance partitions within a cluster in a round-robin fashion to avoid clustering all partitions for high-volume topics on a small number of nodes. Likewise we try to balance leadership so that each node is the leader for a proportional share of its partitions.
Over time, after adding/removing topics and or brokers to the cluster, this will usually lead to unbalances and inefficiencies that you should take care of as an operator of the platform. Also, the assignment of partitions to brokers will not happen based on any kind of information like data volume or number of read/write accesses.
To take care of these unbalances Kafka comes with a command line tool kafka-reassign-partitions.sh. An example can be found in the Kafka documentation on Automatically migrating data to new machines. The licensed Confluent Platform has the Auto Data Balancer.

Kafka producer quota and timeout exceptions

I am trying to come up with a configuration that would enforce producer quota setup based on an average byte rate of producer.
I did a test with a 3 node cluster. The topic however was created with 1 partition and 1 replication factor so that the producer_byte_rate can be measured only for 1 broker (the leader broker).
I set the producer_byte_rate to 20480 on client id test_producer_quota.
I used kafka-producer-perf-test to test out the throughput and throttle.
kafka-producer-perf-test --producer-props bootstrap.servers=SSL://kafka-broker1:6667 \
client.id=test_producer_quota \
--topic quota_test \
--producer.config /myfolder/client.properties \
--record.size 2048 --num-records 4000 --throughput -1
I expected the producer client to learn about the throttle and eventually smooth out the requests sent to the broker. Instead I noticed there is alternate throghput of 98 rec/sec and 21 recs/sec for a period of more than 30 seconds. During this time average latency slowly kept increseing and finally when it hits 120000 ms, I start to see Timeout exception as below
org.apache.kafka.common.errors.TimeoutException : Expiring 7 records for quota_test-0: 120000 ms has passed since batch creation.
What is possibly causing this issue?
The producer is hitting timeout when latency reaches 120 seconds (default value of delivery.timeout.ms )
Why isnt the producer not learning about the throttle and quota and slowing down or backing off
What other producer configuration could help alleviate this timeout issue ?
(2048 * 4000) / 20480 = 400 (sec)
This means that, if your producer is trying to send the 4000 records full speed ( which is the case because you set throughput to -1), then it might batch them and put them in the queue.. in maybe one or two seconds (depending on your CPU).
Then, thanks to your quota settings (20480), you can be sure that the broker won't 'complete' the processing of those 4000 records before at least 399 or 398 seconds.
The broker does not return an error when a client exceeds its quota, but instead attempts to slow the client down. The broker computes the amount of delay needed to bring a client under its quota and delays the response for that amount of time.
Your request.timeout.ms being set to 120 seconds, you then have this timeoutException.

How to recover kafka messages?

We are considering to use kafka for distributed development, but also would like to use it as a database. Specific case: we write to "transact" topic in kafka and want to rely on it to store all the transactions.
Question is: Is there a recovery plan needed in this design, would Kafka lose data due to crashes, disk failures?
Or maybe Kafka has it's own recovery mechanics, so user doesn't need a recovery plan on their side?
Short answer to your question:
Kafka provides durability and fault-tolerance however, you are responsible for the configuration of the corresponding parameters and the design of an architecture which can deal with fail overs in order to ensure that you'll never lose any data.
Long answer to your question:
I'll answer to your question by explaining how Kafka works in general and how it deals with failures.
Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.
Partition 0:
+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+
Partition 1:
+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+
Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.
Example of 2 topics (each having 3 and 2 partitions respectively):
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 2 |
| Partition 1 |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 2 |
| |
| |
| Topic 2 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).
Topics, should have a replication-factor > 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2 as shown below:
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| |
| |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 1 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.
Note about Leaders:
At any time, only one broker can be a leader of a partition and only that leader can receive and serve data for that partition. The remaining brokers will just synchronize the data (in-sync replicas). Also note that when the replication-factor is set to 1, the leader cannot be moved elsewhere when a broker fails. In general, when all replicas of a partition fail or go offline, the leader will automatically be set to -1.
Note about retention period
If you are planning to use Kafka as a storage you also need to be aware of the configurable retention period for every topic. If you don't take care of this setting, you might lose your data. According to the docs:
The Kafka cluster durably persists all published records—whether or
not they have been consumed—using a configurable retention period. For
example, if the retention policy is set to two days, then for the two
days after a record is published, it is available for consumption,
after which it will be discarded to free up space.
Please read the replication section of Kafka docs, especially the subsection called "Availability and Durability Guarantees". After reading the docs, if you encounter problems, then feel free to post another question.

Tuning kafka performance to get 1 Million messages/second

I'm using 3 VM servers, each one has 16 core/ 56 GB Ram /1 TB, to setup a kafka cluster. I work with Kafka 0.10.0 version. I installed a broker on two of them. I have created a topic with 2 partitions, 1 partition/broker and without replication.
My goal is to attend 1 000 000 messages / second.
I made a test with kafka-producer-perf-test.sh script and i get between 150 000 msg/s and 204 000 msg/s.
My configuration is:
-batch size: 8k (8192)
-message size: 300 byte (0.3 KB)
-thread num: 1
The producer configuration:
-request.required.acks=1
-queue.buffering.max.ms=0 #linger.ms=0
-compression.codec=none
-queue.buffering.max.messages=100000
-send.buffer.bytes=100000000
Any help will be appreciated to get 1 000 000 msg / s
Thank you
You're running an old version of Apache Kafka. The most recent release (0.11) had improvements including around performance.
You might find this useful too: https://www.confluent.io/blog/optimizing-apache-kafka-deployment/

Why single node multiple broker in kafka cluster not preferred?

I am trying to implement kafka into production. Wanted to know why single-node, multiple-broker kafka instance is not preferred. Few people suggested that if multiple brokers are used on single node, they should be allocated separate disk space but the reason to do so is not clear.
Can someone please explain the impact of single broker vs multiple broker kafka instance on a single node.
If you have multiple brokers on a single node with a single disk, then all brokers have to read from and write to a single disk. That makes the system do lots of random read and random write, and the Kafka cluster will have poor performance.
In contrast, if you have multiple disks on a single node, and each broker read from and write to a different disk, then you can avoid the random read/write problem.
UPDATE
Also, if you have too many brokers on a single machine, the network bandwidth might be a bottleneck. Since all brokers have to share the network bandwidth.
Every topic, is a particular stream of data (similar to a table in a database). Topics, are split into partitions (as many as you like) where each message within a partition gets an incremental id, known as offset as shown below.
Partition 0:
+---+---+---+-----+
| 0 | 1 | 2 | ... |
+---+---+---+-----+
Partition 1:
+---+---+---+---+----+
| 0 | 1 | 2 | 3 | .. |
+---+---+---+---+----+
Now a Kafka cluster is composed of multiple brokers. Each broker is identified with an ID and can contain certain topic partitions.
Example of 2 topics (each having 3 and 2 partitions respectively):
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 2 |
| Partition 1 |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 2 |
| |
| |
| Topic 2 |
| Partition 0 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
| |
+-------------------+
Note that data is distributed (and Broker 3 doesn't hold any data of topic 2).
Topics, should have a replication-factor > 1 (usually 2 or 3) so that when a broker is down, another one can serve the data of a topic. For instance, assume that we have a topic with 2 partitions with a replication-factor set to 2 as shown below:
Broker 1:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| |
+-------------------+
Broker 2:
+-------------------+
| Topic 1 |
| Partition 0 |
| |
| |
| Topic 1 |
| Partition 1 |
+-------------------+
Broker 3:
+-------------------+
| Topic 1 |
| Partition 1 |
| |
| |
| |
+-------------------+
Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication-factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. Therefore, Apache-Kafka offers strong durability and fault tolerance guarantees.
Like most things, the answer to this question is 'it depends'. Your question is generic in nature. It would help if you can be more specific in terms of which attributes of your system are you interested in - performance, availability etc. From a performance standpoint, having lots of instances on box (node) is fine if its has lot of resources. But it will not help you from a availability perspective i.e. your system will have a single point of failure and is at huge risk if that one node happens to go down (unless you have multiple such high resource nodes at your disposal :-) )
If you have multiple brokers on the same node then it's possible to end up with all the partitions of a topic in single node only. If that node fails then the particular topic would become unresponsive.