Apache Kafka Throughput and Latency - apache-kafka

I made a Kafka Cluster on my local machine and I was testing creating producers with different Throughput to see what happens to the latency.
I used the kafka-test-perf benchmark to these tests
https://docs.cloudera.com/runtime/7.2.10/kafka-managing/topics/kafka-manage-cli-perf-test.html
Different throughput on producer
When I set the troughput to 200.000 there is only 22k records/sec. This means that my Kafka Cluster in my local machine can not handle this type of throughput?
I tested different throughputs to try to understand what happens here.

Related

Druid Tuning Configuration

iam beginner using druid and kafka.
I want to create interactive data realtime kafka-druid. I still confuse what tuning should i change of this configuration?
Thank in advance
It depends:
on your size of kafka's messages
on your producer's capacity and the rate of messages produced to Kafka
on your Kafka's servers, partitioning and lots of other factors.
on your druid's deployment model (single or cluster)
and ....
But the most important thing I see missing here is your task count, which means the amount of parallel processing (and since yours is Kafka, it means parallel consumers). Increase it and make sure your Druid host (or your middle manager host if it is a cluster) has the adequate cores for your tasks. And make sure in your middle-manager you have increased the total number of available tasks:
druid.worker.capacity

How does distribution mechanism works when Kafka runs locally?

How does distribution mechanism works when Kafka runs locally? Please tell the disadvantages too.
If you only run one broker locally, you have a single point of failure and no processing is truly distributed
If you have multiple brokers on the same machine, and you mount different volumes for each broker process logs, you'd end up with distributed storage + fault tolerance, but still no distributed processing
In either case, you can create as many topics as you want with many partitions, but you can only set the replication factor of the topics to be the number of active brokers
Multiple consumer processes are also able to run fine on a single machine, but you'd get more throughput by separating brokers and consumers across several physical machines (more cpu available, and different network interfaces)

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

kafka Performance Reduced when adding more consumer or producer

I have 3 server with 10GB connection between them and run a Kafka cluster on 2 servers and generate some test in third server...
when I run a single java producer (in third server that is not in Kafka cluster) sending 1 million messages take 3 seconds, but when I run another java producer (with different topic) both of producers take 6 seconds for sending messages.
I sure network connection is not bottleneck (it is 10GB)
so why this problem happened and how can I solve this (I want both producers take 3 seconds) ?
Sounds like you are getting a consistent 333,333 messages/sec performance out of a two node kafka cluster, with zookeeper running on the same 2 machines as your 2 kafka brokers. You don’t say what size these messages are or what kind of disks you are using, or how much memory, or if you are publishing with acks=all, or what programming language you are using (I assume java) but that actually sounds like good consistent results that are probably disk IO bound on the brokers or cpu bound on your single client machine.

Kafka Producers/Consumers over WAN?

I have a Kafka Cluster in a data center. A bunch of clients that may communicate across WANs (even the internet) will send/receive real time messages to/from the cluster.
I read from Kafka's Documentation:
...It is possible to read from or write to a remote Kafka cluster over the WAN though TCP tuning will be necessary for high-latency links.
It is generally not advisable to run a single Kafka cluster that spans multiple datacenters as this will incur very high replication latency both for Kafka writes and Zookeeper writes and neither Kafka nor Zookeeper will remain available if the network partitions.
From what I understand here and here:
Producing over a WAN doesn't require ZK and is okay, just mind tweaks to TCP for high latency connections. Great! Check.
The High Level consumer APIs require ZK connections.
Aren't then clients reading/writing to Kafka over a WAN subject to the same limitations for clusters in bold above?
The statements you have highlighted are mostly targeted at the internal communication between the Kafka/zookeeper cluster where evil things will happen during network partitions which are much more common across a WAN.
Producers are isolated and if there are network issues should be able to buffer/retry based on your settings.
High level consumers are trickier since, as you note, require a connection to zookeeper. Here when disconnects occur, there will be rebalancing and a higher chance messages will get duplicated.
Keep in mind, the producer will need to be able to get to every Kafka broker and the consumer will need to be able to get to all zookeeper nodes and Kafka brokers, a load balancer won't work.