FlinkKafkaConsumer010 can not consume data with full parallelism - apache-kafka

I have a Kafka(0.10.2.0) cluster with 10 partitions(with 10 individual kafka server ports) on one machine which holded 1 topic named "test"
And have a Flink Cluster with 294 task slots on 7 machines, a Flink app with parallelism 250 runs on this Flink Cluster using FlinkKafkaConsumer010 to consume data from Kafka Server with one group id "TestGroup".
But I found that there are only 2 flink ips with 171 tcp connection has been establised with kafka cluster, and more worse, only 10 connections are transfer data, only these 10 connections had data transfered from beginning to end.
I have checked this Reading from multiple broker kafka with flink, but not work in my case.
Appreciated for any information, thank you.

Related

Kafka connect multiple workers with same connect topics

I'm running the Kafka connect in distributed mode. There are 3 workers. All of them having same configuration like group id and connect topic names (connect offset, status, config).
My use case is running a Debezium connector (6 connectors) to extract data from 6 different MySQL servers.
Is it a good practice to maintain them in the same Kafka topic (I mean the offsets and all)?

kafka Connect: Tasks.max more than # of partitions but the status says RUNNING

In our setup, we have 50 tasks and 40 partitions in the topic. We have 2 workers. Ideally, the connector should start just 40 tasks but we see all 50 tasks have the status as RUNNING. How is that possible?
There are may be idle tasks, but that does not necessarily mean they are in UNASSIGNED or FAILURE state. They are active and running as part of a consumer group (assuming a sink connector).
If you had a source connector, then there are just 50 running producer threads, sending data to all 40 partitions. There isn't a 1:1 limitation on how many producers like there are for consumers.
You're welcome to PUT a new configuration for the connector and set tasks.max back to 40.

Does scaling Kafka Connect is same as scaling Kafka Consumer?

We need to pull data from Kafka and write into AWS s3. The Kafka is managed by separate department and we have access to only specific topic.
Based on Kafka documentation it looks like Kafka Connect is easy solution for me because I don't have any custom message processing logic.
Normally when we run Kafka Consumer we can run multiple JVM with same consumer group for scalability. The consumer JVM of specific consumer can run in same physical server or different. What would be the case when I want to use Kafka Connect?
Let's say I have 20 partitions of the topic.
How can I run Kafka Connect with 20 instances?
Can I have multiple instances of Kafka Connect running on the same physical instance?
Kafka Connect handles balancing the load across all its workers. In your example of 20 nodes, you could have : (for example)
1 Kafka Connect worker, processing 20 partitions
5 Kafka Connect workers, each processing 4 partitions
20 Kafka Connect workers, each processing 1 partition
It depends on your volumes and required throughput.
To run Kafka Connect in Distributed mode across multiple nodes, follow the instructions here and make sure you give them all the same group.id which identifies them as members of the same cluster (and thus eligible for sharing workload of tasks out across them). More config details for distributed mode here.
Even if you're running Kafka Connect on a single node, I would personally recommend running it in Distributed mode as it makes scale-out more simple (you just add additional nodes, but the execution & config remains the same).
I'm don't see a benefit in running multiple Kafka Connect workers on a single node. Each Kafka Connect worker can run multiple tasks, and connectors, as required.
My understanding is that if you only have a single machine, you should only launch one kafka connect instance, and configure the tasks.max property to the amount of parallelism you'd like to achieve (in your example 20 might be good). This should allow kafka connect to read from your partitions in parallel, see the docs for this here.
You could launch multiple instances on the same machine in theory. It makes sense to do this if you need each instance to consume data from different topics. But if you want the instances to consume data from the same topic, I don't think doing this would benefit you. Using separate threads within the same process with tasks.max will give you the same if not better performance.
If you want kafka connect to run on multiple machines and read data from the same topic it is possible to run in distributed mode.

kafka Performance Reduced when adding more consumer or producer

I have 3 server with 10GB connection between them and run a Kafka cluster on 2 servers and generate some test in third server...
when I run a single java producer (in third server that is not in Kafka cluster) sending 1 million messages take 3 seconds, but when I run another java producer (with different topic) both of producers take 6 seconds for sending messages.
I sure network connection is not bottleneck (it is 10GB)
so why this problem happened and how can I solve this (I want both producers take 3 seconds) ?
Sounds like you are getting a consistent 333,333 messages/sec performance out of a two node kafka cluster, with zookeeper running on the same 2 machines as your 2 kafka brokers. You don’t say what size these messages are or what kind of disks you are using, or how much memory, or if you are publishing with acks=all, or what programming language you are using (I assume java) but that actually sounds like good consistent results that are probably disk IO bound on the brokers or cpu bound on your single client machine.

How to run kafka on different machines

From last 10 days i am trying to set Kafka on different machine:
Server32
Server56
Below are the list of task which i have done so far
Configured Zookeeper and started on both server with
server.1=Server32_IP:2888:3888
server.2=Server56_IP:2888:3888
I also changed server and server-1 properties as below
broker.id=0 port=9092 log.dir=/tmp/kafka0-logs
host.name=Server32
zookeeper.connect=Server32_IP:9092,Server56_IP:9062
& server-1
broker.id=1 port=9062 log.dir=/tmp/kafka1-logs
host.name=Server56
zookeeper.connect=Server32_IP:9092,Server56_IP:9062
Server.property i ran in Server32
Server-1.property i ran in Server56
The Problem is : when i start producer in both the servers and if i try to consume from any one then it is working BUT
When i stop any one server then another one is not able to send the details
Please help me in explaining the process
Running 2 zookeepers is not fault tolerant. If one of the zookeepers is stopped, then the system will not work. Unlike Kafka brokers, zookeeper needs a quorum (or majority) of the configured nodes in order to work. This is why zookeeper is typically deployed with an odd number of instances (nodes). Since 1 of 2 nodes is not a majority it really is no better than running a single zookeeper. You need at least 3 zookeepers to tolerate a failure because 2 of 3 is a majority so the system will stay up.
Kafka is different so you can have any number of Kafka brokers and if they are configured correctly and you create your topics with a replication factor of 2 or greater, then the Kafka cluster can continue if you take any one of the broker nodes down , even if it's just 1 of 2.
There's a lot of information missing here like the Kafka version and whether or not you're using the new consumer APIs or the old APIs. I'm assuming you're probably using a new version of Kafka like 0.10.x along with the new client APIs. In the new version of the client APIs the log data is stored on the Kafka brokers and not Zookeeper as in the older versions. I think your issue here is that you created your topics with a replication factor of 1 and coincidently the Kafka broker server you shutdown was hosting the only replica, so you won't be able to produce or consume messages. You can confirm the health of your topics by running the command:
kafka-topics.sh --zookeeper ZHOST:2181 --describe
You might want to increase the replication factor to 2. That way you might be able to get away with one broker failing. Ideally you would have 3 or more Kafka Broker servers with a replication factor of 2 or higher (obviously not more than the number of brokers in your cluster). Refer to the link below:
https://kafka.apache.org/documentation/#basic_ops_increase_replication_factor
For a topic with replication factor N, we will tolerate up to N-1 server >failures without losing any records committed to the log."