Do Kafka consumers have an open connection per partition? - apache-kafka

I know that each partition is allocated to one Kafka consumer (inside of a consumer-group), but one Kafka consumer can be consuming multiple partitions at the same time. If each has an open connection to the partition, then I can imagine tens of thousands of connections open per consumer. If this is true, that seems like something to watch out for when deciding on number of partitions, no?

I'm assuming you are asking about the official Java client. Third party clients could do something else.
The KafkaConsumer does not have a network connection per partition. As you hinted, that would not scale very well.
Instead the KafkaConsumer has a connection to each broker/node that is the leader of a partition it is consuming from. Data for partitions that have the same leader is transmitted using the same connection. It also uses an additional connection to the Coordinator for its group. So at worst it can have <# of brokers in the cluster> + 1 connections to the Kafka cluster.
Have a look at NetworkClient.java, you'll see that connections are handle per Node (broker)

Related

Kafkajs - multiple consumers reading from same topic and partition

I'm planning to use Kafkajs https://kafka.js.org/ and implement it in a NodeJs server.
I would like to know what is the expected behavior in case I have 2 (or more) instances of the server running, each of them having a consumer which is configured with the same group id and topic?
Does this mean that they might read the same messages?
Should I specify a unique consumer group per each server instance ?
I read this - Multiple consumers consuming from same topic but not sure it applies for Kafkajs
It's not possible for a single consumer group to have multiple consumers reading from overlapping partitions in any Kafka library. If your topic only has one partition, only one instance of your application will be consuming it, and if that instance dies, the group will rebalance and the other instance will take over (potentially reading some of the same data, due to the nature of at-least-once delivery, but it's not at the same time as the other instance)

How exactly Kafka's consumer communicate to server?

Does Kafka consumer keep checking the health of the broker(Kafka Server) or vice versa?
Let's say by anyhow, Consumers and brokers know each other's health so how exactly the consumer will read from the partition?
Let's say I have 48 partitions for a topic and have two consumer groups for the topic, so how many threads will be consuming the data from all partitions?
Consumers send out healthchecks so that the broker knows if the consumers are healthy. Brokers' health is controlled by Controller, which is a Kafka service that runs on every broker in a Kafka cluster, but only one can be active (elected) at any point in time.
See this video for detailed description. In a nutshell, first consumer in the group is a leader and decided on the assignments for the rest of the consumers in the same group. This data is send to the broker and distributed across consumers.
Thread management is your own responsibility. Both KafkaProducer and KafkaConsumer are single-threaded components.

Under what circumstance will kafka overflow?

I am testing a POC where data will be sent from 50+ servers from UDP port to kafka client (Will use UDP-kafka bridge). Each servers generates ~2000 messages/sec and that adds up to 100K+ messages/sec for all 50+ servers. My question here is
What will happen if kafka is not able to ingest all those messages?
Will that create a backlog on my 50+ servers?
If I have a kafka cluster, how do I decide which server sends message to which broker?
1) It depends what you mean by "not able" to ingest. If you mean it's bottlenecked by the network, then you'll likely just get timeouts when trying to produce some of your messages. If you have retries (documented here) set up to anything other than the default (0) then it will attempt to send the message that many times. You can also customise the request.timeout.ms, the default value of this being 30 seconds. If you mean in terms of disk space, the broker will probably crash.
2) You do this in a roundabout way. When you create a topic with N partitions, the brokers will create those partitions on whichever server has the least usage at that time, one by one. This means that, if you have 10 brokers, and you create a 10 partition topic, each broker should receive one partition. Now, you can specify which partition to send a message to, but you can't specify which broker that partition will be on. It's possible to find out which broker a partition is on after you have created it, so you could create a map of partitions to brokers and use that to choose which broker to send to.

Load Balance 1-Topic Kafka Cluster

We are in the process of designing a Kafka Cluster (at least 3 nodes) that will process events from an array of web servers. Since the logs are largely identical, we are planning to create a single Topic only (say - webevents)
We expect a lot of traffic from the servers. Since there is a single topic, there will be a single leader broker. In such a case how will the cluster balance the high traffic? All write requests will always be routed to the leader broker at all times and other nodes might be underutilized.
Does a external hardware balancer help solve this problem? Alternately, can a Kafka configuration help distribute write requests evenly on a 1-topic cluster?
Thanks,
Sharod
Short answer: a topic may have multiple partitions and each partition, not topic, has a leader. Leaders are evenly distributed among brokers. So, if you have multiple partitions in your topic you will have multiple leaders and your writes will be evenly distributed among brokers.
You will have a single topic with lot of partitions, you can replicate partitions for high availability/durability of your data.
Each broker will hold an evenly distributed number of partitions and each of these partitions can be either a leader or a replica for a topic. Kafka producers (Kafka clients running in your web servers in your case) write to a single leader, this provides a means of load balancing production so that each write can be serviced by a separate broker and machine.
Producers do the load balancing selecting the target partition for each message. It can be done based on the message key, so all messages with same key go to the same partition, or on a round-robin fashion if you don't set a message key.
Take a look at this nice post. I took the diagram from there.

Why do Kafka consumers connect to zookeeper, and producers get metadata from brokers?

Why is it that consumers connect to zookeeper to retrieve the partition locations? And kafka producers have to connect to one of the brokers to retrieve metadata.
My point is, what exactly is the use of zookeeper when every broker already has all the necessary metadata to tell producers the location to send their messages? Couldn't the brokers send this same information to the consumers?
I can understand why brokers have the metadata, to not have to make a connection to zookeeper each time a new message is sent to them. Is there a function that zookeeper has that I'm missing? I'm finding it hard to think of a reason why zookeeper is really needed within a kafka cluster.
First of all, zookeeper is needed only for high level consumer. SimpleConsumer does not require zookeeper to work.
The main reason zookeeper is needed for a high level consumer is to track consumed offsets and handle load balancing.
Now in more detail.
Regarding offset tracking, imagine following scenario: you start a consumer, consume 100 messages and shut the consumer down. Next time you start your consumer you'll probably want to resume from your last consumed offset (which is 100), and that means you have to store the maximum consumed offset somewhere. Here's where zookeeper kicks in: it stores offsets for every group/topic/partition. So this way next time you start your consumer it may ask "hey zookeeper, what's the offset I should start consuming from?". Kafka is actually moving towards being able to store offsets not only in zookeeper, but in other storages as well (for now only zookeeper and kafka offset storages are available and i'm not sure kafka storage is fully implemented).
Regarding load balancing, the amount of messages produced can be quite large to be handled by 1 machine and you'll probably want to add computing power at some point. Lets say you have a topic with 100 partitions and to handle this amount of messages you have 10 machines. There are several questions that arise here actually:
how should these 10 machines divide partitions between each other?
what happens if one of machines die?
what happens if you want to add another machine?
And again, here's where zookeeper kicks in: it tracks all consumers in group and each high level consumer is subscribed for changes in this group. The point is that when a consumer appears or disappears, zookeeper notifies all consumers and triggers rebalance so that they split partitions near-equally (e.g. to balance load). This way it guarantees if one of consumer dies others will continue processing partitions that were owned by this consumer.
With kafka 0.9+ the new Consumer API was introduced. New consumers do not need connection to Zookeeper since group balancing is provided by kafka itself.
You are right, the consumers don't need to connect to ZooKeeper since kafka 0.9 release. They redesigned the api and new consumer client was introduced:
the 0.9 release introduces beta support for the newly redesigned
consumer client. At a high level, the primary difference in the new
consumer is that it removes the distinction between the “high-level”
ZooKeeper-based consumer and the “low-level” SimpleConsumer APIs, and
instead offers a unified consumer API.
and
Finally this completes a series of projects done in the last few years
to fully decouple Kafka clients from Zookeeper, thus entirely removing
the consumer client’s dependency on ZooKeeper.