Does Kafka consumer keep checking the health of the broker(Kafka Server) or vice versa?
Let's say by anyhow, Consumers and brokers know each other's health so how exactly the consumer will read from the partition?
Let's say I have 48 partitions for a topic and have two consumer groups for the topic, so how many threads will be consuming the data from all partitions?
Consumers send out healthchecks so that the broker knows if the consumers are healthy. Brokers' health is controlled by Controller, which is a Kafka service that runs on every broker in a Kafka cluster, but only one can be active (elected) at any point in time.
See this video for detailed description. In a nutshell, first consumer in the group is a leader and decided on the assignments for the rest of the consumers in the same group. This data is send to the broker and distributed across consumers.
Thread management is your own responsibility. Both KafkaProducer and KafkaConsumer are single-threaded components.
Related
I could not find the answer in the transaction API. I think it's probably not possible looking at the design under the hood but I wanted a confirmation: is it possible to have a Kafka transaction if the consumer and the producer are connected to different Kafka clusters? Meaning the topics I am consuming from and the topics I am pushing to are on 2 separate Kafka clusters.
You are correct, a Kafka transaction cannot span across multiple clusters.
Both the consumer and producer must use the same cluster. The main reason is that both the offsets from the consumer and the records from the producer have to be committed together in the transaction.
Is there any way to pause or throttle a Kafka producer based on consumer lag or other consumer issues? Would the producer need to determine itself if there is consumer lag then perform throttling itself?
Kafka is build on Pub/Sub design. Producer publish the message to centralized topic. Multiple consumers can subscribe to that topic. Since multiple consumers are involve you cannot decide on producer speed. One consumer can be slow other can be fast. Also it is against the design principle otherwise both system will become tightly couple. If you have use case of throttling may be you should evaluate other framework like direct rest call.
Producer and Consumer are decoupled.
Producer push data to Kafka topics (partitions topic), that are stored in Kafka Brokers. Producer doesn't know who and how often consume messages.
Consumer consume data from Brokers. Consumer doesn't know how many producers produce the messages. Even the same messages can be consumed by several consumers that are in different groups. In example some consumer can consume faster than the other.
You can read more about Producer and Consumer in Apache Kafka webpage
It is not possible to throttle the producer/producers weighing on performance of consumers.
In my scenario I don't want to loose events if the disk size is
exceeded before a message is consumed
To tackle your issue, you have to depend on the parallelism offering by the Kafka. Your Kafka topic should have multiple partitions and producers has to use different keys to populate the topic. So your data will be distributed across multiple partitions and bringing a consumer group you can manage load within a group of consumers. All data within a partition can be processed in order, that may be relevant since you are dealing with event processing.
I have two consumer servers with same group id subscribed the same topic.
A kafka server is running with only one partition.
As far as I know, the message should be consumed randomly in those two consumer servers.
But now it seems to be always the same consumer server A consume messages, another one does not consume messages.If I stop consumer server A, another one will work fine.
What I expect that they can consume message randomly.
To be able to use two consumer instances in parallel you need at least two partitions in the topic. A consumer will bind to one or more partitions of a topic and other consumers with the same groupId will not claim partitions which already have consumers bound to them. If a consumer fails/crashes, the partition will be released and then picked up by another consumer instance.
Is there a way I can make a kafka topic non persistant? I plan to use multiple consumers in a single topic but I dont want all my consumers picking up the same messages.
In kafka to simulate the behaviour of a queue all your consumers would be in the same consumer group.
See the kafka docs for more information
Consumers
Messaging traditionally has two models: queuing and publish-subscribe.
In a queue, a pool of consumers may read from a server and each
message goes to one of them; in publish-subscribe the message is
broadcast to all consumers. Kafka offers a single consumer abstraction
that generalizes both of these—the consumer group. Consumers label
themselves with a consumer group name, and each message published to a
topic is delivered to one consumer instance within each subscribing
consumer group. Consumer instances can be in separate processes or on
separate machines.
If all the consumer instances have the same consumer group, then this
works just like a traditional queue balancing load over the consumers.
If you want to control when messages are deleted from the log you can set retention.ms or retention.bytes in the topic configuration. Be aware that these parameters will delete a message disregarding if it was consumed or not
Why is it that consumers connect to zookeeper to retrieve the partition locations? And kafka producers have to connect to one of the brokers to retrieve metadata.
My point is, what exactly is the use of zookeeper when every broker already has all the necessary metadata to tell producers the location to send their messages? Couldn't the brokers send this same information to the consumers?
I can understand why brokers have the metadata, to not have to make a connection to zookeeper each time a new message is sent to them. Is there a function that zookeeper has that I'm missing? I'm finding it hard to think of a reason why zookeeper is really needed within a kafka cluster.
First of all, zookeeper is needed only for high level consumer. SimpleConsumer does not require zookeeper to work.
The main reason zookeeper is needed for a high level consumer is to track consumed offsets and handle load balancing.
Now in more detail.
Regarding offset tracking, imagine following scenario: you start a consumer, consume 100 messages and shut the consumer down. Next time you start your consumer you'll probably want to resume from your last consumed offset (which is 100), and that means you have to store the maximum consumed offset somewhere. Here's where zookeeper kicks in: it stores offsets for every group/topic/partition. So this way next time you start your consumer it may ask "hey zookeeper, what's the offset I should start consuming from?". Kafka is actually moving towards being able to store offsets not only in zookeeper, but in other storages as well (for now only zookeeper and kafka offset storages are available and i'm not sure kafka storage is fully implemented).
Regarding load balancing, the amount of messages produced can be quite large to be handled by 1 machine and you'll probably want to add computing power at some point. Lets say you have a topic with 100 partitions and to handle this amount of messages you have 10 machines. There are several questions that arise here actually:
how should these 10 machines divide partitions between each other?
what happens if one of machines die?
what happens if you want to add another machine?
And again, here's where zookeeper kicks in: it tracks all consumers in group and each high level consumer is subscribed for changes in this group. The point is that when a consumer appears or disappears, zookeeper notifies all consumers and triggers rebalance so that they split partitions near-equally (e.g. to balance load). This way it guarantees if one of consumer dies others will continue processing partitions that were owned by this consumer.
With kafka 0.9+ the new Consumer API was introduced. New consumers do not need connection to Zookeeper since group balancing is provided by kafka itself.
You are right, the consumers don't need to connect to ZooKeeper since kafka 0.9 release. They redesigned the api and new consumer client was introduced:
the 0.9 release introduces beta support for the newly redesigned
consumer client. At a high level, the primary difference in the new
consumer is that it removes the distinction between the “high-level”
ZooKeeper-based consumer and the “low-level” SimpleConsumer APIs, and
instead offers a unified consumer API.
and
Finally this completes a series of projects done in the last few years
to fully decouple Kafka clients from Zookeeper, thus entirely removing
the consumer client’s dependency on ZooKeeper.