I understand Broker Controller is responsible for managing all the brokers in the cluser. As per my understanding ZooKeeper helps in identifying Controller.
Is the resposibility of ZooKeeper limited to identifying Controller or Zookeeper has more responsibility in management of cluster.
Secondly, the Producer / Consumers take the broker list to identify the state of the cluster, why Producer / Consumers doesn't interact with zoo-keeper?
Prior to KIP-500 (removal of Zookeeper), Zookeepeer maintains the list of topics, their replica placements, Kafka ACLs, among other details more than simply leader-election facilities
why Producer / Consumers doesn't interact with zoo-keeper?
They used to (Kafka clients older than 0.9), but this was removed to ease the maintenance burden and simplify the codebase for a rewritten client library.
Related
So my question is this: If i have a server running Kafka (And zookeeper), and another machine only consuming messages, does the consumer machine need to run zookeeper too? Or does the server take care of all?
No.
Role of Zookeeper in Kafka is:
Broker registration: (cluster membership) with heartbeats mechanism to keep the list current
Storing topic configuration: which topics exist, how many partitions each
has, where are the replicas, who is the preferred leader, list of ISR for
partitions
Electing controller: The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
So Zookeeper is required only for kafka broker. There is no need to have Zookeper on the producer or consumer side.
The consumer does not need zookeeper
You have not mentioned which version of Kafka or the clients you're using.
Kafka consumers using 0.8 store their offsets in Zookeeper, so it is required for them. However, no, you would not run Zookeeper and consumers on the same server
From 0.9 and later, clients are separate from needing it (unless you want to manage external connections to Zookeeper on your own for storing data)
A Kafka cluster has a controller node and a Zookeeper cluster, both with their own set of responsibilities. What is the requirement of the controller when we have zookeeper already ?
For example : Controller election is performed by zookeeper, a partition leader election is done by controller. Why doesn't Kafka use Zookeeper also for Partition Leader election when it already has the information of what partitions are on what nodes and which nodes are actually active.
In short, I am struggling to understand the requirement of controller in spite of the zookeeper being present. It would be really helpful if someone could explain the reason for this design choice and the advantages.
kafka uses zookeeper for a few things:
cluster membership - the live brokers of a cluster are those who have ephemeral ZK nodes
leader election - election of the kafka broker that acts as a controller
state storage - some (mostly the older) state is stored in ZK - the configuration for topics, for example. some state that used to be in ZK has been migrated to special topics (consumer offsets) and some newer functionality was written to store state entirely in kafka (transaction logs, for example).
the general trend is to stop using state in ZK and instead self-host it (although older parts of the code have never been migrated out).
as for why not use ZK for partition leader election - one reason is there is logic involved. when electing a cluster leader broker there's no preference - any broker will do. this fits well with how ZK-based leader-election works (1st memeber to create and own an ephemeral znode wins).
when choosing a partition leader, however, you need a little bit more logic. for example - you'd like to elect the leader with the "highest watermark" (with the most up to date data, remember replication is generally async). there's also logic around unclean leader election. ZK alone cannot do that, hence it is done by the controller.
Zookeeper works as a coordination service and Kafka is using zookeeper for the same purpose.
Zookeeper is must by design for Kafka. Because Zookeeper has the responsibility a kind of managing Kafka cluster. It has list of all Kafka brokers with it and Controller of the cluster is selected by Zookeeper and stored there.
Kafka stores minimum information on Zookeeper.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-500%3A+Replace+ZooKeeper+with+a+Self-Managed+Metadata+Quorum
in order to protected zookeeper. without controller, zookeeper needs to trigger too many listeners (equals to broker count), and most of these listener are useless, it is a potential risk for zookeeper, through controller only controller interact with zookeeper
I've configured a cluster of Kafka brokers and a cluster of Zk instances using kafka_2.11-1.1.0 distribution archive.
For Kafka brokers I've configured config/server.properties
broker.id=1,2,3
zookeeper.connect=box1:2181,box2:2181,box3:2181
For Zk instances I've configured config/zookeeper.properties:
server.1=box1:2888:3888
server.2=box3:2888:3888
server.3=box3:2888:3888
I've created a basic producer and a basic consumer and I don't know why I am able to write messages / read messages even if I shut down all the Zookeeper
instances and have all the Kafka brokers up and running.
Even booting up new consumers, producers works without any issue.
I thought having a quorum of Zk instances is a vital point for a Kafka cluster.
For both consumer and producer, I've used following configuration:
bootrapServers=box1:9092,box2:9092,box3:9092
Thanks
I thought having a quorum of Zk instances is a vital point for a Kafka cluster.
Zookeeper quorum is vital for managing partition lists, leaders, etc. In general, ZK is necessary for management that is done by the cluster coordinator in the cluster.
Basically, right now (with ZK down), you cannot modify topics (as the partition metadata is stored in ZK), start up / shut down brokers (as they use ZK for discovery) and other similar operations.
Even booting up new consumers, producers works without any issue.
Producer/consumer operations reach out to brokers only. The broker instance can still append to the log, and can still communicate with other brokers to have replication. So it is possible to send a message, get it received by broker and saved to disk, with other brokers replicating (as they are continuously sending fetch requests to the leader (and they know who this partition's leader is because they saved that data when ZK was still running)).
Does Kafka broker store metadata which producer API uses (e.g. which partitions are leader for a topic etc.)? As per my understanding this metadata is stored in Zookeeper , is it correct? If it is true then how Brokers are updated by Zookeeper with latest information?
All Kafka brokers can answer a metadata request that describes the current state of the cluster: what topics there are, which partitions those topics have, which broker is the leader for those partitions etc.
ZooKeeper is responsible for:
Electing a controller broker - and making sure there is only one
Cluster membership - allowing brokers to join a cluster
Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
Quotas - how much data is each client allowed to read and write
ACLs - who is allowed to read and write to which topic
There is regular communication between Kafka and ZooKeeper such that ZooKeeper knows a Kafka broker is still alive (ZooKeeper heartbeat mechanism) and also in response to events such as a topic being created or a replica falling out of sync for a topic-partition.
Kafka is a distributed system and is built to use Zookeeper which is responsible for controller election, topic configuration, clustering etc.
More precisely, Zookeeper initiates controller election. The controller broker is a single broker in the Kafka cluster which takes care of leader broker and followers for every partition. When a particular broker is taken down, the controller lets other replicas know (in order to handle partition leaders etc). Moreover, when the controller fails then Zookeeper initiates new elections in order to elect the new broker which will act as the controller.
Furthermore, Zookeeper knows which brokers are part of the Kafka cluster and which are still alive. Similarly, it is also aware of topic-specific information such as which topics exist, how many partitions each has, where are the replicas and so on.
Zookeeper also stores information regarding quotas and ACLs, i.e. what volume of data each client is allowed to consume/produce and also, who is allowed to consume or produce from a particular topic.
This is a follow-up question to an earlier discussion. I think of Zookeeper as a coordinator for instances of the Kafka broker, or "message bus". I understand why we might want producer/consumer clients transacting through Zookeeper -- because Zookeeper has built-in fault-tolerance as to which Kafka broker to transact with. But with the new model -- ie, 0.10.1+ -- should we always bypass Zookeeper altogether in our producer/consumer clients? Are we giving up any advantages (Eg, better fault-tolerance) by doing that? Or is Zookeeper ultimately still at work behind the scenes?
To add to the answer of Hans Jespersen, recent Kafka producer/consumer clients (0.9+) do not interact with ZooKeeper anymore.
Nowadays ZooKeeper is only used by the Kafka brokers (i.e., the server-side of Kafka). This means you can e.g. lock-down external access from clients to all ZooKeeper instances for better security.
I understand why we might want producer/consumer clients transacting through Zookeeper -- because Zookeeper has built-in fault-tolerance as to which Kafka broker to transact with.
Producer/consumer clients are not "transacting" through ZooKeeper, see above.
But with the new model -- ie, 0.10.1+ -- should we always bypass Zookeeper altogether in our producer/consumer clients?
If the motivation of your question is because you want to implement your own Kafka producer or consumer client, then the answer is: your custom client should not ZooKeeper any longer. The official Kafka producer/consumer clients (Java/Scala) or e.g. Confluent's C/C++, Python, or Go clients for Kafka demonstrate how scalability, fault-tolerance, etc. can be achieved by leveraging Kafka functionality (rather than having to rely on a separate service such as ZooKeeper).
Are we giving up any advantages (Eg, better fault-tolerance) by doing that? Or is Zookeeper ultimately still at work behind the scenes?
No, we are not giving up any advantages here. Otherwise the Kafka project would not have changed its producer/consumer clients to stop using ZooKeeper and start using Kafka themselves for their inner workings.
ZooKeeper is only still at work behind the scenes for the Kafka brokers, see above.
Zookeeper is still at work behind the scenes but the 0.9+ clients don't need to worry about it anymore because consumer offsets are now stored in a Kafka topic rather than in zookeeper.