This is a follow-up question to an earlier discussion. I think of Zookeeper as a coordinator for instances of the Kafka broker, or "message bus". I understand why we might want producer/consumer clients transacting through Zookeeper -- because Zookeeper has built-in fault-tolerance as to which Kafka broker to transact with. But with the new model -- ie, 0.10.1+ -- should we always bypass Zookeeper altogether in our producer/consumer clients? Are we giving up any advantages (Eg, better fault-tolerance) by doing that? Or is Zookeeper ultimately still at work behind the scenes?
To add to the answer of Hans Jespersen, recent Kafka producer/consumer clients (0.9+) do not interact with ZooKeeper anymore.
Nowadays ZooKeeper is only used by the Kafka brokers (i.e., the server-side of Kafka). This means you can e.g. lock-down external access from clients to all ZooKeeper instances for better security.
I understand why we might want producer/consumer clients transacting through Zookeeper -- because Zookeeper has built-in fault-tolerance as to which Kafka broker to transact with.
Producer/consumer clients are not "transacting" through ZooKeeper, see above.
But with the new model -- ie, 0.10.1+ -- should we always bypass Zookeeper altogether in our producer/consumer clients?
If the motivation of your question is because you want to implement your own Kafka producer or consumer client, then the answer is: your custom client should not ZooKeeper any longer. The official Kafka producer/consumer clients (Java/Scala) or e.g. Confluent's C/C++, Python, or Go clients for Kafka demonstrate how scalability, fault-tolerance, etc. can be achieved by leveraging Kafka functionality (rather than having to rely on a separate service such as ZooKeeper).
Are we giving up any advantages (Eg, better fault-tolerance) by doing that? Or is Zookeeper ultimately still at work behind the scenes?
No, we are not giving up any advantages here. Otherwise the Kafka project would not have changed its producer/consumer clients to stop using ZooKeeper and start using Kafka themselves for their inner workings.
ZooKeeper is only still at work behind the scenes for the Kafka brokers, see above.
Zookeeper is still at work behind the scenes but the 0.9+ clients don't need to worry about it anymore because consumer offsets are now stored in a Kafka topic rather than in zookeeper.
Related
I understand Broker Controller is responsible for managing all the brokers in the cluser. As per my understanding ZooKeeper helps in identifying Controller.
Is the resposibility of ZooKeeper limited to identifying Controller or Zookeeper has more responsibility in management of cluster.
Secondly, the Producer / Consumers take the broker list to identify the state of the cluster, why Producer / Consumers doesn't interact with zoo-keeper?
Prior to KIP-500 (removal of Zookeeper), Zookeepeer maintains the list of topics, their replica placements, Kafka ACLs, among other details more than simply leader-election facilities
why Producer / Consumers doesn't interact with zoo-keeper?
They used to (Kafka clients older than 0.9), but this was removed to ease the maintenance burden and simplify the codebase for a rewritten client library.
I'm new to kafka.
Kafka is supposed to be used as a distributed service. But the tutorials and blog posts i found online never mention if there is one or several zookeeper nodes.
The tutorials just pop one zookeper instance, and then multiple kafka brokers.
Is it how it is supposed to be done?
Zookeeper is a co-ordination service (in a centralized manner) for distributed systems that is used by clusters for maintenance of distributed system . The distributed synchronization achieved by it via metadata such as configuration information, naming, etc.
In general architectures, Kafka cluster shall be served by 3 ZooKeeper nodes, but if the size of deployment is huge, then it can be ramped up to 5 ZooKeeper nodes but that in turn will add load on the nodes as all nodes try to be in sync as all metadata related activities are handled by ZooKeeper.
Also, it should be noted that as an improvement, the new release of Kafka reduces dependency on ZooKeeper in order to enhance scalability of metadata across, to reduce the complexity in maintaining the meta data with external components and to enhance the recovery from unexpected shutdowns. With new approach, the controller failover is almost instantaneous. This is achieved by Kafka Raft Metadata mode termed as 'KRaft' that will run Kafka without ZooKeeper by merging all the responsibilities handled by ZooKeeper inside a service in the Kafka Cluster itself and operates on event based mechanism that is used in the KRaft protocol.
Tutorials generally keep things nice and simple, so one ZooKeeper (often one Kafka broker too). Useful for getting started; useless for any kind of resilience :)
In practice, you are going to need three ZooKeeper nodes minimum.
If it helps, here is an enterprise reference architecture whitepaper for the deployment of Apache Kafka
Disclaimer: I work for Confluent, who publish the above whitepaper.
While I am creating cluster setup for kafka I came to know zookeeper quorum set up is needed for coordination between kafka brokers.
Are there any other scenarios where we use zookeeper other than only for kafka setup in real time?
This link lists many applications and organisations using ZooKeeper
https://zookeeper.apache.org/doc/r3.6.2/zookeeperUseCases.html
ZooKeeper is used with many Apache projects and is a distributed coordination service used to manage a large set of hosts. In simple terms, Zookeeper allows workers to get on with their jobs and handles all the other complexities i.e. if a leader goes down, alerting the workers, electing a new leader etc.
I am studying the internals of Apache Kafka; how it works.
The Kafka brokers deal with the requests from the multiple producers and consumers.
I want to know how Kafka schedules those requests. (i.e. FCFS)
- Is it First-Come-First-Served (FCFS) or Processor Sharing (PS) ?
- Does the producers have the higher priorities than consumers?
The Kafka official documentation does not have explanation on it.
Can anyone give me an idea on this?
Thanks,
There is a TCP connection per client at the broker (the client can either be a consumer or producer or any number of producers &/or consumers)
The way CPU resources are shared between different connections is not a property controlled by Kafka. This depends on the OS on which your broker is running. Specifically, the scheduler implementation of your OS (which decide how processes are schedules on cores), will decide this.
If the scheduler is FCFS, this will very well be FCFS. More generally, the scheduler implementation in most OS is some version of Multi Level Feedback Queue.
Thus, this has got nothing to do with Kafka.
I notice that when sending messages to kafka (a producer) the samples show connecting to port 9092 -- writing directly to a broker. When consuming the examples show connecting to port 2181, presumably zookeeper.
The latter makes sense--I want to read from "the cluster", letting zookeeper figure out which broker the client should communicate with, and managing such things as knowing who's alive/dead in the cluster.
Why wouldn't publish/writes work the same way, i.e. write to "the cluster" (via zookeeper)?
Am I understanding this correctly, that for producing I'm bypassing zookeeper (cluster knowledge) and must know producer nodes (and presumably figure out what to do if one fails)?
The "high level consumer" of Kafka uses Zookeeper to keep track of which partitions each member in a consumer group is consuming and sometimes to track which offsets were read in which partition. Since access to Zookeeper is required, we may as well use it to figure out where are the brokers...
In the new consumer (coming soon in the next release), Zookeeper is no longer needed, and consumers connect directly to brokers, just like producers currently do.