Zookeeper on Dataproc - apache-zookeeper

I could use an advice about setting up Zookeeper ensamble on Dataproc.
The scenario at hand is a project that will have 3 long running Dataproc clusters, and many ephemeral clusters that will be dynamically created per job.
I would like to have a quorum of at least 7 zookeepers but haven't found any documentation on how to get it done.
I know that I can add Zookeeper component to each Dataproc cluster, but how can I create an ensamble comprised of several clusters?

As you mentioned, you can activate Zookeeper on Dataproc with help of Zookeeper component.
Each Dataproc cluster with Zookeepr component will initialize an independent 3-node Zookeeper cluster.
If you want to configure a single Zookeeper cluster that spans multiple Dataproc clusters then you need to do this manually. It should be possible to orchestrate Zookeeper nodes on multiple Dataproc clusters in a single Zookeeper cluster using Dataproc cluster properties with zookeeper: prefix.

Related

consolidate and migrate multiple kafka clusters to 1 cluster

I have 2 onprem kafka clusters on 2 environment dev and test (they have the same topic names). Now I want to consolidate them into only one cluster (aws msk). I would like my new kafka cluster to have both environment topics. They will be differentiated by the prefix in their names. Example: dev_topicA, test_topicA. Is that posible?
It is possible with MirrorMaker2:
run a connector of MirrorSourceConnector with properties source.cluster.alias=dev and source cluster bootstrap servers to dev cluster
run another connector of MirrorSourceConnector with properties source.cluster.alias=test and source cluster bootstrap servers to test clusters
in both connectors use target cluster bootstrap as your MSK cluster
The default replication naming policy is the one to add aliases.
Default separator is .. If you would like to have _ like in your example, override default separator: replication.policy.separator=_

How to set kafka schema registry cluster

I have set up zookeeper and kafka broker cluster. I want to setup multiple schema registry cluster for fail over.
Zookeeper cluster having 3 node
kafka broker cluster having 3 node.
Could you please mention details steps how to set multiple schema registry?
I am using confluent 5.0 version
Schema Registry is designed to work as a distributed service using single master architecture, so at any given time there will be only one master and rest of the nodes refer back to it. You can refer the schema-registry arch here
You can choose 3 nodes schema-registry cluster (you can run on the same nodes along with zookeeper/Kafka), As you are using confluent 5.0, you can use the confluent CLI,
confluent start schema-registry
Update the schema-registry.properties,
#zookeeper urls
kafkastore.connection.url=zookeeper-1:2181,zookeeper-2:2181,...
#make every node eligible to become master for failover
master.eligibility=true
On the consumer and producer side, pass the list of schema-registry urls in the Consumer.props & Produce.props
props.put("schema.registry.url","http://schemaregistry-1:8081,http://schemaregistry-2:8081,http://schemaregistry-3:8081")
*By default schema-registry port will be 8081.
Hope this helps.

multiple kafka clusters on single zookeeper ensemble

I currently have a 3 node Kafka cluster which connects to base chroot path in my zookeeper ensemble.
zookeeper.connect=172.12.32.123:2181,172.11.43.211:2181,172.18.32.131:2181
Now, I want to add a new 5 node Kafka cluster which will connect to some other chroot path in the same zookeeper ensemble.
zookeeper.connect=172.12.32.123:2181,172.11.43.211:2181,172.18.32.131:2181/cluster/2
Will these configurations work as in the relative paths for the two chroots? I understand that the original Kafka cluster should have been connected on some path other than the base chroot path for better isolation.
Also, is it good to have same zookeeper ensemble across Kafka clusters? The documentation says that it is generally better to have isolated zookeeper ensembles for different clusters.
If you're only limited to a single Zookeeper cluster, then it should work out fine with a unique chroot that doesn't collide with the other cluster's znodes.
It is not "good" to share, no, because Zookeeper losing quorum causes two clusters to be down, but again if you're limited on hardware, then it'll still work
Note: You can only afford to lose one ZK server with 3 nodes in the cluster, which is why a cluster of 5 is recommended

2 cluster of zookeper servers in hadoop+kafka cluster - is it posible?

We have Kafka cluster with the following details
3 kafka machines
3 zookeeper servers
We also have Hadoop cluster that includes datanode machines
And all application are using the zookeeper servers, including the kafka machines
Now
We want to do the following changes
We want to add additional 3 zookeeper servers that will be in a separate cluster
And only kafka machine will use this additional zookeeper servers
Is it possible ?
Editing the ha.zookeeper.quorum in Hadoop configurations to be separate from zookeeper.connect in Kafka configurations, such that you have two individual Zookeeper clusters, can be achieved, yes.
However, I don't think Ambari or Cloudera Manager, for example, allow you to view or configure more than one Zookeeper cluster at a time.
Yes, that's possible. Kafka uses Zookeeper to perform various distributed coordination tasks, such as deciding which Kafka broker is responsible for allocating partition leaders, and storing metadata on topics in the broker.
After closing kafka, the original zookeeper cluster data will be copied to the new cluster using tools, this is a zookeeper cluster data transfer util zkcopy
But if your Kafka cluster didn't stop work, you should think about Zookeeper data transfer to additional zookeeper servers.

How to scale Zookeeper with kafka

I am working on scaling the kafka cluster in Prod. Confluent provides easy way to add kafka brokers. However, how do I know how to scale zookeeper along with Kafka. What should be the ratio? Right now we have 5 zookeeper nodes for 5 kafka brokers. If I have 10 kafka brokers how many zookeeper nodes should be there?
Zookeeper works as a coordination service for Apache Kafka which stores metadata of kafka cluster. Zookeeper cluster is called ensemble.
Number of servers in a zookeeper ensemble are an odd number(3,5 etc).These numbers represents, how much your cluster is fault tolerant.A three node ensemble ,you can run with one node missing.
With five node ensemble,you can run with two nodes missing and your cluster will be available.
You can add as many zookeeper servers based on how much you want system to be functional inspire of failures, however a ZooKeeper cluter of more than 7 nodes is not recommended for issues with overhead of latency and over-communication between those nodes.