Roles of zookeeper in solr cloud - apache-zookeeper

I am new to SolrCloud(4.X), Can anybody explain in detail about roles and responsibility of zookeeper in SolrCloud?
Also how does zookeper work in regards to search/add request to Solr?

Zookeepers are a central repository for SolrCloud configuration. You can consider it as a distributed filesystem which can be accessed by all Solr nodes in the cluster. So if you change any config file you just need to inform or upload it to Zookeeper and not on every node in the cluster.
One more important responsibility of Zookeeper is to keep an eye on the state of all Solr nodes in the cluster. If any node goes down and a search request comes in for that node, Zookeeper routes it to an alternative replica node.
When you are updating any document in SolrCloud, it is zookeeper who delegates your update request to the appropriate node in the cloud holding the document
For in depth details you should be reading this,
https://cwiki.apache.org/confluence/display/ZOOKEEPER/ProjectDescription

Related

How to handle failure senario for kafka and zookeeper in kubernetes

What I have zookeeper setup which is running on server1, server2 and server3 and similarly kafka also running in server1, server2 and server3.
Setup are running in kubernetes.
Problem statement:
In case one zookeeper setup get down entire setup will get down, because kafka is depended to zookeeper. am i right?
If Q1 correct - Is there any way to make setup like if one zookeeper server will get down then kafka should run as it is?
How to expose kafka port in kubernetes setup ?
what is the recommended way to persist data in kubernetes for production server ?
I fail to see how Zookeeper questions are related to k8s... But you definitely should set affinity rules such that Zookeeper and Kafka are not on the same physical servers or sharing same disks
If one Zookeeper out of three goes down, you'll end up with a split brain event in that no single Zookeeper knows which should be responsible for leadership. This effectively can crash or corrupt Kafka, yes.
To mitigate that risk, you can choose to run 5 Zookeepers, in which case you can lose up to 3 servers to reach the same state. The Definitive Guide book covers these concepts in the first few chapters
Regarding the other questions - NodePorts and PVCs, generally speaking.
Use one of the popular Kafka Operators on Github and you'll not need to think too hard about setting those properties
You still must manually perform Kafka admin tasks in any installation... You can use extra services like Cruise Control if you want to reduce that workload, though

Building a Kafka Cluster using two servers only

I'm planning to build a Kafka Cluster using two servers, and host Zookeeper on these two servers as well.
The Question is, since Kafka requires Zookeeper to run, what is the best cluster build for zookeeper to implement Kafka Cluster on two servers?
for eg. I'm currently running two zookeepers on both servers and one Kafka on each server, and in the Kafka configuration they point to all Zookeepers.
Is there a better way to do this?
First of all, you don't have to setup Zookeper and Kafka in the same server. One of the roles of Zookeeper is electing controller. (one of the brokers which is responsible for maintaining the leader/follower relationship for all the partitions) For election; majority of Zookeper nodes must be alive. In your case even one Zookeeper instance is down, you cannot select controller. So there is no difference between having one Zookeper or two. That's why it is recommended to have at least 3 nodes in Zookeeper cluster. By this way you can handle failure of one Zookeeper node.
An addition to this, it is highly recommended to have at least three brokers in your Kafka cluster to maintain both consistency and high availability. (link1, link2)
UPDATE:
As long as you are limited to only two servers, then you can consider sacrificing from high availability by set up your broker by setting min.insync.replicas=2 and having topics with replication.factor=2. If HA is more important than data loss, then you can use min.insync.replicas=1 (default) broker config with again topic replication.factor=2. In this circumstance, your options are these IMHO. (Having one or two Zookeepers is not important as I mentioned above)
I am often faced with the same problem as you do #frisky5 where i would like to achieve a "suboptimal" HA system using only 2 nodes, and thus workarounds are always needed with cloud-native frameworks that rely on the assumption that clusters will have lot of nodes available.
That ain't always the case in real life, is it ;) ?
That being said, i see you essentially having 2 options:
Externalize zookeeper configuration on a replicated storage system using 2 nodes (e.g. DRBD)
Replicate Kafka data volumes entirely on the second nodes and use 2 one-node Kafka clusters that you switch on and off depending on who is the current master node.
I would go for the first option. In that case you would have 2 Kafka servers and one zookeeper server whose ip needs to be static (virtual ip). When the zookeeper node goes down, it is restarted one the second node with same VIP, but it needs to access the synchronized data folder.
I am not too familiar with zookeepers internals and i can't tell you whether it will go in conflict when starting up on a data store who "wasn't its own" but i would guess it makes sense for you to test it using a simple rsync setup.
Another way to achieve consensus if you are using a k3s based kubernetes cluster would be to rely on internal k8s distributed consensus mechanics to "tell Kafka" which node is the leader. This works for the postgresoperator by chruncydata because Patroni is cool ( https://patroni.readthedocs.io/en/latest/kubernetes.html ) 😎 but i am not sure if Kafka/zookeeper are that flexible and can communicate with a rest API to set their locks ...
Once you have achieved this intermediate step, then you can use a PostgreSQL db as external source of truth for k3s and then it is as simple as syncing the postgres data folder between the machines (easily done with rsync). The beauty of this approach is that it is way more generic and could be used for other systems too.
Let me know what do you think about these two approaches and whether you manage to setup a test environment. If you do on GitHub i can help you out with implementation

Can I query any zookeeper node to get any data?

I have a small zookeeper cluster of 3 nodes. I also have another software that needs to be configured to talk to zookeeper, also running in a cluster of 3 nodes, on the same host.
I don't know anything about how zookeeper works. Do I have to configure this other software to talk to all hosts, or should it work to just configure it to talk to localhost zookeeper?
Put another way, can a query to any zookeeper node to get any data?
If you had a ZooKeeper cluster, so you can query to any ZooKeeper node and get eventually consistent data.
For how ZooKeeper works you can check this awesome post here:Explaining Apache ZooKeeper
A lots of good projects use ZooKeeper as a backbone: HBase, Kafka, please Google it, and learn from those projects for more digest.

How to migrate Kafka from old Zookeeper cluster to new Zookeeper cluster with different znode parent path

I have a three-node Kafka cluster in service running on a separate three-node Zookeeper cluster. I intend to switch Kafka to use a new five-node Zookeeper cluster, and although I have found information about doing that, I have an extra wrinkle where Kafka will be using a custom znode parent path on the new cluster.
For instance, my current Kafka Zookeeper string looks something like this:
192.0.2.11:2181,192.0.2.12:2181,192.0.2.13:2181
I'm looking to switch it to this:
192.0.2.21:2181,192.0.2.22:2181,192.0.2.23:2181,192.0.2.24:2181,192.0.2.25:2181/kafka/uid1
The reason for this is that we intend to reuse the larger Zookeeper cluster for other Kafka clusters. Don't worry, this is for testing and not production. However, we still want to do this without losing any data on the stream that is coming into Kafka, so we want to do this without taking anything down.
Is this possible?
I have come across the following questions:
Copy/Migrate old zookeeper znode/data to new zookeeper
best way to copy data across 2 zookeeper cluster?
Unfortunately they appear to require some downtime, which I'm hoping to avoid.
This page (https://qgraph.io/blog/migrating-kafka-zookeeper-cluster/) was a little more helpful in the way of rollover, but not with znode migration.
I've been looking for 'znode symlinks' or 'specifying znode path per zookeeper server' but neither seem possible. Am I out of luck and require downtime and possibly lost data?
By what I can tell, there is no way to move Kafka's parent znode without restarting Kafka. There are no such things as hard or soft links for znodes: https://www.igvita.com/2010/04/30/distributed-coordination-with-zookeeper/

Why do we need to add all zookeeper nodes in Kafka Consumer Configuration

Looks like we need to add the ip addresses of all zookeeper nodes in the property "zookeeper.connect" for configuring a consumer.
Now my understanding says the zookeeper cluster has a leader which is managed in a fail-safe way.
So, why cant we just provide a bootstrap list for zookeeper nodes like its done in Producer configuration(while providing bootstrap broker list) & they should provide metadata about the entire zookeeper cluster?
You can specify a subset of the nodes. The nodes in that list are only used to get an initial connection to the cluster of nodes and the client goes through the list until a connection is made. Usually the first node is up and available so the client doesn't have to go too far into the list. So you only need to add extra nodes to the list depending on how pessimistic you are.