I'm looking to start using Kafka for a system and I'm trying to cover all use cases.
Normally it would be run as a cluster of brokers running on virtual servers (replication factor 3-5). but some customers though don't care about resilience and a broker failure needing a manual reboot of the whole system is fine with them, they just care about hardware costs.
So my question is, are there any issues with using Kafka as a single broker system for small installations with low throughput?
Cheers
It's absolutely OK to use a single Kafka broker. Note, however, that with a single broker you won't have a highly available service meaning that when the broker fails you will have a downtime.
Your replication-factor will be limited to 1 and therefore all of the partitions of a topic will be stored on the same node.
For a proof-of-concept or non-critical dev work, a single node cluster works just fine. However having a cluster has multiple benefits. It's okay to go with a single node cluster if the following are not important/relevant for you.
scalability [spreads load across multiple brokers to maintain certain throughput]
fail-over [guards against data loss in case one/more node(s) go down]
availability [system remains reachable and functioning even if one/more node(s) go down]
Related
I am trying to figure out an appropriate production deployment strategy for an Apache Kafka cluster with High Availability.
I was unable to find a specific documentation which describes such a strategy. So based on the articles I found, I have come up with the following strategy.
3 zookeeper nodes
3 kafka brokers (each having a replica of all the topic partitions that I'm planning to use)
Replication factor of 3 for each Topic
on 3 physical machines (each having a zookeeper node and a broker node)
The reason why I have decided to have a zookeeper node and a broker node on each machine is to avoid a 'brain split' in an event of a network partitioning as described in this question and the accepted answer
I want to know,
Whether there is a adverse performance impact in having both a zookeeper node and a broker node on a single machine? (and whether it would make more sense to go ahead with 6 physical machines by deploying such that each machine would either have a kafka broker or a zookeeper node?)
Whether the deployment strategy I have come with is suitable for a production deployment? (Or how it can be improved?)
Also, if you have come across a guide which recommends a suitable deployment configuration, kindly include its link.
Appreciate any help on this matter. Thanks in advance.
Three is the minimum number of brokers you'll need, but you might want more for additional redundancy and/or capacity
Usually, people deploy their Kafka brokers and Zookeeper nodes on separate hardware.
This Reference Architecture should help you further.
I'm planning to build a Kafka Cluster using two servers, and host Zookeeper on these two servers as well.
The Question is, since Kafka requires Zookeeper to run, what is the best cluster build for zookeeper to implement Kafka Cluster on two servers?
for eg. I'm currently running two zookeepers on both servers and one Kafka on each server, and in the Kafka configuration they point to all Zookeepers.
Is there a better way to do this?
First of all, you don't have to setup Zookeper and Kafka in the same server. One of the roles of Zookeeper is electing controller. (one of the brokers which is responsible for maintaining the leader/follower relationship for all the partitions) For election; majority of Zookeper nodes must be alive. In your case even one Zookeeper instance is down, you cannot select controller. So there is no difference between having one Zookeper or two. That's why it is recommended to have at least 3 nodes in Zookeeper cluster. By this way you can handle failure of one Zookeeper node.
An addition to this, it is highly recommended to have at least three brokers in your Kafka cluster to maintain both consistency and high availability. (link1, link2)
UPDATE:
As long as you are limited to only two servers, then you can consider sacrificing from high availability by set up your broker by setting min.insync.replicas=2 and having topics with replication.factor=2. If HA is more important than data loss, then you can use min.insync.replicas=1 (default) broker config with again topic replication.factor=2. In this circumstance, your options are these IMHO. (Having one or two Zookeepers is not important as I mentioned above)
I am often faced with the same problem as you do #frisky5 where i would like to achieve a "suboptimal" HA system using only 2 nodes, and thus workarounds are always needed with cloud-native frameworks that rely on the assumption that clusters will have lot of nodes available.
That ain't always the case in real life, is it ;) ?
That being said, i see you essentially having 2 options:
Externalize zookeeper configuration on a replicated storage system using 2 nodes (e.g. DRBD)
Replicate Kafka data volumes entirely on the second nodes and use 2 one-node Kafka clusters that you switch on and off depending on who is the current master node.
I would go for the first option. In that case you would have 2 Kafka servers and one zookeeper server whose ip needs to be static (virtual ip). When the zookeeper node goes down, it is restarted one the second node with same VIP, but it needs to access the synchronized data folder.
I am not too familiar with zookeepers internals and i can't tell you whether it will go in conflict when starting up on a data store who "wasn't its own" but i would guess it makes sense for you to test it using a simple rsync setup.
Another way to achieve consensus if you are using a k3s based kubernetes cluster would be to rely on internal k8s distributed consensus mechanics to "tell Kafka" which node is the leader. This works for the postgresoperator by chruncydata because Patroni is cool ( https://patroni.readthedocs.io/en/latest/kubernetes.html ) 😎 but i am not sure if Kafka/zookeeper are that flexible and can communicate with a rest API to set their locks ...
Once you have achieved this intermediate step, then you can use a PostgreSQL db as external source of truth for k3s and then it is as simple as syncing the postgres data folder between the machines (easily done with rsync). The beauty of this approach is that it is way more generic and could be used for other systems too.
Let me know what do you think about these two approaches and whether you manage to setup a test environment. If you do on GitHub i can help you out with implementation
I need to automate a rolling restart of a kafka cluster (3 kafka brokers). I can easily do it manually - restart one after the other, while checking the log to see when it's fine (e.g., when the new process has joined the cluster).
What is a good way to automate this check? How can I ask the broker whether it's up and running, connected to its peers, all topics up-to-date and such? In my restart script, I have access to the metrics, but to be frank, I did not really see one there which gives me a clear picture.
Another way would be to ask what a good "readyness" probe would be that does not simply check some TCP/IP port, but looks at the actual server...
I would suggest exposing JMX metrics and tracking the following for cluster health
the controller count (must be 1 over the whole cluster)
under replicated partitions (should be zero for healthy cluster)
unclean leader elections (if you don't disable this in server.properties make sure there are none in the metric counts)
ISR shrinks within a reasonable time period, like 10 minute window (should be none)
Also, Yelp has tooling for rolling restarts implemented in Python, which requires Jolokia JMX Agents installed on the brokers, and it polls the metrics to make sure some of the above conditions are true
Assuming your cluster was healthy at the beginning of the restart operation, at a minimum, after each broker restart, you should ensure that the under-replicated partition count returns to zero before restarting the next broker.
As the previous responders mentioned, there is existing code out there to automate this. I don’t use Jolikia, myself, but my solution (which I’m working on now) also uses JMX metrics.
Kakfa Utils by Yelp is one of the best tools that can be used to detect when a kafka broker is "done". Specifically, kafka_rolling_restart is the tool which gets broker details from zookeeper and URP (Under Replicated Partitions) metrics from each broker. When a broker is restarted, total URPs across Kafka cluster is periodically collected and when it goes to zero, it restarts another broker. The controller broker is restarted at the last.
I am setting up a kafka+zookeeper cluster. Let's say I want 3 kafka brokers. I am wondering if I can setup 3 machines with kafka on them and then run the zookeeper cluster on the same nodes. So each machine has a kafka+zookeeper node in the cluster, instead of having 3 machines for kafka and 3 machines for zookeeper (6 in total).
What are the advantages and disadvantages? These machines will most probably be dedicated to running kafka/zookeeper. I am thinking if I can reduce costs a bit without sacrificing performance.
We have been running zookeeper and kafka broker on the same node in production environment for years without any problems. The cluster is running at very very high qps and IO traffics, so I dare say that our experience suits most scenarios.
The advantage is quite simple, which is saving machines. Kafka brokers are IO-intensive, while zookeeper nodes don't cost too much disk IO as well as CPU. So they won't disturb each other in most occasions.
But do remember to keep watching at your CPU and IO(not only disk but also network) usages, and increase cluster capacity before they reach bottleneck.
I don't see any disadvantages because we have very good cluster capacity planning.
It makes sense to collocate them when Kafka cluster is small, 3-5 nodes. But keep in mind that it is a colocation of two applications that are sensitive to disk I/O. The workloads and how chatty they are with local Zk's also plays an important role here, especially from page cache memory usage perspective.Â
Once Kafka cluster grows to a dozen or more nodes, collocation of Zk’s accordingly on each node will create quorum overheads(like slower writes, more nodes in quorum checks), so a separate Zk cluster has to be in place.
Overall, if from the start Kafka cluster usage is low and you want to save some costs, then it is reasonable to start them collocated, but have a migration strategy for setting up a separate Zk cluster to not be caught of guard once Kafka cluster has to be scaled horizontally.Â
I am currently working on trying to use zookeeper in a two node cluster. I have my own cluster formation algorithm running on the nodes based on configuration. We only need Zookeeper's distributed DB functionality.
Is it possible to use Zookeeper in a two node cluster ? Do you know of any solutions where this has been done ?
Can we still retain the zookeepers DB functionality without forming a quorum ?
Note: Fault tolerance is not the main concern in this project. If one of the nodes go down we have enough code logic to run without the zookeeper service. We use the zookeeper to share data when both the nodes are alive.
Would greatly appreciate any help.
Zookeeper is a coordination system which is basically used to coordinate among nodes. When writes are occurred to such a distributed system, in ordered to coordinate and agree upon values which are being stored, all the writes are gone through master (aka leader). Reads can occur through any node. Zookeeper requires a master/leader to be elected per a quorum in order to serve write requests consistently. Zookeeper make use of the ZAB protocol as the consensus algorithm.
In order to elect a leader, a quorum should ideally have an odd number of nodes (Otherwise, a node will not be able to win majority and become the leader). In your case, with two nodes, zookeeper will not possibly be able to elect a leader for a long time since both nodes will be candidates and wait for the other node to vote for it. Even though they elect a leader, your ensemble will not work properly in network patitioning situations.
As I said, zookeeper is not a distributed storage. If you need to use it in a distributed manner (more than one node), it need to form a quorum.
As I see, what you need is a distributed database. Not a distributed coordination system.