Does it make sense to have more #replicas than #broker in kafka?

Does it make sense to have more #replicas than #broker in kafka? - apache-kafka

I am going to use kafka as messaging system. Still missing the following dots in my mind.
How many brokers can I have on one machine ?
Does it make sense to have more #replicas (partition replication) than #broker in kafka ?
Is it possible to add additional zookeeper server(on other machine) to scale without shutting down/restarting the current service ?

You could have more than one broker per machine but there is usually not any good reason to have more than one.
I can not think of a good reason to have more #replicas specified than #brokers.
Your Zookeeper servers should optimally be on separate machines and be and odd number of nodes. There is a tradeoff between write latency and resiliency here. 3 Zookeepers are common where write latency is very important. 5 or even 7 nodes can be used for more resiliency.

Related

What is the recommended production deployment strategy for Apache Kafka?

I am trying to figure out an appropriate production deployment strategy for an Apache Kafka cluster with High Availability.
I was unable to find a specific documentation which describes such a strategy. So based on the articles I found, I have come up with the following strategy.
3 zookeeper nodes
3 kafka brokers (each having a replica of all the topic partitions that I'm planning to use)
Replication factor of 3 for each Topic
on 3 physical machines (each having a zookeeper node and a broker node)
The reason why I have decided to have a zookeeper node and a broker node on each machine is to avoid a 'brain split' in an event of a network partitioning as described in this question and the accepted answer
I want to know,
Whether there is a adverse performance impact in having both a zookeeper node and a broker node on a single machine? (and whether it would make more sense to go ahead with 6 physical machines by deploying such that each machine would either have a kafka broker or a zookeeper node?)
Whether the deployment strategy I have come with is suitable for a production deployment? (Or how it can be improved?)
Also, if you have come across a guide which recommends a suitable deployment configuration, kindly include its link.
Appreciate any help on this matter. Thanks in advance.

Three is the minimum number of brokers you'll need, but you might want more for additional redundancy and/or capacity
Usually, people deploy their Kafka brokers and Zookeeper nodes on separate hardware.
This Reference Architecture should help you further.

Building a Kafka Cluster using two servers only

I'm planning to build a Kafka Cluster using two servers, and host Zookeeper on these two servers as well.
The Question is, since Kafka requires Zookeeper to run, what is the best cluster build for zookeeper to implement Kafka Cluster on two servers?
for eg. I'm currently running two zookeepers on both servers and one Kafka on each server, and in the Kafka configuration they point to all Zookeepers.
Is there a better way to do this?

First of all, you don't have to setup Zookeper and Kafka in the same server. One of the roles of Zookeeper is electing controller. (one of the brokers which is responsible for maintaining the leader/follower relationship for all the partitions) For election; majority of Zookeper nodes must be alive. In your case even one Zookeeper instance is down, you cannot select controller. So there is no difference between having one Zookeper or two. That's why it is recommended to have at least 3 nodes in Zookeeper cluster. By this way you can handle failure of one Zookeeper node.
An addition to this, it is highly recommended to have at least three brokers in your Kafka cluster to maintain both consistency and high availability. (link1, link2)
UPDATE:
As long as you are limited to only two servers, then you can consider sacrificing from high availability by set up your broker by setting min.insync.replicas=2 and having topics with replication.factor=2. If HA is more important than data loss, then you can use min.insync.replicas=1 (default) broker config with again topic replication.factor=2. In this circumstance, your options are these IMHO. (Having one or two Zookeepers is not important as I mentioned above)

I am often faced with the same problem as you do #frisky5 where i would like to achieve a "suboptimal" HA system using only 2 nodes, and thus workarounds are always needed with cloud-native frameworks that rely on the assumption that clusters will have lot of nodes available.
That ain't always the case in real life, is it ;) ?
That being said, i see you essentially having 2 options:
Externalize zookeeper configuration on a replicated storage system using 2 nodes (e.g. DRBD)
Replicate Kafka data volumes entirely on the second nodes and use 2 one-node Kafka clusters that you switch on and off depending on who is the current master node.
I would go for the first option. In that case you would have 2 Kafka servers and one zookeeper server whose ip needs to be static (virtual ip). When the zookeeper node goes down, it is restarted one the second node with same VIP, but it needs to access the synchronized data folder.
I am not too familiar with zookeepers internals and i can't tell you whether it will go in conflict when starting up on a data store who "wasn't its own" but i would guess it makes sense for you to test it using a simple rsync setup.
Another way to achieve consensus if you are using a k3s based kubernetes cluster would be to rely on internal k8s distributed consensus mechanics to "tell Kafka" which node is the leader. This works for the postgresoperator by chruncydata because Patroni is cool ( https://patroni.readthedocs.io/en/latest/kubernetes.html ) 😎 but i am not sure if Kafka/zookeeper are that flexible and can communicate with a rest API to set their locks ...
Once you have achieved this intermediate step, then you can use a PostgreSQL db as external source of truth for k3s and then it is as simple as syncing the postgres data folder between the machines (easily done with rsync). The beauty of this approach is that it is way more generic and could be used for other systems too.
Let me know what do you think about these two approaches and whether you manage to setup a test environment. If you do on GitHub i can help you out with implementation

Kafka cluster with single broker

I'm looking to start using Kafka for a system and I'm trying to cover all use cases.
Normally it would be run as a cluster of brokers running on virtual servers (replication factor 3-5). but some customers though don't care about resilience and a broker failure needing a manual reboot of the whole system is fine with them, they just care about hardware costs.
So my question is, are there any issues with using Kafka as a single broker system for small installations with low throughput?
Cheers

It's absolutely OK to use a single Kafka broker. Note, however, that with a single broker you won't have a highly available service meaning that when the broker fails you will have a downtime.
Your replication-factor will be limited to 1 and therefore all of the partitions of a topic will be stored on the same node.

For a proof-of-concept or non-critical dev work, a single node cluster works just fine. However having a cluster has multiple benefits. It's okay to go with a single node cluster if the following are not important/relevant for you.
scalability [spreads load across multiple brokers to maintain certain throughput]
fail-over [guards against data loss in case one/more node(s) go down]
availability [system remains reachable and functioning even if one/more node(s) go down]

Running zookeeper on a cluster of 2 nodes

I am currently working on trying to use zookeeper in a two node cluster. I have my own cluster formation algorithm running on the nodes based on configuration. We only need Zookeeper's distributed DB functionality.
Is it possible to use Zookeeper in a two node cluster ? Do you know of any solutions where this has been done ?
Can we still retain the zookeepers DB functionality without forming a quorum ?
Note: Fault tolerance is not the main concern in this project. If one of the nodes go down we have enough code logic to run without the zookeeper service. We use the zookeeper to share data when both the nodes are alive.
Would greatly appreciate any help.

Zookeeper is a coordination system which is basically used to coordinate among nodes. When writes are occurred to such a distributed system, in ordered to coordinate and agree upon values which are being stored, all the writes are gone through master (aka leader). Reads can occur through any node. Zookeeper requires a master/leader to be elected per a quorum in order to serve write requests consistently. Zookeeper make use of the ZAB protocol as the consensus algorithm.
In order to elect a leader, a quorum should ideally have an odd number of nodes (Otherwise, a node will not be able to win majority and become the leader). In your case, with two nodes, zookeeper will not possibly be able to elect a leader for a long time since both nodes will be candidates and wait for the other node to vote for it. Even though they elect a leader, your ensemble will not work properly in network patitioning situations.
As I said, zookeeper is not a distributed storage. If you need to use it in a distributed manner (more than one node), it need to form a quorum.
As I see, what you need is a distributed database. Not a distributed coordination system.

Kafka recommended system configuration

I'm expecting our influx in to Kafka to raise to around 2 TB/day over a period of time. I'm planning to setup a Kafka cluster with 2 brokers (each running on separate system). What is the recommended hardware configuration for handling 2 TB/day ?

To use as a base you could look here: https://docs.confluent.io/4.1.1/installation/system-requirements.html#hardware
You need to know the amount of messages you get per second/hour because this will determine the size of your cluster. For HD, it's not necessary to get SSD because the system will use RAM to store the data first. Still you could need quite speed hard disk to ensure that the flushing process of the queue will not slow your system.
I would also recommend to use 3 kafka broker and 3 or 4 zookeeper server too.