Doubts Regarding Kafka Cluster Setup - apache-kafka

I have a use case I want to set up a Kafka cluster initially at the starting I have 1 Kafka Broker(A) and 1 Zookeeper Node. So below mentioned are my queries:
On adding a new Kafka Broker(B) to the cluster. Will all data present on broker A will be distributed automatically? If not what I need to do distribute the data.
Not let's suppose somehow the case! is solved my data is distributed on both the brokers. Now due to some maintenance issue, I want to take down the server B.
How to transfer the data of Broker B to the already existing broker A or to a new Broker C.
How can I increase the replication factor of my brokers at runtime
How can I change the zookeeper IPs present in Kafka Broker Config at runtime without restarting Kafka?
How can I dynamically change the Kafka Configuration at runtime
Regarding Kafka Client:
Do I need to specify all Kafka broker IP to kafkaClient for connection?
And each and every time a broker is added or removed does I need to add or remove my IP in Kafka Client connection String. As it will always require to restart my producer and consumers?
Note:
Kafka Version: 2.0.0
Zookeeper: 3.4.9
Broker Size : (2 core, 8 GB RAM) [4GB for Kafka and 4 GB for OS]

To run a topic from a single kafka broker you will have to set a replication factor of 1 when creating that topic (explicitly, or implicitly via default.replication.factor). This means that the topic's partitions will be on a single broker, even after increasing the number of brokers.
You will have to increase the number of replicas as described in the kafka documentation. You will also have to pay attention that the internal __consumer_offsets topic has enough replicas. This will start the replication process and eventually the original broker will be the leader of every topic partition, and the other broker will be the follower and fully caught up. You can use kafka-topics.sh --describe to check that every partition has both brokers in the ISR (in-sync replicas).
Once that is done you should be able to take the original broker offline and kafka will elect the new broker as the leader of every topic partition. Don't forget to update the clients so they are aware of the new broker as well, in case a client needs to restart when the original broker is down (otherwise it won't find the cluster).

Here are the answers in brief:
Yes, the data present on broker A will also be distributed in Kafka broker B
You can set up three brokers A, B and C so if A fails then B and C will, and if B fails then, C will take over and so on.
You can increase the replication factor of your broker
you could create increase-replication-factor.json and put this content in it:
{"version":1,
"partitions":[
{"topic":"signals","partition":0,"replicas":[0,1,2]},
{"topic":"signals","partition":1,"replicas":[0,1,2]},
{"topic":"signals","partition":2,"replicas":[0,1,2]}
]}
To increase the number of replicas for a given topic, you have to:
Specify the extra partitions to the existing topic with below command(let us say the increase from 2 to 3)
bin/kafktopics.sh --zookeeper localhost:2181 --alter --topic topic-to-increase --partitions 3
There is zoo.cfg file where you can add the IP and configuration related to ZooKeeper.

Related

Kafka Topic, Broker, ZooKeeper architecture overview

I have read a bunch of articles regarding Kafka architecture but I'm still brand-new in this and when it came to coding there was some confusion if I get the things correctly.
From what I understand Kafka server, broker and node are synonyms. There can be a few brokers within Kafka cluster. There is a Kafka topic (T1) and it consists of a few partitions (P1, P2..). These partitions can be replicated across the brokers (B1, B2..). B1 can be leader for P1, B2 for P2 and so on. Do we say that there is topic T1 defined for broker or cluster, and if we treat topic as set of partitions can we say 'topic replicas'?
From the official Kafka documentation:
bootstrap.servers: A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
So from what I understand, defining host1:port1,host2:port2 says that there are two brokers.
In this case, does ZooKeeper automatically distribute a message to a leader when executing bin/kafka-console-producer.sh --broker-list host1:port1,host2:port2 --topic test ? (I believe somewhere I have read that a producer should read broker id from ZooKeeper, but wouldn't it be unnecessary here?)
Is it equal to publishing using bin/kafka-console-producer.sh --zookeeper host1:z_port1,host2:z_port2 --topic test ?
How should I basically understand bin/kafka-configs.sh --zookeeper host1:z_port1,host2:z_port2? We have only one zookeeper instance?
Do we say that there is topic T1 defined for broker or cluster, and if we treat topic as set of partitions can we say 'topic replicas'?
1) Cluster. 2) Partitions are individually replicated across multiple brokers, often more than the replication factor itself. The more proper term would be the "in sync replicas (ISR)"
does ZooKeeper automatically distribute a message to a leader when executing
Zookeeper does not, no. Your client communicates with a Broker Controller, then receives all brokers in the cluster, which also returns metadata about which broker is the leader for which topic-partitions. The client then individually connects and produces to each leader broker for the calculated partitions
Is it equal to publishing
Producing*, yes.
We have only one zookeeper instance?
One Zookeeper cluster can manage multiple Kafka clusters via a feature called a chroot, the root directory in the Zookeeper znodes that contains information about the managed service.
Also, kafka-topics command can now use --bootstrap-server, not --zookeeper

Why is my kafka topic not consumable with a broker down?

My issue is that I have a three broker Kafka Cluster and an availability requirement to have access to consume and produce to a topic when one or two of my three brokers is down.
I also have a reliability requirement to have a replication factor of 3. These seem to be conflicting requirements to me. Here is how my problem manifests:
I create a new topic with replication factor 3
I send several messages to that topic
I kill one of my brokers to simulate a broker issue
I attempt to consume the topic I created
My consumer hangs
I review my logs and see the error:
Number of alive brokers '2' does not meet the required replication factor '3' for the offsets topic
If I set all my broker's offsets.topic.replication.factor setting to 1, then I'm able to produce and consume my topics, even if I set the topic level replication factor to 3.
Is this an okay configuration? Or can you see any pitfalls in setting things up this way?
You only need as many brokers as your replication factor when creating the topic.
I'm guessing in your case, you start with a fresh cluster and no consumers have connected yet. In this case, the __consumer_offsets internal topic does not exist as it is only created when it's first needed. So first connect a consumer for a moment and then kill one of the brokers.
Apart from that, in order to consume you only need 1 broker up, the leader for the partition.

What happens when one of the Kafka replicas is down

I have a cluster of 2 Kafka brokers and a topic with replication factor 2. If one of the brokers dies, will my producers be able to continue sending new messages to this degraded cluster of 1 node? Or replication factor 2 requires 2 alive nodes and messaged will be refused?
It depends on a few factors:
What is your producer configuration for acks? If you configure to "all", the leader broker won't answer with an ACK until the message have been replicated to all nodes in ISR list. At this point is up to your producer to decide if he cares about ACKs or not.
What is your value for min.insync.replicas? If the number of nodes is below this config, your broker leader won't accept more messages from producers until more nodes are available.
So basically your producers may get into a pause for a while, until more nodes are up.
Messages will not be ignored if the no. of alive brokers is lesser than the configured replicas. Whenever a new Kafka broker joins the cluster, the data gets replicated to that node.
You can reproduce this scenario by configuring the replication factor as 3 or more and start only one broker.
Kafka will handle reassigning partitions for producers and consumers that where dealing with the partitions lost, but it will problematic for new topics.
You could start one broker with a replication factor of 2 or 3. It does work. However, you could not create a topic with that replication factor until you have that amount of brokers in the cluster. Either the topic is auto generated on the first message or created manually, kafka will throw an error.
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic test
Error while executing topic command : Replication factor: 3 larger than available brokers: 1.
[2018-08-08 15:23:18,339] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
As soon as new node joined to the kafka cluster, data will be replicated, the replicas factor does not effect the publisher messages
replication-factor 2 doesn't require 2 live brokers, it publish message while one broker is down depends on those configurations
- acks
- min.insync.replicas
Check those configurations as mentioned above #Javier

One Kafka broker connects to multiple zookeepers

I'm new to Kafka, zookeeper and Storm.
I our environment we have one Kafka broker connecting to multiple zookeepers. Is there an advantage having the producer send the messages to a specific topic and partition on one broker to multiple zookeepers vs multiple brokers to multiple zookeepers?
Yes there is. Kafka allows you to scale by adding brokers. When you use a Kafka cluster with a single broker, as you have, all partitions reside on that single broker. But when you have multiple brokers, Kafka will split the partitions between them. So, broker A may be elected leader for partitions 1 and 2 of your topic, and broker B leader for partition 3. So, when you publish messages to the topic, the client will split the messages between the various partitions on the two brokers.
Note that I also mentioned leader election. Adding brokers to your Kafka cluster gives you replication. Kafka uses ZooKeeper to elect a leader for each partition as I mentioned in my example. Once a leader is elected, the client splits messages among partitions and sends each message to the leader for the appropriate partition. Depending on the topic configuration, the leader may synchronously replicate messages to a backup. So, in my example, if the replication factor for the topic is 2 then broker A will synchronously replicate messages for partitions 1 and 2 to broker B and broker B will synchronously replicate messages for partition 3 to broker A.
So, that's all to say that adding brokers gives you both scalability and fault-tolerance.

How to load balance the Kafka Leadership?

My kafka version is kafka_2.9.2-0.8.1.1. I have two brokers in the cluster, 4 topics and each topic has 4 partitions.
When I run
sh kafka-topics.sh --describe --zookeeper rhost:2181
for all the topics/partitions, I see broker 1 as Leader.
How can I load balance the leader?
For example, for topic 1 and topic 2 have broker 1 as leader and
for topic 3 and topic 4 have broker 2 as leader.
The partitions should be automatically rebalanced, since the default value of the broker configuration parameter auto.leader.rebalance.enable is true. (see documentation)
However, by default this rebalance occurs every 5 minutes, as defined by the leader.imbalance.check.interval.seconds parameter. If you wish this to occur more frequently, you will have to modify this parameter.
You can use the Preferred Replica Leader Election Tool:
sh kafka-preferred-replica-election.sh --zookeeper zklist
This guarantees that the leadership load across the brokers in a cluster is evenly balanced.
I know it is a bit late, maybe you already have the answer, but to balance leaders, first you need to make brokers equally preferred between all partitions, for a broker to be a "preferred leader" it has two criteria, first, it needs to be in sync replica, second, it has to be the first element on the replicas list. So, if you have a small enough number of topics/partitions, you can do that manually, it would be easier, otherwise you need to reassign partitions with distributing the first element (preferred replica) among all brokers, then kick off preferred leader election tool which will make sure that the preferred leader is actually the leader.
Brokers have a property which can be set in server.properties file, which will enable auto re-balancing of the leadership. By default, it is not enabled. Add the following line of code to every broker and restart kafka.
auto.leader.rebalance.enable=true