How do I add a backup producer in Apache Kafka? - apache-kafka

I am a Kafka newbie. I have order/market data multicast (over Tibco Rendezvous) and my Kafka producer is listening to it and publishing it on a topic all in 1 partition to a broker (I have a list of brokers and a Zookeeper ensemble of 3 nodes) tolerating Zookeeper and broker failures.
However, persistence in Kafka brokers though necessary won't be sufficient if my producer goes down as I would have lost the messages multicast anyway. My consumer commits offsets after every message as it cannot double process a single message.
I was thinking of having a backup producer publish on a different topic, but how would the consumer know where to start picking off even if Kafka allows leeway to restart the consumer.
Additionally I might not have a unique identifier on the incoming message.

The Data what ever you have produced from the producer is stored in the Consumer. Which means in Kafka-logs. If you need the backup then a simple solution for this is to put the replica of the Topic which you created.bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --topic topicname

Related

Kafka topics not created empty

I have a Kafka cluster consisting on 3 servers all connected through Zookeeper. But when I delete a topic that has some information and create the topic again with the same name, the offset does not start from zero.
I tried restarting both Kafka and Zookeeper and deleting the topics directly from Zookeeper.
What I expect is to have a clean topic When I create it again.
I found the problem. A consumer was consuming from the topic and the topic was never actually deleted. I used this tool to have a GUI that allowed me to see the topics easily https://github.com/tchiotludo/kafkahq. Anyway, the consumers can be seen running this:
bin/kafka-consumer-groups.sh --list --bootstrap-server localhost:9092

Kafka Consumer does not receive data when one of the brokers is down

Kafka Quickstart
Using Kafka v2.1.0 on RHEL v6.9
Consumer fails to receive data when one of the Kafka brokers is down.
Steps performed:
1. Start zookeeper
2. Start Kafka-Server0 (localhost:9092, kafkalogs1)
3. Start Kafka-Server1 (localhost:9094, kafkalog2)
4. Create topic "test1", num of partitions = 1, replication factor = 2
5. Run producer for topic "test1"
6. Run consumer
7. Send messages from the producer
8. Receive messages on the consumer side.
All the above steps worked without any issues.
When I shutdown Kafka-Server0, the consumer stops getting data from Producer.
When I bring back up Kafka-Server0, the consumer starts to get messages from where it left off.
These are the commands used
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test1
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test1
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 2 --partitions 1 --topic test1
The behavior is the same (no message received on the consumer side) when I run the consumer with two servers specified in the --bootstrap-server option.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092,localhost:9094 --topic test1
Any idea why the consumer stops getting messages when server0 is down even though the replication factor for the topic test1 was set to 2?
There is a similar question already but it was not answered completely
Kafka 0.10 quickstart: consumer fails when "primary" broker is brought down
If the offsets topic is unavailable, you cannot consume.
Look at the server.properties file for these, and see the comment above, and increase accordingly (only applies if topic doesn't already exist)
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
According to your previous question, looks like it only has one replica
See how you can increase replication factor for an existing topic
In initial versions of Kafka, offset was being managed at zookeeper, but Kafka has continuously evolved over the time introducing lot of new features. Now Kafka manages the offset in a topic __consumer_offsets.
You can think of a scenario where you created a topic with a replication factor of 1. In case the broker goes down the data is only on that Kafka node which is down. So you can't get this data. Same analogy applies to __consumer_offsets topic.
You need to revisit the server.properties in order to get features you are expecting. But in case you still wanna consume the messages from the replica partition, you may need to re-start the console consumer with --from-beginning true

Doubts Regarding Kafka Cluster Setup

I have a use case I want to set up a Kafka cluster initially at the starting I have 1 Kafka Broker(A) and 1 Zookeeper Node. So below mentioned are my queries:
On adding a new Kafka Broker(B) to the cluster. Will all data present on broker A will be distributed automatically? If not what I need to do distribute the data.
Not let's suppose somehow the case! is solved my data is distributed on both the brokers. Now due to some maintenance issue, I want to take down the server B.
How to transfer the data of Broker B to the already existing broker A or to a new Broker C.
How can I increase the replication factor of my brokers at runtime
How can I change the zookeeper IPs present in Kafka Broker Config at runtime without restarting Kafka?
How can I dynamically change the Kafka Configuration at runtime
Regarding Kafka Client:
Do I need to specify all Kafka broker IP to kafkaClient for connection?
And each and every time a broker is added or removed does I need to add or remove my IP in Kafka Client connection String. As it will always require to restart my producer and consumers?
Note:
Kafka Version: 2.0.0
Zookeeper: 3.4.9
Broker Size : (2 core, 8 GB RAM) [4GB for Kafka and 4 GB for OS]
To run a topic from a single kafka broker you will have to set a replication factor of 1 when creating that topic (explicitly, or implicitly via default.replication.factor). This means that the topic's partitions will be on a single broker, even after increasing the number of brokers.
You will have to increase the number of replicas as described in the kafka documentation. You will also have to pay attention that the internal __consumer_offsets topic has enough replicas. This will start the replication process and eventually the original broker will be the leader of every topic partition, and the other broker will be the follower and fully caught up. You can use kafka-topics.sh --describe to check that every partition has both brokers in the ISR (in-sync replicas).
Once that is done you should be able to take the original broker offline and kafka will elect the new broker as the leader of every topic partition. Don't forget to update the clients so they are aware of the new broker as well, in case a client needs to restart when the original broker is down (otherwise it won't find the cluster).
Here are the answers in brief:
Yes, the data present on broker A will also be distributed in Kafka broker B
You can set up three brokers A, B and C so if A fails then B and C will, and if B fails then, C will take over and so on.
You can increase the replication factor of your broker
you could create increase-replication-factor.json and put this content in it:
{"version":1,
"partitions":[
{"topic":"signals","partition":0,"replicas":[0,1,2]},
{"topic":"signals","partition":1,"replicas":[0,1,2]},
{"topic":"signals","partition":2,"replicas":[0,1,2]}
]}
To increase the number of replicas for a given topic, you have to:
Specify the extra partitions to the existing topic with below command(let us say the increase from 2 to 3)
bin/kafktopics.sh --zookeeper localhost:2181 --alter --topic topic-to-increase --partitions 3
There is zoo.cfg file where you can add the IP and configuration related to ZooKeeper.

What happens when one of the Kafka replicas is down

I have a cluster of 2 Kafka brokers and a topic with replication factor 2. If one of the brokers dies, will my producers be able to continue sending new messages to this degraded cluster of 1 node? Or replication factor 2 requires 2 alive nodes and messaged will be refused?
It depends on a few factors:
What is your producer configuration for acks? If you configure to "all", the leader broker won't answer with an ACK until the message have been replicated to all nodes in ISR list. At this point is up to your producer to decide if he cares about ACKs or not.
What is your value for min.insync.replicas? If the number of nodes is below this config, your broker leader won't accept more messages from producers until more nodes are available.
So basically your producers may get into a pause for a while, until more nodes are up.
Messages will not be ignored if the no. of alive brokers is lesser than the configured replicas. Whenever a new Kafka broker joins the cluster, the data gets replicated to that node.
You can reproduce this scenario by configuring the replication factor as 3 or more and start only one broker.
Kafka will handle reassigning partitions for producers and consumers that where dealing with the partitions lost, but it will problematic for new topics.
You could start one broker with a replication factor of 2 or 3. It does work. However, you could not create a topic with that replication factor until you have that amount of brokers in the cluster. Either the topic is auto generated on the first message or created manually, kafka will throw an error.
kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic test
Error while executing topic command : Replication factor: 3 larger than available brokers: 1.
[2018-08-08 15:23:18,339] ERROR org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
As soon as new node joined to the kafka cluster, data will be replicated, the replicas factor does not effect the publisher messages
replication-factor 2 doesn't require 2 live brokers, it publish message while one broker is down depends on those configurations
- acks
- min.insync.replicas
Check those configurations as mentioned above #Javier

Why does kafka producer take a broker endpoint when being initialized instead of the zk

If I have multiple brokers, which broker should my producer use? Do I need to manually switch the broker to balance the load? Also why does the consumer only need a zookeeper endpoint instead of a broker endpoint?
quick example from tutorial:
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
which broker should my producer use? Do I need to manually switch the broker to balance the load?
Kafka runs on cluster, meaning set of nodes, so while producing anything you need to tell him the LIST of brokers that you've configured for your application, below is a small note taken from their documentation.
“metadata.broker.list” defines where the Producer can find a one or more Brokers to determine the Leader for each topic. This does not need to be the full set of Brokers in your cluster but should include at least two in case the first Broker is not available. No need to worry about figuring out which Broker is the leader for the topic (and partition), the Producer knows how to connect to the Broker and ask for the meta data then connect to the correct Broker.
Hope this clear some of your confusion
Also why does the consumer only need a zookeeper endpoint instead of a
broker endpoint
This is not technically correct, as there are two types of APIs available, High level and Low level consumer.
The high level consumer basically takes care of most of the thing like leader detection, threading issue, etc. but does not provide much control over messages which exactly the purpose of using the other alternatives Simple or Low level consumer, in which you will see that you need to provide the brokers, partition related details.
So Consumer need zookeeper end point only when you are going with the high level API, in case of using Simple you do need to provide other information
Kafka sets a single broker as the leader for each partition of each topic. The leader is responsible for handling both reads and writes to that partition. You cannot decide to read or write from a non-Leader broker.
So, what does it mean to provide a broker or list of brokers to the kafka-console-producer ? Well, the broker or brokers you provide on the command-line are just the first contact point for your producer. If the broker you list is not the leader for the topic/partition you need, your producer will get the current leader info (called "topic metadata" in kafka-speak) and reconnect to other brokers as necessary before sending writes. In fact, if your topic has multiple partitions it may even connect to several brokers in parallel (if the partition leaders are different brokers).
Second q: why does the consumer require a zookeeper list for connections instead of a broker list? The answer to that is that kafka consumers can operate in "groups" and zookeeper is used to coordinate those groups (how groups work is a larger issue, beyond the scope of this Q). Zookeeper also stores broker lists for topics, so the consumer can pull broker lists directly from zookeeper, making an additional --broker-list a bit redundant.
Kafka Producer API does not interact directly with Zookeeper. However, the High Level Consumer API connects to Zookeeper to fetch/update the partition offset information for each consumer. So, the consumer API would fail if it cannot connect to Zookeeper.
All above answers are correct in older versions of Kafka, but things have changed with arrival of Kafka 0.9.
Now there is no longer any direct interaction with zookeeper from either the producer or consumer. Another interesting things is with 0.9, Kafka has removed the dissimilarity between High-level and Low-level APIs, since both follows a uniformed consumer API.