Kafka Stream Internal Topics lag increases on taking Kafka Broker down - apache-kafka

I have a Kafka Streams Application version - 0.11.0.1 which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers - version 0.11
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour.
Note: Whenever I take any kafka broker down, it brings down few consumers and Kafka streams (consumer) rebalances and lag of internal topics increase from 0 - few millions(1-10).
Is this because of any local state store config or something? How can I handle this?

Related

Impact on having large number of consumers on producers / brokers

I wanted to understand, having large numbers of consumers on producer latency and brokers. I have around 7K independent consumers. Each consumer is consuming all partitions of a topic. I have manually assigned partitions of a topic to a consumer, not using consumer groups. Each consumer is consuming msg only one topic. Message size is small less than a KB. Produce throughput is also low. But when produce spike come then produce latency (here ack=1). Brokers resource is also very low. I want to understand the impact on having large number of consumers on Kafka.
Cluster Details:
Broker#: 13 (1 Broker : 14 cores & 36GB memory )
Kafka cluster version: 2.0.0
Kafka Java client version: 2.0.0
Number topics: ~15.
Number of consumers: 7K (all independent and manually assigned all partitions of a topic to a consumers. One consumer is consuming all partitions from a topic only)

How can I run Kafka Consumer processor instance on multiple nodes with Apache Nifi

Currently we are using Apache NiFi to consume messages via Kafka consumer. Output of kafka consumer is connected to hive processor.
I'm looking into how to run kafka consumer instance on a nifi cluster.
I have 3 nodes of nifi cluster and a kafka topic which have 3 partitions, I want the kafka consumer to be able run on each node so each consumer can poll message from one of topic partitions.
After I started the kafka consumer processor ,i can only see that the kafka consumer always run on a single node but not all nodes.
Is there any configuration that I missed?
NiFi uses the Apache Kafka client which is what performs the assignment of consumers to partitions. When you start the processor, assuming you have it set to 1 concurrent task, then you should have 1 consumer on each node of your cluster, and each consumer should get assigned a different partition.
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka

Kafka Producer, Consumer, Broker in same host?

Are there any downsides to running the same producer and consumer code for all nodes in the cluster? If there are 8 nodes in the cluster (8 consumer, 8 kafka broker, and 8 producers), would 8 producers be running at the same time in the cluster then? Is there a way to modify cluster so that only one producer runs at a time?
Kafka cluster is nothing but Kafka brokers running under a distributed consensus. Kafka cluster is agnostic about number of producers and consumers running around it. Producers and consumers are clients of the Kafka cluster. Producers will stream data to Kafka and consumers consume the data from Kafka. Within Kafka cluster data will be distributed within topics. Topics are sharded using partitions. If multiple consumers belong to the same consumer group consumers can work in a self healing fashion.
Is there a way to modify cluster so that only one producer runs at a
time?
If you intend to run a single producer at certain point of time, you don't need to make any change within cluster.
Are there any downsides to running the same producer and consumer code for all nodes in the cluster?
The primary downsides here would be scalability and memory usage.
Producers and Consumers are not required to run on Brokers. Producers should be deployed where data is being generated (or running as separate hosts, like Kafka Connect workers).
Consumers should be scaled out independently based on the throughput and ordering guarantees that you need in your downstream systems.
There is nothing that says 8 brokers requires 8 producers and 8 consumers; partitions are what matters more
If you have N partitions in a topic, you can only scale to N active consumers anyway, and infinitely many producers
8 brokers can hold lots of partitions for any given topic
Running a single producer is an implementation of your own code. The broker cannot force it.

Kafka Stream reprocessing old messages on rebalancing

I have a Kafka Streams application which reads data from a few topics, joins the data and writes it to another topic.
This is the configuration of my Kafka cluster:
5 Kafka brokers
Kafka topics - 15 partitions and replication factor 3.
My Kafka Streams applications are running on the same machines as my Kafka broker.
A few million records are consumed/produced per hour. Whenever I take a broker down, the application goes into rebalancing state and after rebalancing many times it starts consuming very old messages.
Note: When the Kafka Streams application was running fine, its consumer lag was almost 0. But after rebalancing, its lag went from 0 to 10million.
Can this be because of offset.retention.minutes.
This is the log and offset retention policy configuration of my Kafka broker:
log retention policy : 3 days
offset.retention.minutes : 1 day
In the below link I read that this could be the cause:
Offset Retention Minutes reference
Any help in this would be appreciated.
Offset retention can have an impact. Cf this FAQ: https://docs.confluent.io/current/streams/faq.html#why-is-my-application-re-processing-data-from-the-beginning
Also cf How to commit manually with Kafka Stream? and How to commit manually with Kafka Stream? about how commits work.

One Kafka broker connects to multiple zookeepers

I'm new to Kafka, zookeeper and Storm.
I our environment we have one Kafka broker connecting to multiple zookeepers. Is there an advantage having the producer send the messages to a specific topic and partition on one broker to multiple zookeepers vs multiple brokers to multiple zookeepers?
Yes there is. Kafka allows you to scale by adding brokers. When you use a Kafka cluster with a single broker, as you have, all partitions reside on that single broker. But when you have multiple brokers, Kafka will split the partitions between them. So, broker A may be elected leader for partitions 1 and 2 of your topic, and broker B leader for partition 3. So, when you publish messages to the topic, the client will split the messages between the various partitions on the two brokers.
Note that I also mentioned leader election. Adding brokers to your Kafka cluster gives you replication. Kafka uses ZooKeeper to elect a leader for each partition as I mentioned in my example. Once a leader is elected, the client splits messages among partitions and sends each message to the leader for the appropriate partition. Depending on the topic configuration, the leader may synchronously replicate messages to a backup. So, in my example, if the replication factor for the topic is 2 then broker A will synchronously replicate messages for partitions 1 and 2 to broker B and broker B will synchronously replicate messages for partition 3 to broker A.
So, that's all to say that adding brokers gives you both scalability and fault-tolerance.