I have a Kafka topic that somehow went from 3 ISRs to 1 ISR, in a kafka cluster with 3 brokers. I changed minimum ISR from 2 to 1 to allow it to function. Presumably the other brokers are trying to replicate the topic from the leader, how can I monitor their progress?
Actually you can monitor those metrics to see the replication lag:
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)
Lag in number of messages per follower replica. This is useful to know if the replica is slow or has stopped replicating from the leader.
As stated in https://kafka.apache.org/documentation/
Yannick
Related
I am trying to understand how failover and replication factors work in kafka.
Let's say my cluster has 3 brokers and replication factor is also 3. In this case each broker will have one copy of partition and one of the broker is leader. If leader broker fails, then one of the follower broker will become leader but now the replication factor is down to 2. At this point if I add a new broker in the cluster, will kafka make sure that replication factor is 3 and will it copy the required data on the new broker.
How will above scenario work if my cluster already has an addition broker.
In your setup (3 broker, 3 replicas), when 1 broker fails Kafka will automatically elect new leaders (on the remaining brokers) for all the partitions whose leaders were on the failing broker.
The replication factor does not change. The replication factor is a topic configuration that can only be changed by the user.
Similarly the Replica list does not change. This lists the brokers that should host each partition.
However, the In Sync Replicas (ISR) list will change and only contain the 2 remaining brokers.
If you add another broker to the cluster, what happens depend on its broker.id:
if the broker.id is the same as the broker that failed, this new broker will start replicating data and eventually join the ISR for all the existing partitions.
if it uses a different broker.id, nothing will happen. You will be able to create new topics with 3 replicas (that is not possible while there are only 2 brokers) but Kafka will not automatically replicate existing partitions. You can manually trigger a reassignment if needed, see the docs.
Leaving out partitions (which is another concept of Kafka):
The replication factor does not say how many times a topic is replicated, but rather how many times it should be replicated. It is not affected by brokers shutting down.
Once a leader broker shuts down, the "leader" status goes over to another broker which is in sync, that means a broker that has the current state replicated and is not behind. Electing "leader" status to a broker that is not in sync would obviously lead to data loss, so this will never happen (when using the right settings).
These replicas eligible for taking "leader status" are called in-sync replica (ISR), which is important, as there is a configuration called min.insync.replicas that specifies how many ISR have to exist for a Kafka message to be acknowledged. If this is set to 0, every Kafka message is acknowledged as "successful" as soon as it enters the "leader" broker, if this broker would die, all data that was not replicated yet is lost. If min.insync.replicas would be set to 1, every message waits with the acknowledgement, until at least 1 replica exists in order to be "successful", so if the broker would die now, there would be a replica covering this data. If there are not enough brokers to cover the minimum amount of replicas, your cluster will fail eventually.
So to answer your question: if you had 2 running brokers, min.insync.replicas=1 (default) and replication factor of 3, your cluster runs fine and will add a replica as soon as you start up another broker. If another of the 2 brokers dies before you launch the third one, you will run into problems.
we are having 3 zookeeper and 3 kafka broker nodes as cluster setup running in different systems in AWS,And we changes the below properties to ensure the high availabilty and prevent data loss.
server.properties
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
i am having the following question
Assume BROKER A,B,C
since we are enabling replication factor as 3 all the data will be available in all A,B,C brokers if A broker is down it wont affect the flow.
but when ever broker A went down but at the same time we are continously receiving data from connector and it is storing the broker B and C
so after 2 hours broker A came up
In that time the data came between the down time and up time of A is available in broker A or not?
is there any specific configuration we need to mention for that?
how does the replication between the broker happen when one broker came online from offline?
i didn't know whether it is a valid question, but please share your thoughts on this to understand this replication factor working
While A is recovering, it'll be out of the ISR list. If you've disabled unclean leader election, then A cannot become the leader broker of any partitions it holds (no client can write or read to it) and will replicate data from other replicas until its up to date, then join the ISR
running confluent-kafka 3.3.1e Kafka 0.11.0x
I have a single partition with replicas = 3.
My producer is running with ack=-1
The partition has 1 out of sync replica
Replica lag time max ms = 10000
Min insync replicas = 2
1) Will the record get committed from the producer when ack=-1
2) How can I get the out of sync replica back in sync?
1) With acks=-1 Kafka will accept records as long as there are min.insync.replicas in-sync replicas.
So assuming min.insync.replicas is 1 or 2 for your topic, with a single replica out of sync, yes the record will be accepted by Kafka
2) In a normal case, Kafka always tries to keep all replicas in-sync. If that's not happening then you want to check the broker that is hosting the out-of-sync replica as something is wrong. Is this broker running? Is it healthy? Is it overloaded? Are other partitions on this broker also out-of-sync?
Older versions of Kafka also had some replication issues which could cause out of sync replicas. A good workaround these is to force a controller re-election:
zookeeper-shell [ZK_HOST:ZK_POST] rmr /controller
I have a setup of 4 Kafka brokers. Each partition in each topic in my setup has a replication factor of 2. All partitions are balanced - Leaders and followers are uniformly distributed
This setup has been running for over 6 months
While monitoring the setup via Kafka Manager I see that 8% of my partitions are under-replicated.
All these partitions were assigned to the same set of replicas. And every partition which was assigned to this set of replicas is displayed as under-replicated
Lets call this set of brokers as [1,2] - broker 1 and 2. The ISR for all these partitions is [1] right now.
Both brokers 1 and 2 are up and running. All other partitions have the ISR count as expected.
The script bin/kafka-topics.sh also shows 8% of partitions to be under replicated.
But the jolokia metric - UnderReplicatedPartitions - is 0
I need help to answer -
Is there an issue?
Why is there an inconsistency in the jolokia metric and kafka console?
How can I fix the issue ?
I can't say anything about "jolokia metric" but we experienced the same because we had a "slow" broker which was lagging behind with replicating the data.
"Slow" meaning that the replication requests somes breached the broker-wide configuration replica.lag.time.max.ms which defaults to 10 seconds and is described as:
"If a follower hasn't sent any fetch requests or hasn't consumed up to the leaders log end offset for at least this time, the leader will remove the follower from isr"
Slightly increasing this configuration solved the problem for us.
The relatively scarce documentation for Kafka 0.8 does not mention what the expected behaviour for balancing existing topics, partitions and replicas on brokers is.
More specifically, what is the expected behaviour on arrival of a broker and on crash of a broker (leader or not) ?
Thanks.
I have tested those 2 cases a while ago and not under heavy load. I have one producer sending 10k messages (just a little string) synchronously to a topic, with replication factor of 2, with 2 partitions, on a cluster of 2 brokers. There are 2 consumers. Each component is deployed on a separate machine. What I have observed is :
On normal operation : broker 1 is leader on partition 1 and replica on partition 2. broker 2 is leader on partition 2 and replica on partition 1. Bring a broker 3 into the cluster don't trigger rebalance on partitions automatically.
On broker revival (crashed than reboot) : rebalancing is transparent to the producer and consumers. The rebooting broker replicate the log first and then make itself available.
On broker crashed (leader or not) : simulated by a kill -9 on any one broker. The producer and consumers get frozen until the ephemeral node in ZK of the killed broker is expired. After that, operations are resumed normally.