Can I create a Kafka topic with an ISR lower than the replication factor? - apache-kafka

Meaning
rf =3 , byt Min.insyce.reaplica = 1, and When procuder setup ack=1, I want to make sure only need leader get data then cosnidering good.
However, I do see it comes with default even I setup it lower with the commmand. It shows automatically like this.
Topic:isr-set1 PartitionCount:6 ReplicationFactor:3 Configs:min.insync.replicas=1
Topic: isr-set1 Partition: 0 Leader: 3 Replicas: 3,1,5 Isr: 3,1,5
Topic: isr-set1 Partition: 1 Leader: 4 Replicas: 4,5,3 Isr: 4,5,3
Topic: isr-set1 Partition: 2 Leader: 0 Replicas: 0,3,4 Isr: 0,3,4
Topic: isr-set1 Partition: 3 Leader: 2 Replicas: 2,4,0 Isr: 2,4,0
I expect something like:
Topic:isr-set1 PartitionCount:6 ReplicationFactor:3 Configs:min.insync.replicas=1
Topic: isr-set1 Partition: 0 Leader: 3 Replicas: 3,1,5 Isr: 3
Topic: isr-set1 Partition: 1 Leader: 4 Replicas: 4,5,3 Isr: 4
Topic: isr-set1 Partition: 2 Leader: 0 Replicas: 0,3,4 Isr: 0
Topic: isr-set1 Partition: 3 Leader: 2 Replicas: 2,4,0 Isr: 2

ISR means in sync replicas, It can equal or lower than replication factor, but you can not control it.
If you create a topic with replication factor = 3, that means Kafka keeps that topic-partition log in 3 different places. To keep in 3 different places, follower replicas need to get synced with leader replica for the partition. In Sync Replicas or ISR means this replica sync process. How many replicas are in synced with the leader.
But in your case you configured Min ISR to the one. That means when you produce data with ack=all it look in Kafka if there is atleast min ISR number of replicas are synced with the leader including leader. If not, it returns an error and not produced.
If you only need Leader availability, this Min ISR is not considered. ISR is also not considered in producing messages. So if you don't need replication (In sync replicas) just create topic with replication factor = 1. But it is too risky because there is no fault tolerance for that topic partitions.
Replication
Kafka replicates the log for each topic's partitions across a
configurable number of servers (you can set this replication factor on
a topic-by-topic basis). This allows automatic failover to these
replicas when a server in the cluster fails so messages remain
available in the presence of failures.
min.insync.replicas
When a producer sets acks to "all" (or "-1"), this configuration
specifies the minimum number of replicas that must acknowledge a write
for the write to be considered successful. If this minimum cannot be
met, then the producer will raise an exception (either
NotEnoughReplicas or NotEnoughReplicasAfterAppend). When used
together, min.insync.replicas and acks allow you to enforce greater
durability guarantees. A typical scenario would be to create a topic
with a replication factor of 3, set min.insync.replicas to 2, and
produce with acks of "all". This will ensure that the producer raises
an exception if a majority of replicas do not receive a write.

Related

Some replica are not in sync when installing scratch Kafka cluster

we are installing new Apache Kafka - version 2.7 on Linux machines version RHEL 7.9
total Kafka machines in the cluster are - 5 machines
now installation is completed , but we noticed that not all ISR are in Sync
I want to share all the reason that maybe explain what cause replica to be not in Sync
Slow replica: A follower replica that is consistently not able to catch up with the writes on the leader for a certain period of time. One of the most common reasons for this is an I/O bottleneck on the follower replica causing it to append the copied messages at a rate slower than it can consume from the leader.
Stuck replica: A follower replica that has stopped fetching from the leader for a certain period of time. A replica could be stuck either due to a GC pause or because it has failed or died.
Bootstrapping replica: When the user increases the replication factor of the topic, the new follower replicas are out-of-sync until they are fully caught up to the leader’s log.
but since we are dealing with new scratch Kafka cluster , then I wonder if the problem with ISR that are not in sync maybe related to some parameters in Kafka server.properties that are not set as well
here is example about __consumer_offsets topic
we can see many missing ISR's
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:3 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 1003 Replicas: 1003,1001,1002 Isr: 1003,1001,1002
Topic: __consumer_offsets Partition: 1 Leader: 1001 Replicas: 1001,1002,1003 Isr: 1001,1003,1002
Topic: __consumer_offsets Partition: 2 Leader: 1003 Replicas: 1002,1003,1001 Isr: 1003,1001
Topic: __consumer_offsets Partition: 3 Leader: 1003 Replicas: 1003,1002,1001 Isr: 1003,1001
Topic: __consumer_offsets Partition: 4 Leader: 1001 Replicas: 1001,1003,1002 Isr: 1001,1003
Topic: __consumer_offsets Partition: 5 Leader: 1001 Replicas: 1002,1001,1003 Isr: 1003,1001,1002
Topic: __consumer_offsets Partition: 6 Leader: 1003 Replicas: 1003,1001,1002 Isr: 1003,1001,1002
Topic: __consumer_offsets Partition: 7 Leader: 1001 Replicas: 1001,1002,1003 Isr: 1001,1003,1002
Topic: __consumer_offsets Partition: 8 Leader: 1003 Replicas: 1002,1003,1001 Isr: 1003,1001
Topic: __consumer_offsets Partition: 9 Leader: 1003 Replicas: 1003,1002,1001 Isr: 1003,1001
Topic: __consumer_offsets Partition: 10 Leader: 1001 Replicas: 1001,1003,1002 Isr: 1001,1003
Topic: __consumer_offsets Partition: 11 Leader: 1001 Replicas: 1002,1001,1003 Isr: 1003
here is example to what we have in server.properties
but after googled a while , we not found what can avoid the problem of ISR that are not in sync
auto.create.topics.enable=false
auto.leader.rebalance.enable=true
background.threads=10
log.retention.bytes=-1
log.retention.hours=12
delete.topic.enable=true
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10
log.dir=/var/kafka/kafka-data
log.flush.interval.messages=9223372036854775807
log.flush.interval.ms=1000
log.flush.offset.checkpoint.interval.ms=60000
log.flush.scheduler.interval.ms=9223372036854775807
log.flush.start.offset.checkpoint.interval.ms=60000
compression.type=producer
log.roll.jitter.hours=0
log.segment.bytes=1073741824
log.segment.delete.delay.ms=60000
message.max.bytes=1000012
min.insync.replicas=1
num.io.threads=8
num.network.threads=3
num.recovery.threads.per.data.dir=1
num.replica.fetchers=1
offset.metadata.max.bytes=4096
offsets.commit.required.acks=-1
offsets.commit.timeout.ms=5000
offsets.load.buffer.size=5242880
offsets.retention.check.interval.ms=600000
offsets.retention.minutes=10080
offsets.topic.compression.codec=0
offsets.topic.num.partitions=50
offsets.topic.replication.factor=3
offsets.topic.segment.bytes=104857600
queued.max.requests=500
quota.consumer.default=9223372036854775807
quota.producer.default=9223372036854775807
replica.fetch.min.bytes=1
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.lag.time.max.ms=10000
replica.socket.receive.buffer.bytes=65536
replica.socket.timeout.ms=30000
request.timeout.ms=30000
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
transaction.max.timeout.ms=900000
transaction.state.log.load.buffer.size=5242880
transaction.state.log.min.isr=2
transaction.state.log.num.partitions=50
transaction.state.log.replication.factor=3
transaction.state.log.segment.bytes=104857600
transactional.id.expiration.ms=604800000
unclean.leader.election.enable=false
zookeeper.connection.timeout.ms=600000
zookeeper.max.in.flight.requests=10
zookeeper.session.timeout.ms=600000
zookeeper.set.acl=false
broker.id.generation.enable=true
connections.max.idle.ms=600000
connections.max.reauth.ms=0
controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backoff.ms=5000
controller.socket.timeout.ms=30000
default.replication.factor=2
delegation.token.expiry.time.ms=86400000
delegation.token.max.lifetime.ms=604800000
delete.records.purgatory.purge.interval.requests=1
fetch.purgatory.purge.interval.requests=1000
group.initial.rebalance.delay.ms=3000
group.max.session.timeout.ms=1800000
group.max.size=2147483647
group.min.session.timeout.ms=6000
log.213`1234cleaner.backoff.ms=15000
log.cleaner.dedupe.buffer.size=134217728
log.cleaner.delete.retention.ms=86400000
log.cleaner.enable=true
log.cleaner.io.buffer.load.factor=0.9
log.cleaner.io.buffer.size=524288
log.cleaner.io.max.bytes.per.second=1.7976931348623157e308
log.cleaner.max.compaction.lag.ms=9223372036854775807
log.cleaner.min.cleanable.ratio=0.5
log.cleaner.min.compaction.lag.ms=0
log.cleaner.threads=1
log.cleanup.policy=delete
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.message.timestamp.difference.max.ms=9223372036854775807
log.message.timestamp.type=CreateTime
log.preallocate=false
log.retention.check.interval.ms=300000
max.connections=2147483647
max.connections.per.ip=2147483647
max.incremental.fetch.session.cache.slots=1000
num.partitions=1
producer.purgatory.purge.interval.requests=1000
queued.max.request.bytes=-1
replica.fetch.backoff.ms=1000
replica.fetch.max.bytes=1048576
replica.fetch.response.max.bytes=10485760
reserved.broker.max.id=1500
transaction.abort.timed.out.transaction.cleanup.interval.ms=60000
transaction.remove.expired.transaction.cleanup.interval.ms=3600000
zookeeper.sync.time.ms=2000
broker.rack=/default-rack
we'll appreciate , to get suggestions to how to improve the replica to be in Sync
links
Fixing under replicated partitions in kafka
https://emilywibberley.com/blog/kafka-how-to-fix-out-of-sync-replicas/
What is a right value for replica.lag.time.max.ms?
https://strimzi.io/blog/2021/06/08/broker-tuning/
https://community.cloudera.com/t5/Support-Questions/Kafka-Replica-out-of-sync-for-over-24-hrs/m-p/82922
https://hevodata.com/learn/kafka-replication/
here are the options that we consider to do ( but only as suggestion not solution )
restart Kafka brokers , each Kafka step by step
remove the non in SYNC replica by rm -rf , as example rm -rf TEST_TOPIC_1 , and hope that Kafka will create this replica and as results it will be in SYNC
try to use the kafka-reassign-partitions
maybe ISR will be in Sync after some time ?
increase replica.lag.time.max.ms to higher value as 1 day and restart the brokers
The definition of synchronization depends on the topic configuration, but by default, this means that the replica has been or has been fully synchronized with the leader in the last 10 seconds. The settings for this time period are:replica.lag.time.max.ms, and has a server default value, which can be overridden by each topic.
What is the ISR?
The ISR is simply all the replicas of a partition that are "in-sync" with the leader. The definition of "in-sync" depends on the topic configuration, but by default, it means that a replica is or has been fully caught up with the leader in the last 10 seconds. The setting for this time period is: replica.lag.time.max.ms and has a server default which can be overridden on a per topic basis.
At a minimum the, ISR will consist of the leader replica and any additional follower replicas that are also considered in-sync. Followers replicate data from the leader to themselves by sending Fetch Requests periodically, by default every 500ms.
If a follower fails, then it will cease sending fetch requests and after the default, 10 seconds will be removed from the ISR. Likewise, if a follower slows down, perhaps a network related issue or constrained server resources, then as soon as it has been lagging behind the leader for more than 10 seconds it is removed from the ISR.
Some other important related parameters to be configured are:
min.insync.replicas: Specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful.
offsets.retention.check.interval.ms: Frequency at which to check for stale Offsets.
offsets.topic.segment.bytes: This should be kept relatively small in order to facilitate faster Log Compaction and Cache Loads.
replica.lag.time.max.ms: If the follower has not consumed the Leaders log OR sent fetch requests, for at least this much time, it is removed from the ISR.
replica.fetch.wait.max.ms: Max wait time for each fetcher request issued by follower replicas, must be less than the replica.lag.time.max.ms to avoid shrinking of ISR.
transaction.max.timeout.ms: In case a client requests a timeout greater than this value, it’s not allowed so as to not stall other consumers.
zookeeper.session.timeout.ms: Zookeeper session timeout.
zookeeper.sync.time.ms: How far a follower can be behind a Leader, setting this too high can result in an ISR that has potentially many out-of-sync nodes.
time-related settings aren't what you want; if you increase these, it just means it will take longer for Kafka to show you the problem, meanwhile your data actually gets further behind. For a brand new cluster, you should have no out of sync ISR until you start adding load...
Increasing num.replica.fetchers and num.network.threads will allow the brokers to read replicas over the network faster. At most, you can try setting these to the number of CPU cores on the machines.
Smaller segment bytes can be used to increase replication, but it's better to set that on a per-topic basis, for compaction only, not to adjust replication cluster-wide.

How do I avoid data loss when using Kafka over Kubernetes and one of the nodes fail?

My application runs over a Kubernetes cluster of 3 nodes and uses Kafka to stream data.
I am trying to check my system's ability to recover from node failure, so I deliberately fail one of the nodes for 1 minute.
Around 50% of the times, I experience data loss of a single data record after the node failure.
If the controller Kafka broker was running on the failed node, I see that a new controller broker was elected as expected.
When the data loss occur, I see the following error in the new controller broker log:
ERROR [Controller id=2 epoch=13] Controller 2 epoch 13 failed to
change state for partition __consumer_offsets-45 from OfflinePartition
to OnlinePartition (state.change.logger) [controller-event-thread]
I am not sure if that's the problem, but searching the web for information about this error made me suspect that I need to configure Kafka to have more than 1 replica for each topic.
This is how my topics/partitions/replicas configuration looks like:
My questions:
Is my suspicion that more replicas are required is correct?
If yes, how do I increase the number of topics replicas? I played around with a few broker parameters such as default.replication.factor and replication.factor but I did not see the number of replicas change.
If no, what is the meaning of this error log?
Thanks!
Yes, if the broker hosting the single replica goes down, then you can expect an unclean topic. If you have unclean leader election disabled, however, you shouldn't lose data that's already been persisted to the broker.
To modify existing topics, you must use kafka-reassign-partitions tool, not any of the broker settings, as those only apply for brand new topics.
Kafka | Increase replication factor of multiple topics
Ideally, you should disable auto topic creation, as well, to force clients to use Topic CRD resources in Strimzi that include a replication factor, and you can use other k8s tools to verify that they have values greater than 1.
Yes, you're right, you need to set the replication factor to more than 1 to be able to sustain the broker-level failures.
Once you add this value as default, the new topics will start having the configured number of replicas. But for existing topics, you need to follow the below instruction-
Describe the topic
$ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic one
Topic: one PartitionCount: 3 ReplicationFactor: 1 Configs: segment.bytes=1073741824
Topic: one Partition: 0 Leader: 1 Replicas: 1 Isr: 1
Topic: one Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: one Partition: 2 Leader: 2 Replicas: 2 Isr: 2
Create the json file with the topic reassignment details
$ cat >>increase.json <<EOF
{
"version":1,
"partitions":[
{"topic":"one","partition":0,"replicas":[0,1,2]},
{"topic":"one","partition":1,"replicas":[1,0,2]},
{"topic":"one","partition":2,"replicas":[2,1,0]},
]
}
EOF
Execute this reassignment plan
$ ./bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file increase.json --execute
Current partition replica assignment
{"version":1,"partitions":[{"topic":"one","partition":0,"replicas":[0,1,2],"log_dirs":["any","any"]},{"topic":"one","partition":1,"replicas":[1,0,2],"log_dirs":["any","any"]},{"topic":"one","partition":2,"replicas":[2,1.0],"log_dirs":["any","any"]}]}
Save this to use as the --reassignment-json-file option during rollback
Successfully started partition reassignments for one-0,one-1,one-2
Describe the topic again
$ ./bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic one
Topic: one PartitionCount: 3 ReplicationFactor: 3 Configs: segment.bytes=1073741824
Topic: one Partition: 0 Leader: 0 Replicas: 0,1,2 Isr: 0,1,2
Topic: one Partition: 1 Leader: 1 Replicas: 1,0,2 Isr: 1,0,2
Topic: one Partition: 2 Leader: 2 Replicas: 2,1,0 Isr: 2,1,0

How change topic leader or remove partition after some broker down?

We have kafka cluster with 4 brokers and some topics with replica factor 1 and 10 partitions.
At one moment 2 of 4 our servers with kafka cluster - fail.
So now we have 2 brokers with same topics.
When im run command
./kafka_topics.sh --zookeeper localhost:2181 --describe
im get this:
Topic:outcoming-notification-error-topic PartitionCount:10 ReplicationFactor:1 Configs:
Topic: outcoming-error-topic Partition: 0 Leader: 2 Replicas: 2 Isr: 2
Topic: outcoming-error-topic Partition: 1 Leader: 3 Replicas: 3 Isr: 3
Topic: outcoming-error-topic Partition: 2 Leader: 4 Replicas: 4 Isr: 4
Topic: outcoming-error-topic Partition: 3 Leader: 1 Replicas: 1 Isr: 1
Topic: outcoming-error-topic Partition: 4 Leader: 2 Replicas: 2 Isr: 2
Topic: outcoming-error-topic Partition: 5 Leader: 3 Replicas: 3 Isr: 3
Topic: outcoming-error-topic Partition: 6 Leader: 4 Replicas: 4 Isr: 4
Topic: outcoming-error-topic Partition: 7 Leader: 1 Replicas: 1 Isr: 1
Topic: outcoming-error-topic Partition: 8 Leader: 2 Replicas: 2 Isr: 2
Topic: outcoming-error-topic Partition: 9 Leader: 3 Replicas: 3 Isr: 3
How can i delete Leader 2...4 ? or may be i need delete partition for this Leader , but how ?
UPD..
Also we use kafka_exporter for monitoring kafka with prometheus. After 2 brokers down in log of kafka_exporter we get this errors:
level=error msg="Cannot get oldest offset of topic outcoming-error-topic partition 10: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes." source="kafka_exporter.go:296"
You could use Kafka's kafka-reassign-partitions.sh to do that. You have two ways, one is generating a proposal of new assignments, and the other one is manually specifying the leaders for specific partitions.
1 . Generate a proposal
The first method, as specified on the kafka docs, follows this logic:
1.1 Generate Proposed partition reassignment configuration
First, you should create a json file such as the provided in the link. Let's name it topics.json.
{
"topics": [{"topic": "foo1"},
{"topic": "foo2"}],
"version":1
}
This will tell kafka what are the topics you are willing to rellocate their partitions from. In the example, he wants Kafka to make a proposal for topics foo1 and foo2.
With that json, call the tool and set the active broker list in the command:
kafka-reassign-partitions.sh --zookeeper $ZK_HOSTS
--topics-to-move-json-file topics.json --broker-list "1,2,3,4,5" --generate
This will output Kafka's proposal, which you can save into another .json file. For example:
{
"version":1,
"partitions":[{"topic":"foo1","partition":2,"replicas":[5,6]},
{"topic":"foo1","partition":0,"replicas":[5,6]},
{"topic":"foo2","partition":2,"replicas":[5,6]},
{"topic":"foo2","partition":0,"replicas":[5,6]},
{"topic":"foo1","partition":1,"replicas":[5,6]},
{"topic":"foo2","partition":1,"replicas":[5,6]}]
}
You can manually modify some of the assignments, if you want to (or think it's the proper think to do, as the tool is not perfect). Save the json into a file, for example, reassign-example.json, which will be used in the next step.
1.2. Execute the Proposed partition reassignment
Let's make Kafka execute the proposal and move the partitions. For that, execute:
bin/kafka-reassign-partitions.sh --zookeeper $ZK_HOSTS
--reassignment-json-file reassign-example.json --execute
This will execute the partition movement defined on the reassign-example.json file.
2 . Manual specification
The second method is fairly easier, but you must manually identify the partitions you want to reassign. For example, if you want partition 1 of topic XXX to move to brokers 5 and 6, you could create a json file (manual-reassign.json) such as:
{"version":1,"partitions":[{"topic":"XXX","partition":1,"replicas":[5,6]}]}
The way it's launched is the same as in the previous method:
bin/kafka-reassign-partitions.sh --zookeeper $ZK_HOSTS
--reassignment-json-file manual-reassign.json --execute

Kafka monitoring in cluster environment

I have an kafka cluster (3 machine with 1 zookeeper and 1 broker run on each machine)
I am using kafka_exporter to monitoring consumer lag metric, it's work fine in normal case.
But, when i kill 1 broker, the Prometheus cannot get metric from http://machine1:9308/metric (kafka_exporter metric endpoint), because it take a long time to get data (1,5m), so it will be timeout.
Now, if I restart kafka_exporter I will see some error:
Cannot get leader of topic __consumer_offsets partition 20: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes
When I run the command: kafka-topics.bat --describe --zookeeper machine1:2181,machine2:2181,machine3:2181 --topic __consumer_offsets
The result are:
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:1 Configs:compression.type=producer,cleanup.policy=compact,segment.bytes=104857600
Topic: __consumer_offsets Partition: 0 Leader: -1 Replicas: 1 Isr: 1
Topic: __consumer_offsets Partition: 1 Leader: 2 Replicas: 2 Isr: 2
Topic: __consumer_offsets Partition: 49 Leader: 2 Replicas: 2 Isr: 2
Is this a configuration error? And how can I get the consumer lag in this case? The "Leader: -1" is an error? if I shutdown the machine 1 forever, it's still work fine?
The leader being -1 means that there is no other broker in the cluster that has a copy of the data for the partition.
The problem in your case is that the replication factor for your topic __consumer_offsets is 1, which means that there is only one broker that hosts the data of any partition in the topic. If you lose any one of the brokers, all the partitions on the broker become unavailable resulting in the topic becoming unavailable. So, your kafka_exporter will fail to read from this topic.
The fix to this if you want to continue exporting consumer offsets on a broker loss, is to reconfigure the topic __consumer_offsets to have replication factor more than 1.
Advised Config - Replication factor - 3, min.insync.replicas - 2.

Unfair Leader election in Kafka - Same leader for all partitions

I have a Kafka cluster with 5 partitions.
On scaling down to 3, leader election took place several times.
Finally only one broker became the leader for all the 3 partitions of one of my topics.
Topic: test PartitionCount:3 ReplicationFactor:3
Topic: test Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 2,1,0
Topic: test Partition: 1 Leader: 2 Replicas: 3,1,2 Isr: 2,1
Topic: test Partition: 2 Leader: 2 Replicas: 4,2,3 Isr: 2
2,1,0 are the brokers that are running.
partition 0 is available with 2, 0, 1. All the brokers are available.
So, isr=2,1,0
partition 1 is available with 3, 1, 2 but 3 is removed broker. So
isr=2,1
partition 2 is available with 4,2,3 but both 4,3 are removed brokers. So
isr=2
Note that, only 2 has been elected as the leader. Even if we assume that it has the highest watermark among the other brokers, all the ISRs for a given partition could have been in sync, therefore all have the same offsets for a given partition (otherwise they would have been removed from the ISR).
I have waited for a lot of time (there is a time after which if one of the replicas is not up to the mark, it will be removed from ISR) but still that is the leader election.
Leaders can be evenly distributed (load balanced).
For example, partition-0 leader can be 0
partition 1 leader can be 1
partition 2 leader can be 2
Why is this not so?
Note: I did not enable unclean leader election. It is the default value only.
If we assume that 0,1 came up after the leader election happened, why
is not there a re-election then? If the ISRs are updated, ideally the
leaders should also be. Isn't it?
i.e. if Kafka knows that 0,1 are up
and have in-sync replicas, it SHOULD have conducted one more leader
election.
Is there any specific reason why is it not so?
Kafka has the concept of a preferred leader, meaning that if possible it will elect that replica to serve as the leader. The first replica listed in the replicas list is the preferred leader. Now looking at the current cluster state:
Topic: test Partition: 0 Leader: 2 Replicas: 2,0,1 Isr: 2,1,0
Topic: test Partition: 1 Leader: 2 Replicas: 3,1,2 Isr: 2,1
Topic: test Partition: 2 Leader: 2 Replicas: 4,2,3 Isr: 2
Partition 0, broker 2 is the preferred leader and is the current leader
Partition 1, broker 3 is the preferred leader but it's not in-sync, so a random leader is picked between 2 and 1
Partition 2, broker 4 is the preferred leader but again 4 is not in-sync. Only 2 is in-sync, so it's elected.
If all your brokers were to go back in-sync, by default Kafka would re-elect the preferred leaders (or it can be forced using the kafka-preferred-replica-election.sh tool, see Balancing leadership).
If the missing brokers are not going to be restarted, you can change the replica assignments for the partitions, to balance the leadership manually using the kafka-reassign-partitions.sh tool. Just make sure you put the preferred leader as the first entry in the replicas list.