I have two questions.
I wonder how to sychronized leader partion and follower partions.
If leader partition receive a message, then the leader broadcasting to follower partition on background communication? but It seemed kafka config file does not include these features(synchronization port info etc.)
If assume the following architecture.
Two brokers - Two partition - Two replicas
Broker#1 - leader partition#1, follower partition#2
Broker#2 - leader partition#2, follower partition#1
Sending messages will be round-robin to these two brokers...
If message#1 go to Broker#1(partition#1) and Broker#1 was shut down,
then broker#2 open the follower partition#1 and broker#2 has active two leader partition (for delivering the message#1)?
This is already handled by Kafka. You only need to define the replication factor for a topic. According to Kafka docs,
The partitions of the log are distributed over the servers in the
Kafka cluster with each server handling data and requests for a share
of the partitions. Each partition is replicated across a configurable
number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
Your question is not clear. I believe my answer to this question should shed some light with regards to kafka partitions, distribution of messages and fault tolerance.
Related
For example, I have a topic that has 2 partitions and a producer using defaultpartitioner (round-robin I assumed) writes to the topic. At some point, partition 1 becomes unavailable because all of the replica brokers go offline. Assuming the messages have no specified keys, will the producer resend the messages to partition 2? or simply gets stuck?
That is an interesting question and we should look at it from a broader (cluster) perspective.
At some point, partition 1 becomes unavailable because all of the replica brokers go offline.
I see the following scenarios:
All replica brokers of partition one are different to the replica brokers of partition two.
All replica brokers of partition one are the same as for partition two.
Some replica brokers are the same as for partition two.
In scenario "1" it means you still have enough brokers alive as the replication factor is a topic-wide not a partition-based configuration. In that case as soon as the first broker goes down its data will be moved to another broker to ensure that your partition always has enough in-sync replicas.
In scenarios "2" both partitions become unavailable and your KafkaProducer will eventually time out. Now, it depends if you have other brokers that are alive and can take on the data of the partitions.
In scenario "3" the dead replicas would be shifted to running brokers. During that time the KafkaProducer will only write to partition 2 as this is the only available partition in the topic. As soon as partition 1 has enough in-sync replicas the producer will start producing again to both partitions.
Actually, I could think of many more scenarios. If you need a more concrete answer you need to specify
how many brokers you have,
what your replication factor actually is and
in what timely order which broker goes down.
Assuming the messages have no specified keys, will the producer resend the messages to partition 2?
The KafkaProducer will not re-send the data that was previously send to partition 1 to partition 2. Whatever was written to partition 1 will stay in partition 1.
Suppose default.replication.factor is set to one. For the sake of simplicity, let's say we have a topic with only one partition. We have a Kafka setup with three brokers. The topic we are interested in lives on the broker that just went down. Obviously, we won't have access to messages on this topic until the broker will be brought back, but my question is what will happen to messages for this topic that will come from producers while the broker is down? Will they be rejected?
The Producer will not be able to locate the primary replica of the partition because there will be no available primary replica because there will be no ISR's (in-sync replicas) at failover time. There will be an error, but I'm not exactly sure it's on the send, especially if you're batching sends.
Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.
What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?
The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.
The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.
Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.
What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?
I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.
Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.
Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)
Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.
Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.
If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.
Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.
Let's say you expect longer network partitions than you are comfortable with being read-only.
Something like this can work:
you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all
This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.
In a Kafka cluster, one of the brokers is elected to serve as the controller.
Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment
Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.
In that case, Kafka has a number of configurations to limit the impact:
unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all
See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.
We are in the process of designing a Kafka Cluster (at least 3 nodes) that will process events from an array of web servers. Since the logs are largely identical, we are planning to create a single Topic only (say - webevents)
We expect a lot of traffic from the servers. Since there is a single topic, there will be a single leader broker. In such a case how will the cluster balance the high traffic? All write requests will always be routed to the leader broker at all times and other nodes might be underutilized.
Does a external hardware balancer help solve this problem? Alternately, can a Kafka configuration help distribute write requests evenly on a 1-topic cluster?
Thanks,
Sharod
Short answer: a topic may have multiple partitions and each partition, not topic, has a leader. Leaders are evenly distributed among brokers. So, if you have multiple partitions in your topic you will have multiple leaders and your writes will be evenly distributed among brokers.
You will have a single topic with lot of partitions, you can replicate partitions for high availability/durability of your data.
Each broker will hold an evenly distributed number of partitions and each of these partitions can be either a leader or a replica for a topic. Kafka producers (Kafka clients running in your web servers in your case) write to a single leader, this provides a means of load balancing production so that each write can be serviced by a separate broker and machine.
Producers do the load balancing selecting the target partition for each message. It can be done based on the message key, so all messages with same key go to the same partition, or on a round-robin fashion if you don't set a message key.
Take a look at this nice post. I took the diagram from there.
I learned from http://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
Our goal was to support replication in a Kafka cluster within a single datacenter, where network partitioning is rare
In the Distributed system, I think "Partitioning" is basic, so I don't know Kafka guarantee the availability without partition when only server node failed. Or I miss something
I think you may be confusing the sharding sense of "partitioning" with network partitions.
Kafka does indeed provide sharding and replication. Kafka elects a unique leader for each partition of each topic. All writes for a topic partition go through the leader. This is relevant to the documentation you cited indicating Kafka favor's availability over partition tolerance.
What is meant by network partitions is a break in communication between servers. Network communication failures are more rare in a LAN than in a WAN, so Kafka was architected to provide consistency except in cases where a network partition occurs. In the event of a network partition, Kafka's replicas may diverge from one another, with nodes on both sides of the partition potentially accepting writes. The reason this may occur is because when a network partition happens, nodes on each side of the partition can perceive nodes on the other side of the partition as having failed when in fact the link between them only failed. This means that each side of the network partition may elect a new leader for some topic partitions, therefore meaning that each side of the network partition can accept writes for some topic partitions. Once the network partition heals (the network is fixed), writes made on one side of the partition may overwrite writes made on the other side of the partition.