I'm a little confused with the RAFT log driving the state machine of my system
Let's take Kafka as an example,
Every partition has a leader & a replica, does that mean there's a RAFT instance per partition?
IF yes, then every time a producer sends a record, should it be persisted in the RAFT log before appending to the Kafka partition?
IF no, then how does Kafka maintain leader & replica to be in sync during failures?
Consensus is confusing me
There are multiple consensus points IMO and I'm unsure where to use RAFT and where it's not required. e.g. the cluster leader election, partition leader, even replication is a form of consensus to me
Other doubts regarding RAFT
It is said that followers should update their state machine only after a commit is done, what if a follower is isolated and ends up committing the wrong event? Is this even possible? IF yes, should the state machine have logic for rollbacks? From my experience with elastic search, there always should be a rollback mechanism for obsolete events
I'm trying to implement something similar to Kafka for learning purpose and I'm confused at the point where I don't even know how to begin implementing one
Related
I have two questions I'm hoping someone with experience in MSK/Kafka and MirrorMaker2 can help with.
Currently we have a production MSK 2.7.0 cluster with 3 brokers and roughly 1T in topic data. We use the Debezium plugin for most things, a few jdbc/mysql sink connectors and then a handful of random consumers so far. For DR purposes, I'm considering adding a second MSK cluster of the same size and using MirrorMaker2 to replicate everything to it. I've done a fair amount of searching and reading about how others might be approaching DR for Kafka. It seems that MM2 is the standard.
I've seen conflicting views on whether active/standby or active/active is recommended. It seems that active/active would be ideal but it comes with a lot of considerations for producers and consumers, mostly when event ordering is important. Curious if anyone can elaborate on that, and how realistic it would be in setting up that topology. Event order is important for most of our cases.
For an active/standby configuration, it's not clear to me after what I've read what to plan for in the event the primary cluster goes down permanently and all of the consumers/producers have to migrate over to the new cluster. There's a lot written about how MM2 replicates its own offset data, but I'm not finding much about what a consumer needs to account for when being moved over to the replicated topic. I'm especially interested in what it would mean to move the Debezium connectors over, and if it has a mechanism built in for such a thing or what I should expect.
Let's say I have a cheap and less reliable datacenter A, and an expensive and more reliable datacenter B. I want to run Kafka in the most cost-effective way, even if that means risking data loss and/or downtime. I can run any number of brokers in either datacenter, but remember that costs need to be as low as possible.
For this scenario, assume that no costs are incurred if brokers are not running. Also assume that producers/consumers run completely reliably with no concern for their cost.
Two ideas I have are as follows:
Provision two completely separate Kafka clusters, one in each datacenter, but keep the cluster in the more expensive datacenter (B) powered off. Upon detecting an outage in A, power on the cluster in B. Producers/consumers will have logic to switch between clusters.
Run the Zookeeper cluster in B, with powered on brokers in A, and powered off brokers in B. If there is an outage in A, then brokers in B come online to pick up where A left off.
Option 1 would be cheaper, but requires more complexity in the producers/consumers. Option 2 would be more expensive, but requires less complexity in the producers/consumers.
Is Option 2 even possible? If there is an outage in A, is there any way to have brokers in B come online, get elected as leaders for the topics and have the producers/consumers seamlessly start sending to them? Again, data loss is okay and so is switchover downtime. But whatever option needs to not require manual intervention.
Is there any other approach that I can consider?
Neither is feasible.
Topics and their records are unique to each cluster. Only one leader partition can exist for any Kafka partition in a cluster.
With these two pieces of information, example scenarios include:
Producers cut over to a new cluster, and find the new leaders until old cluster comes back
Even if above could happen instantaneously, or with minimal retries, consumers then are responsible for reading from where? They cannot aggregate data from more than one bootstrap.servers at any time.
So, now you get into a situation where both clusters always need to be available, with N consumer threads for N partitions existing in the other cluster, and M threads for the original cluster
Meanwhile, producers are back to writing to the appropriate (cheaper) cluster, so data will potentially be out of order since you have no control which consumer threads process what data first.
Only after you track the consumer lag from the more expensive cluster consumers will you be able to reasonably stop those threads and shut down that cluster upon reaching zero lag across all consumers
Another thing to keep in mind is that topic creation/update/delete events aren't automatically synced across clusters, so Kafka Streams apps, especially, will all be unable to maintain state with this approach.
You can use tools like MirrorMaker or Confluent Replicator / Cluster Linking to help with all this, but the client failover piece I've personally never seen handled very well, especially when record order and idempotent processing matters
Ultimately, this is what availability zones are for. From what I understand, the chances of a cloud provider losing more than one availability zone at a time is extremely rare. So, you'd setup one Kafka cluster across 3 or more availability zones, and configure "rack awareness" for Kafka to account for its installation locations.
If you want to keep the target / passive cluster shutdown while not operational and then spin up the cluster you should be ok if you don't need any history and don't care about the consumer lag gap in the source cluster.. obv use case dependent.
MM2 or any sort of async directional replication requires the cluster to be active all the time.
Stretch cluster is not really doable b/c of the 2 dc thing, whether raft or zk you need a 3rd dc for that, and that would probably be your most expensive option.
Redpanda has the capability of offloading all of your log segments to s3 and then indexes them to allow them to be used for other clusters, so if you constantly wrote one copy of your log segments to your standby DC storage array with s3 interface it might be palatable. Then whenever needed you just spin up a cluster on demand in the target dc and point it to the object store and you can immediately start producing and consuming with your new clients.
I'm designing an event driven distributed system.
One of the events we need to distribute needs
1- Low Latency
2- High availability
Durability of the message and consistency between replicas is not that important for this event type.
Reading the Kafka documentation it seems that consumers need to wait until all sync replicas for a partition have applied the message to their log before consumers can read it from any replica.
Is my understanding correct? If so is there a way around it
If configured improperly; consumers can read data that has not been written to replica yet.
As per the book,
Data is only available to consumers after it has been committed to Kafka—meaning it was written to all in-sync.
If you have configured min.insync.replicas=1 then only Kafka will not wait for replicas to catch-up and serve the data to Consumers.
Recommended configuration for min.insync.replicas depends on type of application. If you don't care about data then it can be 1, if it's critical piece of information then you should configure it to >1.
There are 2 things you should think:
Is it alright if Producer don't send message to Kafka? (fire & forget strategy with ack=0)
Is it alright if consumer doesn't read a message? (if min.insync.replica=1 then if a broker goes down then you may lose some data)
7 member cluster, one of which is the leader.
Leader attempts to replicate log (some write)
Network partition occurs. 3 and 4 members respectively.
Leader ends up in minority partition
Leader only reaches 2 followers → replication failure
What happens in this situation?
As I understand it: The 2 followers have applied a "bad" write and when the network partition mends they will overwrite that write with the majority leaders history. But this violates linearization.
🤔
You're confusing replication with commitment. Merely replicating an entry to a minority of this cluster doesn't break linearizability. What's important is when that change is considered committed. Since the leader on the minority side of the partition is unable to replicate the change to a majority of the cluster, it will never commit the change and will never acknowledge to a client that the change has been persisted. Furthermore, the uncommitted change will never have been applied to the state machine on any node. Therefore, overwriting the uncommitted change when the partition is healed does not break any guarantees.
Guarantees would only be broken if the leader were to increase the commitIndex and acknowledge a write after replicating only to a minority of the cluster.
Does it stop acting as the leader (i.e. stop serving produce and fetch
requests) returning the "not a leader for partition" exception? Or
does it keep thinking it's the leader?
If it's the latter, any connected consumers that wait for new requests
on that replica will do so in vain. Since the cluster controller will
elect a new partition leader, this particular replica will become
stale.
I would expect this node to do the former, but I'd like to check to
make sure. (I understand it's an edge case, and maybe not a realistic
one at that, but still.)
According to the documentation, more specifically in the Distribution topic:
Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader
for some of its partitions and a follower for others so load is well
balanced within the cluster.
Considering that a loss of connection is one of the many kinds of failure, I'd say that your first hypothesis is more likely to happen.