Does Kafka Cluster give disaster recovery advantage? - apache-kafka

A Kafka cluster provides high availability, but does it also provide some disaster recovery protection?
Specifically, if say one of your topic files was somehow corrupted or deleted on one server, can you recover from this with the topic files on your other servers in the cluster?

Topic replication accounts for these scenarios, yes.
If topics have a replication factor of higher than one and you have unclean leader election disabled, then it's highly unlikely for a topic or partition to become non-recoverable.

Related

Kafka brokers will share same locations to store data logs; if they are in a cluster

I am reading one of the article related to Kafka basics. If one of the Kafka brokerX dies in a cluster then, that brokerX data copies will move to other live brokers, which are in the cluster.
If that is the case, Is zookeeper/Kafka Controller will copy the brokerX data folder and move to live brokers like copy paste from one machine hard-disc to another (physical copy)?
Or, live brokers will share a common location ? so that will zookeeper/controller will link/point to the brokerX locations(logical copy) ?
I am little hard in understanding here. Could someone help me on this?
If a broker dies, it's dead. There's no background process that will copy data off of it
The replication of topics only happens while the broker is running
Plus, that image is wrong. The partitions = 2 means exactly that. A third partition doesn't just appear when a broker dies
This all depends if the topic has a replication factor > 1. In this case, brokers holding follower replica are constantly sending fetch requests to the leader replica (a specific broker), with a goal of being head to head with the leader (both the follower replica and leader replica having the same records stored on disk).
So when a broker goes down, all it takes is for the controller to select and promote an in-sync replica (by default, but could select non insync replicas) to take over as the leader of the partition. No copy/paste required, all brokers holding a partition(s) (as a follower replica or leader replica) of that topic are storing the same information prior to shutting down.
If a broker dies the behaviour depends on the dead broker. If it was not the leader for its partition it's non problem. when the broker returns on-line it will have to copy all missing data from the leader replica. If the dead broker was the leader for the partition a new leader will be elected according to some rules. If the new elected leader was in sync before the old leader died, there will be no message loss and the follower brokers will sync their replica from the new leader, as the broken leader will do when up again. If the new elected leader was not in sync you might have some message loss. Anyway you can drive the behaviour of your kafka cluster setting various parameters to balance speed, data integrity and reliability.

Kafka scalability if consuming from replica node

In a cluster scenario with data replication > 1, why is that we must always consume from a master/leader of a partition instead of being able to consume from a replica/follower node that contains a replica of this master node?
I understand the Kafka will always route the request to a master node(of that particular partition/topic) but doesn't this affect scalability (since all requests go to a single node)? Wouldnt it be better if we could read from any node containing the replica information and not necessarily the master?
Partition leader replicas, from which you can write/read data, are evenly distributed among available brokers. Anyway, you may also want to leverage the "fetch from closest replica" functionality, which is described in KIP-392, and available since Kafka 2.4.0.

Kafka broker with "No space left on device"

I have a 6 node Kafka cluster where due to unforseen circumstances the kafka partition on one of the brokers filled up completely.
Kafka understandable won't start.
We managed to process the data from topics on the other brokers.
We have a replication factor of 4 so all is good there.
Can I delete an index file from a topic manually so that kafka can start and clear the data itself or is there a risk of corruption if I do that?
Once the brokers starts it should clear most of the space as we have cleared the topics by setting the retention low on the topics that have been processed.
What is the best approach?
The best way that I found, in this case, is removing logs and decrease the retention or replication of Kafka!
Some comments mention tuning the retention. I mentioned that we had already done that. The problem was that the broker that had a full disk could not start until some space was cleared.
After testing on dev environment I was able to resolve this by deleting some .log and .index files from one Kafka log folder. This allowed the broker to start. It then automatically started to clear the data based on retention and the situation was resolved.

Handle kafka broker full disk space

We have setup a zookeeper quorum (3 nodes) and 3 kafka brokers. The producers can't able to send record to kafka --- data loss. During investigation, we (can still) SSH to that broker and observed that the broker disk is full. We deleted topic logs to clear some disk space and the broker function as expected again.
Given that we can still SSH to that broker, (we can't see the logs right now) but I assume that zookeeper can hear the heartbeat of that broker and didn't consider it down? What is the best practice to handle such events?
The best practice is to avoid this from happening!
You need to monitor the disk usage of your brokers and have alerts in advance in case available disk space runs low.
You need to put retention limits on your topics to ensure data is deleted regularly.
You can also use Topic Policies (see create.topic.policy.class.name) to control how much retention time/size is allowed when creating/updating topics to ensure topics can't fill your disk.
The recovery steps you did are ok but you really don't want to fill the disks to keep your cluster availability high.

Kafka Producer, multi DC failover support

I have two distinct kafka clusters located in different data centers - DC1 and DC2. How to organize kafka producer failover between two DCs? If primary kafka cluster (DC1) becomes unavailable, I want producer to switch to failover kafka cluster (DC2) and continue publishing to it? Producer also should be able to switch back to primary cluster, once it is available. Any good patterns, existing libs, approaches, code examples?
Each partition of the Kafka topic your producer is publishing to has a separate leader, often spread across multiple brokers in the cluster, so the producer is connected to many “primary” brokers simultaneously. Should any one of them fail another In Sync Replica (ISR) will be elected as leader and automatically take over. You do not need to do anything in your client app for it to reconnect to the new leader(s), retry any failed requests, and continue.
If this is for Multi-Data Center (MDC) failover then things get much more complicated depending on if the client apps die as well or if they keep running and need just their cluster connections to failover. Offsets are not preserved across multiple Kafka clusters so while producers are simpler, consumers need to call GetOffsetsForTimes() upon failover.
For a great write up of the the MDC failover modes and best practices see the MDC Whitepaper here: https://www.confluent.io/white-paper/disaster-recovery-for-multi-datacenter-apache-kafka-deployments/
Since you asked only about producers, your app can detect if the primary cluster is down (say for a certain number of retries) and then instead of attempting to reconnect, it can instead connect to another brokerlist from the secondary cluster. Alternatively you can redirect the dns name of the brokerlist hosts to point to the secondary cluster.