What happens when a Kafka Broker runs out of space before the configured Retention time/bytes? - apache-kafka

I understand that most systems should have monitoring to make sure that this doesn't happen (and that we should set the retention policies properly), but am just curious what happens if the Kafka Broker does indeed run out of disk space (for example, if we set retention time to 30 days, but the Broker runs out of disk space by the 1st day)?
In a single Broker scenario, does the Broker simply stop receiving any new messages and return an exception to the Producer? Or does it delete old message to make space for the new ones?
In a multi Broker scenario, assuming we have Broker A (leader of the partition but has not more disk space) and Broker B (follower of the partition and still has disk space), will leadership move to Broker B? What happens when both Brokers run out of space? Does it also return an exception to the Producer?

Assuming the main data directory is not on a separate volume, the OS processes themselves will start locking up because there's no free space left on the device.
Otherwise, if the log directories are isolated to Kafka, you can expect any producer acks to stop working. I'm not sure if a specific error message is returned to clients, though. From what I remember, the brokers just stop responding to Kafka client requests and we had to SSH to them, stop kafka services, and manually clean up files rather than waiting for retention policies. No, the brokers don't preempt old data in favor of new records

Related

Prevent data loss while upgrading Kafka with a single broker

I have a Kafka server which runs on a single node. There is only 1 node because it's a test server. But even for a test server, I need to be sure that no data loss will occur while upgrade is in process.
I upgrade Kafka as:
Stop Kafka, Zookeeper, Kafka Connect and Schema Registry.
Upgrade all the components.
Start upgraded services.
Data loss may occur in the first step, where kafka is not running. I guess you can do a rolling update (?) with multiple brokers to prevent data loss but in my case it is not possible. How can I do something similar with a single broker? Is it possible? If not, what is the best approach for upgrading?
I have to say, obviously, you are always vulnerable to data losses if you are using only one node.
If you can't have more nodes you have the only choice:
Stop producing;
Stop consuming;
Enable parameter controlled.shutdown.enable - this will ensure that your broker saved offset in case of a shutdown.
I guess the first 2 steps are quite tricky.
Unfortunately, there is not much to play with - Kafka was not designed to be fault-tolerant with only one node.
The process of a rolling upgrade is still the same for a single broker.
Existing data during the upgrade shouldn't be lost.
Obviously, if producers are still running, all their requests will be denied while the broker is down, thus why you not only need multiple brokers to prevent data-loss, but a balanced cluster (with unclean leader election disabled) where your restart cycles don't completely take a set of topics offline.

Kafka broker with "No space left on device"

I have a 6 node Kafka cluster where due to unforseen circumstances the kafka partition on one of the brokers filled up completely.
Kafka understandable won't start.
We managed to process the data from topics on the other brokers.
We have a replication factor of 4 so all is good there.
Can I delete an index file from a topic manually so that kafka can start and clear the data itself or is there a risk of corruption if I do that?
Once the brokers starts it should clear most of the space as we have cleared the topics by setting the retention low on the topics that have been processed.
What is the best approach?
The best way that I found, in this case, is removing logs and decrease the retention or replication of Kafka!
Some comments mention tuning the retention. I mentioned that we had already done that. The problem was that the broker that had a full disk could not start until some space was cleared.
After testing on dev environment I was able to resolve this by deleting some .log and .index files from one Kafka log folder. This allowed the broker to start. It then automatically started to clear the data based on retention and the situation was resolved.

Handle kafka broker full disk space

We have setup a zookeeper quorum (3 nodes) and 3 kafka brokers. The producers can't able to send record to kafka --- data loss. During investigation, we (can still) SSH to that broker and observed that the broker disk is full. We deleted topic logs to clear some disk space and the broker function as expected again.
Given that we can still SSH to that broker, (we can't see the logs right now) but I assume that zookeeper can hear the heartbeat of that broker and didn't consider it down? What is the best practice to handle such events?
The best practice is to avoid this from happening!
You need to monitor the disk usage of your brokers and have alerts in advance in case available disk space runs low.
You need to put retention limits on your topics to ensure data is deleted regularly.
You can also use Topic Policies (see create.topic.policy.class.name) to control how much retention time/size is allowed when creating/updating topics to ensure topics can't fill your disk.
The recovery steps you did are ok but you really don't want to fill the disks to keep your cluster availability high.

What happens to consumer groups in Kafka if the entire cluster goes down?

We have a consumer service that is always trying to read data from a topic using a consumer group. Due to redeployments, our Kafka cluster periodically is brought down and recreated again.
Whenever the cluster comes back again, we observed that although the previous topics are picked up (probably from zookeeper), the previous consumer groups are not created. Because of this, our running consumer process which is created with a previous consumer group gets stuck and never comes out.
Is this how the behavior of the consumer groups should be or is there a configuration we need to enable somewhere?
Any help is greatly appreciated.
Kafka Brokers keep a cache of healthy consumers and consumer groups, if the entire cluster is destroyed/recreated it no longer has knowledge of those consumers and groups, including offsets. The consumers will have to reconnect and re-establish the group and offsets from the beginning of the topic.
Operationally it makes more sense to keep the Kafka cluster running long-term, and do version upgrades in a rolling fashion so you don't interrupt the service.

Maximum value for zookeeper.connection.timeout.ms

Right now we are running kafka in AWS EC2 servers and zookeeper is also running on separate EC2 instances.
We have created a service (system units ) for kafka and zookeeper to make sure that they are started in case the server gets rebooted.
The problem is sometimes zookeeper severs are little late in starting and kafka brokers by that time getting terminated.
So to deal with this issue we are planning to increase the zookeeper.connection.timeout.ms to some high number like 10 mins, at the broker side. Is this a good approach ?
Are there any size effect of increasing the zookeeper.connection.timeout.ms timeout in zookeeper ?
Increasing zookeeper.connection.timeout.ms may or may not handle your problem in hand but there is a possibility that it will take longer time to detect a broker soft failure.
Couple of things you can do:
1) You must alter the System to launch the kafka to delay by 10 mins (the time you wanted to put in zookeper timeout).
2) We are using HDP cluster which automatically takes care of such scenarios.
Here is an explanation from Kafka FAQs:
During a broker soft failure, e.g., a long GC, its session on ZooKeeper may timeout and hence be treated as failed. Upon detecting this situation, Kafka will migrate all the partition leaderships it currently hosts to other replicas. And once the broker resumes from the soft failure, it can only act as the follower replica of the partitions it originally leads.
To move the leadership back to the brokers, one can use the preferred-leader-election tool here. Also, in 0.8.2 a new feature will be added which periodically trigger this functionality (details here).
To reduce Zookeeper session expiration, either tune the GC or increase zookeeper.session.timeout.ms in the broker config.
https://cwiki.apache.org/confluence/display/KAFKA/FAQ
Hope this helps