Kafka Storage Exception - apache-kafka

I am getting this Kafka Storage Exception : Halting due to unrecoverable IO error in handling produce requests exception periodically in every 3-4 days from some time. It causes my broker to shutdown which causes uncommitted offsets resulting in loss of data.
Can someone please suggest reason for the same.

One of the reasons could be If you are using windows machine to run kafka broker.Kafka logs might be getting stored in windows drive(C:) . Try changing the kafka logs Dir from C: drive to D: \ drive and never faced the issue.
Ensure that you disk has enough space.
Try deleting Kafka-logs dir(Please do not even think of trying this in production) and perform zookeeper logs clean up as well.

Related

Kafka is not restarting - cleared logs, disk space. Restarted but it turns off again and again

We are a talent system and have installed Zookeeper and Kafka on our AWS instance to send our requests to the core engine to get matches.
On our UI we are getting the error:
NoBrokersAvailable
and when we check Kafka is down. We restart it and it's still down. We checked for the logs and cleared it and also we checked cleared disk space.
Still the same problem of Kafka not starting. What should we do?

When one Broker has a problem, what is the best way to resolve the situation?

If I have three Brokers running in Kafka cluster, and one of them failed due to an error. So I only have two running brokers left.
1) Usually, when this happens, restarting a failed broker will solve the problem?
2) If restarting the broker wouldn't solve the problem, can I erase all the data that the failed Broker had and restart it? (Because all the data will be restored by two other Brokers). Is this method okay in production? If not, why?
When I was testing Kafka with my desktop on Windows 10 long time ago, if a Broker has an error and the restarting the server wouldn't work, I erased all the data. Then, it began to run okay. (I am aware of Kafka and Windows issues.) So, I am curious if this would work on multi-clustered Kafka (Linux) environments.
Ultimately, it depends what the error is. If it is a networking error, then there is nothing necessarily wrong with the logs, so you should leave them alone (unless they are not being replicated properly).
The main downside of deleting all data from a broker is that some topics may only have one replica, and it is on that node. Or if you lose other brokers while the replication is catching up, then all data is potentially gone. Also, if you have many TB of data that is replicating back to one node, then you have to be aware of any disk/network contention that may occur, and consider throttling the replication (which would take hours for the node to be healthy again)
But yes, Windows and Linux ultimately work the same in this regard, and it is one way to address a clustered environment

After reboot KAFKA topic appears to be lost

Having installed KAFKA and having looked at these posts:
kafka loses all topics on reboot
Kafka topic no longer exists after restart
and thus moving kafka-logs to /opt... location, I still note that when I reboot:
I can re-create the topic again.
the kafka-logs directory contains information on topics, offsets etc. but it gets corrupted.
I am wondering how to rectify this.
Testing of new topics prior to reboot works fine.
There can be two potential problems
If it is kafka running in docker, then docker image restart always cleans up the previous state and creates a new cluster hence all topics are lost.
Check the log.dir or Zookeeper data path. If either is set to /tmp directory, it will be cleaned on each reboot. Hence you will lose all logs and topics will be lost.
In this VM I noted the Zookeeper log was defined on /tmp. Changed that to /opt (presume it should be /var though) and the clearing of Kafka data when instance terminated was corrected. Not sure how to explain this completely.

Lagom Kafka Unexpected Close Error

In Lagom Dev Enviornment, after starting Kafka using lagomKafkaStart
sometimes it shows KafkaServer Closed Unexpectedly, after that i need to run clean command to again get it running.
Please suggest is this the expected behaviour.
This can happen if you forcibly shut down sbt and the ZooKeeper data becomes corrupted.
Other than running the clean command, you can manually delete the target/lagom-dynamic-projects/lagom-internal-meta-project-kafka/ directory.
This will clear your local data from Kafka, but not from any other database (Cassandra or RDBMS). If you are using Lagom's message broker API, it will automatically repopulate the Kafka topic from the source database when you restart your service.

Reproducing UnknownTopicOrPartitionException: This server does not host this topic-partition

We have encountered few exception on production environment:
UnknownTopicOrPartitionException: This server does not host this topic-partition
As per my analysis, one possible workaround for this issue is increasing no of retries since this is a retriable exception.
I am facing some difficulties which reproducing this issue locally. I tried bringing down broker while producing but it is failing with TimeoutException.
I am looking for suggestions to reproduce this issue.
If you get this error log during topic creation process, there is an open issue for this:
KAFKA-6221 ReplicaFetcherThread throws UnknownTopicOrPartitionException on topic creation
at some point of time during batch creating topics, it's likely that UpdateMetadata requests got processed later than FetchRequest, therefore metadata cache was not updated on a timely basis.
issue was about log messages that have no impact on cluster health.