Recovering Kafka Data from .log Files - apache-kafka

I have a 1-node kafka that crashed recently. I was able to salvage the .log and .index files from /tmp/kafka-logs/mytopic-0/ and I have moved these files to a different server and installed kafka on it.
Is there a way to have the new kafka server serve the data contained in these .log files?
Update:
I probably didn't do this the right way, but here is what I've tired:
created a topic named recovermytopic on the new kafka server
stopped kafka
moved all the .log files into /tmp/kafka-logs/recovermytopic-0
restarted kafka
it appeared that for each .log file, kafka generated a .index file, looked promising but after the index files were created, I saw messeages below:
WARN Partition [recovermytopic,0] on broker 0: No checkpointed highwatermark is found for partition [recovermytopic,0] (kafka.cluster.Partition)
INFO [ReplicaFetcherManager on broker 0] Removed fetcher for partitions [recovermytopic,0] (kafka.server.ReplicaFetcherManager)
When I try to check the topic using kafka-console-consumer, the kafka server says:
INFO Closing socket connection to /127.0.0.1. (kafka.network.Processor)
no messages being consumed..

Kafka comes packaged with a DumpLogSegments tool that will extract messages (along with offsets, etc.) from Kafka data log files:
$KAFKA_HOME/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --deep-iteration --print-data-log --files mytopic-0/00000000000000132285.log > 00000000000000132285_messages.out
The output will vary a bit depending on which version of Kafka you're using, but it should be easy to extract the message keys and values with the use of sed or some other tool. The messages can then be replayed into your Kafka cluster using the kafka-console-producer.sh tool, or programmatically.
While this method is a bit roundabout, I think it's more transparent/reliable than trying to get a broker to start with data log files obtained from somewhere else. I've tested the DumpLogSegments tool with various versions of Kafka from 0.9 all the way up to 2.4.

Related

Kafka Stream for Kafka to HDFS

I have a Flink Job which reads data from Kafka topics and writes it to HDFS. There are some problems with checkpoints, for example after stopping Flink Job some files stay in pending mode and other problems with checkpoints which write to HDFS too.
I want to try Kafka Streams for the same type of pipeline Kafka to HDFS. I found the next problem - https://github.com/confluentinc/kafka-connect-hdfs/issues/365
Could you tell me please how to resolve it?
Could you tell me where Kafka Streams keep files for recovery?
Kafka Streams only interacts between topics of the same cluster, not with external systems.
Kafka Connect HDFS2 connector maintains offsets in an internal offsets topic. Older versions of it maintained offsets in the filenames and used a write-ahead log to ensure file delivery

Kafka brokers shuts down because log dirs have failed

I have a 3 broker Kafka clusters with the Kafka logs in the /tmp directory. I am running Debezium Source Connector to MongoDB which polls data from 4 collections.
However within 5 mins after starting the connector, the Kafka brokers are shutting down with the following error:
[2020-04-16 18:25:08,642] ERROR Shutdown broker because all log dirs in /tmp/kafka-logs-1 have failed (kafka.log.LogManager)
I have tried the different suggestions viz. Deleting the Kafka logs and cleaning out the Zookeeper logs. But I ran into the same problem again.
I have also noticed that the kafka logs occupy 100% of the /tmp directory when this happens. So I have also changed the log retention policy based on size.
log.retention.hours=168
log.retention.bytes=1073741824
log.segment.bytes=1073741824
log.retention.check.interval.ms=10000
This also turned up to be futile.
I would like to have some assistance regarding this. Thanks in advance!
Your log files got corrupted probably because you've ran out of storage.
I would suggest to change log.dirs in server.properties. Also make sure that you don't use the tmp/ location, as this is going to be purged once your machine turns off. Once you have changed log.dirs you can restart Kafka.
Note that the older messages will be lost.

Wiping ALL Kafka data?

Running Kafka on Windows 10 x64. I stopped Zookeeper and Kafka. Deleted the logs folder in both. Deleted my kafka-streams folder. But still, when I start Kafka up, I get a bunch of:
[2020-04-02 10:32:46,717] INFO [Partition xxx-stream-e1d97106-95ab-4f64-a692-ebfe73382e4c-KSTREAM-AGGREGATE-STATE-STORE-0000000003-repartition-0 broker=0] No checkpointed highwatermark is found for partition xxx-stream-e1d97106-95ab-4f64-a
and about consumer offsets.
Where is it storing this stuff? I thought it was all in the logs directory?

Is Kafka topic linked with zookeeper and If zookeeper changed will topic disappeare

I was working with Kafka. I downloaded the zookeeper, extracted and started it.
Then I downloaded Kafka, extracted the zipped file and started Kafka. Everything was working good. I created few topics and I was able to send and receive messages. After that I stopped Kafka and Zookeeper. Then I read that Kafka itself provides Zookeeper. So I started Zookeeper that was provided with Kafka. However the data directory for it was different, and then I started Kafka from same configuration file and same data directory location. However after starting Kafka I could not find the topics that I had created.
I just want to know that, does this mean the meta data about the topics is maintained by Zookeeper. I searched Kafka documentation, however, I could not find anything in detail.
https://kafka.apache.org/documentation/
Check this documentation provided by confluent. According to this Apache Kafka® uses ZooKeeper to store persistent cluster metadata and is a critical component of the Confluent Platform deployment. For example, if you lost the Kafka data in ZooKeeper, the mapping of replicas to Brokers and topic configurations would be lost as well, making your Kafka cluster no longer functional and potentially resulting in total data loss.
So, the answer to your question is, yes, the purpose of zookeeper is to store relevant metadata about the kafka brokers, topics, etc,.
Also, since you have just started working on Kafka and Zookeeper, I would like to mention this. By default, Kafka stored it's data in a temp location which get's deleted on system reboot, so you should change that as well.
the answer to your question tag is yes,
1)Initially you started standalone zookeeper from zip file and you stopped the zookeeper, which means the topics that are created are stored in the zookeeper standalone are lost.Now you persistent cluster metadata related to Kafka is lost .
2)second time you started the zookeeper from the package that comes along with Kafka, now the new zookeeper instance does not have any topics information that you created previously, so you need to create newly .
3) suppose in case 1: if you close the terminal and start again the zookeeper from standalone , you no need to create the Topic again ,but if you stopped the zookeeper server from standalone then topics are lost.
in simple : you created two separate zookeeper instances, where topics will not be shared between them .

How to save a kafka topic at shutdown

I'm configuring my first kafka network. I can't seem to find any support on saving a configured topic. I know I can configure a topic from the quickstart guide here, but how do I save it? I thought I could add the topic info to a .properties file inside the config dir, but I don't see any support for that.
If I shutdown my machine, my topic is deleted. How do I save the configuration?
Could the topic be deleted because you are using the default broker config? With the default config, Kafka logs are stored under /tmp folder. This folder gets wiped out during a machine reboot. You could change the broker config and pick another location for Kafka logs.