Is there a procedure in Kafka to take backup of Kafka broker data ?
How does backup and restore work in Kafka ?
Note-
The one method is to create another DC and configure inter DC replication .
But is there any other method to take backup of data ?
Thanks!
One approach I'd recommend is to continuously backup your Kafka data into HDFS. In order to do this, you can apply Confluent HDFS-Sink connector. You can store your records in Avro or Parquet format.
On the flip side, using HDFS as data source allows you to replay all your records into Kafka.
[1] Confluent HDFS-Sink: https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html
Related
I understand within confluent 5.X and higher consumer offsets are now only committed back to the consumer group periodically for the connect hdfs sink connector. This makes consumer lag monitoring not entirely accurate as it uses WAls to keep track of its current offsets within HDFS itself.
If this is truly the case, is there another place to keep track of the sinks lag that does not involve reading the files out of HDFS but from the connector itself?
Or am I missing something and consumer group lag is always up to date and the proper way to track progress?
version:
confluent community edition 5.5.2
confluent hdfs 2 sink connector 10.1.1
I have a Flink Job which reads data from Kafka topics and writes it to HDFS. There are some problems with checkpoints, for example after stopping Flink Job some files stay in pending mode and other problems with checkpoints which write to HDFS too.
I want to try Kafka Streams for the same type of pipeline Kafka to HDFS. I found the next problem - https://github.com/confluentinc/kafka-connect-hdfs/issues/365
Could you tell me please how to resolve it?
Could you tell me where Kafka Streams keep files for recovery?
Kafka Streams only interacts between topics of the same cluster, not with external systems.
Kafka Connect HDFS2 connector maintains offsets in an internal offsets topic. Older versions of it maintained offsets in the filenames and used a write-ahead log to ensure file delivery
I have a scenario where my kafka messages(from same topic) are flowing through single enrichment pipeline and written at the end into HDFS and MongoDB. My Kafka consumer for HDFS will run on hourly basis(for micro-batching). So I need to know the best possible way to route flowfiles to putHDFS and putMongo based on which consumer it is coming from(Consumer for HDFS or consumer for Mongo DB).
Or please suggest if there is any other way to achieve micro-batching through Nifi.
Thanks
You could set Nifi up to use a Scheduling Strategy for the processors that upload data.
And I would think you want the Kafka consumers to always read data, building a backlog of FlowFiles in NiFi, and then having the puts run on a less-frequent basis.
This is similar to how Kafka Connect would run for its HDFS Connector
Is it possible in Kafka to archive data daily-wise to some directory?
Also let me know is it possible to create a partition on a daily base.
You can use Kafka Connect with the DailyPartitioner class in Confluent's connectors to backup topic data to HDFS or S3
There's also FileSink connectors for local disk out of the box with Kafka, but you might need to implement the partitioner yourself
Currently i installed kafka into linux and created topic and published message to it and it saves data in the folder /tmp/kafka-logs/topicname-0, as i checked the local file system type is xfs, is there any way kafka can save data in the format of HDFS file system type, if yes help me with configuration or steps.
Kafka runs on top of a local filesystem. It cannot be run on HDFS. If you want to move data from Kafka into HDFS, one option is using a connector to push the data to HDFS https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html