KAFKA partition data achieving data dailywise - apache-kafka

Is it possible in Kafka to archive data daily-wise to some directory?
Also let me know is it possible to create a partition on a daily base.

You can use Kafka Connect with the DailyPartitioner class in Confluent's connectors to backup topic data to HDFS or S3
There's also FileSink connectors for local disk out of the box with Kafka, but you might need to implement the partitioner yourself

Related

Kafka Stream for Kafka to HDFS

I have a Flink Job which reads data from Kafka topics and writes it to HDFS. There are some problems with checkpoints, for example after stopping Flink Job some files stay in pending mode and other problems with checkpoints which write to HDFS too.
I want to try Kafka Streams for the same type of pipeline Kafka to HDFS. I found the next problem - https://github.com/confluentinc/kafka-connect-hdfs/issues/365
Could you tell me please how to resolve it?
Could you tell me where Kafka Streams keep files for recovery?
Kafka Streams only interacts between topics of the same cluster, not with external systems.
Kafka Connect HDFS2 connector maintains offsets in an internal offsets topic. Older versions of it maintained offsets in the filenames and used a write-ahead log to ensure file delivery

Is there any way that we can use Kafka streams for loading file to database?

It can be any file. I just wanna know whether it's possible using Kafka stream.
Use Kafka Connect's JDBC Sink connector to stream data from Kafka to a database.
Here's an example of it in use: https://rmoff.dev/kafka-jdbc-video

Micro-batching through Nifi

I have a scenario where my kafka messages(from same topic) are flowing through single enrichment pipeline and written at the end into HDFS and MongoDB. My Kafka consumer for HDFS will run on hourly basis(for micro-batching). So I need to know the best possible way to route flowfiles to putHDFS and putMongo based on which consumer it is coming from(Consumer for HDFS or consumer for Mongo DB).
Or please suggest if there is any other way to achieve micro-batching through Nifi.
Thanks
You could set Nifi up to use a Scheduling Strategy for the processors that upload data.
And I would think you want the Kafka consumers to always read data, building a backlog of FlowFiles in NiFi, and then having the puts run on a less-frequent basis.
This is similar to how Kafka Connect would run for its HDFS Connector

Confluent Kafka Backup and Recovery

Is there a procedure in Kafka to take backup of Kafka broker data ?
How does backup and restore work in Kafka ?
Note-
The one method is to create another DC and configure inter DC replication .
But is there any other method to take backup of data ?
Thanks!
One approach I'd recommend is to continuously backup your Kafka data into HDFS. In order to do this, you can apply Confluent HDFS-Sink connector. You can store your records in Avro or Parquet format.
On the flip side, using HDFS as data source allows you to replay all your records into Kafka.
[1] Confluent HDFS-Sink: https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html

Unable to push Avro data to HDFS using Confluent Platform

I have a system pushing Avro data in to multiple Kafka topics.
I want to push that data to HDFS. I came across confluent but am not sure how can I send data to HDFS without starting kafka-avro-console-producer.
Steps I performed:
I have my own Kafka and ZooKeeper running so i just started schema registry of confluent.
I started kafka-connect-hdfs after changing topic name.
This step is also successful. It's able to connect to HDFS.
After this I started pushing data to Kafka but the messages were not being pushed to HDFS.
Please help. I'm new to Confluent.
You can avoid using the kafka-avro-console-producer and use your own producer to send messages to the topics, but we strongly encourage you to use the Confluent Schema Registry (https://github.com/confluentinc/schema-registry) to manage your schemas and use the Avro serializer that is bundled with the Schema Registry to keep your Avro data consistent. There's a nice writeup on the rationale for why this is a good idea to do here.
If you are able to send messages that were produced with the kafka-avro-console-producer to HDFS, then your problem is likely in the kafka-connect-hdfs connector not being able to deserialize the data. I assume you are going through the quickstart guide. The best results will come from you using the same serializer on both sides (in and out of Kafka) if you are intending to write Avro to HDFS. How this process works is described in this documentation.