I have a scenario where my kafka messages(from same topic) are flowing through single enrichment pipeline and written at the end into HDFS and MongoDB. My Kafka consumer for HDFS will run on hourly basis(for micro-batching). So I need to know the best possible way to route flowfiles to putHDFS and putMongo based on which consumer it is coming from(Consumer for HDFS or consumer for Mongo DB).
Or please suggest if there is any other way to achieve micro-batching through Nifi.
Thanks
You could set Nifi up to use a Scheduling Strategy for the processors that upload data.
And I would think you want the Kafka consumers to always read data, building a backlog of FlowFiles in NiFi, and then having the puts run on a less-frequent basis.
This is similar to how Kafka Connect would run for its HDFS Connector
Related
I need to consume changes coming from changes in MongoDB.
As I explored my options, I noticed that there are two common options to do so:
Consume the MongoDB change streams directly
Use the MongoDB Kafka source connector to publish the messages to the Kafka topic and then consume this topic.
I'm dealing with a high throughput so scalability is important.
What is the right option and why? Thanks
I am using Confluent MongoDB Atlas Source Connector to pull data from MongoDB collection to Kafka. I have noticed that the connector is creating multiple topics in the Kafka Cluster. I need the data to be available on one topic so that the consumer application can consume the data from the topic. How can I do this?
Besides, why the Kafka connector is creating so many topics? isn't is difficult for consumer applications to retrieve the data with that approach?
Kafka Connect creates 3 internal topics for the whole cluster for managing its own workload. You should never need/want external consumers to use these
In addition to that, connectors can create their own topics. Debezium for example creates a "database history topic", and again, this shouldn't be read outside of the Connect framework.
Most connectors only need to create one for the source to pull data into, which is what consumers actually should care about
I'm trying to put some data into an uncreated Kafka topic using the PublishKafka_2.0 processor in Nifi.
I don't have a direct approach to the Kafka server - only via the nifi flow, and i need to create 3 new topics for the data.
How can it be done using nifi??
thank you!
You would need to enable automatic creation of Kafka topics from Kafka itself. NiFi doesn't have any control over Kafka. It just supports consuming and producing. From the sound of it, you may have a setup where automatic topic creation is disabled, so you'll need to have someone create the topics for you.
a similar question has been answered before but the solution doesn't work for my use case.
We run 2 Kafka clusters each in 2 separate DCs. Our overall incoming traffic is split between these 2 DCs.
I'd be running separate Kafka streaming app in each DC to transform that data and want to write to a Kafka topic in a single DC.
How can I achieve that?
Ultimately we'd be indexing the kafka topic data in Druid. Its not possible to run separate Druid clusters since we are trying to aggregate the data.
I've read that its not possible with a single Kafka stream. Is there a way I can use another Kafka stream to read from DC1 and write to DC2 kafka cluster ?
As you wrote yourself, you cannot use the Kafka Streams API to read from Kafka cluster A and write to a different Kafka cluster B.
Instead, if you want to move data between Kafka clusters (whether it's in the same DC or across DCs) you should use a tool such as Apache Kafka's Mirror Maker or Confluent Replicator.
Is it possible in Kafka to archive data daily-wise to some directory?
Also let me know is it possible to create a partition on a daily base.
You can use Kafka Connect with the DailyPartitioner class in Confluent's connectors to backup topic data to HDFS or S3
There's also FileSink connectors for local disk out of the box with Kafka, but you might need to implement the partitioner yourself