I'm trying to solve the problem of data denormalization before indexing to the Elasticsearch. Right now, my Postgres 11 database is configured with pgoutput plugin and Debezium with Postgresql Connector is streaming the log changes to RabbitMq which are then aggregated by doing a reverse lookup on the db and feeding to the Elasticsearch.
Although, this works okay, the lookup at the App layer to aggregate the data is expensive and taking a lot of execution time (the query is already refined but it has about 10 joins making it sloppy).
The other alternative I explored was to use KStreams for data aggregation. My knowledge on Apache Kafka is minimal and thus I'm here. My question here is it a requirement to have Apache Kafka as the broker to be able to utilize the Java KStreams API or can it be leveraged with any broker such as RabbitMq? I'm unsure about this because all the articles talk about Kafka Topics and Key Value pairs which are specific to Apache Kafka.
If there is a better way to solve the data denormalization problem, I'm open to it too.
Thanks
Kafka Steams is only for Kafka. You're more than welcome to use Kafka Streams between Debezium and the process that consumes any topic (the Postgres connector that writes to RabbitMQ?)
You can use Spark, Flink, or Beam for stream processing on other messaging queues, but Debezium requires Kafka so start with tools around that.
Spark, for example, has an Elasticsearch writer library; not sure about the others.
Related
Good Afternoon, my question is pretty simple, I'm new in Apache Kafka but I'm doing some work as part of my internship which is why I came with the question.
I will provide the context as much as I can, so I hope someone can help me, I want to clear my doubts.
I was requested to develop a pipeline (or workflow) using first Apache Nifi.
This pipeline consisted of the following.
I fetched data from one local MySQL database using Nifi, then the data was sent to one Kafka topic which was later processed to clean some raw data using the Kafka Client with Java (KStream, KTable and some regular expressions) and sent again to one kafka topic.
Once the processing was done, the new data was read again using Apache Nifi, and then sent to a new MySQL table.
I provide a picture for a better undertanding.
General Pipeline
After it, I was requested to do the same but using Kafka Connect instead of Apache Nifi, which was even shorter because I only had to use the Source connector to read the data from the MySQL database to sent it to one kafka topic, then process it with the Kafka Client with Java and sent it to a new kafka topic. Finally use the Sink connector to save the processed data of the new topic to sent it straight to one new table in the database.
So, someone in charge asked me when I should use Apache Nifi + Kafka instead of Kafka Connect + Kafka and I have no idea being honest.
So let's consider that the most important point here is apply Data Enrichment and let's consider two scenaries:
when I have data from different source but the data is not streaming data AND when the data is streaming data as well as not.
And all of it needs to be processed, integrated, cleaned and finally unified to apply data enrichment.
If I consider the context provided previously my questions and doubts are:
when should I use or not Nifi and Kafka? and why?
When should I use or not Kafka Connect with Kafka? and why?
I think I have one basic idea, and I have been reading in order to be able to answer it for myself, but being honest, I haven't come with one acceptable answer or clearly idea of when to use each one.
So, I would really appreciate your help.
I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.
Apache Kafka is an real-time messaging service. It stores streams of data safely in distributed and fault-tolerant. We can filter streaming data when comming producer. I don't understant that why we need NoSQL databases like as MongoDB to store same data in Apache Kafka. The true question is that why we store same data in a NoSQL database and Apache Kafka?
I think if we need a NoSQL database, we can collect streams of data from clients in MongoDB at first without the use of Apache Kafka. But, most of big data architecture preference using Apache Kafka between data source and NoSQL database.(see)
What is the advantages of that for real systems?
This architecture has several advantages:
Kafka as Data Integration Bus
It helps distribute data between several producers and many consumers easily. Here Apache Kafka serves as an "data" integration message bus.
Kafka as Data Buffer
Putting Kafka in front of your "end" data storages like MongoDB or MySQL acts like a natural data buffer. So you are able to deploy/maintain/redeploy your consumer services independently. At the time your service is down for maintanance Kafka is still storing all incoming data, that is quite useful.
Kafka as a Short Time Data Storage
You don't have to store everything in Kafka: very often you use Kafka topics with retention. It means all data older than some value will be deleted by Kafka automatically. So, for example you may have Kafka topic with 1 week retention (so you store 1 week of data only) but at the same time your data lives in long time storage services like classic SQL-DBs or Cassandra etc.
Kafka as a Long Time Data Storage
On the other hand you can use Apache Kafka as a long term storage system. Using compacted topics enables you to store only the last value for each key. So your topic becomes a last state storage of your app.
I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB. For testing purposes I'm building on top of flink-streaming-connectors.kafka example (https://github.com/apache/flink).
Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. The only example I've found about MongoDB integration is flink-mongodb-test from github. Unfortunately it uses static data source (database), not the Data Stream.
I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.
What would be the best way to achieve it? Do I need to write the custom sink function or maybe I'm missing something? Maybe it should be done in different way?
I'm not tied to any solution, so any suggestion would be appreciated.
Below there's an example what exactly i'm getting as an input and what I need to store as an output.
Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>
Flink: DataStream.map({
return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection
As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.
As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.
Kafka -> Flink -> Kafka -> Mongo/Anything
With this approach you can mantain the "at-least-once semantics" behaivour.
There is currently no Streaming MongoDB sink available in Flink.
However, there are two ways for writing data into MongoDB:
Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.
Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink.
If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.
If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.
In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.