Collecting MongoDB delta data and pushing to kafka? - mongodb

We have a mongodb database which keep getting data from different sources, i want to keep pushing this data to kafka as producer in real time so that i can have spark kafka integration for my analytics. Let me know if anyone has done anything on this or if there is any probable solution to this. Flume doesnot support mongodb as source and sqoop is for RDBMS.

You can use Kafka Connect for that:
https://www.confluent.io/product/connectors/
As per the above, there are at least 2 source connectors for mongodb available:
https://github.com/DataReply/kafka-connect-mongodb
https://github.com/teambition/kafka-connect-mongo

Related

RabbitMq and KStreams for Data Aggregation

I'm trying to solve the problem of data denormalization before indexing to the Elasticsearch. Right now, my Postgres 11 database is configured with pgoutput plugin and Debezium with Postgresql Connector is streaming the log changes to RabbitMq which are then aggregated by doing a reverse lookup on the db and feeding to the Elasticsearch.
Although, this works okay, the lookup at the App layer to aggregate the data is expensive and taking a lot of execution time (the query is already refined but it has about 10 joins making it sloppy).
The other alternative I explored was to use KStreams for data aggregation. My knowledge on Apache Kafka is minimal and thus I'm here. My question here is it a requirement to have Apache Kafka as the broker to be able to utilize the Java KStreams API or can it be leveraged with any broker such as RabbitMq? I'm unsure about this because all the articles talk about Kafka Topics and Key Value pairs which are specific to Apache Kafka.
If there is a better way to solve the data denormalization problem, I'm open to it too.
Thanks
Kafka Steams is only for Kafka. You're more than welcome to use Kafka Streams between Debezium and the process that consumes any topic (the Postgres connector that writes to RabbitMQ?)
You can use Spark, Flink, or Beam for stream processing on other messaging queues, but Debezium requires Kafka so start with tools around that.
Spark, for example, has an Elasticsearch writer library; not sure about the others.

Streaming database data to Kafka topic without using a connector

I have a use case where I have to push all my MySQL database data to a Kafka topic. Now, I know I can get this up and running using a Kafka connector, but I want to understand how it all works internally without using a connector. In my spring boot project I already have created a Kafka Producer file where I set all my configuration, create a Producer record and so on.
Has anyone tried this approach before? Can anyone throw some light on this?
Create entity using spring jpa for tables and send data to topic using find all. Use scheduler for fetching data and sending it to topic. You can add your own logic for fetching from DB and also a different logic for sending it to Kafka topic. Like fetch using auto increment, fetch using last updated timestamp or a bulk fetch. Same logic of JDBC connectors can be implemented.
Kakfa Connect will do it in an optimized way.

How to do the transformations in Kafka (PostgreSQL-> Red shift )

I'm new to Kafka/AWS.My requirement to load data's from several sources into DW(Redshift).
One of my sources is PostgreSQL. I found a good article using Kafka to Sync data into Redshift.
This article is more good enough to sync the data between the PostgreSQL to redshift.But my requirement is to transform the data's before loading into Redshift.
Can somebody help me to how to transform the data's in Kafka (PostgreSQL->Redhsift)?
Thanks in Advance
Jay
Here's an article I just published on exactly this pattern, describing how to use Apache Kafka's Connect API, and KSQL (which is built on Kafka's Streams API) to do streaming ETL: https://www.confluent.io/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
You should check out Debezium for streaming events from Postgres into Kafka.
For this, you can use any streaming application be it storm/spark/kafka streaming. These application will consume data from diff sources and the data transformation can be done on the fly. All three have their own advantages and complexity.

Build a solution for Kafka+Spark for RDBMS data

My current project is in MainFrames with DB2 as its database. We have 70 databases with nearly 60 tables in each of them. Our architect proposed a plan of using Kafka with Spark streaming for processing data. How good is Kafka in reading the RDBMS tables for data ? Do we directly read the data from the tables using Kafka or is there any other way to get the data from RDBMS into Kafka ?
If there is any better solution, your suggestions can help a lot.
Do not directly read from database, it will create additional load. I would suggest two approaches.
Send new data both to databases and to Kafka, or send it to Kafka and then consume for processing.
Read data from database write ahead log (I know it is possible for MySQL with Maxwell but I am not sure for DB2) and send it to Kafka for further processing.
You can use Spark Streaming or Kafka Streams depending on your needs.

flume or kafka's equivalent to mongodb

In Hadoop world, flume or kafka is used to streaming or collecting data and store them in Hadoop. I am just wondering that does Mango DB has some similar mechanisms or tools to achieve the some?
MongoDB is just the database layer, not the complete solution like the Hadoop ecosystem. I actually use Kafka along with Storm to store data in MongoDB in cases where there is a very large flow of incoming data which needs to be processed and stored.
Although Flume is frequently used and treated as a member of the Hadoop ecosystem, it's not impossible to use it with other sources/sinks. MongoDB is not an exception. In fact, Flume is flexible enough to be extended to create your own custom sources/sinks. See this project, for example. This is a custom Flume-Mongo-sink.