I'm designing a new way for my company to stream data from multiple MongoDB databases, perform some arbitrary initial transformations, and sink them into BigQuery.
There are various requirements but the key ones are speed and ability to omit or redact certain fields before they reach the data warehouse.
We're using Dataflow to basically do this:
MongoDB -> Dataflow (Apache Beam, Python) -> BigQuery
We basically need to just wait on the collection.watch() call as the input, but from the docs and existing research it may not be possible,
At the moment, the MongoDB connector is bounded and there seems to be no readily-available solution to read from a changeStream, or a collection in an unbounded way.
Is it possible to read from a changeStream and have the pipeline wait until the task is killed rather than being out of records?
In this instance I decided to go via Google Pub/Sub which serves as the unbounded data source.
Related
I am configuring the MongoDB to BigQuery CDC Template. The job is able to connect to MongoDB and starts up. But it does not process any Change Streams automatically. When I manually publish a message to the Pub/Sub topic, only then it processes and writes to BigQuery.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#mongodb-to-bigquery-cdc
My understanding is that, if the configuration is asking for MongoDB connection URI, database and collection names, shouldn't it directly connect to the Change Streams and populate the data to BigQuery? It kind of doesn't make sense that I would need to have a separate process that reads the change streams from mongodb, extract the record and then send it to Pub/Sub.
If this is the case, why is the configuration asking for MongoDB parameters?
Not too familiar with that template, it was introduced by MongoDB and they are the ones who usually support it.
However, looking at https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/mongodb-to-googlecloud/docs/MongoDbToBigQueryCDC/README.md, it appears to be the case that it requires Changestream running that pushes the changes from MongoDb to Pub/Sub topic., which appears to be the missing piece here.
Maybe this is helpful: https://www.mongodb.com/developer/products/mongodb/stream-data-mongodb-bigquery-subscription/
As a part of our architecture, we are having Kinesis streams which will send data streams to Redshift, the current thought process is to create external external schema on top of the Kinesis streams and then use materialized views to persist the data with minimal transformations as needed.
However In addition to this there is a need to perform a series of transformations (which are quite complex) on this data and load it into target tables. Was thinking about using stored procedures to perform these transformations and load into a target table. So the flow is like Kinesis Streams -> External View (Real time) -> Batch Processing (Materialized view and stored procedure).
To call the stored procedures (SP) on a timely schedule, the thought process was to schedule the SQL queries. Being fairly new to AWS Redshift and the Kinesis streaming ingestion and exploring on the available options, would like to get thoughts on the above approach.
Further understand that there are limitations with stored procedures, (https://docs.aws.amazon.com/redshift/latest/dg/stored-procedure-constraints.html) and scheduling of SQL queries might not give the feasibility of implementing a good orchestration flow. Hence would like to understand alternate methods that are available to implement the above specifically post the availability of data in external view.
I was going through MongoDB ChangeStream , I understand it reduces the risk of tailing the oplog - we are Currently tailing the oplog to publish the data to Kafka.
Please help me to grasp so how can changestreams are better compared to the Pub/Sub technologies like Kafka or RabbitMQ
ChangeStreams should not be compared to Pub/Sub technologies - ChangeStreams are there to provide a safe way to enable real-time (data) change events to be captured and then processed (and as you rightly pointed out, previously you had to tail the oplog in MongoDB in order to achieve a similar outcome which had its own set of issues, risks and complexities which taxed the developer).
As I mentioned above, ChangeStreams provide a safe way to look at each data change event occuring in MongoDB, apply a filter to those events, and then process each qualifying event. ChangeStreams let you to replay previous events based on the timeframe that your oplog covers - for example, if your application that is implementing ChangeStreams fails, you have the ability to pick-up from the point the application failed.
Whilst ChangeStreams exhibit Pub/Sub-like behaviour from an event identification/processing perspective, that is where the likeness stops. A typical/common use case, where you are interested in capturing/identifying data-change events in MongoDB for downstream processing, is to create a Kafka Producer that utilises the MongoDB driver, instantiates a ChangeStream, and for each qualifying event occurring in MongoDB (made available via the ChangeStream) passes that onto Kafka.
Problem statement:
To transfer data from mongoDB to spark optimally with minimal latency
Problem Description:
I have my data stored in mongoDB and want to process the data (of the order ~100-500GB) using apache spark.
I used the mongoDB-spark connector and was able to read/write data from/to mongoDB (https://docs.mongodb.com/spark-connector/master/python-api/)
The problem was to create spark dataframe each time on the fly.
Is there a solution to handling such huge data transfers?
I looked into :
spark streaming API
Apache Kafka
Amazon S3 and EMR
But couldn't make a decision as to whether it was the optimal way to do it.
What strategy would you reckon to handle transferring such data?
Would having the data on the spark cluster and syncing just the deltas (changes in database) to the local file would be the way to go or just reading from mongoDB each time is the only way (or the optimal way) to go about it?
EDIT 1:
The following suggests to read data of mongoDB (due to secondary indexes, data retrieval is faster): https://www.mongodb.com/blog/post/tutorial-for-operationalizing-spark-with-mongodb
EDIT 2:
The advantages of using parquet format : What are the pros and cons of parquet format compared to other formats?
I was discussing with a coworker about the usage of the MongoDB connector for Hadoop and he explained that it was very inefficient. He stated that the MongoDB connectors utilizes its own map reduce, and then uses the Hadoop map reduce, which internally slows down the entire system.
If that is the case, what is the most efficient way to transport my data to the Hadoop cluster? What purpose does the MongoDB connector serve if it is more inefficient? In my scenario, I want to get the daily inserted data from MongoDB (roughly around 10MB) and put that all into Hadoop. I should also add that each MongoDB node and Hadoop node all share the same server.
The MongoDB Connector for Hadoop reads data directly from MongoDB. You can configure multiple input splits to read data from the same collection in parallel. The Mapper and Reducer jobs are run by Hadoop's Map/Reduce engine, not MongoDB's Map/Reduce.
If your data estimate is correct (only 10MB per day?) that is a small amount to ingest and the job may be faster if you don't have any input splits calculated.
You should be wary of Hadoop and MongoDB competing for resources on the same server, as contention for memory or disk can affect the efficiency of your data transfer.
To transfer your data from Mongodb to Hadoop you can use some ETL tools like Talend or Pentaho , it's much more easy and practical ! Good luck !