Moving all documents in a collection from mongo to azure blob storage - mongodb

I am trying to move all documents in a mongo collection to a azure blob storage within a scheduled azure webjob using c# and mongo 1.9.1 drivers.
I do not want to hold all the 100000 documents in memory in the webjob. Is there a better way may be like a batched retrieve of documents from mongo? Or is there a completely different approach that I can look into?

You could have one web job process responsible for queuing up each document individually. This web job would only need the unique identifier for each document so that it could push that as a message to an Azure Storage Queue. You can have this web job configured to be scheduled or manual depending on the need.
Then have another web job that migrates a single file. You can have this web job setup to be continuous so that as long as there are messages on the queue to be read, it will start processing them. By default a web job will parallelize 16 items off of a queue. This is configurable.

Related

GCP MongoDB to BigQuery CDC Template does not stream / read data from MongoDB change streams

I am configuring the MongoDB to BigQuery CDC Template. The job is able to connect to MongoDB and starts up. But it does not process any Change Streams automatically. When I manually publish a message to the Pub/Sub topic, only then it processes and writes to BigQuery.
https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#mongodb-to-bigquery-cdc
My understanding is that, if the configuration is asking for MongoDB connection URI, database and collection names, shouldn't it directly connect to the Change Streams and populate the data to BigQuery? It kind of doesn't make sense that I would need to have a separate process that reads the change streams from mongodb, extract the record and then send it to Pub/Sub.
If this is the case, why is the configuration asking for MongoDB parameters?
Not too familiar with that template, it was introduced by MongoDB and they are the ones who usually support it.
However, looking at https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/mongodb-to-googlecloud/docs/MongoDbToBigQueryCDC/README.md, it appears to be the case that it requires Changestream running that pushes the changes from MongoDb to Pub/Sub topic., which appears to be the missing piece here.
Maybe this is helpful: https://www.mongodb.com/developer/products/mongodb/stream-data-mongodb-bigquery-subscription/

How to keep track of Cassandra write successes while using Kafka in cluster

When working in my cluster I have the constraint that my frontend cannot display a finished job until all the jobs different results have been added into Cassandra. These result are computed in their individual microservices and sent via Kafka to a cassandra writer.
My question is if there are any best practices for letting the frontend know when these writes have completed? Should I make another database entry for results or is there some other smart way that would scale well?
Each job has about 100 different results written in to it, and I have like 1000jobs/day
I used Cassandra for a UI backend in the past with Kafka , and we would store a status field in each DB record, which would very periodically get updated through a slew of Kafka Streams processors (there were easily more than 1000 DB writes per day).
The UI itself was running some setInterval(refresh) JS function that would query the latest database state, then update the DOM, accordingly.
Your other option is to push some websocket/SSE data into the UI from some other service that indicates "data is finished"

Single Batch job performing heavy database reads

I have a Spring Batch solution which reads several tables in Oracle database, does some flattening and cleaning of data, and sends it to a Restful Api which is our BI platform. The Spring Batch breaks down this data in chunks by date and not by size. It may happen that on a particular day, one chunk could consist of million rows. We are running the complete end-to-end flow in the following way:
Control-M sends a trigger to Load Balancer at a scheduled time
Through Load Balancer request lands on to an instance of Spring Batch app
Spring Batch reads data for that day in chunks from Oracle database
Chunks are then sent to target API
My problems are:
The chunks can get heavier. If it contains of million rows then the instance's heap size increases and at one point chunks will get processed at trickling pace
One instance bears the load of entire batch processing
How can I distribute this processing across a group of instances? Is parallel processing achievable and if yes then how can I make sure that the same rows are not read by multiple instances (to avoid duplication)? Any other suggestions?
Thanks.
You can use a (locally or remotely) partitioned step where each worker step is assigned a distinct dataset. You can find more details and a code example in the documentation here:
https://docs.spring.io/spring-batch/docs/current/reference/html/spring-batch-integration.html#remote-partitioning
https://github.com/spring-projects/spring-batch/tree/main/spring-batch-samples#partitioning-sample

Kubernetes deployment strategy using CQRS with dotnet & MongoDb

I am re-designing a dotnet backend api using the CQRS approach. This question is about how to handle the Query side in the context of a Kubernetes deployment.
I am thinking of using MongoDb as the Query Database. The app is dotnet webapi app. So what would be the best approach:
Create a sidecar Pod which containerizes the dotnet app AND the MongoDb together in one pod. Scale as needed.
Containerize the MongoDb in its own pod and deploy one MongoDb pod PER REGION. And then have the dotnet containers use the MongoDb pod within its own region. Scale the MongoDb by region. And the dotnet pod as needed within and between Regions.
Some other approach I haven't thought of
I would start with the most simple approach and that is to place the write and read side together because they belong to the same bounded context.
Then in the future if it is needed, then I would consider adding more read side or scaling out to other regions.
To get started I would also consider adding the ReadSide inside the same VM as the write side. Just to keep it simple, as getting it all up and working in production is always a big task with a lot of pitfalls.
I would consider using a Kafka like system to transport the data to the read-sides because with queues, if you later add a new or if you want to rebuild a read-side instance, then using queues might be troublesome. Here the sender will need to know what read-sides you have. With a Kafka style of integration, each "read-side" can consume the events in its own pace. You can also more easily add more read-sides later on. And the sender does not need to be aware of the receivers.
Kafka allows you to decouple the producers of data from consumers of the data, like this picture that is taken form one of my training classes:
In kafka you have a set of producers appending data to the Kafka log:
Then you can have one or more consumers processing this log of events:
It has been almost 2 years since I posted this question. Now with 20-20 hindsight I thought I would post my solution. I ended up simply provisioning an Azure Cosmos Db in the region where my cluster lives, and hitting the Cosmos Db for all my query-side requirements.
(My cluster already lives in the Azure Cloud)
I maintain one Postges Db in my original cluster for my write-side requirements. And my app scales nicely in the cluster.
I have not yet needed to deploy clusters to new regions. When that happens, I will provision a replica of the Cosmos Db to that additional region or regions. But still just one postgres db for write-side requirements. Not going to bother to try to maintain/sync replicas of the postgres db.
Additional insight #1. By provisioning the the Cosmos Db separately from my cluster (but in the same region), I am taking the load off of my cluster nodes. In effect, the Cosmos Db has its own dedicated compute resources. And backup etc.
Additional insight #2. It is obvious now but wasnt back then, that tightly coupling a document db (such as MongoDb) to a particular pod is...a bonkers bad idea. Imagine horizontally scaling your app and with each new instance of your app you would instantiate a new document db. You would quickly bloat up your nodes and crash your cluster. One read-side document db per cluster is an efficient and easy way to roll.
Additional insight #3. The read side of any CQRS can get a nice jolt of adrenaline with the help of an in-memory cache like Redis. You can first see if some data is available in the cache before you hit the docuement db. I use this approach for data such as for a checkout cart, where I will leave data in the cache for 24 hours but then let it expire. You could conceivably use redis for all your read-side requirements, but memory could quickly become bloated. So the idea here is consider deploying an in-memory cache on your cluster -- only one instance of the cache -- and have all your apps hit it for low-latency/high-availability, but do not use the cache as a replacemet for the document db.

Reading from a MongoDB changeStream with unbounded PCollections in Apache Beam

I'm designing a new way for my company to stream data from multiple MongoDB databases, perform some arbitrary initial transformations, and sink them into BigQuery.
There are various requirements but the key ones are speed and ability to omit or redact certain fields before they reach the data warehouse.
We're using Dataflow to basically do this:
MongoDB -> Dataflow (Apache Beam, Python) -> BigQuery
We basically need to just wait on the collection.watch() call as the input, but from the docs and existing research it may not be possible,
At the moment, the MongoDB connector is bounded and there seems to be no readily-available solution to read from a changeStream, or a collection in an unbounded way.
Is it possible to read from a changeStream and have the pipeline wait until the task is killed rather than being out of records?
In this instance I decided to go via Google Pub/Sub which serves as the unbounded data source.