How do I efficiently migrate MongoDB to azure CosmosDB with the help of azure Databricks? - mongodb

While searching for a service to migrate our on-premise MongoDB to Azure CosmosDB with Mongo API, We came across the service called, Azure Data Bricks. We have total of 186GB of data. which we need to migrate to CosmosDB with less downtime as possible. How can we improve the data transfer rate for that. If someone can give some insights to this spark based PaaS provided by Azure, It will be very much helpful.
Thank you

Have you referred the article given on our docs page?
In general you can assume the migration workload can consume entire provisioned throughput, the throughout provisioned would give an estimation of the migration speed. You could consider increasing the RUs at the time of migration and reduce it later.
The migration performance can be adjusted through these configurations:
Number of workers and cores in the Spark cluster
maxBatchSize
Disable indexes during data transfer

Related

Big Latency using Azure Database for PostgreSQL flexible server

I have a PostgreSQL database deployed on Azure using Azure Database for PostgreSQL flexible server.
I have the same database created locally and queries that takes 70ms locally, takes 800ms on the database deployed on Azure.
The latency happens if I query the database from my local machine or from an app deployed in Azure Service.
Any clue what may be the issue or how this can be improved?
The compute tier I am using in Azure is the following (I know it's the worst one but I still think it shouldn't take almost a second to query a table that has 50 records in it and two columns):
Burstable (1-2 vCores) - Best for workloads that don’t need the full
CPU continuously
Standard_B1ms (1 vCore, 2 GiB memory, 640 max iops)
This is what Azure is showing me in the overview (so that's why I am guessing the problem should be network related and not CPU/memory related):
Burstable sku's are for testing and small workloads changing to General purpose will improve.

MongoDB && Serverless architecture && batch processing

I believe that aws lambdas (serverless) are not good for batching, by definition they stop after 15 minutes of processing.
I have a Mongo Atlas (MongoDB Cloud Service) DB and I need to process a large dataset/collection several times a week, from a EC2 NodeJS app.
¿Which could be good architectural solutions for this to be efficient ?
Regards
Two key observations here. The size of the data and storage. MongoDB is partitioned and data can be read and processed in parallel.
For those both properties, Apache Spark is the best processing option. In AWS, couple of services provide this option and they are Amazon EMR and AWS Glue. From cost perspective and flexibility perspective, Amazon EMR is the best option.
What if you are not looking for parallel processing.? In that case, AWS Batch would be better option. Using AWS Batch you can run batch jobs with either EC2 or Fargate infra. You don't have to worry about provisioning and terminating the cluster with AWS Batch.

Will Serverless support AWS DocumentDB?

I work in a company that's using Serverless to build cloud-native applications and services. Today we use DynamoDB and SQL Databases with AWS Aurora.
We want to go with DocumentDB for our next application, but we could not find anything about Serverless and AWS DocumentDB. Does Serverless support AWS DocumentDB? If not, is there any plans to support it in the future?
Serverless supports any AWS resources that you can define using CloudFormation. As per the Serverless docs here:
Define your AWS resources in a property titled resources. What goes in
this property is raw CloudFormation template syntax, in YAML...
The YAML for creating a DocumentDB cluster is, going to look something like:
resources:
Resources:
DBCluster:
Type: "AWS::DocDB::DBCluster"
DeletionPolicy: Delete
Properties:
DBClusterIdentifier: "MyCluster"
MasterUsername: "MasterUser"
MasterUserPassword: "Password1234!"
DBInstance:
Type: "AWS::DocDB::DBInstance"
Properties:
DBClusterIdentifier: "MyCluster"
DBInstanceIdentifier: "MyInstance"
DBInstanceClass: "db.r4.large"
DependsOn: DBCluster
You can find the other CloudFormation resources that you can define in the resources parameter of your Serverless.yaml here.
DocumentDB is not a serverless service. You need to manage the backend server to use it.
Please refer to this blog: https://blogs.itemis.com/en/serverless-services-on-aws, you can see it is not in the list of "SERVERLESS SERVICES ON AWS".
No, this won't support serverless, if you really want this you can go with DynamoDB. Also, can see differences if you want.
DocumentDB
MongoDB is supported in this database, which provide ease to learn
Stored procedures are needed in this, where data retrieval and data accumulation is done with help
Document size is limited to 16MB and storage is maximized up to 64TB of data.
Daily backups are managed by the database itself, and can be recovered whenever required
This is costly as we require paying around $200/month even if the user uses only some instances of database or only used few hours.
AWS is not involved in the user credentials stored area as that will be stored in DB directly
Available in specific regions
Can be easily migrated out of AWS into any MongoDB
In case of primary node failure, service promotes read-replica to primary. Multi A-Z has to be configured by users. Backup can be copied across regions
DynamoDB
MongoDB is not directly supported i this and even not easy to migrate from MongoDB to DynamoDB
Stored procedures are not needed in this, which makes the process easier for users
There is no limit in the document size as it can be scaled up to the size of user requirements
Daily backups are not available which makes the user too backup the data which triggered explicitly by users, and can be recovered whenever needed
There is initial cost associated with this, but overall cost is less. Also, on-demand pricing is available where user manage with the lesser amount of $1/month. 25GB data is provided for free in first stage.
AWS controls the user access to the database through identity and access management where authentication and authorization is needed for low level as well
Available in all regions
Can not be easily migrated out of AWS into any MongoDB, you need to write a code to transform
Support global tables, which protect users against regional failure. Data is automatically replicated across multiple AZs in a single region.

MongoDB on Azure worker role

I m developing an application using SignalR to manage websockets and allow my clients to dialog between each other.
I m planning to host this back-office on an Azure worker role. As my SignalR requests carry data that is most of the time saved in the database, I m wondering if NoSQL's MongoDB instead of the classic SQL Server/Entity Framework couple should be a good approach.
Assuming that my application's data types will be strings for most of them, I think MongoDB will be a reliable and a performant solution, and it will allow me to get rid of Azure's SQL's database costs.
For information, the Azure worker role will be running on a machine with the following hardware: 1 core CPU, 3.5GB RAM and 50GB SSD storage.
Do you think I m on a good start with this architecture ?
Thanks
Do you think I m on a good start with this architecture?
In a word, no.
A user asked a similar question regarding running Redis on Worker Roles - Setting up Redis on Azure cloud service worker role - all of the content on that Q/A is relevant in the MongoDb context.
I'd suggest that you read my answer as it goes into more detail, but as an overview of why this is a bad architectural approach:
You cannot guarantee when a Worker Role will be restarted by the Azure Service Fabric.
In a real-world implementation of Mongo, you would run multiple nodes within a cluster, with a single Worker Role (as you have suggested in your question) this won't be possible.
You will need to manage your MongoDb installation within the Worker Role and they simply aren't designed for this.
If you are really fixed on using Mongo, I would suggest that you use a hosted solution such as MongoLabs (as suggested in earlier answers), or consider hosting it on Azure IaaS VM's.
If you are not fixed on using Mongo, I would sincerely suggest that you look at Azure DocumentDb (also suggested above), Microsoft's Azure NoSQL offering - I have used it in several production systems already and it is certainly a capable NoSQL solution; granted, it may not have all of the features available with MongoDb.
If you are looking at a NoSQL solution for caching of data (i.e. not long term storage), I would suggest you take a look at Azure Redis Cache, which is a very capable Redis offering.
Azure has its own native NoSQL Document database called DocumentDB, have you had a look at it? If I were you I would use DocumentDB unless there are some special requirements that you have that you have not mentioned, but from what little requirement info that you have posted DocumentDB would do just fine. I don't think that it is quite similar to MongoDB in terms of the basic functionality, see this article for a comparison between Azure DocumentDB and MongoDB.

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.