MongoDB && Serverless architecture && batch processing - mongodb

I believe that aws lambdas (serverless) are not good for batching, by definition they stop after 15 minutes of processing.
I have a Mongo Atlas (MongoDB Cloud Service) DB and I need to process a large dataset/collection several times a week, from a EC2 NodeJS app.
¿Which could be good architectural solutions for this to be efficient ?
Regards

Two key observations here. The size of the data and storage. MongoDB is partitioned and data can be read and processed in parallel.
For those both properties, Apache Spark is the best processing option. In AWS, couple of services provide this option and they are Amazon EMR and AWS Glue. From cost perspective and flexibility perspective, Amazon EMR is the best option.
What if you are not looking for parallel processing.? In that case, AWS Batch would be better option. Using AWS Batch you can run batch jobs with either EC2 or Fargate infra. You don't have to worry about provisioning and terminating the cluster with AWS Batch.

Related

Should I use AWS Glue or Spark on EMR for processing binary data to parquet format

I have a work requirement of reading binary data from sensors and produce parquet output results for Analytics.
For storage I have chosen s3 and Dynamodb.
For processing engine I’m confused on how to choose between AWS EMR or AWS Glue.
Data processing code base will be maintained in python coupled with Spark.
Please post your suggestion on choosing between AWS EMR or AWS Glue.
Using Glue / EMR depends on your use-case.
EMR is a managed cluster of servers and costs less than Glue, but it also requires more maintenance and set-up overhead. You can not only run Spark but also other frameworks on EMR like Flink.
Glue is serverless Spark / Python and really easy to use. It does not run on the latest Spark version and abstracts a lot of Spark away, in a good but also in a bad sense, that you can not set specific configurations very easily.
It's an opinion based question and now you have AWS EMR Serverless.
AWS Glue is 1) more managed and thus with restrictions, and 2) imho issues with crawling for schema changes to consider, 3) own interpretation of dataframes 4) and less run-time configuration and 5) less options for serverless scalability. There seems to a few bugs etc. that keep on popping up.
AWS EMR is 1) an AWS platform easy enough to configure, 2) with the AWS flavour of what they think the best way of running Spark is, 3) some limitations in terms of subsequently scaling down resources when using dynamic scaling out, 4) a platform that uses Spark so there will be a bigger pool of persons to hire, 5) allowing bootstrapping of software not standardly supplied, and selection of standard software, such as, say, HBase.
So, comparable to an extent. And divergent in other ways; AWS Glue is ETL/ELT, AWS EMR is that with more capabilities.

How do I efficiently migrate MongoDB to azure CosmosDB with the help of azure Databricks?

While searching for a service to migrate our on-premise MongoDB to Azure CosmosDB with Mongo API, We came across the service called, Azure Data Bricks. We have total of 186GB of data. which we need to migrate to CosmosDB with less downtime as possible. How can we improve the data transfer rate for that. If someone can give some insights to this spark based PaaS provided by Azure, It will be very much helpful.
Thank you
Have you referred the article given on our docs page?
In general you can assume the migration workload can consume entire provisioned throughput, the throughout provisioned would give an estimation of the migration speed. You could consider increasing the RUs at the time of migration and reduce it later.
The migration performance can be adjusted through these configurations:
Number of workers and cores in the Spark cluster
maxBatchSize
Disable indexes during data transfer

What is the best way to take snapshots of an EC2 instance running MongoDB?

I wanted to automate taking snapshots of the volume attached to an EC2 instance running the primary node of our production MongoDB replicaSet. While trying to gauge the pitfalls and best practices over Google, I came across the fact that data inconsistency and corruption are very much possible while creating a snapshot but not of journaling is enabled, which it is in our case.
So my question is - is it safe to go ahead and execute aws ec2 create-snapshot --volume-id <volume-id> to get clean backups of my data?
Moreover, I plan on running the same command via a cron job that runs once every week. Is that a good enough plan to have scheduled backups?
For MongoDB on an EC2 instance I do the following:
mongodump to a backup directory on the EBS volume
zip the mongodump output directory
copy the zip file to an S3 bucket (with encryption and versioning enabled)
initiate a snapshot of the EBS volume
I write a script to perform the above tasks, and schedule it to run daily via cron. Now you will have a backup of the database on the EC2 instance, in the EBS snapshot, and on S3. You could go one step further by enabling cross region replication on the S3 bucket.
This setup provides multiple levels of backup. You should now be able to recover your database in the event of an EC2 failure, an EBS failure, an AWS Availability Zone failure or even a complete AWS Region failure.
I would recommend reading the official MongoDB documentation on EC2 deployments:
https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/
https://docs.mongodb.org/ecosystem/tutorial/backup-and-restore-mongodb-on-amazon-ec2/

How to setup cross region replica of AWS RDS for PostgreSQL

I have a RDS for PostgreSQL setup in ASIA and would like to have a read copy in US.
But unfortunately just found from the official site that only RDS for MySQL has cross-region replica but not for PostgreSQL.
And I saw this page introduced other ways to migrate data in to and out of RDS for PostgreSQL.
If not buy an EC2 to install a PostgreSQL by myself in US, is there any way the synchronize data from ASIA RDS to US RDS?
It all depends on the purpose of your replication. Is it to provide a local data source and avoid network latencies ?
Assuming that your goal is to have cross-region replication, you have a couple of options.
Custom EC2 Instances
You can create your own EC2 instances and install PostgreSQL so you can customize replication behavior.
I've documented configuring master-slave replication with PostgreSQL on my blog: http://thedulinreport.com/2015/01/31/configuring-master-slave-replication-with-postgresql/
Of course, you lose some of the benefits of AWS RDS, namely automated multi-AZ redundancy, etc., and now all of a sudden you have to be responsible for maintaining your configuration. This is far from perfect.
Two-Phase Commit
Alternate option is to build replication into your application. One approach is to use a database driver that can do this, or to do your own two-phase commit. If you are using Java, some ideas are described here: JDBC - Connect Multiple Databases
Use SQS to uncouple database writes
Ok, so this one is the one I would personally prefer. For all of your database writes you should use SQS and have background writer processes that take messages off the queue.
You will need to have a writer in Asia and a writer in the US regions. To publish on SQS across regions you can utilize SNS configuration that publishes messages onto multiple queues: http://docs.aws.amazon.com/sns/latest/dg/SendMessageToSQS.html
Of course, unlike a two phase commit, this approach is subject to bugs and it is possible for your US database to get out of sync. You will need to implement a reconciliation process -- a simple one can be a pg_dump from Asian and pg_restore into US on a weekly basis to re-sync it, for instance. Another approach can do something like a Cassandra read-repair: every 10 reads out of your US database, spin up a background process to run the same query against Asian database and if they return different results you can kick off a process to replay some messages.
This approach is common, actually, and I've seen it used on Wall St.
So, pick your battle: either you create your own EC2 instances and take ownership of configuration and devops (yuck), implement a two-phase commit that guarantees consistency, or relax consistency requirements and use SQS and asynchronous writers.
This is now directly supported by RDS.
Example of creating a cross region replica using the CLI:
aws rds create-db-instance-read-replica \
--db-instance-identifier DBInstanceIdentifier \
--region us-west-2 \
--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:my-postgres-instance

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.