AWS mirgate data from MongoDB to DynamoDB/S3/Redshift - mongodb

The issue is that mirgating data from MongoDB to DynamoDB/S3/Redshift currently, as I unterstand for us is not available via AWS DMS Service, as it does not support all data types. Or maybe I'm wrong.
The probelm is that our Mongo object contain not scalar fields(arrays, maps).
So when I make a mirgation task via AWS DMS with table mode, it pull data badly.Buy some reason only selection works. Transformation rules are ignored by DMS(tried renaming and removing).
In the doc mode is all ok, but how can I run migration with some custom script for transformation? As storing data this way still need transformation.
We need some modifications like: rename, remove fields and flatting some fields(for example we ahve a map object and it should be flatten into several scalar fields).
Migration should be done into one of the sources: S3, Dyanamo, Redshift
Will be thankfull for any help and suggestions.

use the following below script to take a backup of the MongoDB DB
mongodump -h localhost:27017 -d my_db_name -o $DEST
use the below command to sync your backup to S3 bucket
aws s3 sync ~/db_backups s3://my-bucket-name
Once your data in S3, you can load very easily to Redshift using copy command

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

Azure Cosmos DB: Clone collection to another database

Currently I am trying to clone a cosmos db collection from one database to another database within the cosmos db. The API of the cosmos db is set to Mongo API.
I already tried to use Azure Data factory, but it looks like that there is no support for the Mongo API so far.
Has anyone an idea how to do this respective to efficiency, automation and performance?
Any ideas are appreciated.
You can use data Migration tool suggested by Microsoft to do the same.
There is no way to take a backup and import cosmosdb.
EDIT:
With the new Cosmic Clone tool, you can take a clone/backup with data/stored procedures/triggers/udf, etc. Read my blog on the same.
I already tried to use Azure Data factory, but it looks like that
there is no support for the Mongo API so far.
Actually, Cosmos DB Mongo API and SQL API are all belong to Azure Cosmos DB service.So , you still can create cosmos db linked service and dataset in the azure data factory for your database.
Then you could create copy activity to import data from one collection to another collection.
If you want to make it as an automation task, I suggest using following 2 ways to run the copy activity.
1.Azure Time Trigger Function.
2.Web job which is run in the background of Azure Web App.
Hope it helps you.Any concern, please feel free to let me know.
I used mongodump and mongorestore to copy my database (with mongodb version 4.0.9 installed). From the windows command line I ran the following commands from my mongodb bin directory (c:\Program Files\MongoDB\Server\4.0\bin in my case).
This will copy all the collections, including indexes, in the DB to the specified /out directory as .json files.
mongodump.exe /uri:URI /out:A_DIRECTORY_TO_DUMP_TO
I then ran the following command to take everything in the /out directory and write it to the target DB:
mongorestore.exe /uri:URI /dir:DIRECTORY_TO_RESTORE_FROM
NOTE: Before importing I also had to increase the throughput for the collection, otherwise I ran into rate limiting errors. If you've set throughput at the database level this may need to be changed.

Backing up redshift database

I want to backup entire redshift cluster, such that I can use it in other databases like mysql or hadoop in future.
I was looking up and creating a manual screenshot seems to be an option but I guess that wont work for cross database languages.
So what would be the detailed steps to backup the entire cluster of redshift
Cluster backups can be done via the aws console, however these can only be restored to another redshift cluster.
Because Redshift is not the same as postgres in many ways, it will be inpossible / tricky to use standard tools like pg_dump and pg_restore.
I think that your best option is to :
extract the ddl from the Redshift tables that you wish to create elsewhere, most ide's have a simple way to do this.
modify the ddl to work with your target database (e.g. postgres will
be easy, mysql harder)
copy the contents of the Redshift database, one table at a time to s3 using
the unload command
import the data that you unloaded in step 3 to your target tables

Easy way to get all tables out to S3 on a nightly basis?

I need to be able to dump the contents of each table in my redshift data warehouse each night to S3.
The outcome that I want to achieve is the same outcome as if I was manually issueing an UNLOAD command for each table.
For something this simple, I assumed I could use something like data pipeline or glue, but these don’t seem to make this easy.
Am I looking at this problem wrong? This seems like it should be simple.
I had this process but in reverse recently. My solution: a python script that queried pg_schema (to grab eligible table names), and then looped through the results using the table name as a parameter in an INSERT query. I ran the script as a cron job in an EC2.
In theory, you could set up the script via Lambda or as a ShellCommand in Pipeline. But I could never get that to work, whereas a cron job was super simple.
Do you have a specific use case for explicitly UNLOADing data to S3? Like being able to use that data with Spark/Hive?
If not, you should be scheduling Snapshots of Redshift cluster to S3 every day. This happens by default anyway.
Snapshots are stored in S3 as well.
Snapshots are incremental and fast. You can restore entire clusters using snapshots.
You can also restore individual tables from snapshots.
Here is the documentation about it: https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html
This is as simple as creating a script (shell/python/...) and putting that in crontab. Somewhere in the lines of (snippet from a shell script):
psql -U$username -p $port -h $hostname $database -f path/to/your/unload_file.psql
and your unload_file.psql would contain the standard Redshift unload statement:
unload ('select * from schema.tablename') to 's3://scratchpad_bucket/filename.extension'
credentials 'aws_access_key_id=XXXXXXXXXX;aws_secret_access_key=XXXXXXXXXX'
[options];
Put your shell script in a crontab and execute it daily at the time when you want to take the backup.
However, remember:
While taking backups is indispensible, daily full backups will generate a mammoth bill for s3. You should rotate the backups /
log files i.e. regularly delete them or take a backup from s3 and
store locally.
A full daily backup might not be the best thing to do. Check whether you can do it incrementally.
It would be better that you tar and gzip files and then send them to s3 rather than storing an Excel or a csv.

How to load data from S3 to PostgreSQL RDS

I have a need to load data from S3 to Postgres RDS (around 50-100 GB) I don't have the option to use AWS Data Pipeline and I am looking for something similar to using the COPY command to load data in S3 into Amazon Redshift.
I would appreciate any suggestions on how I can accomplish this.
Originally, this answer was trying to use the S3 to Postgres RDS Functionality. That whole enterprise failed (see below).
The way I have finally been able to do this is:
Set-up an EC2 instance with psql installed (see below near end of post)
Copy the relevant CSVs to import from S3 to the local instance
Use the psql /copy command to import the files up
This last part is really, really important. If you use the SQL COPY command the entire RDS Postgres role structure will frustrate you to no end. It has a wonky SUPERRDSADMIN role which is not very super at all. However, if you use the psql /copy commany you apparently can do anything. I have confirmed this be the case and have started my uploads succesfully. I will come back and re-edit this post (time permitting) to add relevant documentation steps for the above.
Caveat Emptor: The post below was all the original work I had done trying to get this implemented. I don't want to bury the lead despite multiple efforts (including what can only be described as pathetic tech support from AWS) I don't believe that this feature is ready for prime time. Despite a very simple test environment, easy to replicate, AWS has not provided an effective way to not get the copy statement to crap out as follows:
The actual call to aws_s3.table_import_from_s3(...) is reporting a permission problem between RDS and S3. From my research work with psql this appears to be a C library, probably installed by AWS.
NOTICE: CURL error code: 28 when attempting to validate pre-signed URL, 1 attempt(s) remaining
NOTICE: HINT: make sure your instance is able to connect with S3.
S3 to Postgres RDS Functionality Now Added
On 2019-04-24 AWS released functionality allowing a Postgres RDS to load directly from S3. You can read the announcement here, and see the documentation page here.
I am sharing with the OP because this appears to be the AWS supported way of solving the question posed.
Key summary points:
Requires Postgres 11.1 or greater
Need access to psql and the ability to connect it to the RDS instance
Need to install the aws_s3 extension which pulls in aws_commons.
You can get to the S3 bucket by specifying credentials or by assigning IAM roles to RDS
It advertises supporting all of the same data formats as the postgres COPY command
It currently only appears to support a single file at a time (ie no regex)
The instructions are fairly detailed and provide a variety of paths to configuring (AWS CLI scripts, Console instructions, etc). Additionally, the option to use your IAM keys rather than have to set-up roles is nice.
I did not find a way to download just psql, so I had to bring down a full postgres install down to my mac, but that was no big deal with brew:
brew install postgres
and since the DB service does not get activated it is the quickest way to get psql.
Update: Decided that having psql on my mac was a security hole, port forwarding, etc. I found that there is a simple Postgres install available for AMI Linux 2 under the AMI Extras rubric. The install command is fairly simple on your ami instance type.
sudo amazon-linux-extras install postgresql10
psql is fairly easy to use, however, important to keep in mind that any instructions to psql itself are escaped by a \. Documentation on psql can be found here. Recommend going through it at least once before executing the AWS recommended scripts.
To the extent you run tight security and have access to your RDS instances seriously restricted (which I do) don't forget to open up the ports from your AMI instance running Postgres to your RDS instance.
If your preference is a GUI then you can try to use PGAdmin4. It is the AWS recommended way of connecting to RDS Postgres instances according to the docs. I was unable to get any of the SSH tunneling features to work (which is why I ended up doing the localhost SSH mapping that I used for psql). I also found it to be rather buggy in other ways. Reading reviews of the product it seems that version 4 may not be the stablest of releases.
http://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html
Use the COPY command to load a table in parallel from data files on
Amazon S3. You can specify the files to be loaded by using an Amazon
S3 object prefix or by using a manifest file.
The syntax to specify the files to be loaded by using a prefix is as
follows:
copy <table_name> from 's3://<bucket_name>/<object_prefix>'
authorization;
update
Another option is to mount s3 and use direct path to the csv with COPY command. I'm not sure If it will hold 100GB effectively, but worth of trying. Here is some list of options on software.
Yet another option would be "parsing" s3 file part by part with something described here to a file and COPY from named pipe, described here
And the most obvious option to just download file to local storage and use COPY I don't cover at all
Also worth of mentioning would be s3_fdw (status unstable). Readme is very laconic, but I assume you could create a foreign table leading to s3 file. Which itself means you can load data to other relation...