AWS DMS - Scheduled DB Migration - postgresql

I have Postgresql db in RDS. I need to fetch data from a bunch of tables in postgresql db and push data into a S3 bucket every hour. I only want the delta changes (any new inserts / updates) to be sent in the hourly. Is it possible to do this using DMS or is EMR a better tool for performing this activity?

You can create an automated environment of migration data from RDS to S3 using AWS DMS (Data Migration Service) tasks.
Create a source endpoint (reading your RDS database - Postgres, MySQL, Oracle, etc...);
Create a target endpoint using S3 as an engine endpoint (read it: Using Amazon S3 as a Target for AWS Database Migration Service);
Create a replication instance, responsible to make a bridge between source data and target endpoint (you will only pay while processing);
Create a database migration task using the option 'Replication data change only' on migration type field;
Create a cron lambda, which starts a DMS task, with stack Python following these instructions of this articles Lambda with scheduled events e Start DMS tasks with boto3 in Python.
Connecting these resources above you may can have what you want.
Regards,
Renan S.

Related

Alert about DB creation on RDS/Aurora PostgreSQL

I have some Aurora PostgreSQL Clusters created on our AWS account. Because of some access issues (which we are working on already), there are several people in other teams who create random DB's on these Aurora Clusters and then we need to work on cleaning them up.
I wanted to check if there is a way to get alerted (via SNS Notifications etc.) whenever a new DB is created on these AWS Postgres clusters using some AWS Tools itself.
Thanks
You could do it using AWS Aurora Database Activity Streams, it will capture all database activity on the database and send it AWS Kinesis Data Stream and you could create a AWS Lambda function to read Kinesis Data Stream and identify the events needed (ex. create database)and finally send notification to AWS SNS from AWS Lambda code.
Another option is enable pgaudit on your AWS Aurora PostgreSQL, send logs to AWS CloudWatch, create AWS Lambda to read the events from AWS CloudWatch and send AWS Notification
Below you can find step by step on AWS Blog below.
Part 2: Audit Aurora PostgreSQL databases using Database Activity Streams and pgAudit

Can I configure AWS RDS to only stream INSERT operations to AWS DMS?

My requirement is to stream only INSERTs on a specific table in my db to a Kinesis data stream.
I have configured this pipeline in my AWS environment:
RDS Postgres 13 -> DMS (Database Migration Service) -> KDS (Kinesis Data Stream)
This setup works correctly but it processes all changes, even UPDATEs and DELETEs, on my source table.
What I've tried:
Looking for config options in the Postgres logical decoding plugin. DMS uses the test_decoding PG plugin which does not accept options to include/exclude data changes by operation type.
Looking at the DMS selection and filtering rules. Still didn't see anything that might help.
Of course I could simply ignore records originated from non-INSERT operations in my Kinesis consumer, but this doesn't look like a cost-efficient implementation.
Is there any way to meet my requirements using these AWS services (RDS -> DMS -> Kinesis)?
Well DMS does not have this capability .
If you want only INSERT to be send to Kinesis in that case you can have a lambda function on every INSERT of RDS .
Lambda function can be configured as trigger for INSERT .
You can invoke lambda only for INSERT and write to Kinesis directly .
Cost wise also this will be less .
In DMS you are paying for Replication instance even when not in use .
For detailed reference Stream changes from Amazon RDS for PostgreSQL using Amazon Kinesis Data Streams and AWS Lambda

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

How to continuously populate a Redshift cluster from AWS Aurora (not a sync)

I have a number of MySql databases (OLTP) running on an AWS Aurora cluster. I also have a Redshift cluster that will be used for OLAP. The goal is to replicate inserts and changes from Aurora to Redshift, but not deletes. Redshift in this case will be an ever-growing data repository, while the Aurora databases will have records created, modified and destroyed — Redshift records should never be destroyed (at least, not as part of this replication mechanism).
I was looking at DMS, but it appears that DMS doesn't have the granularity to exclude deletes from the replication. What is the simplest and most effective way of setting up the environment I need? I'm open to third-party solutions, as well, as long as they work within AWS.
Currently have DMS continuous sync set up.
You could consider using DMS to replicate to S3 instead of Redshift, then use Redshift Spectrum (or Athena) against that S3 data.
S3 as a DMS target is append only, so you never lose anything.
see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
and
https://aws.amazon.com/blogs/database/replicate-data-from-amazon-aurora-to-amazon-s3-with-aws-database-migration-service/
That way, things get a bit more complex and you may need some ETL to process that data (depending on your needs)
You will still get the deletes coming through with a record type of "D", but you can ignore or process these depending on your needs.
A simple and effective way to capture Insert and Updates from Aurora to Redshift may be to use below approach:
Aurora Trigger -> Lambda -> Firehose -> S3 -> RedShift
Below AWS blog-post eases this implementation and look almost similar to your use-case.
It provides sample code also to get the changes from Aurora table to S3 through AWS Lambda and Firehose. In Firehose, you may setup the destination as Redshift, which will copy over data from S3 seemlessly into Redshift.
Capturing Data Changes in Amazon Aurora Using AWS Lambda
AWS Firehose Destinations

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!