How to continuously populate a Redshift cluster from AWS Aurora (not a sync) - amazon-redshift

I have a number of MySql databases (OLTP) running on an AWS Aurora cluster. I also have a Redshift cluster that will be used for OLAP. The goal is to replicate inserts and changes from Aurora to Redshift, but not deletes. Redshift in this case will be an ever-growing data repository, while the Aurora databases will have records created, modified and destroyed — Redshift records should never be destroyed (at least, not as part of this replication mechanism).
I was looking at DMS, but it appears that DMS doesn't have the granularity to exclude deletes from the replication. What is the simplest and most effective way of setting up the environment I need? I'm open to third-party solutions, as well, as long as they work within AWS.
Currently have DMS continuous sync set up.

You could consider using DMS to replicate to S3 instead of Redshift, then use Redshift Spectrum (or Athena) against that S3 data.
S3 as a DMS target is append only, so you never lose anything.
see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
and
https://aws.amazon.com/blogs/database/replicate-data-from-amazon-aurora-to-amazon-s3-with-aws-database-migration-service/
That way, things get a bit more complex and you may need some ETL to process that data (depending on your needs)
You will still get the deletes coming through with a record type of "D", but you can ignore or process these depending on your needs.

A simple and effective way to capture Insert and Updates from Aurora to Redshift may be to use below approach:
Aurora Trigger -> Lambda -> Firehose -> S3 -> RedShift
Below AWS blog-post eases this implementation and look almost similar to your use-case.
It provides sample code also to get the changes from Aurora table to S3 through AWS Lambda and Firehose. In Firehose, you may setup the destination as Redshift, which will copy over data from S3 seemlessly into Redshift.
Capturing Data Changes in Amazon Aurora Using AWS Lambda
AWS Firehose Destinations

Related

Can I configure AWS RDS to only stream INSERT operations to AWS DMS?

My requirement is to stream only INSERTs on a specific table in my db to a Kinesis data stream.
I have configured this pipeline in my AWS environment:
RDS Postgres 13 -> DMS (Database Migration Service) -> KDS (Kinesis Data Stream)
This setup works correctly but it processes all changes, even UPDATEs and DELETEs, on my source table.
What I've tried:
Looking for config options in the Postgres logical decoding plugin. DMS uses the test_decoding PG plugin which does not accept options to include/exclude data changes by operation type.
Looking at the DMS selection and filtering rules. Still didn't see anything that might help.
Of course I could simply ignore records originated from non-INSERT operations in my Kinesis consumer, but this doesn't look like a cost-efficient implementation.
Is there any way to meet my requirements using these AWS services (RDS -> DMS -> Kinesis)?
Well DMS does not have this capability .
If you want only INSERT to be send to Kinesis in that case you can have a lambda function on every INSERT of RDS .
Lambda function can be configured as trigger for INSERT .
You can invoke lambda only for INSERT and write to Kinesis directly .
Cost wise also this will be less .
In DMS you are paying for Replication instance even when not in use .
For detailed reference Stream changes from Amazon RDS for PostgreSQL using Amazon Kinesis Data Streams and AWS Lambda

Migrate data from Citus to RDS

Since Citus is not going to be available as a Managed Service in AWS, I am trying move the database to RDS (not the whole history but only the transactional portion as an OLTP). The migration from Citus is not clear because the data does not reside in a single node. I want to check the options we might have to move data from Citus to RDS.
Amazon DMS: This option is good for the supported databases (PostgreSQL) but we do not know what behavior this will have in Citus from the distributed nature of the engine. Has someone migrated the data to S3, to another DB or something in these lines?
I saw this paper from AWS https://d1.awsstatic.com/whitepapers/aws-cloud-data-ingestion-patterns-practices.pdf?did=wp_card&trk=wp_card on how to ingest data from different sources and DMS seems like a good option but I do not know the internals of Citus that well to tell if we will get all the data and gather the CDC correctly.
A Custom migration: Via a support ticket, we can access the S3 buckets that Citus uses for Disaster recovery where the WAL logs are available and we could use something like WAL-G to take those logs and replicate them in a Postgres instance. The issue here is that this is a very custom migration and the development time might be too high.
Is there any other option to move data from Citus to RDS or Aurora in AWS, what looks like a good path to make the database migration? All the documents refer to move data the other way around, from Aurora or RDS to Citus.
Sumedh from Citus Cloud here. Please go ahead and open a support ticket with us to further investigate solutions. We can evaluate if using DMS is a viable approach for your use-case.

how to setup replication instance in on premises postgres for master database in AWS RDS postgres?

I have a requirement of checking whether the exact copy of master database from AWS RDS can be created in on premises or not..
I have already established the connectivity between on prem and aws. Also checked the data migration using pg dump. But i am not getting how to create the replica without using DMS. Due to some security purpose we are not supposed to use DMS. So is there any other way out to implement thi ?
Any help will be much appreciated
It appears that your goal is disaster recovery.
Amazon RDS offers a few options for this:
Amazon RDS Snapshots are a backup of the database, stored in a region. If your database is in an Availability Zone that fails, the snapshot can be restored as a new database in another AZ. All AZs are physically separate data centers, much like your own data center is physically separate from an AWS data center.
Snapshots can also be copied to other Regions, which would guarantee a separation distance between data centers.
Multi-AZ Amazon RDS Databases keep a second copy of the data in another AZ and can switch-over to the alternate AZ without losing any data. This is faster than restoring a snapshot, but costs twice as much since two separate database servers are deployed.
These options would be easier to manage than replicating your data to an on-premises system. A Multi-AZ will automatically start the secondary instance, so your app can continue operating with only a short delay and no data loss. This is much better than you could offer if you fail-over to an on-premises system.

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!