How to nuke aws if aws-nuke mostly fails - aws-nuke

In general aws-nuke fails removing resources e.g. S3 buckets, dynamodb tables. Is there any other solution before closing the credit card?

Related

Can you use AWS DMS to move Aurora DB from one account to another?

I am trying to migrate an Aurora cluster from one of our accounts to another. We actually do not have a lot write requests and the database itself is quite small, but somehow we decided to minimize the downtime.
I have looked into several options
Use snapshot: cut off the mutation in source DB, take snapshot, share and restore in another account. This would definitely introduce some downtime
Use Aurora cloning: cut off the mutation in source DB, clone the cluster in target account and switch to the target DB. According to AWS, the cloning is much faster than taking and restoring a snapshot, so the downtime should be shorter.
I am not sure if I can use DMS to do this as I did not find useful doc/tutorials about moving Aurora across accounts. Also, I am not sure whether DMS will sync any write requests to target DB during migration.
If DMS can not live sync, then I probably should use Bucardo to live migrate.
Looking at the docs, AWS Aurora with PostgreSQL compatibility is allowed as source & target endpoints. So, answering your question, yes it's possible.
Obviously, your source Aurora DB should be accessible from the target account. Check that the DB endpoint is public and the traffic is not restricted by ACLs rules or SGs rules.
Also, if you want to enable ongoing replication, you need to grant rds_replication (or rds_superuser) role to the source database user. Link to the docs.
We actually ended up using DMS for this migration. What we did was:
Take a snapshot of the target DB in the original account.
Share the snapshot to the target account and restore it over there. (You have to use snapshot for migrating things like triggers, custom types, sequence, etc)
Setup connections (like VPC peering or security groups) between two accounts.
Setup DMS in source account (endpoints, replication instance, task)
Write SQL to temporarily disable/delete constraints, triggers, etc which may cause error when load source data.
Using DMS to load source data and enable ongoing replication.
Enable/add constraints, triggers, etc back.
Post migration test

Dataproc: Hot data on HDFS, cold data on Cloud Storage?

I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS

How to continuously populate a Redshift cluster from AWS Aurora (not a sync)

I have a number of MySql databases (OLTP) running on an AWS Aurora cluster. I also have a Redshift cluster that will be used for OLAP. The goal is to replicate inserts and changes from Aurora to Redshift, but not deletes. Redshift in this case will be an ever-growing data repository, while the Aurora databases will have records created, modified and destroyed — Redshift records should never be destroyed (at least, not as part of this replication mechanism).
I was looking at DMS, but it appears that DMS doesn't have the granularity to exclude deletes from the replication. What is the simplest and most effective way of setting up the environment I need? I'm open to third-party solutions, as well, as long as they work within AWS.
Currently have DMS continuous sync set up.
You could consider using DMS to replicate to S3 instead of Redshift, then use Redshift Spectrum (or Athena) against that S3 data.
S3 as a DMS target is append only, so you never lose anything.
see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
and
https://aws.amazon.com/blogs/database/replicate-data-from-amazon-aurora-to-amazon-s3-with-aws-database-migration-service/
That way, things get a bit more complex and you may need some ETL to process that data (depending on your needs)
You will still get the deletes coming through with a record type of "D", but you can ignore or process these depending on your needs.
A simple and effective way to capture Insert and Updates from Aurora to Redshift may be to use below approach:
Aurora Trigger -> Lambda -> Firehose -> S3 -> RedShift
Below AWS blog-post eases this implementation and look almost similar to your use-case.
It provides sample code also to get the changes from Aurora table to S3 through AWS Lambda and Firehose. In Firehose, you may setup the destination as Redshift, which will copy over data from S3 seemlessly into Redshift.
Capturing Data Changes in Amazon Aurora Using AWS Lambda
AWS Firehose Destinations

How to setup cross region replica of AWS RDS for PostgreSQL

I have a RDS for PostgreSQL setup in ASIA and would like to have a read copy in US.
But unfortunately just found from the official site that only RDS for MySQL has cross-region replica but not for PostgreSQL.
And I saw this page introduced other ways to migrate data in to and out of RDS for PostgreSQL.
If not buy an EC2 to install a PostgreSQL by myself in US, is there any way the synchronize data from ASIA RDS to US RDS?
It all depends on the purpose of your replication. Is it to provide a local data source and avoid network latencies ?
Assuming that your goal is to have cross-region replication, you have a couple of options.
Custom EC2 Instances
You can create your own EC2 instances and install PostgreSQL so you can customize replication behavior.
I've documented configuring master-slave replication with PostgreSQL on my blog: http://thedulinreport.com/2015/01/31/configuring-master-slave-replication-with-postgresql/
Of course, you lose some of the benefits of AWS RDS, namely automated multi-AZ redundancy, etc., and now all of a sudden you have to be responsible for maintaining your configuration. This is far from perfect.
Two-Phase Commit
Alternate option is to build replication into your application. One approach is to use a database driver that can do this, or to do your own two-phase commit. If you are using Java, some ideas are described here: JDBC - Connect Multiple Databases
Use SQS to uncouple database writes
Ok, so this one is the one I would personally prefer. For all of your database writes you should use SQS and have background writer processes that take messages off the queue.
You will need to have a writer in Asia and a writer in the US regions. To publish on SQS across regions you can utilize SNS configuration that publishes messages onto multiple queues: http://docs.aws.amazon.com/sns/latest/dg/SendMessageToSQS.html
Of course, unlike a two phase commit, this approach is subject to bugs and it is possible for your US database to get out of sync. You will need to implement a reconciliation process -- a simple one can be a pg_dump from Asian and pg_restore into US on a weekly basis to re-sync it, for instance. Another approach can do something like a Cassandra read-repair: every 10 reads out of your US database, spin up a background process to run the same query against Asian database and if they return different results you can kick off a process to replay some messages.
This approach is common, actually, and I've seen it used on Wall St.
So, pick your battle: either you create your own EC2 instances and take ownership of configuration and devops (yuck), implement a two-phase commit that guarantees consistency, or relax consistency requirements and use SQS and asynchronous writers.
This is now directly supported by RDS.
Example of creating a cross region replica using the CLI:
aws rds create-db-instance-read-replica \
--db-instance-identifier DBInstanceIdentifier \
--region us-west-2 \
--source-db-instance-identifier arn:aws:rds:us-east-1:123456789012:db:my-postgres-instance

Sandbox version for AWS RedShift

I have been using RedShift for a few months and I like it. But I need to add some tests around it and I am not sure what the most cost effective way of doing it is. I can only think of using one server RedShift cluster as Sandbox but that seems to be too costly even if I only use it during testing
Databases in Redshift cannot 'see' each other and cross-database queries are not supported. Therefore we simply have 'development', 'test' and 'production' databases on the same cluster.
When we're ready to push to production we:
take a snapshot
drop production
rename test to production
This generally works fine for use because we find Redshift to be over-provisioned on storage, i.e., filling our nodes to their max storage capacity does not provide acceptable performance.
NOTE: You cannot drop your "master" database defined when the cluster was created. If you are using that as your primary database you will have to unload your cluster and recreate it for this approach to be viable.
I got the answer from AWS RedShift forum: "There is no way of creating a sandbox version of Redshift. We'll add this to our backlog of feature requests"