Google Cloud SQL PostgreSQL replication? - postgresql

I want to make sure that there's not a better (easier, more elegant) way of emulating what I think is typically referred to as "logical replication" ("logical decoding"?) within the PostgreSQL community.
I've got a Cloud SQL instance (PostgreSQL v9.6) that contains two databases, A and B. I want B to mirror A, as closely as possible, but don't need to do so in real time or anything near that. Cloud SQL does not offer the capability of logical replication where write-ahead logs are used to mirror a database (or subset thereof) to another database. So I've cobbled together the following:
A Cloud Scheduler job publishes a message to a topic in Google Cloud Platform (GCP) Pub/Sub.
A Cloud Function kicks off an export. The exported file is in the form of a pg_dump file.
The dump file is written to a named bucket in Google Cloud Storage (GCS).
Another Cloud Function (the import function) is triggered by the writing of this export file to GCS.
The import function makes an API call to delete database B (the pg_dump file created by the export API call does not contain initial DROP statements and there is no documented facility for adding them via the API).
It creates database B anew.
It makes an API call to import the pg_dump file.
It deletes the old pg_dump file.
That's five different objects across four GCP services, just to obtain already existing, native functionality in PostgreSQL.
Is there a better way to do this within Google Cloud SQL?

Related

Best way to set up jupyter notebook project in AWS

My current project have the following structure:
Starts with a script in jupyter notebook which dowloads data from a CRM API to put in a local PostgressSql database I run with PgAdmin. After that it runs cluster analysis, return some scoring values, creates a table in database with the results and updates this values in the CRM with another API call. This process will take between 10 to 20 hours (the API only allows 400 requests per minute).
The second notebook reads the database, detects last update, runs api call to update database since the last call, runs kmeans analysis to cluster the data, compare results with the previous call, updates the new ones and the CRM via API. This second process takes less than 2 hours in my estimation and I want this script to run every 24 hours.
After testing, this works fine. Now I'm evaluating how to put this in production in AWS. I understand for the notebooks I need Sagemaker and from I have seen is not that complicated, my only doubt here is if I can call the API without implementing aditional code or need some configuration. My second problem is database. I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3. My goal is to write the less code as possible, but a have try some tutorial of RDS like this one: [1]: https://www.youtube.com/watch?v=6fDTre5gikg&t=10s, and I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance?? and how to connect to it to analysis this data from SageMaker. My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud. Just some orientation about how to use this tools would be appreciated.
I don't understand the difference between RDS which is the one I think I have to use for this and Aurora or S3
RDS and Aurora are relational databases fully managed by AWS. "Regular" RDS allows you to launch the existing popular databases such as MySQL, PostgreSQSL and other which you can launch at home/work as well.
Aurora is in-house, cloud-native implementation databases compatible with MySQL and PosrgreSQL. It can store the same data as RDS MySQL or PosrgreSQL, but provides a number of features not available for RDS, such as more read replicas, distributed storage, global databases and more.
S3 is not a database, but an object storage, where you can store your files, such as images, csv, excels, similarly like you would store them on your computer.
I understand this connect my local postgress to AWS but I can't find the data in the amazon page, only creates an instance??
You can migrate your data from your local postgress to RDS or Aurora if you wish. But RDS nor Aurora will not connect to your existing local database, as they are databases themselfs.
My final goal is to run the notebooks in the cloud and connect to my postgres in the cloud.
I don't see a reason why you wouldn't be able to connect to the database. You can try to make it work, and if you encounter difficulties you can make new question on SO with RDS/Aurora setup details.

Cloud PostgreSQL clean large objects vacuumlo

We are managing to use GCP CloudSQL for our PostgreSQL database,
at this moment one of our applications uses large objects and i was wondering how to perform a vacuumlo operation on such platforms (question might be valid for AWS RDS or any other cloud postgresql provider).
Does making custom queries/procedures to perform the same task is the only solution?
Since vacuumlo is a client tool, it should work just fine with hosted databases.

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!

If I need to unload data and copy data between 2 Redshift clusters, what is the best approach to script the process?

I have done migration data between Amazon Redshift clusters using unload/copy commands via s3 interactively. The next step is to automate the process and I'm looking for best approach to do so.
you can use java/ any other language to below steps and automate
1) connect to cluster 1
2) unload data to amazon s3
3) connect to cluster 2
4) copy data from amazon s3 to redshift cluster
you can use shell script or php or simple java program will do.
Here are the two ways that you can try:
Use python or bash script to unload and copy data from one RedShift
cluster to another. In this approach the staging area will be S3. If
you are trying to unload and copy between separate accounts then you
need to have appropriate IAM Roles and trust policies. This can be a
little challenging. You can automate this process by using AWS Data Pipeline.
Take a snapshot and restore a RedShift cluster using the snapshot. Also if you want to share this snapshot to other account then just go to Manage Access and put the Account ID of the destination RedShift Cluster. This is very simple and no need to write any code.

How to replicate MySQL database to Cloud SQL Database

I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !