Using Presto on Cloud Dataproc with Google Cloud SQL? - google-cloud-sql

I use both Hive and MySQL (via Google Cloud SQL) and I want to use Presto to connect to both easily. I have seen there is a Presto initialization action for Cloud Dataproc but it does not work with Cloud SQL out of the box. How can I get that initialization action to work with Cloud SQL so I can use both Hive/Spark and Cloud SQL with Presto?

The easiest way to do this is to edit the initialization action installing Presto on the Cloud Dataproc cluster.
Cloud SQL setup
Before you do this, however, make sure to configure Cloud SQL so it will work with Presto. You will need to:
Create a user for Presto (or have a user ready)
Adjust any necessary firewall rules so your Cloud Dataproc cluster can connect to the Cloud SQL instance
Changing the initialization action
In the Presto initialization action there is a section which sets up the Hive configuration and looks like this:
cat > presto-server-${PRESTO_VERSION}/etc/catalog/hive.properties <<EOF
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
EOF
You can add a new section like this (below) which sets up the mysql properties. Add something like this:
cat > presto-server-${PRESTO_VERSION}/etc/catalog/mysql.properties <<EOF
connector.name=mysql
connection-url=jdbc:mysql://<ip_address>:3306
connection-user=<username>
connection-password=<password>
EOF
You will obviously want to replace <ip_address>, <username>, and <password> with your correct values. Moreover, if you have multiple Cloud SQL instances to connect to, you can add multiple sections and give them different names, so long as the filename ends in .properties.

Related

AWS Aurora RDS PostgreSql create global database for existing cluster through cloud formation script

We already have a cluster and instance of Aurora PostgreSql in abc region. Now as part of disaster recovery strategy, we are trying to create a read replica in a xyz region.
I was able to create it manually by clicking on "Add Region" in AWS web console. As explained here.
As part of it, following as been created.
1. A global database to the existing cluster
2. Secondary region cluster
3. Secondary region instance.
Everything is fine. Now I have to implement this through cloud formation script.
My first question is, can we do this through Cloud formation script without losing data if primary cluster and instance already created ?
If possible, please share aws doc for cloud formation scripts.
Please see the other post on this subject: CloudFormation templates for Global Aurora Database
The API that is required for setting up the GlobalCluster is AWS::RDS::GlobalCluster and this is currently not listed in CloudFormation documentation.
I was able to do the same using Terraform and that is documented for PostgreSQL here: Getting Aurora PostgreSQL Global Database setup using Terraform

a way to script automatically to start and stop the sql database in gcp

i want to run a job in cloud scheduler in gcp to start and stop the sql database in weekdays at working hours.
I have tried by triggering cloud function and using pubsub but i am not getting proper way to do it.
You can use the Cloud SQL Admin API to start or stop and instance. Depending on your language, there are clients available to help you do this. This page contains examples using curl.
Once you've created two Cloud Functions (one to start, and one to stop), you can configure the Cloud Scheduler to send a pub/sub trigger to your function. Check out this tutorial which walks you through the process.
In order to achieve this you can use a Cloud Function to make a call to the Cloud SQL Admin API to start and stop your Cloud SQL instance (you will need 2 Cloud functions). You can see my code on how to use a Cloud Function to start a Cloud SQL instance and stop a Cloud SQL instance
After creating your Cloud Function you can configure the Cloud Scheduler to trigger the http address of each Cloud function

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.
I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.
I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?
To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.
Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!

How to replicate MySQL database to Cloud SQL Database

I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !