How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster? - google-cloud-dataproc

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.
I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.
I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?

To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.
Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

Related

issue while connecting spark to redshift using spark -redshift connector

I need to connect spark to my redshift instance to generate data .
I am using spark 1.6 with scala 2.10 .
Have used compatible jdbc connector and spark-redshift connector.
But i am facing a weird problem that is :
I am using pyspark
df=sqlContext.read\
.format("com.databricks.spark.redshift")\
.option("query","select top 10 * from fact_table")\
.option("url","jdbc:redshift://redshift_host:5439/events?user=usernmae&password=pass")\
.option("tempdir","s3a://redshift-archive/").load()
When i do df.show() then it gives me error of permission denied on my bucket.
This is weird because i can see files being created in my bucket, but they can be read.
PS .I have set accesskey and secret access key also.
PS . I am also confused between s3a and s3n file system.
Connector used :
https://github.com/databricks/spark-redshift/tree/branch-1.x
It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps
Add a bucket policy to that bucket that allows the Redshift Account
access Create an IAM role in the Redshift Account that redshift can
assume Grant permissions to access the S3 Bucket to the newly
created role Associate the role with the Redshift cluster
Run COPY statements

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!

Using Presto on Cloud Dataproc with Google Cloud SQL?

I use both Hive and MySQL (via Google Cloud SQL) and I want to use Presto to connect to both easily. I have seen there is a Presto initialization action for Cloud Dataproc but it does not work with Cloud SQL out of the box. How can I get that initialization action to work with Cloud SQL so I can use both Hive/Spark and Cloud SQL with Presto?
The easiest way to do this is to edit the initialization action installing Presto on the Cloud Dataproc cluster.
Cloud SQL setup
Before you do this, however, make sure to configure Cloud SQL so it will work with Presto. You will need to:
Create a user for Presto (or have a user ready)
Adjust any necessary firewall rules so your Cloud Dataproc cluster can connect to the Cloud SQL instance
Changing the initialization action
In the Presto initialization action there is a section which sets up the Hive configuration and looks like this:
cat > presto-server-${PRESTO_VERSION}/etc/catalog/hive.properties <<EOF
connector.name=hive-hadoop2
hive.metastore.uri=thrift://localhost:9083
EOF
You can add a new section like this (below) which sets up the mysql properties. Add something like this:
cat > presto-server-${PRESTO_VERSION}/etc/catalog/mysql.properties <<EOF
connector.name=mysql
connection-url=jdbc:mysql://<ip_address>:3306
connection-user=<username>
connection-password=<password>
EOF
You will obviously want to replace <ip_address>, <username>, and <password> with your correct values. Moreover, if you have multiple Cloud SQL instances to connect to, you can add multiple sections and give them different names, so long as the filename ends in .properties.

Google Cloud Dataproc - Submit Spark Jobs Via Spark

Is there a way to submit Spark jobs to Google Cloud Dataproc from within the Scala code?
val Config = new SparkConf()
.setMaster("...")
What should the master URI look like?
What key-value pairs should be set to authenticate with an API key or keypair?
In this case, I'd strongly recommend an alternative approach. This type of connectivity has not been tested or recommended for a few reasons:
It requires opening firewall ports to connect to the cluster
Unless you use a tunnel, your data may be exposed
Authentication is not enabled by default
Is SSHing into the master node (the node which is named cluster-name-m) a non-starter? It is pretty easy to SSH into the master node to directly use Spark.

How to replicate MySQL database to Cloud SQL Database

I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !