issue while connecting spark to redshift using spark -redshift connector - pyspark

I need to connect spark to my redshift instance to generate data .
I am using spark 1.6 with scala 2.10 .
Have used compatible jdbc connector and spark-redshift connector.
But i am facing a weird problem that is :
I am using pyspark
df=sqlContext.read\
.format("com.databricks.spark.redshift")\
.option("query","select top 10 * from fact_table")\
.option("url","jdbc:redshift://redshift_host:5439/events?user=usernmae&password=pass")\
.option("tempdir","s3a://redshift-archive/").load()
When i do df.show() then it gives me error of permission denied on my bucket.
This is weird because i can see files being created in my bucket, but they can be read.
PS .I have set accesskey and secret access key also.
PS . I am also confused between s3a and s3n file system.
Connector used :
https://github.com/databricks/spark-redshift/tree/branch-1.x

It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps
Add a bucket policy to that bucket that allows the Redshift Account
access Create an IAM role in the Redshift Account that redshift can
assume Grant permissions to access the S3 Bucket to the newly
created role Associate the role with the Redshift cluster
Run COPY statements

Related

How to connect Apache spark (installed on-prem) to BigQuery using service account?

I am trying to connect to big-query from spark installed on premise, but i am unable to figure out how to do that . I have tried accessing BigQuery tables from Dataproc and that works fine .
Essentially i want to authenticate using service account credentials . One way out found is to create the spark dataframe and convert into pandas dataframe and then use that dataframe to create tables in Bigquery . Is there any way to create tables in bigquery using spark dataframe directly?

How I can copy data from one scheme to another one in RedShift with AirFlow without S3 bucket

Can you help me create DAG for AWS Managed AirFlow to copy data from one scheme to another (there are in one database) in RedShift without an S3 bucket?
Thnx.
INSERT INTO {target_schema}.{table} select * from {source_schema}.{table}

AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing.
But Glue job encounters connection timeout while connecting to S3(below is error msg). It is successfully fetching data from PostgreSQL.
There is no permission related issue with S3 because I am able to read/write from same S3 bucket/path using different job. The exception/issue happens only if I try to connect both postgreSQL and S3 in one glue job/script.
In Glue job, glue context is created using SparkContext as object. I have tried creating two different sparkSession, each for S3 and postgreSQL db but this approach didnt work. Same timeout issue encountered.
Please help me in resolving the issue.
Error/Exception from log:
ERROR[main] glue.processLauncher (Logging.scala:logError(91)):Exception in User Class
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to emp_bucket.s3.amazonaws.com:443
[emp_bucket.s3.amazonaws.com/] failed : connect timed out
This is fixed.
Issue was with security group. Only TCP traffic was allowed earlier.
As part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well.

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.
I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.
I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?
To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.
Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

Loading data from S3 to PostgreSQL RDS

We are planning to go for PostgreSQL RDS in AWS environment. There are some files in S3 which we will need to load every week. I don't see any option in AWS documentation where we can load data from S3 to PostgreSQL RDS. I see it is possible for Aurora but cannot find anything for PostgreSQL.
Any help will be appreciated.
One option is to use AWS Data Pipeline. It's essentially a JSON script that allows you to orchestrate the flow of data between sources on AWS.
There's a template offered by AWS that's setup to move data between S3 and MySQL. You can find it here. You can easily follow this and swap out the MySQL parameters with those associated with your Postgres instance. Data Pipeline simply looks for RDS as the type and does not distinguish between MySQL and Postgres instances.
Scheduling is also supported by Data Pipeline, so you can automate your weekly file transfers.
To start this:
Go to the Data Pipeline service in your AWS console
Select "Build from template" under source
Select the "Load S3 to MySQL table" template
Fill in the rest of the fields and create the pipeline
From there, you can monitor the progress of the pipeline in the console!