AWS Glue : Unable to process data from multiple sources S3 bucket and postgreSQL db with AWS Glue using Scala-Spark - postgresql

For my requirement, I need to join data present in PostgreSQL db(hosted in RDS) and file present in S3 bucket. I have created a Glue job(spark-scala) which should connect to both PostgreSQL, S3 bucket and complete processing.
But Glue job encounters connection timeout while connecting to S3(below is error msg). It is successfully fetching data from PostgreSQL.
There is no permission related issue with S3 because I am able to read/write from same S3 bucket/path using different job. The exception/issue happens only if I try to connect both postgreSQL and S3 in one glue job/script.
In Glue job, glue context is created using SparkContext as object. I have tried creating two different sparkSession, each for S3 and postgreSQL db but this approach didnt work. Same timeout issue encountered.
Please help me in resolving the issue.
Error/Exception from log:
ERROR[main] glue.processLauncher (Logging.scala:logError(91)):Exception in User Class
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to emp_bucket.s3.amazonaws.com:443
[emp_bucket.s3.amazonaws.com/] failed : connect timed out

This is fixed.
Issue was with security group. Only TCP traffic was allowed earlier.
As part of the fix traffic was opened for all. Also, added HTTPS rule in inbound rules as well.

Related

Dynamically change VPC inside Glue job

Hi I am having issues with VPC settings. When I use a connection (that has VPC attached) in Glue job I can read data from SQL server but then I can't write data into target Snowflake server due to "timeout". By timeout I mean that job doesn't fail because of any reason but timeout. No errors in logs etc.
If I remove the connection from the same job and replace the SQL server data frame with some dummy spark data frame it all writes to Snowflake without any issue.
For connection to SQL Server I use data catalog and for snowflake I use Spark jdbc connector (2 jar files added to job).
I am thinking about connecting to VPC dynamically from the job script itself, then pull the data into data frame, then disconnect from VPC and write data frame to the target. Does anyone think that it is possible? I didn't find any mentions in documentation TBH.

Copying data from S3 to Redshift stuck

I have a redshift cluster in private subnet, I have attached an IAM role with the policy of full permission to S3.
But sill copy or unload command stuck and finally got timed out error, any thought?
Seems I have to modify some network or connectivity changes.

Data Fusion Dataproc compute profle in a different account

I'm trying to execute a pipeline on a Data Proc cluster in a different project by the one Data Fusion instance is deployed but I am having some trouble. Data Proc instance seems to be created correctly but the start of the job fails. Any idea on how to solve?
Here the stack trace of the error
Thanks
This seems like the project where the Google Cloud Dataproc is doesn't have SSH port open. Can you check that your project allow port 22 connection? Cloud Data Fusion uses SSH to upload and monitor the job in the Cloud Dataproc.

issue while connecting spark to redshift using spark -redshift connector

I need to connect spark to my redshift instance to generate data .
I am using spark 1.6 with scala 2.10 .
Have used compatible jdbc connector and spark-redshift connector.
But i am facing a weird problem that is :
I am using pyspark
df=sqlContext.read\
.format("com.databricks.spark.redshift")\
.option("query","select top 10 * from fact_table")\
.option("url","jdbc:redshift://redshift_host:5439/events?user=usernmae&password=pass")\
.option("tempdir","s3a://redshift-archive/").load()
When i do df.show() then it gives me error of permission denied on my bucket.
This is weird because i can see files being created in my bucket, but they can be read.
PS .I have set accesskey and secret access key also.
PS . I am also confused between s3a and s3n file system.
Connector used :
https://github.com/databricks/spark-redshift/tree/branch-1.x
It seems the permission is not set for Redshift to Access the S3 files. Please follow the below steps
Add a bucket policy to that bucket that allows the Redshift Account
access Create an IAM role in the Redshift Account that redshift can
assume Grant permissions to access the S3 Bucket to the newly
created role Associate the role with the Redshift cluster
Run COPY statements

AWS DMS - Scheduled DB Migration

I have Postgresql db in RDS. I need to fetch data from a bunch of tables in postgresql db and push data into a S3 bucket every hour. I only want the delta changes (any new inserts / updates) to be sent in the hourly. Is it possible to do this using DMS or is EMR a better tool for performing this activity?
You can create an automated environment of migration data from RDS to S3 using AWS DMS (Data Migration Service) tasks.
Create a source endpoint (reading your RDS database - Postgres, MySQL, Oracle, etc...);
Create a target endpoint using S3 as an engine endpoint (read it: Using Amazon S3 as a Target for AWS Database Migration Service);
Create a replication instance, responsible to make a bridge between source data and target endpoint (you will only pay while processing);
Create a database migration task using the option 'Replication data change only' on migration type field;
Create a cron lambda, which starts a DMS task, with stack Python following these instructions of this articles Lambda with scheduled events e Start DMS tasks with boto3 in Python.
Connecting these resources above you may can have what you want.
Regards,
Renan S.