Using AWS Glue Python jobs to run ETL on redshift - amazon-redshift

We have a setup to sync rds postgres changes into s3 using DMS. Now, I want to run ETL on this s3 data(in parquet) using Glue as scheduler.
My plan is to build SQL queries to do the transformation, execute them on redshift spectrum and unload data back into s3 in parquet format. I don't want to Glue Spark as my data loads do not require that kind of capacity.
However, I am facing some problems connecting to redshift from glue, primarily library version issues and the right whl files to be used for pg8000/psycopg2. Wondering if anyone has experience with such implementation and how were you able to manage the db connections from Glue Python shell.

I'm doing something similar in a Python Shell Job but with Postgres instead of Redshift.
This is the whl file I use
psycopg_binary-2.9.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
An updated version can be found here.

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

o110.pyWriteDynamicFrame. null

I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null
So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).

Is it possible to connect to a Postgres DB from SageMaker Data Wrangler?

I set up a regular Postgres DB in AWS using the Amazon Relational Database Service (RDS). I would like to ingest this data using data wrangler for inspection and further processing.
Is this possible? I only see S3, Athena, Redshift and SnowFlake as the data ingestion options. Does this mean I must move the data from Postgresql to one of these 4 options before I can use Data Wrangler?
If it's not possible through data wrangler, can I connect to my Postgres through a Jupyter notebook, using a connection string or some kind of option like this? I'm looking to use the data for the SageMaker Feature Store.
Using Amazon RDS with DataWrangler is not possible (yet).
That said, you can connect to Postgres in Jupyter using Python and then use the Feature Store API to ingest data and use it for subsequent tasks.
Source: AWS employee

loading one table from RDS / postgres into Redshift

We have a Redshift cluster that needs one table from one of our RDS / postgres databases. I'm not quite sure the best way to export that data and bring it in, what the exact steps should be.
In piecing together various blogs and articles the consensus appears to be using pg_dump to copy the table to a csv file, then copying it to an S3 bucket, and from there use the Redshift COPY command to bring it in to a new table-- that's my high level understanding, but am not sure what the command line switches should be, or the actual details. Is anyone doing this currently and if so, is what I have above the 'recommended' way to do a one-off import into Redshift?
It appears that you want to:
Export from Amazon RDS PostgreSQL
Import into Amazon Redshift
From Exporting data from an RDS for PostgreSQL DB instance to Amazon S3 - Amazon Relational Database Service:
You can query data from an RDS for PostgreSQL DB instance and export it directly into files stored in an Amazon S3 bucket. To do this, you use the aws_s3 PostgreSQL extension that Amazon RDS provides.
This will save a CSV file into Amazon S3.
You can then use the Amazon Redshift COPY command to load this CSV file into an existing Redshift table.
You will need some way to orchestrate these operations, which would involve running a command against the RDS database, waiting for it to finish, then running a command in the Redshift database. This could be done via a Python script that connects to each database (eg via psycopg2) in turn and runs the command.

Backing up redshift database

I want to backup entire redshift cluster, such that I can use it in other databases like mysql or hadoop in future.
I was looking up and creating a manual screenshot seems to be an option but I guess that wont work for cross database languages.
So what would be the detailed steps to backup the entire cluster of redshift
Cluster backups can be done via the aws console, however these can only be restored to another redshift cluster.
Because Redshift is not the same as postgres in many ways, it will be inpossible / tricky to use standard tools like pg_dump and pg_restore.
I think that your best option is to :
extract the ddl from the Redshift tables that you wish to create elsewhere, most ide's have a simple way to do this.
modify the ddl to work with your target database (e.g. postgres will
be easy, mysql harder)
copy the contents of the Redshift database, one table at a time to s3 using
the unload command
import the data that you unloaded in step 3 to your target tables