o110.pyWriteDynamicFrame. null - postgresql

I have created a visual job in AWS Glue where I extract data from Snowflake and then my target is a postgresql database in AWS.
I have been able to connect to both Snowflak and Postgre, I can preview data from both.
I have also been able to get data from snoflake, write to s3 as csv and then take that csv and upload it to postgre.
However when I try to get data from snowflake and push it to postgre I get the below error:
o110.pyWriteDynamicFrame. null

So it means that you can get the data from snowflake in a Datafarme and while writing the data from this datafarme to postgres, you are failing.
You need to check was glue logs to get more understanding why is this failing while writing the data into postgres.
Please check if you have the right version of jars (needed by postgres) compatible with scala(on was glue side).

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

How to read tables from synapse database tables using pyspark

I am a newbie to Azure Synapse, I have to work on the Azure spark notebook. One of my colleagues connected the on-prime database using the azure link service. Now I have written a test framework for comparing the on-prime data and data-lake(curated) data. but I don't understand how to read those tables using Pyspark.
here is my linked service data structure.
enter image description here
here my Link service names and Database name.
You can read any file as a table which is stored in Synapse Linked location by using Azure Synapse Dedicated SQL Pool Connector for Apache Spark.
First you need to read the file which you need to read as the table in Synapse. Use below code to read the file.
%%pyspark
df = spark.read.load('abfss://sampleadls2#sampleadls1.dfs.core.windows.net/business.csv', format='csv', header=True)
Then convert this file into table using the code below:
%%pyspark
spark.sql("CREATE DATABASE IF NOT EXISTS business")
df.write.mode("overwrite").saveAsTable("business.data")
Refer below image.
Now you can run any Spark SQL command on this table as shown below:
%%pyspark
data = spark.sql("SELECT * FROM business.data")
display(data)
See the output in below image.

Using AWS Glue Python jobs to run ETL on redshift

We have a setup to sync rds postgres changes into s3 using DMS. Now, I want to run ETL on this s3 data(in parquet) using Glue as scheduler.
My plan is to build SQL queries to do the transformation, execute them on redshift spectrum and unload data back into s3 in parquet format. I don't want to Glue Spark as my data loads do not require that kind of capacity.
However, I am facing some problems connecting to redshift from glue, primarily library version issues and the right whl files to be used for pg8000/psycopg2. Wondering if anyone has experience with such implementation and how were you able to manage the db connections from Glue Python shell.
I'm doing something similar in a Python Shell Job but with Postgres instead of Redshift.
This is the whl file I use
psycopg_binary-2.9.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
An updated version can be found here.

Is it possible to connect to a Postgres DB from SageMaker Data Wrangler?

I set up a regular Postgres DB in AWS using the Amazon Relational Database Service (RDS). I would like to ingest this data using data wrangler for inspection and further processing.
Is this possible? I only see S3, Athena, Redshift and SnowFlake as the data ingestion options. Does this mean I must move the data from Postgresql to one of these 4 options before I can use Data Wrangler?
If it's not possible through data wrangler, can I connect to my Postgres through a Jupyter notebook, using a connection string or some kind of option like this? I'm looking to use the data for the SageMaker Feature Store.
Using Amazon RDS with DataWrangler is not possible (yet).
That said, you can connect to Postgres in Jupyter using Python and then use the Feature Store API to ingest data and use it for subsequent tasks.
Source: AWS employee

Linking Access tables into a PostgreSQL Database using a foreign data wrapper

I'm new to postgres so this problem is probably a relatively easy one for someone else. However, I have spent many frustrating hours trying to figure out the solution. I have an Access Database of metadata that must be kept updated for sending records to other groups. I also have a database using PostgreSQL and PGAdmin which also has these same metadata tables. Currently these tables in the Postgres database get updated manually by exporting the Access tables as excel files, and then importing them into the SQL tables. It's not the most efficient process and could lead to errors in the SQL database if someone forgets to check before running any queries that they are using the most recent data from Access. So I would like to integrate some of the tables from my Access database with my Postgres database.
Originally I tried just installing drivers to export the Access tables directly to Postgres which worked, but not in the way that I wanted since it just brings in a table which I would still need to manually update. From my understanding, I can create a server connection in postgres to access and that would then bring in updated data using a foreign data wrapper.
I tried to use ogr_fdw.
CREATE EXTENSION ogr_fdw;
When I try:
CREATE SERVER metadata
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource 'H:\Databases\20170712.accdb',
format 'ODBC' );
I receive: ERROR: unable to connect to data source "H:\Databases\20170712.accdb"
SQL state: HV00D
When I try:
CREATE SERVER metadata
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource 'H:\Databases\20170712.accdb',
format 'ACCDB' );
I receive: ERROR: unable to find format "ADDCB"
HINT: See the formats list at http://www.gdal.org/ogr_formats.html.
I also tried MDB and received the same error. However, MDB is the code name given by the website but it says that it needs JDK/JRE to compile and I'm not really sure if that's another type of driver that I would need or what it is.
When I try:
CREATE SERVER metadata
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource 'H:\Databases\20170712.mdb',
format 'ODBC' );
I receive: ERROR: unable to connect to data source "H:\Databases\20170712.mdb"
SQL state: HV00D
Hint: Unable to initialize ODBC connection to DSN for DRIVER=Microsoft Access Driver (*.mdb);DBQ=H:\Databases\20170712.mdb,
[Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified
However I thought after looking at the github help page for ogr_fdw didn't need ODBC and special drivers https://github.com/pramsey/pgsql-ogr-fdw/blob/master/FAQ.md.
A lot of this is probably due to my limited knowledge of the terminology when I'm reading through a lot of this stuff. Also my Access database is an .accdb file but since that wasn't working I tried around with mdb and ODBC as the "format" too. If anyone has any suggestions I would greatly appreciate it.
Thanks!