Query Athena (Add Partition) using using AWS Glue Scala - scala

Is there a possibility to query the data to ALTER TABLE ADD PARTITION in existing table in Athena, from Glue Script using Scala?
and possible libraries extending from aws.athena.connections (OR)
Using spark to query the Athena table to add from (Glue Data Catalog) ?

You can make a call to Athena with ALTER TABLE ADD PARTITION query or add partition via Glue API. However, AWS SDK should be already provided for you by Glue so you can use appropriate Athena or Glue client classes.
If your jobs are running in custom VPC please make sure there is access to AWS services.

Related

Creating an external redshift table which maps to multiple glue tables?

Context: I’m trying to create a single redshift external table which maps to multiple glue tables (the glue tables point to separate S3 buckets and crawl them using a crawler)
Question: is there a way I can create a single external table in redshift which will “group by” the multiple glue tables?

Accessing Aurora Postgres Materialized Views from Glue data catalog for Glue Jobs

I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks

Can Pyspark Use JDBC to Pass Alter Table

I would like to pass an alter table command to my PostgreSQL database after I load data from a Databricks notebook using pyspark. I know that I can pass a query using spark.read.jdbc but in this case I would like to add a unique constraint once the data has loaded. The purpose is to speed up the data load process into the db by reducing the time to create the unique index.
Spark is a framework for data processing therefore its API mostly developed for read and write operations with data sources. In your case, you have some DDL statements to execute and Spark isn't supposed to perform such operations.
It will better option, to keep DDL operation separate after data processing in spark sql. Here you can add one more PostgreSQL job to perform such operations.
I was experiencing this exact problem in Redshift. After reviewing the doc on JDBC connections, it looks like you can do something like this:
%sql
ALTER TABLE <jdbcTable> {SOME ACTIONS}
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:<databaseServerType>://<jdbcHostname>:<jdbcPort>",
dbtable "<jdbcDatabase>.atable",
user "<jdbcUsername>",
password "<jdbcPassword>"
)

Unable to access AWS Athena table from AWS Redshift

I am trying to access an existing AWS Athena table fron AWS Redshift.
I tried creating external schema (pointing to AWS Athena DB) in AWS Redshift console. It creates the external schema successfully but it doesn't display tables from Athena DB. Below is the code used.
CREATE EXTERNAL SCHEMA Ext_schema_1
FROM DATA CATALOG
DATABASE 'sample_poc'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::55276673986:role/sample_Redshift_Role';
Few observations..
Even if I specify not existing Athena DB name, it still create external schema in Redshift.
My Redshift role has full access to S3 & Athena.
AWS Glue Catalog contains databases, which contain tables. There are no schemas from the perspective of Athena or Glue Catalog.
In Redshift Spectrum, you create an EXTERNAL SCHEMA which is really a placeholder object, a pointer within Redshift to the Glue Catalog.
Even if I specify not existing Athena DB name, it still create external schema in Redshift.
The creation of the object is lazy as you have discovered, which is useful if the IAM Role needs adjusting. Note the example in the docs has an additional clause:
create external database if not exists
So your full statement would need to be this if you wanted the database to be created as well.
CREATE EXTERNAL SCHEMA Ext_schema_1
FROM DATA CATALOG
DATABASE 'sample_poc'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::55276673986:role/sample_Redshift_Role'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
it doesn't display tables from Athena DB
If you are creating an EXTERNAL SCHEMA to a non-existent database then there will be nothing to display. I assume your point 1. is unrelated to the real attempt you made to create the external schema; that you pointed it to an existing schema with tables.
I have found that tables created using Redshift Spectrum DDL are immediately available to Athena via the Glue Catalog.
I have also tried specifying tables in Glue Catalog, and alternatively using the Crawler, and in both cases those tables are visible in Redshift.
What tool are you using to attempt to display the tables? Do you mean tables don't list in metadata views or do you mean the contents of tables doesn't display?
Redshift does appear to have some differences in the datatypes that are allowed, and the Hive DDL required in Athena can have some differences to the Redshift Spectrum DDL. Spectrum has some nesting limitations.
My Redshift role has full access to S3 & Athena
Assuming you are using Glue Catalog and not the old Athena catalog then your role doesn't need any Athena access.

PySpark save to Redshift table with "Overwirte" mode results in dropping table?

Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards