AWS Glue View like Hive View - aws-glue-data-catalog

In Hive we all know we can define a non-materialized View. Easy.
My question is that if a similar construct can be created in AWS Glue Catalog? I see materialized view, but that it not what I mean.
I think it is not possible, but I may be missing something. Thanks in advance.

You can use Athena for this. Athena is a serverless SQL Query engine (Presto) that can query your Data Catalog.
Athena provides a view feature, where you can create non materialized views from your Glue Meta Data Catalog tables.

Related

Refresh redshift materialized view using glue jbdc connection

I have a glue connection which connects to redshift using jdbc.
The redshift has spectrum which automatically fetches latest data from S3 location.
I need to create a view from that spectrum while utilizing the glue connection!
I have tried using below code where I have added sql query in post_actions to refresh redshift view!
datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame=inputGDF_final,
catalog_connection=f"{srvc_user}",connection_options={"preactions":pre_query,"dbtable": f"
{table_dbname}.{table}","database": f"{database}","postactions": post_query},
redshift_tmp_dir=args["TempDir"],transformation_ctx="datasink1")
But issue with the above piece of code is that it creates a table in redshift and then uses that table to create view. I need to utilize the redshift spectrum and not table!
Can anyone please help!

Is there a way to use Spark SQL to query partition information in AWS Glue Data Catalog (similar to in Athena)?

I'm currently developing a Glue ETL script in PySpark that needs to query my Glue Data Catalog's partitions and join that information with other Glue tables programmatically.
At the moment, I'm able to do this with Athena using SELECT * FROM db_name.table_name$partitions JOIN table_name2 ON ..., but looks like this doesn't work with Spark SQL. The closest thing I've been able to find is SHOW PARTIIONS db_name.table_name, which doesn't seem to cut it.
Does anyone know an easy way I can leverage Glue ETL / Boto3 (Glue API) / PySpark to query my partition information in a SQL-like manner?
For the time being, the only possible workaround seems like the get_partitions() method in Boto3, but this looks like a lot more complex work to deal with from my end. I already have my Athena queries to get the information I need, so if there's ideally a way to replicate getting my tables' partitions in a similar way using SQL, that'd be amazing. Please let me know, thank you!
For those interested, an alternative workaround I've been able to find but still need to test out is the Athena API with the Boto3 client. I may also possibly use the AWS Wrangler integrated with Athena to retrieve a dataframe.

Accessing Aurora Postgres Materialized Views from Glue data catalog for Glue Jobs

I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks

Where the join gets executed when using Custom SQL in Amazon QuickSight?

I am using a Custom SQL in Amazon QuickSight for joining several tables from RedShift. I wonder where the join happens, does QuickSight sends the query to the RedShift cluster and gets the results back, or does the join happen in QuickSight? I thought to create a view in RedShift and select data from the view to make sure the join happens in RedShift, however, read in few articles that using views in RedShift is not a good idea.
Quicksight pushes SQL down to the underlying database e.g. Redshift.
Using custom SQL is the same as using a view inside Redshift from a performance point of view.
In my opinion it is easier to manage as a Redshift view as you can:
Use Quicksight wizards more effectively
Drop and recreate the view as needed to add new columns
Have visibility into your SQL source code by storing it on a code
repo e.g. git.

PySpark save to Redshift table with "Overwirte" mode results in dropping table?

Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards