Creating an external redshift table which maps to multiple glue tables? - amazon-redshift

Context: I’m trying to create a single redshift external table which maps to multiple glue tables (the glue tables point to separate S3 buckets and crawl them using a crawler)
Question: is there a way I can create a single external table in redshift which will “group by” the multiple glue tables?

Related

How to read Schema from the config table and attach to Pyspark DataFrame?

I have 50 tables in my on premise server. I want to migrate those 50 tables from on-premise to delta table in data bricks. But every table has specific schema defined but i need to design the single adf pipeline to move those fifty tables from on-premises to delta table.
How to attach the schema to the data frame at the run time based on the table name ?
I would use mapping data flows for this scenario:
Create a static table/file with the list of your tables.
Add a for each loop in ADF pipeline.
Within foreach, create a mapping data flow.
As a source provide your on-prem database (schema will be detected automatically).
Delta table as a destination (sink).
Mapping data flows are spark-based, therefore you can see in "projection" tab that it is translated to the spark type already.

Accessing Aurora Postgres Materialized Views from Glue data catalog for Glue Jobs

I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks

Unable to write data from two separate db2 tables into an Elastic Index

I have created two pipelines for two tables in Kubernetes. Each pipeline corresponds to one table in DB2. They both show under monitoring tab in Kibana. but the index is populated only with the data from one of the tables.
Any idea?

Unable to access AWS Athena table from AWS Redshift

I am trying to access an existing AWS Athena table fron AWS Redshift.
I tried creating external schema (pointing to AWS Athena DB) in AWS Redshift console. It creates the external schema successfully but it doesn't display tables from Athena DB. Below is the code used.
CREATE EXTERNAL SCHEMA Ext_schema_1
FROM DATA CATALOG
DATABASE 'sample_poc'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::55276673986:role/sample_Redshift_Role';
Few observations..
Even if I specify not existing Athena DB name, it still create external schema in Redshift.
My Redshift role has full access to S3 & Athena.
AWS Glue Catalog contains databases, which contain tables. There are no schemas from the perspective of Athena or Glue Catalog.
In Redshift Spectrum, you create an EXTERNAL SCHEMA which is really a placeholder object, a pointer within Redshift to the Glue Catalog.
Even if I specify not existing Athena DB name, it still create external schema in Redshift.
The creation of the object is lazy as you have discovered, which is useful if the IAM Role needs adjusting. Note the example in the docs has an additional clause:
create external database if not exists
So your full statement would need to be this if you wanted the database to be created as well.
CREATE EXTERNAL SCHEMA Ext_schema_1
FROM DATA CATALOG
DATABASE 'sample_poc'
REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::55276673986:role/sample_Redshift_Role'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
it doesn't display tables from Athena DB
If you are creating an EXTERNAL SCHEMA to a non-existent database then there will be nothing to display. I assume your point 1. is unrelated to the real attempt you made to create the external schema; that you pointed it to an existing schema with tables.
I have found that tables created using Redshift Spectrum DDL are immediately available to Athena via the Glue Catalog.
I have also tried specifying tables in Glue Catalog, and alternatively using the Crawler, and in both cases those tables are visible in Redshift.
What tool are you using to attempt to display the tables? Do you mean tables don't list in metadata views or do you mean the contents of tables doesn't display?
Redshift does appear to have some differences in the datatypes that are allowed, and the Hive DDL required in Athena can have some differences to the Redshift Spectrum DDL. Spectrum has some nesting limitations.
My Redshift role has full access to S3 & Athena
Assuming you are using Glue Catalog and not the old Athena catalog then your role doesn't need any Athena access.

Query Athena (Add Partition) using using AWS Glue Scala

Is there a possibility to query the data to ALTER TABLE ADD PARTITION in existing table in Athena, from Glue Script using Scala?
and possible libraries extending from aws.athena.connections (OR)
Using spark to query the Athena table to add from (Glue Data Catalog) ?
You can make a call to Athena with ALTER TABLE ADD PARTITION query or add partition via Glue API. However, AWS SDK should be already provided for you by Glue so you can use appropriate Athena or Glue client classes.
If your jobs are running in custom VPC please make sure there is access to AWS services.