Accessing Aurora Postgres Materialized Views from Glue data catalog for Glue Jobs - amazon-redshift

I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.

Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks

Related

CDC Migration from AWS RDS to AWS Redshift

How to migrate my whole database which is currently in AWS RDS Postgres to AWS Redshift and also can you please help me out how can I keep both these DBs in sync. I want to sync even if any column is updated in RDS so it must get updated in Redshift also.
I know we can achieve it with AWS Glue, but the above scenario is mandatory in my case. Migration task is easy to do but to to the CDC migration is bit challenging. I am also aware about the bookmark key but my situation is bit different, I do not have any sequential column in the tables, but it has updated_at field in all the tables so this column is the only field on which I can check whether the record is processed or not so that duplicate processing may not occur and if any new data is inserted it should also get replicated in RedShift.
So, would anyone help me out to do this even by using pyspark script?
Thanks.

Source of data in Redshift tables

I am looking to find the data source of couple of Tables in Redshift. I have gone through all the stored procedures in Redshift instance. I couldn't find any stored procedure which populates these tables in Redshift. I have also checked the Data Migration Service and didn't see these tables are being migrated from RDS instance. However, the tables are updated regularly each day.
What would be the way to find how data is populated in those 2 tables? Is there any logs or system tables I can look in to?
One place I'd look is svl_statementtext. That will pull any queries and utility queries that may be inserting or running copy jobs against that table. Just use a WHERE text LIKE %yourtablenamehere% and see what comes back.
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
Also check scheduled queries in the Redshift UI console.

Where the join gets executed when using Custom SQL in Amazon QuickSight?

I am using a Custom SQL in Amazon QuickSight for joining several tables from RedShift. I wonder where the join happens, does QuickSight sends the query to the RedShift cluster and gets the results back, or does the join happen in QuickSight? I thought to create a view in RedShift and select data from the view to make sure the join happens in RedShift, however, read in few articles that using views in RedShift is not a good idea.
Quicksight pushes SQL down to the underlying database e.g. Redshift.
Using custom SQL is the same as using a view inside Redshift from a performance point of view.
In my opinion it is easier to manage as a Redshift view as you can:
Use Quicksight wizards more effectively
Drop and recreate the view as needed to add new columns
Have visibility into your SQL source code by storing it on a code
repo e.g. git.

PySpark save to Redshift table with "Overwirte" mode results in dropping table?

Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards

Replicate views from Postgres to Postgres on AWS DMS

I'm using AWS DMS to migrate data from one Postgres database to another Postgres database. Everything works fine, except one thing: the views are no replicated on my target database.
I've read that this cannot be done between heterogenous database (i.e. from Oracle to Postgres) using DMS, but I imagine that this is possible somehow when we're using the same database.
Does someone know how to replicate the views using AWS DMS from Postgres to Postgres?
DMS is a data migration service. View is a virtual table(represented by sql code/object) and it does not contain any data by itself like a table does.
If you aren't looking to migrate the actual DDL but instead were happy to replicate the data from views in the source to tables in the target (akin to materialising the views in the target) then you can do this, but you need to jump into JSON configuration and there's some caveats.
Per the Selection rules and actions portion of the DMS documentation, you can add to the object-locator object a table-type field, setting its value to one of view or all to migrate views. It defaults to table which skips views, and there's no configuration field for it in the console aside from modifying the JSON.
For example:
{
"rule-type": "selection",
"rule-id": "123456",
"rule-name": "some-rule",
"object-locator": {
"schema-name": "my_schema",
"table-name": "%",
"table-type": "view"
},
"rule-action": "include",
"filters": []
}
There are a couple of caveats though:
Not all sources support migrating views. Aurora PostgreSQL doesn't, for instance, and saving the task will give an error if you try so you may need to replace the endpoint with a vanilla PostgreSQL one
If you're migrating views, your only option is a full load - CDC isn't going to work, so partial migrations aren't on the cards