Error when try to access late-binding view in a cross-database query in Redshift - amazon-redshift

So far I've been able to do cross-database queries between two databases with data stored inside Redshift (using r3 node type). But when I do a query with a database that has an external schema/table from Redshift Spectrum. I got an error that the relations don't exist (my guess is that it is unreachable) and also I tried to do a late-binding view from the external schema/table and got the following error:
ERROR: Localized temp table cannot be created while getting lbv
definition
Creating instead a materialized view for the external table works, but we rather avoid it due to the operability maintenance of those views.
I want to do a cross-database query with data from Redshift Spectrum and Redshift.
Could anyone help me?

Related

Refresh redshift materialized view using glue jbdc connection

I have a glue connection which connects to redshift using jdbc.
The redshift has spectrum which automatically fetches latest data from S3 location.
I need to create a view from that spectrum while utilizing the glue connection!
I have tried using below code where I have added sql query in post_actions to refresh redshift view!
datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame=inputGDF_final,
catalog_connection=f"{srvc_user}",connection_options={"preactions":pre_query,"dbtable": f"
{table_dbname}.{table}","database": f"{database}","postactions": post_query},
redshift_tmp_dir=args["TempDir"],transformation_ctx="datasink1")
But issue with the above piece of code is that it creates a table in redshift and then uses that table to create view. I need to utilize the redshift spectrum and not table!
Can anyone please help!

Accessing Aurora Postgres Materialized Views from Glue data catalog for Glue Jobs

I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks

Source of data in Redshift tables

I am looking to find the data source of couple of Tables in Redshift. I have gone through all the stored procedures in Redshift instance. I couldn't find any stored procedure which populates these tables in Redshift. I have also checked the Data Migration Service and didn't see these tables are being migrated from RDS instance. However, the tables are updated regularly each day.
What would be the way to find how data is populated in those 2 tables? Is there any logs or system tables I can look in to?
One place I'd look is svl_statementtext. That will pull any queries and utility queries that may be inserting or running copy jobs against that table. Just use a WHERE text LIKE %yourtablenamehere% and see what comes back.
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
Also check scheduled queries in the Redshift UI console.

Where the join gets executed when using Custom SQL in Amazon QuickSight?

I am using a Custom SQL in Amazon QuickSight for joining several tables from RedShift. I wonder where the join happens, does QuickSight sends the query to the RedShift cluster and gets the results back, or does the join happen in QuickSight? I thought to create a view in RedShift and select data from the view to make sure the join happens in RedShift, however, read in few articles that using views in RedShift is not a good idea.
Quicksight pushes SQL down to the underlying database e.g. Redshift.
Using custom SQL is the same as using a view inside Redshift from a performance point of view.
In my opinion it is easier to manage as a Redshift view as you can:
Use Quicksight wizards more effectively
Drop and recreate the view as needed to add new columns
Have visibility into your SQL source code by storing it on a code
repo e.g. git.

How to replicate rows into different tables of different database in postgresql?

I use postgresql. I have many databases in a server. There is one database which I use the most say 'main'. This 'main' has many tables inside it. And also other databases have many tables inside them.
What I want to do is, whenever a new row is inserted into 'main.users' table I wish to insert the same data into 'users' table of other databases. How shall I do it in postgresql? Similarly I wish to do the same for all actions like UPDATE, DELETE etc.,
I had gone through the "logical replication" concept as suggested by you. In my case I know the source db name up front and I will come to know the target db name as part of the query. So it is going to be dynamic.
How to achieve this? is there any db concept available in postgresql? Or I welcome all other possible ways as well. Please share me some idea on this.
If this is all on the same Postgres instance (aka "cluster"), then I would recommend to use a foreign table to access the tables from the "main" database in the other databases.
Those foreign tables look like "local" tables inside each database, but access the original data in the source database directly, so there is no need to synchronize anything.
Upgrade to a recent PostgreSQL release and use logical replication.
Add a trigger on the table in the master database that uses dblink to access and write the other databases.
Be sure to consider what should be done if the row alreasdy exists remotely, or if the rome server is unreachable.
Also not that updates propogated usign dblink are not rolled back if the inboking transaction is rolled back