I have a glue connection which connects to redshift using jdbc.
The redshift has spectrum which automatically fetches latest data from S3 location.
I need to create a view from that spectrum while utilizing the glue connection!
I have tried using below code where I have added sql query in post_actions to refresh redshift view!
datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame=inputGDF_final,
catalog_connection=f"{srvc_user}",connection_options={"preactions":pre_query,"dbtable": f"
{table_dbname}.{table}","database": f"{database}","postactions": post_query},
redshift_tmp_dir=args["TempDir"],transformation_ctx="datasink1")
But issue with the above piece of code is that it creates a table in redshift and then uses that table to create view. I need to utilize the redshift spectrum and not table!
Can anyone please help!
Related
I am setting up data warehouse.
Source is aurora mysql and I will use redshift as DW.
I succeed full data loading from aurora mysql to redshift.
But I want to incremental loading about data the day before.
There is a date field in table.
ex)
source table is :
target table is :
If today is 2022-06-04, so 2022-06-03 row's data only must be incremental loaded.
I tried to UPSERT in glue and result is good.
But data are wrote in real time in source.
And KPI in BI show to the day before.
So I need a incremental loading the day before, NOT UPSERT or Last data insert.
In python script in glue jobs, I declared preaction and postaction through SQL with where clause.
But where clause is not working.
All data is loaded.
I don't know why where clause is not working.
And I want to know how to incremental load from aurora mysql to redshift with glue.
Please help me.
I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
Output to parquet and use copy to load
Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
In option 1, would I be able to handle incremental loads?
Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using aws glue get-job-bookmark --job-name yourjobname and then just that in the where clause of the mv as where id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks
I am using a Custom SQL in Amazon QuickSight for joining several tables from RedShift. I wonder where the join happens, does QuickSight sends the query to the RedShift cluster and gets the results back, or does the join happen in QuickSight? I thought to create a view in RedShift and select data from the view to make sure the join happens in RedShift, however, read in few articles that using views in RedShift is not a good idea.
Quicksight pushes SQL down to the underlying database e.g. Redshift.
Using custom SQL is the same as using a view inside Redshift from a performance point of view.
In my opinion it is easier to manage as a Redshift view as you can:
Use Quicksight wizards more effectively
Drop and recreate the view as needed to add new columns
Have visibility into your SQL source code by storing it on a code
repo e.g. git.
Using PySpark in AWS Glue to load data from S3 files to Redshift table, in code used mode("Overwirte") got error stated that "can't drop table because other object depend on the table", turned out there is view created on top of that table, seams the "Overwrite" mode actually drop and re-create redshift table then load data, is there any option that only "truncate" table not dropping it?
AWS Glue uses databricks spark redshift connector (it's not documented anywhere but I verified that empirically). Spark redshift connector's documentation mentions:
Overwriting an existing table: By default, this library uses transactions to perform overwrites, which are implemented by deleting the destination table, creating a new empty table, and appending rows to it.
Here there is a related discussion inline to your question, where they have used truncate instead of overwrite, also its a combination of lambda & glue. Please refer here for detailed discussions and code samples. Hope this helps.
regards
Currently we have a workbook developed in Tableau using Oracle server as the data store where we have all our tables and views. Now we are migrating to Redshift fora better performance. We have the same table structure as in the Oracle with the same table names and the field names in the Redshift. We already have the Tableau workbook developed and we need to point to Redshift tables and views now. How do we point the developed workbook to Redshift now, kindly help.
Also let me know any other inputs in this regard.
Thanks,
Raj
Use the Replace Data Source functionality of Tableau Desktop
You can bypass Replace Data Source and move data directly from Oracle to Redshift using bulk loaders.
Simple combo of SQL*Plus + Python + boto + psycopg2 will do the job.
It should:
Open read pipe from Oracle SQL*Plus
Compress data stream
Upload compressed stream to S3
Bulk append data from S3 to Redshift table.
You can check example of how to extract table or query data from Oracle and then load it to Redshift using COPY command from S3.