SCD2 Implementation in Redshift using AWS GLue Pyspark

SCD2 Implementation in Redshift using AWS GLue Pyspark - pyspark

I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.

We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .

Related

Azure Data Factory Copy Pipeline with Geography Data Type

I am trying to get a geography data type from a production DB to another DB on a nightly occurrence. I really wanted to leverage upsert as the write activity, but it seems that geography is not supported with this method. I was reading a similar post about bringing the data through ADF as a well known text data type and then changing it, but I keep getting confused on what to do with the data once it is brought over as a well known data type. I would appreciate any advice, thank you.
Tried to utilize ADF pipelines and data flows. Tried to convert the data type once it was in the destination, but then I was not able to run the pipeline again.

I tried to upsert the data with geography datatype from one Azure SQL database to another using copy activity and got error message.
Then, I did the upsert using dataflow activity. Below are the steps.
A source table is taken in dataflow as in below image.
CREATE TABLE SpatialTable
( id int ,
GeogCol1 geography,
GeogCol2 AS GeogCol1.STAsText() );
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (1,geography::STGeomFromText('LINESTRING(-122.360 46.656, -122.343 46.656 )', 4326));
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (2,geography::STGeomFromText('POLYGON((-122.357 47.653 , -122.348 47.649, -122.348 47.658, -122.358 47.658, -122.358 47.653))', 4326));
Then Alter Row transformation is taken and in Alter Row Conditions, Upsert if isNull(id)==false()is given. (Based on the column id, sink table upserted)
Then, in Sink dataset for target table is given. In sink settings, Update method is selected as Allow Upsert and required Key column is given. (Here column id is selected)
When pipeline is run for the first time, data is inserted into target table.
When pipeline is run for the second time by updating the existing data and inserting new records to source, data is upserted correctly.
Source Data is changed for id=1 and new row is inserted with id=3
Sink data is reflecting the changes done in source.

Is there any other alternatives for load command in DB2

We daily receive 7 millions of records , we are going to append to the existing target table.The target table is partitioned based on date
We are using DB2 Load command to load data from one DB2 table (stage) to another DB2 table target
call SYSPROC.ADMIN_CMD('LOAD FROM (SELECT * FROM stage_table )
OF CURSOR INSERT INTO target_table NONRECOVERABLE INDEXING MODE INCREMENTAL ALLOW READ ACCESS')
As per IBM documentation , ALLOW READ ACCESS is going to be deprecated suggested to use INGEST method instead of that
https://www.ibm.com/docs/en/db2/10.1.0?topic=functionality-fp1-allow-read-access-parameter-load-command
Question:
How to use INGEST method to load data from DB2 to DB2 tables ?
what would be other alternatives to load millions of records with improved performance.

Glue load job not retaining default column value in redshift

I have a Glue job that loads a CSV from S3 into a redshift table. There is 1 column (updated_date) which is not mapped. Default value for that column is set to current_timestamp in UTC. But each time the Glue job runs, this updated_date column is null.
I tried removing updated_dt from Glue metadata table. I tried removing updated_dt from SelectFields.apply() in Glue script.
When I do a normal insert statement in Redshift without using updated_dt column, default current_timestamp() value is inserted for those rows.
Thanks

Well I had the same problem. AWS support told me to convert the Glue DynamicFrame into a Spark DataFrame and use the Spark writer to load data into Redshift:
SparkDF = GlueDynFrame.toDF()
SparkDF.write.format('jdbc').options(
url = ‘<‘JDBC url>’,
dbtable=‘<schema>.<table>’,
user=‘<username>’,
password=‘<password>’).mode(‘append’).save()
Me on the other hand handled the problem by either dropping and creating the target table using preactions or just use some update to set the default values in the postactions.

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

I'm trying to run the upsert/delete some of the values in DB2 database source table, which is a existing table on DB2. Is it possible using Pyspark/Spark SQL/Dataframes.

There is no direct way for update/delete in relational database using Pyspark job, but there are workarounds.
(1) You can create a identical empty table (secondary table) in relational database and insert data into secondary table using pyspark job, and write a DML trigger that would perform desired DML operation on your primary table.
(2) You can create a dataframe (eg. a) in spark that would be copy of your existing relational table and merge existing table dataframe with current dataframe(eg. b) and create a new dataframe(eg. c) that would be having latest changes. Now truncate the relational database table and reload with spark latest changes dataframe(c).
These is just a workaround and not a optimal solution for huge amount of data.

Hive: dynamic partition adding to external table

I am running hive 071, processing existing data which is has the following directory layout:
-TableName
- d= (e.g. 2011-08-01)
- d=2011-08-02
- d=2011-08-03
... etc
under each date I have the date files.
now to load the data I'm using
CREATE EXTERNAL TABLE table_name (i int)
PARTITIONED BY (date String)
LOCATION '${hiveconf:basepath}/TableName';**
I would like my hive script to be able to load the relevant partitions according to some input date, and number of days. so if I pass date='2011-08-03' and days='7'
The script should load the following partitions
- d=2011-08-03
- d=2011-08-04
- d=2011-08-05
- d=2011-08-06
- d=2011-08-07
- d=2011-08-08
- d=2011-08-09
I havn't found any discent way to do it except explicitlly running:
ALTER TABLE table_name ADD PARTITION (d='2011-08-03');
ALTER TABLE table_name ADD PARTITION (d='2011-08-04');
ALTER TABLE table_name ADD PARTITION (d='2011-08-05');
ALTER TABLE table_name ADD PARTITION (d='2011-08-06');
ALTER TABLE table_name ADD PARTITION (d='2011-08-07');
ALTER TABLE table_name ADD PARTITION (d='2011-08-08');
ALTER TABLE table_name ADD PARTITION (d='2011-08-09');
and then running my query
select count(1) from table_name;
however this is offcourse not automated according to the date and days input
Is there any way I can define to the external table to load partitions according to date range, or date arithmetics?

I have a very similar issue where, after a migration, I have to recreate a table for which I have the data, but not the metadata. The solution seems to be, after recreating the table:
MSCK REPAIR TABLE table_name;
Explained here
This also mentions the "alter table X recover partitions" that OP commented on his own post. MSCK REPAIR TABLE table_name; works on non-Amazon-EMR implementations (Cloudera in my case).

The partitions are a physical segmenting of the data - where the partition is maintained by the directory system, and the queries use the metadata to determine where the partition is located. so if you can make the directory structure match the query, it should find the data you want. for example:
select count(*) from table_name where (d >= '2011-08-03) and (d <= '2011-08-09');
but I do not know of any date-range operations otherwise, you'll have to do the math to create the query pattern first.
you can also create external tables, and add partitions to them that define the location.
This allows you to shred the data as you like, and still use the partition scheme to optimize the queries.

I do not believe there is any built-in functionality for this in Hive. You may be able to write a plugin. Creating custom UDFs
Probably do not need to mention this, but have you considered a simple bash script that would take your parameters and pipe the commands to hive?
Oozie workflows would be another option, however that might be overkill. Oozie Hive Extension - After some thinking I dont think Oozie would work for this.

I have explained the similar scenario in my blog post:
1) You need to set properties:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
2)Create a external staging table to load the input files data in to this table.
3) Create a main production external table "production_order" with date field as one of the partitioned columns.
4) Load the production table from the staging table so that data is distributed in partitions automatically.
Explained the similar concept in the below blog post. If you want to see the code.
http://exploredatascience.blogspot.in/2014/06/dynamic-partitioning-with-hive.html

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SCD2 Implementation in Redshift using AWS GLue Pyspark - pyspark

Related

Azure Data Factory Copy Pipeline with Geography Data Type

Is there any other alternatives for load command in DB2

Glue load job not retaining default column value in redshift

How to upsert/Delete the DB2 source table data using Pyspark/SQL/DataFrames SPARK RDD's?

Hive: dynamic partition adding to external table

Categories

Resources