how to incremental load from aurora mysql to redshift with glue?

how to incremental load from aurora mysql to redshift with glue? - amazon-redshift

I am setting up data warehouse.
Source is aurora mysql and I will use redshift as DW.
I succeed full data loading from aurora mysql to redshift.
But I want to incremental loading about data the day before.
There is a date field in table.
ex)
source table is :
target table is :
If today is 2022-06-04, so 2022-06-03 row's data only must be incremental loaded.
I tried to UPSERT in glue and result is good.
But data are wrote in real time in source.
And KPI in BI show to the day before.
So I need a incremental loading the day before, NOT UPSERT or Last data insert.
In python script in glue jobs, I declared preaction and postaction through SQL with where clause.
But where clause is not working.
All data is loaded.
I don't know why where clause is not working.
And I want to know how to incremental load from aurora mysql to redshift with glue.
Please help me.

Related

CDC Migration from AWS RDS to AWS Redshift

How to migrate my whole database which is currently in AWS RDS Postgres to AWS Redshift and also can you please help me out how can I keep both these DBs in sync. I want to sync even if any column is updated in RDS so it must get updated in Redshift also.
I know we can achieve it with AWS Glue, but the above scenario is mandatory in my case. Migration task is easy to do but to to the CDC migration is bit challenging. I am also aware about the bookmark key but my situation is bit different, I do not have any sequential column in the tables, but it has updated_at field in all the tables so this column is the only field on which I can check whether the record is processed or not so that duplicate processing may not occur and if any new data is inserted it should also get replicated in RedShift.
So, would anyone help me out to do this even by using pyspark script?
Thanks.

Source of data in Redshift tables

I am looking to find the data source of couple of Tables in Redshift. I have gone through all the stored procedures in Redshift instance. I couldn't find any stored procedure which populates these tables in Redshift. I have also checked the Data Migration Service and didn't see these tables are being migrated from RDS instance. However, the tables are updated regularly each day.
What would be the way to find how data is populated in those 2 tables? Is there any logs or system tables I can look in to?

One place I'd look is svl_statementtext. That will pull any queries and utility queries that may be inserting or running copy jobs against that table. Just use a WHERE text LIKE %yourtablenamehere% and see what comes back.
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
Also check scheduled queries in the Redshift UI console.

Data Migration from one DB to another

I have to create an app which transfer data from snowflake to postgres everyday. Some tables in postgres are truncated before migration and all data from corresponding snowflake table is copied. While for other tables, data after last timestamp in postgres is copied from snowflake.
This job has to run at night sometime and not when customers are using the service at daytime.
What is the best way to do this ?

Do you have constraints, limiting your choices in:
ETL or bulk data tooling
Development languages?

According to this site, you can create a foreign data wrapper on Postgresql for snowflake

AWS DMS how to deal with invalid timestamp

I am moving data from a MySQL Aurora Db which contains timestamp such as "0000-00-00 00:00:00".
The target is a redshift cluster. The migration is failing because the invalid timestamps are not accepted by Redshift.
I cannot modify the source data. I cannot drop that column (most of the timestamps are valid)
I have tried to use a replace-prefix transformation to replace "0000-00-00" by "1970-01-01". It doesn't work because DMS is trying to load the data first and process them after.
How can I import the data ?

Slow insert and update commands during mysql to redshift replication

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table

Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse