How does DDL replication work in AWS DMS? - postgresql

Could you please explain how DDL replication works in AWS DMS (in the case of the two Postgres databases)? I didn't find an explanation of this process in the official documentation.
As I can see the replication task installs awsdms_ddl_audit trigger (here is information about of this trigger). This trigger intercepts DDL operations and writes them to the awsdms_ddl_audit table. I don't understand what happens with these intercepted DDL operations after it.
P.S.
I am asking this because I've noticed that DMS is applying these DDL operations in the middle of the CDC process i.e. it doesn't arrange them with a timeline of CDC changes.
In my case, I update the source database for an hour and at the end of it I remove several columns. The DMS removes these columns when the CDC synchronization process isn't finished yet.
It's very strange behavior.

Related

CDC Migration from AWS RDS to AWS Redshift

How to migrate my whole database which is currently in AWS RDS Postgres to AWS Redshift and also can you please help me out how can I keep both these DBs in sync. I want to sync even if any column is updated in RDS so it must get updated in Redshift also.
I know we can achieve it with AWS Glue, but the above scenario is mandatory in my case. Migration task is easy to do but to to the CDC migration is bit challenging. I am also aware about the bookmark key but my situation is bit different, I do not have any sequential column in the tables, but it has updated_at field in all the tables so this column is the only field on which I can check whether the record is processed or not so that duplicate processing may not occur and if any new data is inserted it should also get replicated in RedShift.
So, would anyone help me out to do this even by using pyspark script?
Thanks.

AWS - DMS migration missing sequence , views , routines ... etc

I a trying to do the migration for our Postgres database to Aurora postgres
first I create a normal task it migrates all tables only except its constraints.
My tries to clone our database
I downloaded AWS SCT (Schema Conversion Tool) then set my configuration to generate a migration report, here is the report
We completed the analysis of your PostgreSQL source database and
estimate that 100% of the database storage objects and 99.1% of
database code objects can be converted automatically or with minimal
changes if you select Amazon Aurora (PostgreSQL compatible) as your
migration target. Database storage objects include schemas, tables,
table constraints, indexes, types, sequences and foreign tables.
Database code objects include triggers, views, materialized views,
functions, domains, rules, operators, collations, fts configurations,
fts dictionaries and aggregates. Based on the source code syntax
analysis, we estimate 99.9% (based on # lines of code) of your code
can be converted to Amazon Aurora (PostgreSQL compatible)
automatically. To complete the migration, we recommend 133 conversion
action(s) ranging from simple tasks to medium-complexity actions to
complex conversion actions.
my question:
1- is there a way to automate including everything in my source database
2- the report mentions we recommend 133 conversion action(s) where I can find these conversion actions.
3- is it safe to ongoing migration as in my case we need to run migration every day.
Sequence, Index, and Constraint are not migrated and it is mentioned in the official docs on AWS.
You can use this source.
This will help you to migrate Sequence, Index, and Constraint at once.
p.s: this doesn't include View and Routine.
There's no way AFAIK in AWS to automate everything if that was there then it would have been already added in SCT. however, if there are similar errors that are occurring in code/DDL/function like some datatype conversions. you can create a script that will take schema dump and convert all these data types to the desired ones.
Choose the SQL Conversion Actions tab in SCT tool.
The SQL Conversion Actions tab contains a list of SQL code items that can't be converted automatically. There are also recommendations for how to manually convert the SQL code. You can look into the errors and make changes accordingly.
In case if you are migrating to the same version of PG in aurora you can take a schema only dump and restore it into target aurora and later setup a full load/ongoing replication with DMS and you don't have to take SCT into consideration(most of the time worked for me). Just make sure you adhere to aurora limitations specific to the PG version
We have been using ongoing migration in our project at it's working great. There are some best practices we have developed but that will differ from project to project
DDL changes must be made on the target first and stop replication while doing it and resume once done
Separate the tables with high transactions as different DMS task as it will help you in troubleshooting and your rest of the tables can still be working
Always keep in mind DMS replicates data, not the view/function/procedures
Active monitoring of tasks and replication instances
And I would like to suggest if you are performing homogenous migration(PG -> PG) you should consider pg_dump & pg_restore that easy and sophisticated for the same versions and AWS aurora supports it.

Bidirectional Replication Design: best way to script and execute unmatched row on Source DB to multiple subscriber DBs, sequentially or concurrently?

Thank you for help or suggestion offered.
I am trying to build my own multi-master replication on Postgresql 10 in Windows, for a situation which cannot use any of the current 3rd party tools for PG multimaster replication, which can also involve another DB platform in a subscriber group (Sybase ADS). I have the following logic to create bidirectional replication, partially inspired by Bucardo's logic, between 1 publisher and 2 subscribers:
When INSERT, UPDATE, or DELETE is made on Source table, Source table Trigger adds row to created meta table on Source DB that will act as a replication transaction to be performed on the 2 subscriber DBs which subcribe to it.
A NOTIFY signal will be sent to a service, or script written in Python or some scripting language will monitor for changes in the metatable or trigger execution and be able to do a table compare or script the statement to run on each subscriber database.
***I believe that triggers on the subscribers will need to be paused to keep them from pushing their received statements to their subscribers, i.e. if node A and node B both subscribe to each other's table A, then an update to node A's table A should replicate to node B's table A without then replicating back to table A in a bidirectional "ping-pong storm".
There will be a final compare between tables and the transaction will be closed. Re-enable triggers on subscribers if they were paused/disabled when pushing transactions from step 2 addendum.
This will hopefully be able to be done bidirectionally, in order of timestamp, in FIFO order, unless I can figure out a to create child processes to run the synchronizations concurrently.
For this, I am trying to figure out the best way to setup the service logic---essentially Step 2 above, which has apparently been done using a daemon in Linux, but I have to work in Windows, making it run as, or resembling, a service/agent---or come up with a reasonably easy and efficient design to send the source DBs statements to the subscribers DBs.
Does anyone see that this plan is faulty or may not work?
Disclaimer: I don't know anything about Postgresql but have done plenty of custom replication.
The main problem with bidirectional replication is merge issues.
If the same key is used in both systems with different attributes, which one gets to push their change? If you nominate a master it's easier. Then the slave just gets overwritten every time.
How much latency can you handle? It's much easier to take the 'notify' part out and just have a five minute windows task scheduler job that inspects log tables and pushes data around.
In other words, this kind of pattern:
Change occurs in a table. A database trigger on that table notes the change and writes the PK of the table to a change log table. A ReplicationBatch column in the log table is set to NULL by default
A windows scheduled task inspects all change log tables to find all changes that happened since the last run and 'reserves' these records by setting their replication state to a replication batch number
i.e. you run a UPDATE LogTable Set ReplicationBatch=BatchNumber WHERE ReplicationState IS NULL
All records that have been marked are replicated
you run a SELECT * FROM LogTable WHERE ReplicationState=RepID to get the records to be processed
When complete, the reserved records are marked as complete so the next time around only subsequent changes are replicated. This completion flag might be in the log table or it might be in a ReplicaionBatch number table
The main point is that you need to reserve records for replication, so that as you are replicating them out, additional log records can be added in from the source without messing up the batch
Then periodically you clear out the log tables.

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

Slow insert and update commands during mysql to redshift replication

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table
Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)