How do I identify modified rows using AWS DMS to Redshift? - amazon-redshift

I thought I had a simple question, but I am having a hard time finding an answer. This makes me suspicious that I'm asking the wrong question...
I am new to Redshift... I am using a DMS migration task to pull data from a Sql Server database sitting on an EC2 instance into my Redshift database. I have it set to do a Full Load with Ongoing Replication. This is working.
However, I want to know specifically which rows have changed after ongoing replication makes its updates. It is replicating to my staging tables, and I do further transformations from there depending on certain changes to columns (eg history tracking), which is why I need to know what changed. I compare the staging tables to the existing facts and dimensions, but I don't want to compare the entire table, just the modified rows.
The source database is older and I can't trust the modification timestamp columns are always updated. I thought that setting the migration task to truncate the table, then ingesting ongoing changes would leave my staging table with just changed rows. In hindsight, maybe that was a silly thought.
The other route I am thinking is to turn on CDC in the source tables, load staging tables on the Sql Server side with the net changes and put DMS on those tables instead. I was hoping that extra step would not be necessary.
Help is appreciated!

Not sure if you're still looking for a answer on this but you could always use a transformation rule to flag rows that have changed from dms. e.g.
{
"rule-type": "transformation",
"rule-id": "1",
"rule-name": "1",
"rule-target": "column",
"object-locator": {
"schema-name": "%",
"table-name": "%"
},
"rule-action": "add-column",
"value": "dms_load_ts",
"expression": "current_timestamp",
"data-type": {
"type": "datetime"
}
}

Related

Retention management for time-series table in Redshift

I have a table which I use DMS to migrate from Aurora to Redshift. This table is insert only with a lot of data by timestamp.
I would like to have a trimmed version of that table in redshift.
The idea was to use partitions on it and use retention script to keep it with just the last 2 months. However in Redshift there is no partitions and what I find out there time-series table which sounds the same. If I understand it correctly my table should look like:
create table public."bigtable"(
"id" integer NOT NULL DISTKEY,
"date" timestamp,
"name" varchar(256)
)
SORTKEY(date);
However I don't find good documentation how the retention is managed. Would like any corrections and advice :)
A couple of ways this is typically done in Redshift.
For small to medium tables the data can just be DELETEd and the table VACUUMed (usually a delete only vacuum). Redshift is very good at handling large amounts of data and for tables that are really large this works fine. There is some overhead for the delete and vacuum but if these are scheduled on off hours it works just fine and is simple.
When the table in question get really big or there are not low workload times to perform the delete and vacuum, people set up "month" tables for their data and use a view that UNION ALLs these tables together. Then "removing" a month is just redefining the view and dropping the unneeded table. This is very low effort for Redshift to perform but is a bit more complex to set up. Your incoming data needs to be put into the correct table based on month so it is no longer just a copy from Aurora. This process also simplifies UNLOADing the old tables to S3 for history capture purposes.

How to see changes in a postgresql database

My postresql database is updated each night.
At the end of each nightly update, I need to know what data changed.
The update process is complex, taking a couple of hours and requires dozens of scripts, so I don't know if that influences how I could see what data has changed.
The database is around 1 TB in size, so any method that requires starting a temporary database may be very slow.
The database is an AWS instance (RDS). I have automated backups enabled (these are different to RDS snapshots which are user initiated). Is it possible to see the difference between two RDS automated backups?
I do not know if it is possible to see difference between RDS snapshots. But in the past we tested several solutions for similar problem. Maybe you can take some inspiration from it.
Obvious solution is of course auditing system. This way you can see in relatively simply way what was changed. Depending on granularity of your auditing system down to column values. Of course there is impact on your application due auditing triggers and queries into audit tables.
Another possibility is - for tables with primary keys you can store values of primary key and 'xmin' and 'ctid' hidden system columns (https://www.postgresql.org/docs/current/static/ddl-system-columns.html) for each row before updated and compare them with values after update. But this way you can identify only changed / inserted / deleted rows but not changes in different columns.
You can make streaming replica and set replication slots (and to be on the safe side also WAL log archiving ). Then stop replication on replica before updates and compare data after updates using dblink selects. But these queries can be very heavy.

Replicate views from Postgres to Postgres on AWS DMS

I'm using AWS DMS to migrate data from one Postgres database to another Postgres database. Everything works fine, except one thing: the views are no replicated on my target database.
I've read that this cannot be done between heterogenous database (i.e. from Oracle to Postgres) using DMS, but I imagine that this is possible somehow when we're using the same database.
Does someone know how to replicate the views using AWS DMS from Postgres to Postgres?
DMS is a data migration service. View is a virtual table(represented by sql code/object) and it does not contain any data by itself like a table does.
If you aren't looking to migrate the actual DDL but instead were happy to replicate the data from views in the source to tables in the target (akin to materialising the views in the target) then you can do this, but you need to jump into JSON configuration and there's some caveats.
Per the Selection rules and actions portion of the DMS documentation, you can add to the object-locator object a table-type field, setting its value to one of view or all to migrate views. It defaults to table which skips views, and there's no configuration field for it in the console aside from modifying the JSON.
For example:
{
"rule-type": "selection",
"rule-id": "123456",
"rule-name": "some-rule",
"object-locator": {
"schema-name": "my_schema",
"table-name": "%",
"table-type": "view"
},
"rule-action": "include",
"filters": []
}
There are a couple of caveats though:
Not all sources support migrating views. Aurora PostgreSQL doesn't, for instance, and saving the task will give an error if you try so you may need to replace the endpoint with a vanilla PostgreSQL one
If you're migrating views, your only option is a full load - CDC isn't going to work, so partial migrations aren't on the cards

Slow insert and update commands during mysql to redshift replication

I am trying to make a replication server from MySQL to redshift, for this, I am parsing the MySQL binlog. For initial replication, I am taking the dump of the mysql table, converting it into a CSV file and uploading the same to S3 and then I use the redshift copy command. For this the performance is efficient.
After the initial replication, for the continuous sync when I am reading the binlog the inserts and updates have to be run sequentially which are very slow.
Is there anything that can be done for increasing the performance?
One possible solution that I can think of is to wrap the statements in a transaction and then send the transaction at once, to avoid multiple network calls. But that would not address the problem that single update and insert statements in redshift run very slow. A single update statement is taking 6s. Knowing the limitations of redshift (That it is a columnar database and single row insertion will be slow) what can be done to work around those limitations?
Edit 1:
Regarding DMS: I want to use redshift as a warehousing solution which just replicates our MYSQL continuously, I don't want to denormalise the data since I have 170+ tables in mysql. During ongoing replication, DMS shows many errors multiple times in a day and fails completely after a day or two and it's very hard to decipher DMS error logs. Also, When I drop and reload tables, it deletes the existing tables on redshift and creates and new table and then starts inserting data which causes downtime in my case. What I wanted was to create a new table and then switch the old one with new one and delete old table
Here is what you need to do to get DMS to work
1) create and run a dms task with "migrate and ongoing replication" and "Drop tables on target"
2) this will probably fail, do not worry. "stop" the dms task.
3) on redshift make the following changes to the table
Change all dates and timestamps to varchar (because the options used
by dms for redshift copy cannot cope with '00:00:00 00:00' dates that
you get in mysql)
change all bool to be varchar - due to a bug in dms.
4) on dms - modify the task to "Truncate" in "Target table preparation mode"
5) restart the dms task - full reload
now - the initial copy and ongoing binlog replication should work.
Make sure you are on latest replication instance software version
Make sure you have followed the instructions here exactly
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
If your source is aurora, also make sure you have set binlog_checksum to "none" (bad documentation)

Sync postgreSql data with ElasticSearch

Ultimately I want to have a scalable search solution for the data in PostgreSql. My finding points me towards using Logstash to ship write events from Postgres to ElasticSearch, however I have not found a usable solution. The soluions I have found involve using jdbc-input to query all data from Postgres on an interval, and the delete events are not captured.
I think this is a common use case so I hope you guys could share with me your experience, or give me some pointers to proceed.
If you need to also be notified on DELETEs and delete the respective record in Elasticsearch, it is true that the Logstash jdbc input will not help. You'd have to use a solution working around the binlog as suggested here
However, if you still want to use the Logstash jdbc input, what you could do is simply soft-delete records in PostgreSQL, i.e. create a new BOOLEAN column in order to mark your records as deleted. The same flag would then exist in Elasticsearch and you can exclude them from your searches with a simple term query on the deleted field.
Whenever you need to perform some cleanup, you can delete all records flagged deleted in both PostgreSQL and Elasticsearch.
You can also take a look at PGSync.
It's similar to Debezium but a lot easier to get up and running.
PGSync is a Change data capture tool for moving data from Postgres to Elasticsearch.
It allows you to keep Postgres as your source-of-truth and expose structured denormalized
documents in Elasticsearch.
You simply define a JSON schema describing the structure of the data in
Elasticsearch.
Here is an example schema: (you can also have nested objects)
e.g
{
"nodes": {
"table": "book",
"columns": [
"isbn",
"title",
"description"
]
}
}
PGsync generates queries for your document on the fly.
No need to write queries like Logstash. It also supports and tracks deletion operations.
It operates both a polling and an event-driven model to capture changes made to date
and notification for changes that occur at a point in time.
The initial sync polls the database for changes since the last time the daemon
was run and thereafter event notification (based on triggers and handled by the pg-notify)
for changes to the database.
It has very little development overhead.
Create a schema as described above
Point pgsync at your Postgres database and Elasticsearch cluster
Start the daemon.
You can easily create a document that includes multiple relations as nested objects. PGSync tracks any changes for you.
Have a look at the github repo for more details.
You can install the package from PyPI
Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to steam your data
I created a simple github repository, which shows how it works