Data replication from mysql to Redshift - amazon-redshift

I would like to load the data from mysql to redshift.
Here my data values can change at anytime. So I need to capture old records and new records as well into Redshift.
Here modified records need to be archive.Only new records reflect in Redshift.
For an example
MysqlTable :
ID NAME SAL
-- ---- -----
1 XYZ 10000
2 ABC 20000
For first load into Redshift(this should be same as Mysqltable)
ID NAME SAL
-- ---- ----
1 XYZ 10000
2 ABC 20000
for Second load(I changed salary of Employee 'XYZ' from 10000 to 30000 )
ID NAME SAL
-- ---- ----
1 XYZ 30000
2 ABC 20000
The above table should be reflect in Redshift and modified record (1 XYZ 10000) should be archive.
Is this possible?

How many rows are you expecting?
One approach would be to add a timestamp column which gets updated to current time whenever a record is modified.
Then with an external process doing a replication run, you could get the max timestamp from Redshift and select any records from MySQL that are greater than that timestamp and, if you use the COPY method to load into Redshift, dump them to S3.
To load new records and archive old you'll need to use a variation of a Redshift upsert pattern. This would involve loading to a temporary table, identifying records in the original table to be archived, moving those to another archive table or UNLOADing them to an S3 archive, and then ALTER APPEND the new records into the main table.

See this site https://flydata.com/blog/mysql-to-amazon-redshift-replication.
A better option is the Change Data Capture (CDC). CDC is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It’s similar to technique mention by systemjack, but in that it only imports changed data, not the entire database.
To use the CDC method with a MySQL database, you must utilize the Binary Change Log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in Redshift.

Related

Copy data activity in ADF is not deleting data in Azure Table Storage

Right now, I am using ADF CopyData activity to copy the data from Azure SQL to Azure Table Storage.
It is inserting/Replacing/Updating data data while loading the data by triggering that ADF pipeline. This operation will take care by "Insert type" option which is having in Sink of CopyData activity.
But, It is not deleting the records in destination(ATS table).
How to sync Azure SQL data with Azure Table Storage(for deleted data as well)
Ex:
Source SQL table: Employee
Id Name
1 User1
2 User2
Now, using this copy data, these 2 data synced in ATS
Destination ATS Table: Employee
PartitionKey RowKey Timestamp Name
1 NewGuid 2022-07-22 11:30 User1
2 NewGuid 2022-07-22 11:30 User2
Now, in Source SQL table getting updated as below,
Id Name
1 User2
3 User3
Now, Id 2 got deleted and Name udpated for Id 1 and Added Id 3.
Again If I run pipeling, ATS updated as below,
PartitionKey RowKey Timestamp Name
1 NewGuid 2022-07-22 12:30 User2
2 NewGuid 2022-07-22 11:30 User2
3 NewGuid 2022-07-22 12:30 User3
Now, here PartitionKey 2 is not deleted. but Insert and Update as done.
How to delete this record as well using Copy Data activity sync.?
have you tried using a stored procedure for that case
I reproduced this and I got same result.
AFAIK, Copy activity did not delete any data from any target. It will overwrite the data in target with the source data.
In the Sink settings also, it is mentioned as insert type.
But here, the Azure Table storage supports update or replace (we can assume it as delete as old data deletes) of data from outside only when the new record's Partition Key and RowKey matches with the records in Target.
That is not happening here with your last row. That's why it is not updating.
You can raise the feature request for the deletion using Copy activity for this type of storage here.
You can try this manual approach to delete that kind of records when your data is small.
Create a new Unique Column only for the RowKey of Table storage. Assign your regular table Id to partition key and this for rowkey. So, whenever you want to delete old records and update new ones, give this old RowKey value to that.

Redshift CDC or delta load

Any one knows best way for loading delta or CDC with using any tools
I got big table with billions of records and want to update or insert like Merge in Sql server or Oracle but in Amazon Redshift S3
Also we have loads of columns as can't compare all columns as well
e.g
TableA
Col1 Col2 Col3 ...
It has say already records
SO when inserting new records need to check that particular record is already existing if so no insert if not insert and if changed update record like that
I do have key id and date columns but as its got 200+ columns not easy to check all columns and taking much time
Many thanks in advance

Purging of transactional data in DB2

We have existing table of size more than 130 TB we have to delete records in DB2 . Using delete statement would will hang the system. So one way is we can partition the table month and year wise and then drop the partition one by one by using truncate or drop. Looking for a script which can create the partition and subsequently dropping.
You can't partition the data within an existing table. You would need to move the data to a new ranged partitioned table.
If using Db2 LUW, and depending on your specific requirments, consider using ADMIN_MOVE_TABLE to move your data to a new table while keeping your table "on-line"
ADMIN_MOVE_TABLE has the ability to add Range Partitioning and/or Multi-Dimentional Clustering on the new table during the move.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.db2.luw.sql.rtn.doc/doc/r0055069.html
Still, a 130 TB table is very large, and you would be well advised to be carful in planning and testing such a movement.

From Postgres table to KSQL table with updates tracking

My task is transfer data from Postgres table to KSQL table (for future joins with streams). Let's imagine table has three records:
id | name | description
-------------------------
1 | name1 | description1
2 | name2 | description2
3 | name3 | description3
It is easy to do by means of Kafka JdbcSourceConnector. But there is one little problem - data in table may be changed. Changes must be in KTable too.
According to documentation there is no way to track changes except bulk mode. But bulk mode takes absolutely all rows and inserts them into topic.
I thought to set up bulk mode for connector. Create a KSream for that topic. Create a KTable for that stream...
And here I do not know what to do. How to make sure changes in Postgres table were in KTable too?
Bulk mode would work, you just define the key of the stream, then new bulk writes will update the KTable of the same key. In other words, you need to ensure the primary keys don't change in your database
Alternatively, Debezium is the CDC version of Kafka Connect.
JDBC source doesn't capture UPDATE queries, as you've stated.
Debezium will produce records that contain previous and new versions of the modified rows

DB2 Partitioning

I know how partitioning in DB2 works but I am unaware about where this partition values exactly get stored. After writing a create partition query, for example:
CREATE TABLE orders(id INT, shipdate DATE, …)
PARTITION BY RANGE(shipdate)
(
STARTING '1/1/2006' ENDING '12/31/2006'
EVERY 3 MONTHS
)
after running the above query we know that partitions are created on order for every 3 month but when we run a select query the query engine refers this partitions. I am curious to know where this actually get stored, whether in the same table or DB2 has a different table where partition value for every table get stored.
Thanks,
table partitions in DB2 are stored in tablespaces.
For regular tables (if table partitioning is not used) table data is stored in a single tablespace (not considering LOBs).
For partitioned tables multiple tablespaces can used for its partitions.
This is achieved by the "" clause of the CREATE TABLE statement.
CREATE TABLE parttab
...
in TBSP1, TBSP2, TBSP3
In this example the first partition will be stored in TBSP1, the second in TBSP2, The third in TBSP3, the fourth in TBSP1 and so on.
Table partitions are named in DB2 - by default PART1 ..PARTn - and all these details can be looked up in the system catalog view SYSCAT.DATAPARTITIONS including the specified partition ranges.
See also http://www-01.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?cp=SSEPGG_10.5.0%2F2-12-8-27&lang=en
The column used as partitioning key can be looked up in syscat.datapartitionexpression.
There is also a long syntax for creating partitioned tables where partition names can be explizitly specified as well as the tablespace where the partitions will get stored.
For applications partitioned tables look like a single normal table.
Partitions can be detached from a partitioned table. In this case a partition is "disconnected" from the partitioned table and converted to a table without moving the data (or vice versa).
best regards
Michael
After a bit of research I finally figure it out and want to share this information with others, I hope it may come useful to others.
How to see this key values ? => For LUW (Linux/Unix/Windows) you can see the keys in the Table Object Editor or the Object Viewer Script tab. For z/OS there is an Object Viewer tab called "Limit Keys". I've opened issue TDB-885 to create an Object Viewer tab for LUW tables.
A simple query to check this values:
SELECT * FROM SYSCAT.DATAPARTITIONS
WHERE TABSCHEMA = ? AND TABNAME = ?
ORDER BY SEQNO
reference: http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0021353.html?lang=en
DB2 will create separate Physical Locations for each partition. So each partition will have its own Table-space. When you SELECT on this partitioned Table your SQL may directly go to a single partition or it may span across many depending on how your SQL is. Also, this may allow your SQL to run in parallel i.e. many TS can be accessed concurrently to speed up the SELECT.