Copy data activity in ADF is not deleting data in Azure Table Storage - azure-data-factory

Right now, I am using ADF CopyData activity to copy the data from Azure SQL to Azure Table Storage.
It is inserting/Replacing/Updating data data while loading the data by triggering that ADF pipeline. This operation will take care by "Insert type" option which is having in Sink of CopyData activity.
But, It is not deleting the records in destination(ATS table).
How to sync Azure SQL data with Azure Table Storage(for deleted data as well)
Ex:
Source SQL table: Employee
Id Name
1 User1
2 User2
Now, using this copy data, these 2 data synced in ATS
Destination ATS Table: Employee
PartitionKey RowKey Timestamp Name
1 NewGuid 2022-07-22 11:30 User1
2 NewGuid 2022-07-22 11:30 User2
Now, in Source SQL table getting updated as below,
Id Name
1 User2
3 User3
Now, Id 2 got deleted and Name udpated for Id 1 and Added Id 3.
Again If I run pipeling, ATS updated as below,
PartitionKey RowKey Timestamp Name
1 NewGuid 2022-07-22 12:30 User2
2 NewGuid 2022-07-22 11:30 User2
3 NewGuid 2022-07-22 12:30 User3
Now, here PartitionKey 2 is not deleted. but Insert and Update as done.
How to delete this record as well using Copy Data activity sync.?

have you tried using a stored procedure for that case

I reproduced this and I got same result.
AFAIK, Copy activity did not delete any data from any target. It will overwrite the data in target with the source data.
In the Sink settings also, it is mentioned as insert type.
But here, the Azure Table storage supports update or replace (we can assume it as delete as old data deletes) of data from outside only when the new record's Partition Key and RowKey matches with the records in Target.
That is not happening here with your last row. That's why it is not updating.
You can raise the feature request for the deletion using Copy activity for this type of storage here.
You can try this manual approach to delete that kind of records when your data is small.
Create a new Unique Column only for the RowKey of Table storage. Assign your regular table Id to partition key and this for rowkey. So, whenever you want to delete old records and update new ones, give this old RowKey value to that.

Related

Azure Data Factory Copy Pipeline with Geography Data Type

I am trying to get a geography data type from a production DB to another DB on a nightly occurrence. I really wanted to leverage upsert as the write activity, but it seems that geography is not supported with this method. I was reading a similar post about bringing the data through ADF as a well known text data type and then changing it, but I keep getting confused on what to do with the data once it is brought over as a well known data type. I would appreciate any advice, thank you.
Tried to utilize ADF pipelines and data flows. Tried to convert the data type once it was in the destination, but then I was not able to run the pipeline again.
I tried to upsert the data with geography datatype from one Azure SQL database to another using copy activity and got error message.
Then, I did the upsert using dataflow activity. Below are the steps.
A source table is taken in dataflow as in below image.
CREATE TABLE SpatialTable
( id int ,
GeogCol1 geography,
GeogCol2 AS GeogCol1.STAsText() );
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (1,geography::STGeomFromText('LINESTRING(-122.360 46.656, -122.343 46.656 )', 4326));
INSERT INTO SpatialTable (id,GeogCol1)
VALUES (2,geography::STGeomFromText('POLYGON((-122.357 47.653 , -122.348 47.649, -122.348 47.658, -122.358 47.658, -122.358 47.653))', 4326));
Then Alter Row transformation is taken and in Alter Row Conditions, Upsert if isNull(id)==false()is given. (Based on the column id, sink table upserted)
Then, in Sink dataset for target table is given. In sink settings, Update method is selected as Allow Upsert and required Key column is given. (Here column id is selected)
When pipeline is run for the first time, data is inserted into target table.
When pipeline is run for the second time by updating the existing data and inserting new records to source, data is upserted correctly.
Source Data is changed for id=1 and new row is inserted with id=3
Sink data is reflecting the changes done in source.

MS Access - PostgreSQL auto number

I have a frontend in ms access and a backend in Postgresql. The database used to be on SQL Server before moving onto Postgresql. Most tables have a column called ID that is an AutoNumber in Access, so it gets the next ID automatically. It used to work fine on simple Access (database started on accdb file) and then when it was moved to SQL server.
After moving to PostgreSQL though the column is now labeled as "Number" in Access and it doesn't give new IDs when a new record is added, this causes a "Primary Key cannot be null" error. The Column ID is set to INTEGER with IDENTITY: Always and an increment of 1. Obviously I cannot alter the table in Access and change number into autonumber because it's read-only as a linked table.
How can I make Ms Access recognize the IDENTIFY column of PostgreSQL as an AutoNumber?
Edit with some more information:
I have a table called Orders and a table called Jobs. Each Order has multiple Jobs and they are connected with an OrderID column. Adding an Order without any Jobs works fine (even though the column is "Number" instead of "AutoNumber" Postgres creates new unique values correctly) The problem comes when I add a Job to that Order, I get "Primary Key cannot be null" error.
I added the Job ID field to be visible on the form and it doesn't have any value while I edit the other columns of the Job form. I used to create a new ID using Me.Dirty = False on the Job adding form, and that's the exact point where the error is reported. So Me.Dirty = False doesn't create a new ID on Postgres, but it creates one just fine on Access or SQL Server.

From Postgres table to KSQL table with updates tracking

My task is transfer data from Postgres table to KSQL table (for future joins with streams). Let's imagine table has three records:
id | name | description
-------------------------
1 | name1 | description1
2 | name2 | description2
3 | name3 | description3
It is easy to do by means of Kafka JdbcSourceConnector. But there is one little problem - data in table may be changed. Changes must be in KTable too.
According to documentation there is no way to track changes except bulk mode. But bulk mode takes absolutely all rows and inserts them into topic.
I thought to set up bulk mode for connector. Create a KSream for that topic. Create a KTable for that stream...
And here I do not know what to do. How to make sure changes in Postgres table were in KTable too?
Bulk mode would work, you just define the key of the stream, then new bulk writes will update the KTable of the same key. In other words, you need to ensure the primary keys don't change in your database
Alternatively, Debezium is the CDC version of Kafka Connect.
JDBC source doesn't capture UPDATE queries, as you've stated.
Debezium will produce records that contain previous and new versions of the modified rows

Moving a table from a database to another - Only insert missing rows

I have two databases that are alike, one called datastore and the other called datarestore.
datarestore is a copy of datastore which was created from a backup image. The problem is that I accidentally deleted a little too much data from datastore.
Both databases are located on different AWS instances and I typically connect to them using pgAdmin III or Python to create scripts that handle the data.
I want to get the rows that I accidentally deleted from datastore which are in datarestore into datastore. Does anyone have any idea of how this can be achieved. Both databases contain close to 1.000.000.000 rows and are on version 9.6.
I have seen some backup/import/restore options within pgAdmin III, I just don't know how they work and if they support my needs? I also thought about creating a python script, but querying my database has become pretty slow, so this seems not to be an option either.
-----------------------------------------------------
| id (serial - auto incrementing int) | - primary key
| did (varchar) |
| sensorid (int) |
| timestamp (bigint) |
| data (json) |
| db_timestamp (bigint) |
-----------------------------------------------------
If you preserved primary keys between those databases then you could create foreign tables pointing from datarestore to datastore and check what keys are missing (using for example select pk from old_table except select pk from new_table) and fetch those missing rows using the same foreign table you created. This should limit your first check for missing PK to just index only scans (+ network transfer) and then it will be index scan to fetch missing data. If you are missing only small part of it then it shouldn't take long.
If you require more detailed example then I'll update my answer.
EDIT:
Example of foreign table/server usage
Those commands need to be exuecuted on datarestore (or datastore if you choose to push data instead of pulling it).
If you don't have foreign data wrapper "installed" yet:
CREATE EXTENSION postgres_fdw;
This will create virtual server on your datarestore host. It is just some metadata pointing at foreign server:
CREATE SERVER foreign_datastore FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'foreign_hostname', dbname 'foreign_database_name',
port '5432_or_whatever_you_have_on_datastore_host');
This will tell your datarestore host what user should it connect as when using fdw on server foreign_datastore. It will be used only for your_local_role_name logged in on datarestore:
CREATE USER MAPPING FOR your_local_role_name SERVER foreign_datastore
OPTIONS (user 'foreign_username', password 'foreign_password');
You need to create schema on datarestore. It is where new foreign tables will be created.
CREATE SCHEMA schema_where_foreign_tables_will_be_created;
This will log in to remote host and create foreign tables on datarestore, pointing to tables at datastore. ONLY tables will be done this way.
No data will be copied, just structure of tables.
IMPORT FOREIGN SCHEMA foreign_datastore_schema_name_goes_here
FROM SERVER foreign_datastore INTO schema_where_foreign_tables_will_be_created;
This will return list of id that are missing in your datarestore database for this table
SELECT id FROM foreign_datastore_schema_name_goes_here.table_a
EXCEPT
SELECT id FROM datarestore_schema.table_a
You can either store them in temp table (CREATE TABLE table_a_missing_pk AS [query from above here]
Or use them right away:
INSERT INTO datarestore_schema.table_a (id, did, sensorid, timestamp, data, db_timestamp)
SELECT id, did, sensorid, timestamp, data, db_timestamp
FROM foreign_datastore_schema_name_goes_here.table_a
WHERE id = ANY((
SELECT array_agg(id)
FROM (
SELECT id FROM foreign_datastore_schema_name_goes_here.table_a
EXCEPT
SELECT id FROM datarestore_schema.table_a
) sub
)::int[])
From my tests, this should push-down (meaning send to remote host) something like that:
Remote SQL: SELECT id, did, sensorid, timestamp, data, db_timestamp
FROM foreign_datastore_schema_name_goes_here.table_a WHERE ((id = ANY ($1::integer[])))
You can make sure it does by running explain verbose on your full query to see what plan it will execute. You should see Remote SQL in there.
In case it does not work as expected, you can instead create temp table as mentioned earlier and make sure that this temp table is on datastore host.
Alternative approach would be to create foreign server on datastore pointing to datarestore and push data from your old database to new one (you can insert into foreign tables). This way you won't have to worry about list of id not being pushed down to datastore and instead fetching all data and filtering them afterwards (with would be extremely slow).

Data replication from mysql to Redshift

I would like to load the data from mysql to redshift.
Here my data values can change at anytime. So I need to capture old records and new records as well into Redshift.
Here modified records need to be archive.Only new records reflect in Redshift.
For an example
MysqlTable :
ID NAME SAL
-- ---- -----
1 XYZ 10000
2 ABC 20000
For first load into Redshift(this should be same as Mysqltable)
ID NAME SAL
-- ---- ----
1 XYZ 10000
2 ABC 20000
for Second load(I changed salary of Employee 'XYZ' from 10000 to 30000 )
ID NAME SAL
-- ---- ----
1 XYZ 30000
2 ABC 20000
The above table should be reflect in Redshift and modified record (1 XYZ 10000) should be archive.
Is this possible?
How many rows are you expecting?
One approach would be to add a timestamp column which gets updated to current time whenever a record is modified.
Then with an external process doing a replication run, you could get the max timestamp from Redshift and select any records from MySQL that are greater than that timestamp and, if you use the COPY method to load into Redshift, dump them to S3.
To load new records and archive old you'll need to use a variation of a Redshift upsert pattern. This would involve loading to a temporary table, identifying records in the original table to be archived, moving those to another archive table or UNLOADing them to an S3 archive, and then ALTER APPEND the new records into the main table.
See this site https://flydata.com/blog/mysql-to-amazon-redshift-replication.
A better option is the Change Data Capture (CDC). CDC is a technique that captures changes made to data in MySQL and applies it to the destination Redshift table. It’s similar to technique mention by systemjack, but in that it only imports changed data, not the entire database.
To use the CDC method with a MySQL database, you must utilize the Binary Change Log (binlog). Binlog allows you to capture change data as a stream, enabling near real-time replication.
Binlog not only captures data changes (INSERT, UPDATE, DELETE) but also table schema changes such as ADD/DROP COLUMN. It also ensures that rows deleted from MySQL are also deleted in Redshift.