Set up delta load in azure data factory - tsql

I have an SQL database on prem that I want to get data from. In the database there is a column called last_update that has information about when a row was last updated. The first time I run my pipeline I want it to copy everything from the database on prem to an azure database. The next time I want copy only the rows that have been updated since the last run. I therefore want to copy everything where last_update is higher than the time of the last run. Is there a way of using information about the time of the last run in a pipeline? Is there any other good way of creating what i want?

I think you can do this by developing custom copy activity. You can add your own transformation/processing logic and use the activity in a pipeline.

Related

Real Time Data Copy from Oracle DB to ADLS via Azure Data Factory?

I want to copy my data from my source that is Oracle DB to Sink which is ADLS Gen2 via Azure Data Factory. But I want to make this replication real time, meaning whenever their is a change in my Oracle database it should be populated in ADLS storage.
How can I achieve this via ADF? Any good suggestion?
No, we can achieve the real time copy with data factory. The most closet way is set the schedule trigger for your pipeline. But even with this trigger, it needs every 5min to prepare the pipeline next execution.
You can create the pipeline wit copy active or data flow active, copy the data from Oracle to ADLS Gen2. Then create a schedule time trigger to trigger the pipeline run. The pipeline only can be triggered at least every 5 min.
Like you said, the table data changes every second, it's not a real time copy.
HTH.

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

How to trigger a Google Cloud Composer DAG based on a latest record in the control table

I have a DAG thet loads the data into no of raw tables .There is a control table that stores the list of tables and when they are last updated by the DAG .This is all managed by different ta=eam. I am trying to create a DAG to run a query on one of the raw table and load it into persistant table. I would like to run the DAG as soon as i see the latest time stamp than what i already processed in my control table.
I am new to the Cloud Composer, can you please let me know how i can accomplish it?
Thanks

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

Real time Change Data Capture in Talend

I am trying to setup Real time change data capture between two different MySQL databases using Talend Studio. I was able to successfully create a job that uses Publish/Subscribe model that picks up only the changed data from source and populates in the target database.
I could not find the documentation to setup CDC in real time i.e. as soon as a new row is inserted in the source database it will be picked up by the job and populated in target database. The Talend job will be running continuously to look for possible changes in the source.
My question: is scheduling the Talend job using some scheduler for desired interval the only option in this case?
Thanks in advance.
You could also use triggers for Create, Update, Delete on the database and use those triggers to start a process pushing the data somewhere or starting a process.