Azure copy data activity - azure-data-factory

assume that I'm trying to copy 1000 records in a table from a database to an Azure SQL DB/Synapse using ADF Copy activity. if the Copy activity fails after copying 500 records, is it possible to re-run/restart the pipeline such that the Copy activity avoids copying already copied records( 600 records which were copied in the earlier run) and resume copy operation from the remaining 500 records?
Thank you.
n

Unfortunately based on my understanding, the copy activity would being from the start and cannot have the scope to copy from where it had failed (in case of bulk copy)
There is a way (though a bad way) wherein you can can use a for each loop with the number of iterations equivalent to the rows and copy each row in an iteration and maintain a watermark feature .
In that way in case if copy failed, it would start from the watermark but this would be a bad way and a bad performance as compared to allowing a full rerun

Related

Azure Data Factory - Copy Data Upsert only updating a single row at a time

I'm using Data Factory (well synapse pipelines) to ingest data from sources into a staging layer. I am using the Copy Data activity with UPSERT. However i found the performance of incrementally loading large tables particularly slow so i did some digging.
So my incremental load brought in 193k new/modified records from the source. These get stored in the transient staging/landing table that the copy data activity creates in the database in the background. In this table it adds a column called BatchIdentifier, however the batch identifier value is different for every row.
Profiling the load i can see individual statements issued for each batchidentifier so effectively its processing the incoming data row by row rather than using a batch process to do the same thing.
I tried setting the sink writebatchsize property on copy data activity to 10k but that doesn't make any difference.
Has anyone else come across this, or a better way to perform a dynamic upsert without having to specify all the columns in advance (which i'm really hoping to avoid)
This is the SQL statement issued 193k times on my load as an example.
Does a check to see if the record exists in the target table, if so performs an update otherwise performs an insert. logic makes sense but its performing this on a row by row basis when this could just be done in bulk.
Is your primary key definition in the source the same as in the sink?
I just ran into this same behavior when the columns in the source and destination tables used different columns.
It also appears ADF/Synapse does not use MERGE for upserts, but its own IF EXISTS THEN UPDATE ELSE INSERT logic so there may be something behind the scenes making it select single rows for those BatchId executions.

Azure Data Factory - Degree of copy parallelism

I'm running an Azure Data Factory that copies multiple tables from on prem SQL server to an Azure Data Lake.
So, I set many Copy activities through Az Data Factory Designer to execute parallel copies (each activity is carrying on the extract of one table).
For better resources optimization, I would like to know if there is a way to copy multiple tables with one Copy activity ?
I heard of "degree of copy parallelism", but don't know how to use it ?
Rgds,
If the question helped, up-vote it. Thanks in advance.
To use one Copy activity for multiple tables, you'd need to wrap a single parameterized Copy activity in a ForEach activity. The ForEach can scale to run multiple sources at one time by setting isSequential to false and setting the batchCount value to the number of threads you want. The default batch count is 20 and the max is 50. Copy Parallelism on a single Copy activity just uses more threads to concurrently copy partitions of data from the same data source.

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

How do you treat a sequential file stage that cannot find the file similar to an empty table?

I have several datastage jobs that will run, but MIGHT not have the source file there. If not, I want the datastage job to complete similar to if I was using a Source DB Connector and the source table has zero rows.
how can this be done?
Thanks
The SequentialFile stage in DataStage expects a file to exists - even it it might be zero bytes in size.
One option would be to place a WaitForFile stage in front of your job to avoid the job run if no file exists. This would save efforts for loading lookup data etc. but is not 100% the behavior of an empty table. You could also touch an empty file in that case to get the behavior you want but I doubt this is a good design.

How to push a big file data in talend?

I have created a table where I have a text input file which is 7.5 GB in size and there are 65 million records and now I want to push that data into an Amazon RedShift table.
But after processing 5.6 million records it's no longer moving.
What can be the issue? Is there any limitation with tFileOutputDelimited as the job has been running for 3 hours.
Below is the job which I have created to push data in to Redshift table.
tFileInputDelimited(.text)---tMap--->tFilOutputDelimited(csv)
|
|
tS3Put(copy output file to S3) ------> tRedShiftRow(createTempTable)--> tRedShiftRow(COPY to Temp)
The limitation comes from Tmap component, its not the good choice to deal with large amount of data, for your case, you have to enable the option "Store temp data" to overcome the memory consumption limitation of Tmap.
Its well described in Talend Help Center.
Looks like, tFilOutputDelimited(csv) is creating the problem. Any file can't handle after certain amount of data. Not sure thought. Try to find out a way to load only portion of the parent input file and commit it in redshift. Repeat the process till your parent input file gets completely processed.
use AWS Glue to push your file data from S3 to Redshift. AWS Glue will easily push the large data into redshift without any issue.
Steps:
1: Create a connection with your Redshift
2: create a database and two tables.
a: Data-from-S3 (this will use to crawl file data from S3)
b: data-to-redshift ( add redshift connection)
3: Create a Job:
a: In Data source, select the "Data-from-S3" table
b: In Data Target, select the "data-to-redshift" table
4: Run the job.
Note: You can also automate this with lambda and SNS trigger.
You can use copy command option to load large data into aws redshift, if the copy command doesn't support txt file, then we need to have csv file. processing 65 million records will create issue. so we need to perform split and run. for that create 65 iterations and do process 1 million data a a time . To implement this use tloop and set the values inside the component. take the global variables of tloop in header and limit values of tinputdelimited component
job:
tloop----->tinputfiledelimited---->tmap(if needed)--------> tfileoutdelimited
also enable the option "Store temp data" to handle the memory issue