Talend: How to copy the file with modified as of today - talend

I have a job in Talend which will connect to a ftp folder and look for the files eg:ABCD. This file is created everyday and its placed in the ftp path and i need to move this files to some other folder. I'm new to talend and Java. Could you please help me how to move this file when and only the file last modified date as of the job run date.

You can use tFTPFileProperties to obtain the properties of the remote file, then in a javarow access those properties. You can then compare to current date either in the tJavaRow and stick the results in a global variable or put the date in a global variable. You then use an IF trigger to join to the tFTPGet component.
The IF trigger will either check the results of your compare, or do the compare. It will only execute the FTP Get if true.
This shows overall job structure, including the fields made available from the file properties:
This shows how to obtain the datetime of the remote file. This is where you will need to stick it in a global variable (code for that is not shown) so you can use it in your IF trigger code.
This shows the datetime of the remote file when the job is run.
This points you in the right direction but you will need to still do some work. You will need to do the compare in your IF trigger and know how to compare dates.

Related

ADF - built-in copy task to create folders based on variables

I am using Azure Data Factory's built-in copy task, set on a daily schedule, to copy data into a container in Azure Data Lake Storage Gen2 using the Built-in Copy tool. For my destination, I'm trying to use date variables to create a folder structure for the data. In my resultant pipeline the formula looks like this:
Dir1/Dir2/#{formatDateTime(pipeline().parameters.windowStart,'yyyy')}/#{formatDateTime(pipeline().parameters.windowStart,'MM')}/#{formatDateTime(pipeline().parameters.windowStart,'dd')}
Unfortunately this is throwing an error:
Operation on target ForEach_h33 failed: Activity failed because an inner activity failed; Inner activity name: Copy_h33, Error: The function 'formatDateTime' expects its first parameter to be of type string. The provided value is of type 'Null'.
Everything I've created was just generated by the tool, the folder path I used when following the tool was as suggested:
Dir1/Dir2/{year}/{month}/{day} (I was then able to set the format of each variable - e.g., yyyy, MM, dd, which suggests the tool understood what I was doing.
The only other thing I can think of, is that the folder structure in the container only contains Dir1/Dir2/ - I am expecting the subdirectories to be created as the copy task runs.
I'll also add, everything runs fine if I just use the directory Dir1/Dir2/ - so the issue is with my variables.
There is nothing wrong in your built-in copy task. It will give perfect result when it is running on the scheduled trigger time.
But if you want to run with manual trigger, it will give the error like above.
When you are running with manual trigger you must give the windowStart parameter value by yourself. In the above It is saying that the value is null.
Give the value as MM/DD/YYYY. The Schedule trigger automatically takes this value when it runs daily. But In manual triggering we have to specify this value.
Then click on Ok. Now, you will not get error like that, and you can create folders with the year/month/day format.

How can I pass output from a filter activity directly to a copy activity in ADF?

I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.

Validation checks on dataset ADF vs databricks

I want to perform some file level, field level validation checks on the dataset I receive.
Given below some checks which I want to perform and capture any issues into audit tables.
File Level Checks: File present, size of the file, Count of records matches to count present in control file
Field Level checks: Content in right format, Duplicate key checks, range in important fields.
I want to make this as a template so that all the project can adopt it, Is it good to perform these checks in ADF or in Databricks. If it is ADF any reference to example dataflow/pipeline would be very helpful.
Thanks,
Kumar
You can accomplish these tasks by using various Activities in Azure data factory pipeline.
To check the file existence, you can use Validation Activity.
In the validation activity, you specify several things. The dataset you want to validate the existence of, sleep how long you want to wait between retries, and timeout how long it should try before giving up and timing out. The minimum size is optional.
Be sure to set the timeout value properly. The default is 7 days, much too long for most jobs.
If the file is found, the activity reports success.
If the file is not found, or is smaller than minimum size, then it can timeout, which is treated like a failure by dependencies.
To count of matching records and assuming that you are using CSV, you could create a generic dataset (one column) and run a copy activity over whatever folders you want to count to a temp folder. Get the rowcount of the copy activity and save it.
At the end, delete everything in your temp folder.
Something like this:
Lookup Activity (Get's your list of base folders - Just for easy rerunning)
For Each (Base Folder)
Copy Recursively to temp folder
Stored procedure activity which stores the Copy Activity.output.rowsCopied
Delete temp files recursively.
To use the same pipeline repeatedly for multiple datasets, you can make your pipeline dynamic. Refer: https://sqlitybi.com/how-to-build-dynamic-azure-data-factory-pipelines/

Azure Data factory, How to incrementally copy blob data to sql

I have a azure blob container where some json files with data gets put every 6 hours and I want to use Azure Data Factory to copy it to an Azure SQL DB. The file pattern for the files are like this: "customer_year_month_day_hour_min_sec.json.data.json"
The blob container also has other json data files as well so I have filter for the files in the dataset.
First question is how can I set the file path on the blob dataset to only look for the json files that I want? I tried with the wildcard *.data.json but that doesn't work. The only filename wildcard I have gotten to work is *.json
Second question is how can I copy data only from the new files (with the specific file pattern) that lands in the blob storage to Azure SQL? I have no control of the process that puts the data in the blob container and cannot move the files to another location which makes it harder.
Please help.
You could use ADF event trigger to achieve this.
Define your event trigger as 'blob created' and specify the blobPathBeginsWith and blobPathEndsWith property based on your filename pattern.
For the first question, when an event trigger fires for a specific blob, the event captures the folder path and file name of the blob into the properties #triggerBody().folderPath and #triggerBody().fileName. You need to map the properties to pipeline parameters and pass #pipeline.parameters.parameterName expression to your fileName in copy activity.
This also answers the second question, each time the trigger is fired, you'll get the fileName of the newest created files in #triggerBody().folderPath and #triggerBody().fileName.
Thanks.
I understand your situation. Seems they've used a new platform to recreate a decades old problem. :)
The patter I would setup first looks something like:
Create a Storage Account Trigger that will fire on every new file in the source container.
In the triggered Pipeline, examine the blog name to see if it fits your parameters. If no, just end, taking no action. If so, binary copy the blob to a account/container your app owns, leaving the original in place.
Create another Trigger on your container that runs the import Pipeline.
Run your import process.
Couple caveats your management has to understand. You can be very, very reliable, but cannot guarantee compliance because there is no transaction/contract between you and the source container. Also, there may be a sequence gap since a small file can usually process while a larger file is processing.
If for any reason you do miss a file, all you need to do is copy it to your container where your process will pick it up. You can load all previous blobs in the same way.

Append date and time to the std_out_file in jil

I have a job in Autosys. I want new log file to be created for its every run. Hence, I want to append date and time to the logfile name I am giving in std_out_file. Is there any way to do this other than creating a global variable and then updating it everyday using another autosys job?
There is no default to do this. You can use this:
\\logpath\%auto_job_name%.%autorun%.OUT.txt
This will create a run number and it will not overwrite the logs even if there are multiple runs a day. You can use the date/time that is when the file is created for sorting.