Google Cloud Data Fusion variable in S3 path - google-cloud-data-fusion

I have an S3 source in my pipeline. I would like to fetch the CSV file including today's date in the filename. Is it possible to somehow include the date in the path? Let's say the path is:
s3a://products-data/products_20210204.csv

You can use the logical start time macro function in the name of the s3 file as s3a://products-data/products_${logicalStartTime(yyyyMMdd)}.csv. At runtime this will resolve to the today's date.
For more information about logical start time macro, please checkout the article here - https://cdap.atlassian.net/wiki/spaces/KB/pages/269910059/Add+processed+date+to+all+records+in+the+destination

Related

Azure Data Factory data flow file sink

I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')

Can a Mapping Data Flow use a parameterized Parquet dataset?

thanks for coming in.
I am trying to develop a Mapping Data Flow in an Azure Synapse workspace (so I believe that this can also apply to ADFv2) that takes a Delta input and transforms it straight into a Parquet -formatted output, with the relevant detail of using a Parquet dataset pointing to ADLSGen2 with parameterized file system and folder, in opposition to a hard-coded file-system and folder, because this would take creating too many datasets as there are too many folders of interest in the Data Lake.
The Mapping Data Flow:
As I try to use it as a Source in my Mapping Data Flows, the debug configuration (as well as the parent pipeline configuration) will duly ask for my input on those parameters, which I am happy to enter.
Then, as soon I try to debug or run the pipeline I get this error in less than 1 second:
{
"Message": "ErrorCode=InvalidTemplate, ErrorMessage=The expression 'body('DataFlowDebugExpressionResolver')?.50_DeltaToParquet_xxxxxxxxx?.ParquetCurrent.directory' is not valid: the string character '_' at position '43' is not expected."
}
RunId: xxx-xxxxxx-xxxxxx
This error message is not very specific to know where I should look.
I tried replacing the parameterized Parquet dataset with a hard-coded one, and it works perfectly both in debug and pipeline -run modes. However, this does not gets me what I need which is the ability to reuse my Parquet dataset instead of having to create a specific dataset for each Data Lake folder.
There are also no spaces in the Data Lake file system. Please refer to these parameters that look a lot like my production environment:
File System: prodfs001
Directory: synapse/workspace01/parquet/dim_mydim
Thanks in advance to all of you, folks!
The directory name synapse/workspace01/parquet/dim_mydim has an _ in dim_mydim, can you try replacing the underscore, or maybe you can use dimmydim to test whether it works.

Is there a way to find the oldest file in a directory using Azure Data Lake?

Is there a way to find the oldest file in a directory using Azure Data Lake?
I had assumed I could use the meta data activity to get all the file names and dates (which I can). I then thought I could use the forEach to set two variables in the pipeline (Name & Date) with the values from the list if they were older than the current value of the variables. This does not work as all the files are processed in parallel. This really should not be this hard.
Yes, ForEach activity in Azure Data Factory works in parallel by default , but you change to work sequentially through checking Sequential option.
More details, you can refer to this documentation.

Reading file from Google Drive with Talend

I need to read an uploaded file in Google Drive and perform X transformation with it. As per my reading, the single way to do it is by downloading the file to my local machine with the Talend component and then, reading from there.
If it is correct, I cannot figure what would be the file name assuming that I don't want to use the exact name of the file.
I found http://meowbi.com/2018/02/23/getting-google-sheet-gdrive-talend/ and it is exactly what I need - read from Google Drive, check the file name and proceed if the file name is X. What is unclear for me is what they used in tJava.
The output schema of tGoogleDriveList component's Main row contains a field name that is the file name you're looking for. Using Iterate row is less straightforward as you need to extract values from GlobalMap. In the article you cited they get file name by "tGoogleDriveList_1_TITLE" key of the GlobalMap.
Main row between tGoogleDriveList and tJava
For more details please look into the Talend Reference for Google Drive components. The Listing files and folders in Google Drive section should be particularly topical for your case.

Talend: How to copy the file with modified as of today

I have a job in Talend which will connect to a ftp folder and look for the files eg:ABCD. This file is created everyday and its placed in the ftp path and i need to move this files to some other folder. I'm new to talend and Java. Could you please help me how to move this file when and only the file last modified date as of the job run date.
You can use tFTPFileProperties to obtain the properties of the remote file, then in a javarow access those properties. You can then compare to current date either in the tJavaRow and stick the results in a global variable or put the date in a global variable. You then use an IF trigger to join to the tFTPGet component.
The IF trigger will either check the results of your compare, or do the compare. It will only execute the FTP Get if true.
This shows overall job structure, including the fields made available from the file properties:
This shows how to obtain the datetime of the remote file. This is where you will need to stick it in a global variable (code for that is not shown) so you can use it in your IF trigger code.
This shows the datetime of the remote file when the job is run.
This points you in the right direction but you will need to still do some work. You will need to do the compare in your IF trigger and know how to compare dates.