Data Factory - Can I use the date field in a CSV to determine the destination folder in Copy Activity - date

I have some CSV files that I want to copy to a specific folder in ADLS based on the date column within the file.
i.e. CSV file has a column named "date" that reads "2022-02-23" on all rows. I want to copy that file to a folder that has the corresponding year and month, such as "/curated/UK/ProjectABC/2022/02"
I've got a Lookup activity that's pointing to the source CSV file and populating a Set Variable activity with the month using this dynamic content - #substring(string(activity('Lookup1').output.firstrow.date),5,2)
Would this be the right approach, to use a variable?
I cant use variables in the Directory portion of the Sink Dataset, as far as I know.
Have you come across this situation before?

Sounds like you're on the right path. You can use absolutely use Dataset parameters:
Then populate them in your pipeline using a variable (or parameter, or expression):

Related

pyspark read delta csv file by date

I have several csv files in a folder. please refer to below screenshot.
The files with '20221205' are delta files and are newly uploaded into the folder today.
I want to read these 2 delta csv files only, and do some transformation and then append to existing table.
Every day, i will upload 2 files with current data as suffix, then run the note to handle the files uploaded today only.
Question: how to read only today's file only by pyspark??
How should I load the delta
what you call delta is actually a normal csv file with different prefix, not to be confused with delta data format.
you can read the prefix using glob patterns, simply put the date into the path string and it will read only the files ending with the suffix of the date:
spark.read.csv("path/to/folder/*20221205.csv")
I recommend however, if possible, storing the csv partitioned in your file system. this means each date is in a separate folder.
The file system will be something like:
folder
date=2022-01-01
date=2022-01-02
....
then you can simply:
spark.read.csv('folder').filter(col('date') == '2022-01-02')
the filter on the date will take milliseconds since the data is partitioned, behind the scenes spark knows that csvs with date = X are stored ONLY in date=X folder.

Copy latest files from SAP Server using Azure Data Factory and File System Linked Service and store it in Azure Data Lake

I have a task to copy the newly added files everyday from SAP Server and store it in ADLS. There are 2 types files on the server (Recurring and Upfront) appended with date. I need to store the files in separate folder and everyday add the latest file from the SAP Server to the ADLS.
File name format:
R_07292021.orc
Recurring_08312021.orc
U_07292021.orc
Upfront_08312021.orc
Below are the steps I have taken so far
Get Metadata Activity to get the list of files from the server
Use filter activity to separate the files based on the names, so filtering with the initial letter
I tried using the Foreach activity and If Condition, but it doesn't seem to be working.
I am stuck at this point trying to figure out how to proceed. Any help would be very much appreciated.
If you are trying to get the latest modified date of a file from a folder, you can refer to the below process.
I tested it with one type of file which starts with “U”.
Create 2 variables, one to store maxdate and the other to store the latestfilename. Assign any initial value (which is the past date) in maxdate variable.
Using Get Metadata activity getting the list of files that starts with “U” by hardcoding the filename parameter value as “U”.
Output of Get Metadata1:
Pass the Get Metadata activity output child items to the ForEach activity to loop through all the files from the list.
Inside ForEach--> Use another Get Metadata activity to get the metadata (last modified date & filename) of the current file in the loop.
Output of Get Metadata2:
Connect Get Metadata to If Condition and use greater function and ticks function to evaluate the If condition expression.
Ticks function returns Integer value of specified timestamp and using greater function compare the 2 values.
#greater(ticks(activity('Get_lastmodified_date_and_name').output.lastModified),ticks(formatDateTime(variables('maxdate'))))
When the expression evaluates to true add (2 set variable) activities to store the last modified date and respective file name into the variables created initially.
Maxdate:
LatestFileName:
Note: These variable's values will be overridden if the next iteration file in the loop contains a timestamp greater than the previous (or first) loop file.
In the next activities, Use the variable filename (latestfilename) to assign the value for the filename parameter in the source.

Upload multiple files to pentaho

In pentaho data integration, how do I import a list of xlsx files that are in the same folder?
note: the number of columns are always the same
If your excel column name and sheet name are always same then you can use THIS solution. Here I have take all xlsx file from source folder and convert one-by-one file as csv.
But if your excel columnname and sheet name are dynamic or you need some dynamic solution. Then you can use my another stackoverflow solution from Here

Skip lines while reading csv - Azure Data Factory

I am trying to copy data from Blob to Azure SQL using data flows within a pipeline.
Data Files is in csv format and the Header is at 4th row in the csv file.
i want to use the header as is what is available in the csv data file.
I want to loop through all the files and upload data.
Thanks
Add a Surrogate Key transformation and then a Filter transformation to filter out row number 4.
You need to first uncheck the "First row as header" in your CSV dataset. Then you can use the "Skip line count" field in the copy data activity source tab and skip any number of lines you want.

TalendOpenStuido DI Replace content of one column of .slx File with another column of .csv file

I have two input files:
an .xlsx file that looks like this:
an .csv files that looks like this:
I already have a talend job that transforms the .xlsx file into an .xml file.
One node in the .xml file contains the
<stockLocationCode>SL213</stockLocationCode>
The output .xml file looks like this:
Now I need to replace every occurence of the stockLocationCode with the second column of the .csv file. In this case the result would be:
My talend job looks like this:
I use a tMap component to put the columns of the .xlsx file into the right node of the output xml file.
But I do not know how I can peplace the StockLocactionCode with the acutal full stock location using the .csv file. I tired to also map the .csv file with the tMap component.
I would neet to build in a methof that looks at the current value of the node <stockLocationCode> and loops over the whole .csv file until it find it in the first column of the .csv file and then replace the <stockLocationCode> content with the content of the second column of the .csv file.
Performance is not important ;)
First, you'll need a lookup in e.g. a tMap or tXMLMap component, where you map your keys and add a new column with the second column of the csv file
The resulting columns would look like this:
Product; Stock Location Code; CSV 2nd column data
Now in a second map you could just remove the stock location code and do the rest of your job.
Voila, you exchanged the columns.
u can use tXMLMap which lookup