Anyone had an issue with the Directory setting in the Sink Dataset. The files are ending up in the locations that only includes the File System value:
So the files end up in /curated
But should end up in /curated/profiledata
It depends on the File Name Option that you have selected.
"Output to Single File" will honor that dataset file path folder.
But if you are using "As data in column", we start back at the container root in order to allow you to put your files in different multiple folder locations. You can just append your path to the value in the column in a derived column to set your proper path.
Related
Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:
append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')
a folder called Covid_Cases gets created and there are parquet files with random names inside of it.
What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.
Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,
save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'
df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)
I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')
I have 4000 files each averaging 30Kb in size landing in a folder on our on premise file system each day. I want to apply conditional logic (several and/or conditions) against details in their file names to only move files matching the conditions into another folder. I have tried linking a meta data activity which gets all files in the source folder with a filter activity which applies the conditional logic with a for each activity with an embedded copy activity. This works but it is taking hours to process the files. When running the pipeline in debug the output window appears to list each file copied as a line item. I’ve increased the batch count setting in the for each to 50 but it hasn’t improved things. Is there a way to link the filter activity directly to the copy activity without using for each activity? Ie pass the collection from the filter straight into copy’s source. Alternatively, some of our other pipelines just use the copy activity pointing to a source folder and we configure its filefilter setting with a simple regex using a combination of * and ?, which is extremely fast. However, in this particular scenario, my conditional logic is more complex and I need to compare attributes in each file’s name with values to decide if the file should be moved. The filefilter setting allows dynamic content so I could remove the filter activity completely, point the copy to the source folder and put the conditional logic in the filefilter’s dynamic content area but how would I get a reference to the file name to do the conditional checks?
Here is one solution:
Write array output as text to a .json in Blob Storage (or wherever). Here are the steps to make that work:
Copy Data Source:
Copy Data Sink:
Write the json (array output) to a text file that has the name of the files you want to copy.
Copy Activity Source (to get it from JSON to .txt):
Sink will be .txt file in your Blob.
Use that text file in your main copy activity and use the following setting:
This should copy over all the files that you identified in your Filter Activity.
I realize this is a work around, but really is the only solution for what you are asking. Otherwise there is no way to link a filter activity straight to a copy activity.
As part of migrating data from ADLS Gen1 to ADLS Gen2 using ADFv2, we are having below scenario.
source -> raw/datasourceA/2019/2019-Aug/12-Aug-2019/files
raw/datasourceA/2019/2019-Aug/13-Aug-2019/files
raw/datasourceA/2020/2020-Apr/02-Apr-2020/files
target -> raw/eval/datasourceA/12-Aug-2019/files
raw/eval/datasourceA/13-Aug-2019/files
raw/eval/datasourceA/02-Apr-2020/files
One option to achieve this is by having source path and target path mapping in table and read each row using ADF lookup activity. However doing so, we will end up having so many entries in table.
Is there any other way to achieve it dynamically in ADF ?
In control table I just want to have below source and target path and rest to be handled by ADF.
source path -> raw/datasourceA/
target path -> raw/eval/datasourceA/
Because your folders are hierarchical, I support your idea to pass the file path as a parameter to the copy activity. In ADF, it is more convenient to traverse one hierarchical file.
Declare an array type variable and asign the value ["2019/2019-Aug","2020/2020-Apr"].
Specify the file path via add dynamic content #concat('raw/datasourceA/',item()).
Then sink to the target folder.
We can see the source folders were copied to the target folder.
I am trying to copy files from one folder to another folder using SharePoint REST API. Some columns inside the destination folder have defined a default value. Even though the files are copied successfully, some files do not get the default value for the columns.
On a closer look, I found that the new office documents types (.docx, .xlsx, .pptx etc.) get the default values, while the old office document types (.doc, .xls, .ppt) do not get the values.
Also the old office documents get the values only when they are coming from a source folder which already contains the columns in the destination folder.
I am wondering why the old office documents do not get the values and if anything can be done.
Is it a bug in SharePoint Server or am I missing any configuration to make all files work?
My understanding is that this is expected. Because you are copying files, the copy includes not only the file itself but also its metadata. If the file in the source folder doesn't have values in those columns, it does make sense that if you copy it to a destination folder, those same columns shouldn't have values either. Now, why some files (docx, pptx, etc.) do have values in the destination? Probably because of the SharePoint document parser feature (Document Property Promotion and Demotion). So in your case what you can do is, instead of copying the files, download/upload them using for instance code like this.