Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:
append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')
a folder called Covid_Cases gets created and there are parquet files with random names inside of it.
What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.
Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,
save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'
df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)
Related
I'm using PySpark code to create a parquet file; specifically, I'm using the parquet() function of the DataFrameWriter class and providing just the location, not the name of the parquet file. I'd like to know the name of the parquet file that was created; however, the function returns None. Any suggestions?
Files' name created by DataFrameWriter are unpredictable, because of the nature of distribution work (i.e. multiple workers is writing in the same location). However, you can retrieve the file name using input_file_name when re-read those parquet files.
I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')
As part of migrating data from ADLS Gen1 to ADLS Gen2 using ADFv2, we are having below scenario.
source -> raw/datasourceA/2019/2019-Aug/12-Aug-2019/files
raw/datasourceA/2019/2019-Aug/13-Aug-2019/files
raw/datasourceA/2020/2020-Apr/02-Apr-2020/files
target -> raw/eval/datasourceA/12-Aug-2019/files
raw/eval/datasourceA/13-Aug-2019/files
raw/eval/datasourceA/02-Apr-2020/files
One option to achieve this is by having source path and target path mapping in table and read each row using ADF lookup activity. However doing so, we will end up having so many entries in table.
Is there any other way to achieve it dynamically in ADF ?
In control table I just want to have below source and target path and rest to be handled by ADF.
source path -> raw/datasourceA/
target path -> raw/eval/datasourceA/
Because your folders are hierarchical, I support your idea to pass the file path as a parameter to the copy activity. In ADF, it is more convenient to traverse one hierarchical file.
Declare an array type variable and asign the value ["2019/2019-Aug","2020/2020-Apr"].
Specify the file path via add dynamic content #concat('raw/datasourceA/',item()).
Then sink to the target folder.
We can see the source folders were copied to the target folder.
I have to migrate multiple files(around 2000) in same folder in azure blob storage. I want to read each file with header(as header is different for every file).
And write it into destination folder.
Is there anyway I can do it parallel via pyspark?
I am using below code, but it is only picking header from first file, which is producing wrong output.
Df.read.option(“header”, “true”).parquet(directory/*.parquet)
Df.write.option(“header”,”true”).csv(directory)
Please help me if you know how can I read all the files with source headers of their own.
Thanks!
thanks for coming in.
I am trying to develop a Mapping Data Flow in an Azure Synapse workspace (so I believe that this can also apply to ADFv2) that takes a Delta input and transforms it straight into a Parquet -formatted output, with the relevant detail of using a Parquet dataset pointing to ADLSGen2 with parameterized file system and folder, in opposition to a hard-coded file-system and folder, because this would take creating too many datasets as there are too many folders of interest in the Data Lake.
The Mapping Data Flow:
As I try to use it as a Source in my Mapping Data Flows, the debug configuration (as well as the parent pipeline configuration) will duly ask for my input on those parameters, which I am happy to enter.
Then, as soon I try to debug or run the pipeline I get this error in less than 1 second:
{
"Message": "ErrorCode=InvalidTemplate, ErrorMessage=The expression 'body('DataFlowDebugExpressionResolver')?.50_DeltaToParquet_xxxxxxxxx?.ParquetCurrent.directory' is not valid: the string character '_' at position '43' is not expected."
}
RunId: xxx-xxxxxx-xxxxxx
This error message is not very specific to know where I should look.
I tried replacing the parameterized Parquet dataset with a hard-coded one, and it works perfectly both in debug and pipeline -run modes. However, this does not gets me what I need which is the ability to reuse my Parquet dataset instead of having to create a specific dataset for each Data Lake folder.
There are also no spaces in the Data Lake file system. Please refer to these parameters that look a lot like my production environment:
File System: prodfs001
Directory: synapse/workspace01/parquet/dim_mydim
Thanks in advance to all of you, folks!
The directory name synapse/workspace01/parquet/dim_mydim has an _ in dim_mydim, can you try replacing the underscore, or maybe you can use dimmydim to test whether it works.