Obtain name of file created by parquet() function of DataFrameWriter class? - pyspark

I'm using PySpark code to create a parquet file; specifically, I'm using the parquet() function of the DataFrameWriter class and providing just the location, not the name of the parquet file. I'd like to know the name of the parquet file that was created; however, the function returns None. Any suggestions?

Files' name created by DataFrameWriter are unpredictable, because of the nature of distribution work (i.e. multiple workers is writing in the same location). However, you can retrieve the file name using input_file_name when re-read those parquet files.

Related

Specify parquet file name when saving in Databricks to Azure Data Lake

Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:
append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')
a folder called Covid_Cases gets created and there are parquet files with random names inside of it.
What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.
Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,
save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'
df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)

Azure Data Factory data flow file sink

I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')

How to rename file name in ADF?

I am copying data from sql to adls dynamically, i want to rename the file name after copied into ADLS. How to achieve it? Requesting you suggest.
Thanks in Advance.
Regards,
Ashok
My first question would be "why bother renaming parquet files?" Hopefully you aren't generating a single parquet file, which would seem to defeat the purpose of using Parquet. Instead, my focus would be on the folder name.
OPTION 1
If I did care about the file names, I would use Data Flow and configure the Sink to use patterned naming:
You could then pass the desired file name in as Data Flow Parameter:
And set it dynamically using an expression:
[NOTE: I haven't tested this syntax, but I recommend you always use the Expression Builder to enter these expressions].
OPTION 2
If none of that suits your purposes, then aonther option would be brute force. Use a COPY activity with binary data sets to copy the file to a new file with the desired name, then a DELETE activity to remove the old one.

Is there a way to add literals as columns to a spark dataframe when reading the multiple files at once if the column values depend on the filepath?

I'm trying to read a lot of avro files into a spark dataframe. They all share the same s3 filepath prefix, so initially I was running something like:
path = "s3a://bucketname/data-files"
df = spark.read.format("avro").load(path)
which was successfully identifying all the files.
The individual files are something like:
"s3a://bucketname/data-files/timestamp=20201007123000/id=update_account/0324345431234.avro"
Upon attempting to manipulate the data, the code kept errorring out, with a message that one of the files was not an Avro data file. The actual error message received is: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62476 in stage 44102.0 failed 4 times, most recent failure: Lost task 62476.3 in stage 44102.0 (TID 267428, 10.96.134.227, executor 9): java.io.IOException: Not an Avro data file.
To circumvent the problem, I was able to get the explicit filepaths of the avro files I'm interested in. After putting them in a list (file_list), I was successfully able to run spark.read.format("avro").load(file_list).
The issue now is this - I'm interested in adding a number of fields to the dataframe that are part of the filepath (ie. the timestamp and the id from the example above).
While using just the bucket and prefix filepath to find the files (approach #1), these fields were automatically appended to the resulting dataframe. With the explicit filepaths, I don't get that advantage.
I'm wondering if there's a way to include these columns while using spark to read the files.
Sequentially processing the files would look something like:
for file in file_list:
df = spark.read.format("avro").load(file)
id, timestamp = parse_filename(file)
df = df.withColumn("id", lit(id))\
.withColumn("timestamp", lit(timestamp))
but there are over 500k files and this would take an eternity.
I'm new to Spark, so any help would be much appreciated, thanks!
Two separate things to tackle here:
Specifying Files
Spark has built in handling for reading all files of a particular type in a given path. As #Sri_Karthik suggested, try supplying a path like "s3a://bucketname/data-files/*.avro" (if that doesn't work, maybe try "s3a://bucketname/data-files/**/*.avro"... i can't remember the exact pattern matching syntax spark uses), which should grab all avro files only and get rid of that error where you are seeing non-avro files in those paths. In my opinion this is more elegant than manually fetching the file paths and explicitly specifying them.
As an aside, the reason you are seeing this is likely because folders typically get marked with metadata files like .SUCCESS or .COMPLETED to indicate they are are ready for consumption.
Extracting metadata from filepaths
If you check out this stackoverflow question, it shows how you can add the filename as a new column (both for scala and pyspark). You could then use the regexp_extract function to parse out the desired elements from that filename string. I've never used scala in spark so can't help you there, but it should be similar to the pyspark version.
Why dont you try to read the files first by using wholetextfiles method and add the path name into the data itself at the beginning. Then you can filter out the file names from the data and add it as a column while creating the dataframe. I agree it's a two step process. But it should work. To get a timestamp of file you will need filesystem object which js not serializable , i.e. it cant be used in sparks parallelized operation , So you will have to create a local collection with file and timestamp and join it somehow with the RDD you created with wholetextfiles.

Accessing value of df.write.partitionBy in file name and performing transformations while saving

I am doing something like
df.write.mode("overwrite").partitionBy("sourcefilename").format("orc").save("s3a://my/dir/path/output-data");
The above code does generate orc file name successfully with the partition directory however the naming is something like part-0000.
I need to change the partition by (sourcefilename) value while saving e.g. if source file name is ABC then the partition directory (which would be create while doing a write) should be 123, if DEF then 345 and so on.
How can we do the above requirements? I am using AWS S3 for reading and writing of files.
I am using Spark 2.x and Scala 2.11.
Given that this example show the DF Writer general
df.write.partitionBy("EVENT_NAME","dt","hour").save("/apps/hive/warehouse/db/sample")
format, then your approach should be to create an extra column xc that is set by a UDF or some def or val that sets the xc according to the name, e.g. ABC --> 123, etc. Then you partition by this xc col and accept that part-xxxxx is just how it works in Spark.
You could then rename the files via a script yourself subsequently.
The part-1234 style is how the work is partitioned: different tasks get their own partition of the split data source and saves it with the numbering to guarantee no other task generates output with the same name.
This is fundamental to getting the performance of parallel execution.