How to write .csv File in ADLS Using Pyspark - pyspark

I am reading json file from adls then write it back to ADLS by changing extension to .csv but some random filename is creating in ADLS (writing script in azure synapse)
One _success file and
part-000-***.csv like this some random file name is generating
I want my file name is to be save
ex: sfmc.json it should be write in adls as sfmc.csv

That is how data from different partitions is persisted in spark. You can use databricks fs utility to rename the file.
I have written a small utility function to gather all data on one partition, persist as parquet and rename the only data file in the folder. You can adopt it for JSON or CSV. The utility accepts the folder path and file name, creates a "tmp" folder for persistence, and then moves and renames the file to desired folder:
def export_spark_df_to_parquet(df, dir_dbfs_path, parquet_file_name):
tmp_parquet_dir_name = "tmp"
tmp_parquet_dir_dbfs_path = dir_dbfs_path + "/" + tmp_parquet_dir_name
parquet_file_dbfs_path = dir_dbfs_path + "/" + parquet_file_name
# Export dataframe to Parquet
df.repartition(1).write.mode("overwrite").parquet(tmp_parquet_dir_dbfs_path)
listFiles = dbutils.fs.ls(tmp_parquet_dir_dbfs_path)
for _file in listFiles:
if len(_file.name) > len(".parquet") and _file.name[-len(".parquet"):] == ".parquet":
dbutils.fs.cp(_file.path, parquet_file_dbfs_path)
break
Usage:
export_spark_df_to_parquet(df, "dbfs:/my_folder", "my_df.parquet")

Spark does not allow to name a file as required. It would generate part files with random file names. When I used df.write (where df is a spark dataframe), I get a randomly generated filename.
If you want to generate a filename with specific name, you have to use pandas. Convert the spark dataframe to pandas dataframe using toPandas() and then save the file using to_csv() method (considering csv as the required file format).
pdf = df.toPandas()
pdf.to_csv("abfss://data#datalk0711.dfs.core.windows.net/output/output.csv")
Running the above code produced the required file with required file name.

Related

How to loop over azure blob storage locations in Databricks and process files into dataframes?

i have a file system created by event hubs and it saves files to a location every 10 minutes in this format:
{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}.avro
Example:
eventhub/eventhubservice/0/2022/01/01/01/11/31.avro
There's 2 partitions: 0 and 1 and then the rest is the date format as above.
I'm trying to figure out a way to loop over every folder structure, take the avro file, change it into a df and then put it somewhere a lot more sensible.
However, I can't figure it out and not really got anywhere with it. I've gotten as far as:
dbutils.fs.ls('/mnt/mount-name/eventhub/eventhubservice/0/2022/01/01/01/11/31.avro')
df = spark.read.format("com.databricks.spark.avro").load("/mnt/mount-name/eventhub/eventhubservice/0/2022/01/01/01/11/31.avro")
display(df)
Has anyone done something like this before in Azure Databricks?
You can use glob file path while reading the data into a dataframe. I have the following folder structure where each of the folder has a csv file for demonstration.
Now, to read all the files inside these folders into a dataframe, I have used the code in the following way with glob file path.
df = spark.read.option("header",True).format("csv").load("/mnt/repro/2022/10/*/*.csv")
#both the files have same data for demonstration
df.show()
So, wherever there are multiple folders/files inside a folder, replace it with * indicating all of the contents. Use the following code:
df = spark.read.format("com.databricks.spark.avro").load("/mnt/mount-name/eventhub/eventhubservice/*/2022/*/*/*/*/*.avro")
display(df)
#for year 2022.

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).
The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.
The issues are:
By default the file name is part-0000-<rid>.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.
All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.
This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.
This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.
Notes
The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use #configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.
Full code snippet
example_transform.py
from transforms.api import transform, Input, Output
import .utils
#transform(
output=Output("/path/to/output"),
source_df=Input("/path/to/input"),
)
def compute(output, source_df):
return utils.write_single_named_parquet_file(output, source_df, "readable_file_name")
utils.py
from transforms.api import Input, Output
import shutil
import logging
log = logging.getLogger(__name__)
def write_single_named_parquet_file(output: Output, input: Input, file_name: str):
"""Write a single ".snappy.parquet" file with a given file name to a transforms output, containing the data of the
single ".snappy.parquet" file in the transforms input. This is useful when you need to export the data using
magritte, wanting a human readable name in the output, when not using separate transaction folders this should cause
the previous output to be automatically overwritten.
The input to this function must contain a single ".snappy.parquet" file, this can be achieved by calling
`.coalesce(1)` or `.repartition(1)` on your dataframe at the end of the upstream transform that produces the input.
This function should not be used for large dataframes (e.g. those greater than 512 mb in size), instead
transaction folders should be enabled in the export. This function can work for larger sizes, but you may find you
need additional driver memory to perform both the coalesce/repartition in the upstream transform, and here.
This produces a dataset without a schema, so features like expectations can't be used.
Parameters:
output (Output): The transforms output to write the single custom named ".snappy.parquet" file to, this is
the dataset you want to export
input (Input): The transforms input containing the data to be written to output, this must contain only one
".snappy.parquet" file (it can contain other files, for example logs)
file_name: The name of the file to be written, if the ".snappy.parquet" will be automatically appended if not
already there, and ".snappy" and ".parquet" will be corrected to ".snappy.parquet"
Raises:
RuntimeError: Input dataset must be coalesced or repartitioned into a single file.
RuntimeError: Input dataset file system cannot be empty.
Returns:
void: writes the response to output, no return value
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) > 1:
raise RuntimeError("Input dataset must be coalesced or repartitioned into a single file.")
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
log.info("Inital output file name: " + file_name)
# check for snappy.parquet and append if needed
if file_name.endswith(".snappy.parquet"):
pass # if it is already correct, do nothing
elif file_name.endswith(".parquet"):
# if it ends with ".parquet" (and not ".snappy.parquet"), remove parquet and append ".snappy.parquet"
file_name = file_name.removesuffix(".parquet") + ".snappy.parquet"
elif file_name.endswith(".snappy"):
# if it ends with just ".snappy" then append ".parquet"
file_name = file_name + ".parquet"
else:
# if doesn't end with any of the above, add ".snappy.parquet"
file_name = file_name + ".snappy.parquet"
log.info("Final output file name: " + file_name)
with input.filesystem().open(input_file_path, "rb") as in_f: # open the input file
with output.filesystem().open(file_name, "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:
excludePaths:
- ^_.*
- ^spark/_.*
rewritePaths:
'^spark/(.*[\/])(.*)': $1/export.parquet
uploadConfirmation: exportedFiles
incrementalType: snapshot
retriesPerFile: 0
bucketPolicy: BucketOwnerFullControl
directoryPath: features
setBucketPolicy: true
I ran into the same requirement the only difference was that the dataset required to be split into multiple parts due to the size. Posting here the code and how I have updated it to handle this use case.
def rename_multiple_parquet_outputs(output: Output, input: list, file_name_prefix: str):
"""
Slight improvement to allow multiple output files to be renamed
"""
output.set_mode("replace") # Make sure it is snapshotting
input_files_df = input.filesystem().files() # Get all files
input_files = [row[0] for row in input_files_df.collect()] # noqa - first column in files_df is path
input_files = [f for f in input_files if f.endswith(".snappy.parquet")] # filter non parquet files
if len(input_files) == 0:
raise RuntimeError("Input dataset file system cannot be empty.")
input_file_path = input_files[0]
print(f'input files {input_files}')
print("prefix for target name: " + file_name_prefix)
for i,f in enumerate(input_files):
with input.filesystem().open(f, "rb") as in_f: # open the input file
with output.filesystem().open(f'{file_name_prefix}_part_{i}.snappy.parquet', "wb") as out_f: # open the output file
shutil.copyfileobj(in_f, out_f) # write the file into a new file
Also to use this into a code workbook the input needs to be persisted and the output parameter can be retrieved as shown below.
def rename_outputs(persisted_input):
output = Transforms.get_output()
rename_parquet_outputs(output, persisted_input, "prefix_for_renamed_files")

How to read first record from .dat file transform it and finally store in HDFS

I am trying to read a .dat file in aws s3 using spark scala shell, and create a new file with just the first record of the .dat file.
Let's say my file path to the .dat file is "s3a://filepath.dat"
I assume my logic should look something like but I wasn't able to figure out how to get the first record.
val file = sc.textFile("s3a://filepath.dat")
val onerecord = file.getFirstRecord()
onerecord.saveAsTextFile("s3a://newfilepath.dat")
I've been trying to follow these solutions
How to skip first and last line from a dat file and make it to dataframe using scala in databricks
https://stackoverflow.com/questions/51809228/spark-scalahow-to-read-data-from-dat-file-transform-it-and-finally-store-in-h#:~:text=dat%20file%20in%20Spark%20RDD,be%20delimited%20by%20%22%20%25%24%20%22%20signs
It depends on how records are separated in your .dat file, but in general, you could do something like this(think delimiter is '|'):
val raw = session.sqlContext.read.format("csv").option("delimiter","|").load("data/input.txt")
val firstItem = raw.first()
It looks weird but it will solve your problem.

How to read csv file for which data contains double quotes and comma seperated using spark dataframe in databricks

I'm trying to read csv file using spark dataframe in databricks. The csv file contains double quoted with comma separated columns. I tried with the below code and not able to read the csv file. But if I check the file in datalake I can see the file.
The input and output is as follows
df = spark.read.format("com.databricks.spark.csv")\
.option("header","true")\
.option("quoteAll","true")\
.option("escape",'"')\
.csv("mnt/A/B/test1.csv")
The input file data:header:
"A","B","C"
"123","dss","csc"
"124","sfs","dgs"
Output:
"A"|"B"|"C"|

Save files into different format using original filenames

After reading .csv files in a directory, I want to save each of them into .html files using their original file names. But, my code below brings along the extension (.csv) from the original filenames.
For example,
Original files: File1.csv, File2.csv
Result files: File1.csv.html, File2.csv.html
I want to remove ".csv" from the new file names.
import pandas as pd
import glob, os
os.chdir(r"C:\Users\.....\...")
for counter, file in enumerate(glob.glob("*.csv")):
df = pd.read_csv(file, skipinitialspace=False, sep =',', engine='python')
df.to_file(os.path.basename(file) + ".html")
The code below removed ".csv" but also ".html"
df.to_file(os.path.basename(file + ".html").split('.')[0])
My expectation is:
File1.html, File2.html
EDIT:
Another post [How to get the filename without the extension from a path in Python? suggested how to list existing files without extensions. My issue, however, is to read existing files in a directory and save them using their original file names (excluding original extension) with new extension.