Adding an additional column containing file name to pyspark dataframe - pyspark

I am iterating through csv files in a folder using for loop and performing some operations on each csv (getting the count of rows for each unique id and storing all these outputs into a pyspark dataframe). Now my requirement is to add the name of the file as well to the dataframe for each iteration. Can anyone suggest some way to do this

you can get the file name as a column using the function pyspark.sql.functions.input_file_name, and if your files have the same schema, and you want to apply the same processing pipeline, then don't need to loop on these files, you can read them using a regex:
df = spark.read.csv("path/to/the/files/*.csv", header=True, sep=";") \
.withColumn("file_name", input_file_name())

Related

Save pyspark dataframe entires as separate html files on s3

So, if I have a list of file locations on s3, I can build a dataframe with a column containing the contents of each file in a separate row by doing the following (for example):
s3_path_list = list(df.select('path').toPandas()['path']))
df2 = spark.read.format("binaryFile").load(s3_path_list,'path')
which returns:
df2: pyspark.sql.dataframe.DataFrame
path:string
modificationTime:timestamp
length:long
content:binary
What is the inverse of this operation?
Specifically... I have plotly generating html content stored as a string in an additional 'plot_string' column.
df3: pyspark.sql.dataframe.DataFrame
save_path:string
plot_string:string
How would I go about efficiently saving off each 'plot_string' entry as an html file at some s3 location specified in the 'save_path' column?
Clearly some form of df.write can be used to save off the dataframe (bucketed or partitioned) as parquet, csv, text table, etc... but I can't seem to find any straightforward method to perform a simple parallel write operation without a udf that initializes separate boto clients for each file... which, for large datasets, is a bottleneck (as well as being inelegant). Any help is appreciated.

write only when all tables are valid with databricks and delta table

I'm looping through some CSV files in a folder. I want to write these CSV files as delta table only if they are all valid. Each CSV files in a folder as different name and schemas. I want to reject the entire folder and all the files it contains until data are fixed. I'm running a lot of test but ultimately I have to actually write the files as delta table with the following loop (simplified for this question):
for f in files:
# read csv
df = spark.read.csv(f, header=True, schema=schema)
# writing to already existing delta table
df.write.format("delta").save('path/' + f)
Is there a callback mechanism so the write method is executed only if all the dataframe doesn't returns any errors? Delta table schema enforcement is pretty rigid which is great, but errors can pop at any time despite all the test I'm running before passing these files in this loop.
union is not an option because I want to handle this by date and each files has different schemas and names.
You can use df.union() or df.unionByName() to read all of your files into a single dataframe. Then that one is either written fully or fails.
# Create empty dataframe with schema to fill up
emptyRDD = spark.sparkContext.emptyRDD()
df = spark.createDataFrame(emptyRDD,schema)
for f in files:
# read csv
dfNext = spark.read.csv(f, header=True, schema=schema)
df = df.unionByName(dfNext)
df.write.format("delta").save(path)

Retaining Column Ordering Pyspark

I have synapse analytics notebook. I am reading a csv file to a pyspark dataframe. Now when i write this dataframe to json file, then the the column order changes to Alphabetical order. Can some help me how i can retain the column order with out hardcoding the column names in the notebook.
For example when i do df.show() I am getting BCol, CCol,ACol
Now when i write to json file it is writing as {ACol ='';BCol='';CCol=''}. I am not able to retain the values.
I am using the following code to write to json file
df.coalesce(1).write.format("json").mode("overwrite").save(dest_location)

How to read file path values as columns in Spark?

I'm working in Azure Synapse Notebooks and reading reading file(s) into a Dataframe from a well-formed folder path like so:
Given there are many folders references by that wildcard, how do I capture the "State" value as a column in the resulting Dataframe?
Use input_file_name function to get the full input path and then apply regexp_extract to extract the part that you want.
Example:
df.withColumn("filepath", F.input_file_name())
df.withColum("filepath", F.regexp_extract("filepath", "State=(.+)\.snappy\.parquet", 1)
No need to use the wildcard *.
try : df = spark.read.load("abfss://....dfs.core.windows.net/")
Spark can read partitionned folders directly, and df should then contains the column state with its different values.

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)