How to save pyspark data frame in a single csv file

How to save pyspark data frame in a single csv file - pyspark

This is in continuation of this how to save dataframe into csv pyspark thread.
I'm trying to save my pyspark data frame df in my pyspark 3.0.1. So I wrote
df.coalesce(1).write.csv('mypath/df.csv)
But after executing this, I'm seeing a folder named df.csv in mypath which contains 4 following files
1._committed_..
2._started_...
3._Success
4. part-00000-.. .csv
Can you suggest to me how do I save all data in df.csv?

You can use .coalesce(1) to save the file in just 1 csv partition, then rename this csv and move it to the desired folder.
Here is a function that does that:
df: Your df
fileName: Name you want to for the csv file
filePath: Folder where you want to save to
def export_csv(df, fileName, filePath):
filePathDestTemp = filePath + ".dir/"
df\
.coalesce(1)\
.write\
.save(filePathDestTemp)
listFiles = dbutils.fs.ls(filePathDestTemp)
for subFiles in listFiles:
if subFiles.name[-4:] == ".csv":
dbutils.fs.cp (filePathDestTemp + subFiles.name, filePath + fileName+ '.csv')
dbutils.fs.rm(filePathDestTemp, recurse=True)

If you want to get one file named df.csv as output, you can first write into a temporary folder, then move the part file generated by Spark and rename it.
These steps can be done using Hadoop FileSystem API available via JVM gateway :
temp_path = "mypath/__temp"
target_path = "mypath/df.csv"
df.coalesce(1).write.mode("overwrite").csv(temp_path)
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
# get the part file generated by spark write
fs = Path(temp_path).getFileSystem(sc._jsc.hadoopConfiguration())
csv_part_file = fs.globStatus(Path(temp_path + "/part*"))[0].getPath()
# move and rename the file
fs.rename(csv_part_file, Path(target_path))
fs.delete(Path(temp_path), True)

Related

PySpark - Return first row from each file in a folder

I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.

I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)

Pyspark read all files and write it back it to same file after transformation

Hi I have files in a directory
Folder/1.csv
Folder/2.csv
Folder/3.csv
I want to read all these files in a pyspark dataframe/rdd and change some column value and write it back to same file.
I have tried it but it creating new file in the folder part_000 something but I want to write the data in to same file whatever the contents in 1.csv , 2.csv,3.csv after modification in column values
How can I achieve that using loop or loading file in to each dataframe or how it possible with array or any logic ?

Let's say after your transformations that df_1, df_2 and df_3 are the datafames that will be saved back into the folder with the same name.
Then, you can use this function:
def export_csv(df, fileName, filePath):
filePathDestTemp = filePath + ".dir/"
df\
.coalesce(1)\
.write\
.mode('overwrite')
.save(filePathDestTemp)
listFiles = dbutils.fs.ls(filePathDestTemp)
for subFiles in listFiles:
if subFiles.name[-4:] == ".csv":
dbutils.fs.cp (filePathDestTemp + subFiles.name, filePath + fileName+ '.csv')
dbutils.fs.rm(filePathDestTemp, recurse=True)
...and call it for each df:
export_csv(df_1, '1.csv', 'Folder/')
export_csv(df_2, '2.csv', 'Folder/')
export_csv(df_3, '3.csv', 'Folder/')

In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor?

Similar to How do I parse large compressed csv files in Foundry? but without the file being compressed, a system generated (>10GB) csv file which needs to be parsed as a Foundry Dataset.
A dataset this size normally causes the driver to OOM, so how can I parse this file?

Using the filesystem, you can read the file and yield a rowwise operation to split on each seperator (,) in this case.
df = raw_dataset
fs = df.filesystem()
def process_file(fl):
with fs.open("data_pull.csv", "r") as f:
header = [x.strip() for x in f.readline().split(",")]
Log = Row(*header)
for i in f:
yield Log(*i.split(","))
rdd = fs.files().rdd
rdd = rdd.flatMap(process_file)
df = rdd.toDF()

Pyspark - How to calculate file hashes

I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?
import os
import subprocess
import hashlib
import pandas as pd
class File:
def __init__(self, path):
self.path = path
def get_hash(self):
hash = hashlib.sha1()
with open(self.path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
self.md5hash = hash.hexdigest()
return self.md5hash
path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []
for path, subdirs, files in os.walk(path):
for fi in files:
if cnt < 10: #check on only 10 files for now as it takes ages!
f = File(os.path.join(path, fi))
cnt +=1
hash_value = f.get_hash()
results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
rlist.append(results)
print(fi)
df = pd.DataFrame(rlist)
print(str(cnt) + ' files processed')
df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)
Thanks

Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))
For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles

How to avoid generating crc files and SUCCESS files while writing data frame into CSV file?

I'm trying to save data frame into CSV file using the following code df.repartition(1).write.csv('path',sep = ',') then beside the csv file there are other files generated as in the following snippet
how do I avoid saving the df into CSV file without generating those CSC files? incase there is no possibility how can I let the pandas read the only CSV files out all other files. taking into consideration that there is a file with format csv.crc

For Pandas reading only the csv files you can do:
import pandas as pd
import os
from os import listdir
#you can change the suffix, csv will be the default
def find_csv_filenames( path_to_dir, suffix=".csv" ):
filenames = listdir(path_to_dir)
return [ filename for filename in filenames if filename.endswith( suffix ) ]
your_dir = '/your_path_here/complete_route'
csv_files = ind_csv_filenames(your_dir)
for filename in csv_files:
print(pd.read_csv(your_dir+"/"+filename))
If you want to read all files in the same dataframe:
df = pd.DataFrame()
for filename in csv_files:
df = df.append(pd.read_csv(your_dir+"/"+filename), ignore_index=True)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to save pyspark data frame in a single csv file - pyspark

Related

PySpark - Return first row from each file in a folder

Pyspark read all files and write it back it to same file after transformation

In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor?

Pyspark - How to calculate file hashes

How to avoid generating crc files and SUCCESS files while writing data frame into CSV file?

Categories

Resources