Pyspark - Code to calculate file hash/checksum not working - pyspark

I have the below pyspark code to calculate the SHA1 hash of each file in a folder. I'm using spark.sparkContext.binaryFiles to get an RDD of pairs where the key is the file name and the value is a file-like object, on which I'm calculating the hash in a map function rdd.mapValues(map_hash_file). However, I'm getting the below error at the second-last line, which I don't understand - how can this be fixed please? Thanks
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 4 times, most recent failure: Lost task 0.3 in stage 66.0
Code:
#Function to calulcate hash-value/checksum of a file
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
sha1_hash = hashlib.sha1()
sha1_hash.update(file_contents.encode('utf-8'))
return file_name, sha1_hash.hexdigest()
rdd = spark.sparkContext.binaryFiles('/mnt/workspace/Test_Folder', minPartitions=None)
#As a check, print the list of files collected in the RDD
dataColl=rdd.collect()
for row in dataColl:
print(row[0])
#Apply the function to calcuate hash of each file and store the results
hash_values = rdd.mapValues(map_hash_file)
#Store each file name and it's hash value in a dataframe to later export as a CSV
df = spark.createDataFrame(data=hash_values)
display(df)

You will get your expected result if you do the following:
Change file_contents.encode('utf-8') to file_contents. file_contents is already a of type bytes
Change rdd.mapValues(map_hash_file) to rdd.map(map_hash_file). The function map_hash_file expects a tuple.
Also consider:
Adding an import hashlib
Not collecting the content of all files to the driver - you risk consuming all the memory at the driver.
With the above changes, your code should look something like this:
import hashlib
#Function to calulcate hash-value/checksum of a file
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
sha1_hash = hashlib.sha1()
sha1_hash.update(file_contents)
return file_name, sha1_hash.hexdigest()
rdd = spark.sparkContext.binaryFiles('/mnt/workspace/Test_Folder', minPartitions=None)
#Apply the function to calcuate hash of each file and store the results
hash_values = rdd.map(map_hash_file)
#Store each file name and it's hash value in a dataframe to later export as a CSV
df = spark.createDataFrame(data=hash_values)
display(df)

Related

PySpark - Return first row from each file in a folder

I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)

Pyspark read all files and write it back it to same file after transformation

Hi I have files in a directory
Folder/1.csv
Folder/2.csv
Folder/3.csv
I want to read all these files in a pyspark dataframe/rdd and change some column value and write it back to same file.
I have tried it but it creating new file in the folder part_000 something but I want to write the data in to same file whatever the contents in 1.csv , 2.csv,3.csv after modification in column values
How can I achieve that using loop or loading file in to each dataframe or how it possible with array or any logic ?
Let's say after your transformations that df_1, df_2 and df_3 are the datafames that will be saved back into the folder with the same name.
Then, you can use this function:
def export_csv(df, fileName, filePath):
filePathDestTemp = filePath + ".dir/"
df\
.coalesce(1)\
.write\
.mode('overwrite')
.save(filePathDestTemp)
listFiles = dbutils.fs.ls(filePathDestTemp)
for subFiles in listFiles:
if subFiles.name[-4:] == ".csv":
dbutils.fs.cp (filePathDestTemp + subFiles.name, filePath + fileName+ '.csv')
dbutils.fs.rm(filePathDestTemp, recurse=True)
...and call it for each df:
export_csv(df_1, '1.csv', 'Folder/')
export_csv(df_2, '2.csv', 'Folder/')
export_csv(df_3, '3.csv', 'Folder/')

Pyspark - How to calculate file hashes

I have a bunch of CSV files in a mounted blob container and I need to calculate the 'SHA1' hash values for every file to store as inventory. I'm very new to Azure cloud and pyspark so I'm not sure how this can be achieved efficiently. I have written the following code in Python Pandas and I'm trying to use this in pyspark. It seems to work however it takes quite a while to run as there are thousands of CSV files. I understand that things work differently in pyspark, so can someone please guide if my approach is correct, or if there is a better piece of code I can use to accomplish this task?
import os
import subprocess
import hashlib
import pandas as pd
class File:
def __init__(self, path):
self.path = path
def get_hash(self):
hash = hashlib.sha1()
with open(self.path, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash.update(chunk)
self.md5hash = hash.hexdigest()
return self.md5hash
path = '/dbfs/mnt/data/My_Folder' #Path to CSV files
cnt = 0
rlist = []
for path, subdirs, files in os.walk(path):
for fi in files:
if cnt < 10: #check on only 10 files for now as it takes ages!
f = File(os.path.join(path, fi))
cnt +=1
hash_value = f.get_hash()
results = {'File_Name': fi, 'File_Path': f.filename, 'SHA1_Hash_Value': hash_value}
rlist.append(results)
print(fi)
df = pd.DataFrame(rlist)
print(str(cnt) + ' files processed')
df = pd.DataFrame(rlist)
#df.to_csv('/dbfs/mnt/workspace/Inventory/File_Hashes.csv', mode='a', header=False) #not sure how to write files in pyspark!
display(df)
Thanks
Since you want to treat the files as blobs and not read them into a table. I would recommend using spark.sparkContext.binaryFiles this would land you an RDD of pairs where the key is the file name and the value is a file-like object, on which you can calculate the hash in a map function (rdd.mapValues(calculate_hash_of_file_like))
For more information, refer to the documentation: https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.binaryFiles.html#pyspark.SparkContext.binaryFiles

Databricks PySpark: Make Dataframe from Rows of Strings

In Azure Databricks using PySpark, I'm reading file names from a directory. I am able to print the rows I need:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
Outputs, for example:
mycompany_mytext_2020-12-22_11-34-46.txt
mycompany_mytext_2021-02-01_10-40-57.txt
I want to put those rows into a dataframe but have not been able to make it work. Some of my failed attempts include:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
# file_list = row[filename].collect() #tuple indices must be integers or slices, not str
# file_list = filename # last row
# file_list = filename.collect() # error
# file_list = spark.sparkContext.parallelize(list(filename)).collect() # breaks last row into list of each character
# col = 'fname' # this and below generates ParseException
# df = spark.createDataFrame(data = file_list, schema = col)
The question is, how do I collect the row output into a single dataframe column with a row per value?
you can collect the filenames into a list and spark expects a nested list.
the program will be as follows :
df_ls = dbutils.fs.ls('/mypath/')
file_names =[]
for row in df_ls:
if 'mytext' in row.name.lower():
file_names.append([row.name])
df = spark.createDataFrame(file_names,['filename'])
display(df)

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")