Databricks PySpark: Make Dataframe from Rows of Strings - pyspark

In Azure Databricks using PySpark, I'm reading file names from a directory. I am able to print the rows I need:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
Outputs, for example:
mycompany_mytext_2020-12-22_11-34-46.txt
mycompany_mytext_2021-02-01_10-40-57.txt
I want to put those rows into a dataframe but have not been able to make it work. Some of my failed attempts include:
df_ls = dbutils.fs.ls('/mypath/')
for row in df_ls:
filename = row.name.lower()
if 'mytext' in filename:
print(filename)
# file_list = row[filename].collect() #tuple indices must be integers or slices, not str
# file_list = filename # last row
# file_list = filename.collect() # error
# file_list = spark.sparkContext.parallelize(list(filename)).collect() # breaks last row into list of each character
# col = 'fname' # this and below generates ParseException
# df = spark.createDataFrame(data = file_list, schema = col)
The question is, how do I collect the row output into a single dataframe column with a row per value?

you can collect the filenames into a list and spark expects a nested list.
the program will be as follows :
df_ls = dbutils.fs.ls('/mypath/')
file_names =[]
for row in df_ls:
if 'mytext' in row.name.lower():
file_names.append([row.name])
df = spark.createDataFrame(file_names,['filename'])
display(df)

Related

PySpark - Return first row from each file in a folder

I have multiple .csv files in a folder on Azure. Using PySpark I am trying to create a dataframe that has two columns, filename and firstrow, which are captured for each file within the folder.
Ideally I would like to avoid having to read the files in full as some of them can be quite large.
I am new to PySpark so I do not yet understand the basics so I would appreciate any help.
I have write a code for your scenario and it is working fine.
Create a empty list and append it with all the filenames stored in the source
# read filenames
filenames = []
l = dbutils.fs.ls("/FileStore/tables/")
for i in l:
print(i.name)
filenames.append(i.name)
# converting filenames to tuple
d = [(x,) for x in filenames]
print(d)
Read the data from multiple files and store in a list
# create data by reading from multiple files
data = []
i = 0
for n in filename:
temp = spark.read.option("header", "true").csv("/FileStore/tables/" + n).limit(1)
temp = temp.collect()[0]
temp = str(temp)
s = d[i] + (temp,)
data.append(s)
i+=1
print(data)
Now create DataFrame from the data with column names.
column = ["filename", "filedata"]
df = spark.createDataFrame(data, column)
df.head(2)

Pyspark - Code to calculate file hash/checksum not working

I have the below pyspark code to calculate the SHA1 hash of each file in a folder. I'm using spark.sparkContext.binaryFiles to get an RDD of pairs where the key is the file name and the value is a file-like object, on which I'm calculating the hash in a map function rdd.mapValues(map_hash_file). However, I'm getting the below error at the second-last line, which I don't understand - how can this be fixed please? Thanks
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 66.0 failed 4 times, most recent failure: Lost task 0.3 in stage 66.0
Code:
#Function to calulcate hash-value/checksum of a file
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
sha1_hash = hashlib.sha1()
sha1_hash.update(file_contents.encode('utf-8'))
return file_name, sha1_hash.hexdigest()
rdd = spark.sparkContext.binaryFiles('/mnt/workspace/Test_Folder', minPartitions=None)
#As a check, print the list of files collected in the RDD
dataColl=rdd.collect()
for row in dataColl:
print(row[0])
#Apply the function to calcuate hash of each file and store the results
hash_values = rdd.mapValues(map_hash_file)
#Store each file name and it's hash value in a dataframe to later export as a CSV
df = spark.createDataFrame(data=hash_values)
display(df)
You will get your expected result if you do the following:
Change file_contents.encode('utf-8') to file_contents. file_contents is already a of type bytes
Change rdd.mapValues(map_hash_file) to rdd.map(map_hash_file). The function map_hash_file expects a tuple.
Also consider:
Adding an import hashlib
Not collecting the content of all files to the driver - you risk consuming all the memory at the driver.
With the above changes, your code should look something like this:
import hashlib
#Function to calulcate hash-value/checksum of a file
def map_hash_file(row):
file_name = row[0]
file_contents = row[1]
sha1_hash = hashlib.sha1()
sha1_hash.update(file_contents)
return file_name, sha1_hash.hexdigest()
rdd = spark.sparkContext.binaryFiles('/mnt/workspace/Test_Folder', minPartitions=None)
#Apply the function to calcuate hash of each file and store the results
hash_values = rdd.map(map_hash_file)
#Store each file name and it's hash value in a dataframe to later export as a CSV
df = spark.createDataFrame(data=hash_values)
display(df)

Reading flat file with multi-line string without quote with PySpark

I have a flat file delimited by | (pipe), without quote character. Sample data looks as following:
SOME_NUMBER|SOME_MULTILINE_STRING|SOME_STRING
23|multiline
text1|text1
24|multi
mulitline
text2|text2
25|text3|text4
What I'm trying to do is to load it into a dataframe to look something like this:
SOME_NUMBER
SOME_MULTILINE_STRING
SOME_STRING
23
multilinetext1
text1
24
multimulitlinetext2
text2
25
text3
text4
I tried to specify multiLine option with no luck. Regardless of it being set to True or False, the output doesn't change. I suppose what I'm trying to achieve there is to specify that I'm expecting multi-line data, and every record has the same number of columns specified in the schema.
df_file = spark.read.csv(filePath, \
sep="|", \
header=True, \
enforceSchema=True, \
schema=df_table.schema, \ # I need to explicitly specify the schema
quote='', \
multiLine=True)
To fix that type of psv (pipe-separated values) file without quotes but with multiline values (newline in cell values) you need a stateful algorithm that loops through the rows and decides when to insert the quotes. This means the operation is not easily parallelizable. So you might as well do it using python on rdd rows:
def from_psv_without_quotes(path, sep='|', quote='"'):
rddFromFile = sc.textFile(path)
rdd = rddFromFile.zipWithIndex().keys()
headers = rdd.first()
cols = headers.split(sep)
schema = ", ".join([f"{col} STRING" for col in cols])
n_pipes = headers.count('|')
rows = rdd.collect()
processed_rows = []
n_pipes_in_current_row = 0
complete_row = ""
for row in rows:
if n_pipes_in_current_row < n_pipes:
complete_row += row if n_pipes_in_current_row == 0 else "\n"+row
n_pipes_in_current_row += row.count('|')
if n_pipes_in_current_row == n_pipes:
complete_row = quote + complete_row.replace('|', f'{quote}|{quote}') + quote
processed_rows.append(complete_row)
n_pipes_in_current_row = 0
complete_row = ""
processed_rdd = sc.parallelize(processed_rows)
print(processed_rdd.collect())
df = spark.read.csv(
processed_rdd,
sep=sep,
quote=quote,
ignoreTrailingWhiteSpace=True,
ignoreLeadingWhiteSpace=True,
header=True,
mode='PERMISSIVE',
schema=schema,
enforceSchema=False,
)
return df
df = from_psv_without_quotes('/path/to/unqoted_multiline.psv')
df.show()
I am assuming you are constrained to reading from Hadoop, so the example solution is naive first attempt. It is really inefficient because of the rdd.collect() etc. I am sure you could do this much more efficiently if you avoided the whole spark infrastructure and preprocessed the unquoted multiline file with some gnu tools like sedand awk.

Read and processing data in spark output is not deliminated correctly

So my stored output looks like this, it is one column with
\N|\N|\N|8931|\N|1
Where | is suppose to be the deliminated column. So it should have 6 columns, but it only has one.
My code to generate this is
val distData = sc.textFile(inputFileAdl).repartition(partitions.toInt)
val x = new UdfWrapper(inputTempProp, "local")
val wrapper = sc.broadcast(x)
distData.map({s =>
wrapper.value.exec(s.toString)
}).toDF().write.parquet(outFolder)
Nothing inside of the map can be changed. wrapper.value.exec(s.toString) returns a deliminated string(This cannot be changed). I want to write this deliminated string to a parquet file, but have it be correctly deliminated by a given deliminator. How can I accomplish this?
So current output - One column which is a deliminated string
Exepcted out - Six columns from the single deliminated string

how to parse specific portion of text file using two delimiters or strings in SCALA

I have sample.txt file
The file contains logs with date and time.
For example,
10.10.2012:
erewwetrt=1
wrtertret=2
ertertert=3
;
10.10.2012:
asdafdfd=1
adadfadf=2
adfdafdf=3
;
10.12.2013:
adfsfsdfgg=1
sdfsdfdfg=2
sdfsdgsdg=3
;
12.12.2012:
asdasdas=1
adasfasdf=2
dfsdfsdf=3
;
I just want to retrive only year 2012 data, that is between12.12.2012: to ;
How can I do this in scla or spark scala.
finally I need to remove = with comma and save it in csv format.
How can I do it.
To extract that specific part you can use this:
def main(args:Array[String]):Unit={
val text = "10.10.2012:\nerewwetrt=1\nwrtertret=2\nertertert=3\n;\n10.10.2012:\nasdafdfd=1\nadadfadf=2\nadfdafdf=3\n;\n10.12.2013:\nadfsfsdfgg=1\nsdfsdfdfg=2\nsdfsdgsdg=3\n;\n12.12.2012:\nasdasdas=1\nadasfasdf=2\ndfsdfsdf=3\n;"
val lines = text.split("\n")
val extracted = lines.dropWhile(_ != "12.12.2012:").drop(1).takeWhile(_ != ";")
extracted.foreach(println(_))
}