Create pyspark data frame from list? - pyspark

So I have a list of headers e.g.
Headers=["col1", "col2", "col3"]
and a list of rows
Body=[ ["val1", "val2", "val3"], ["val1", "val2", "val3"] ]
where val1 corresponds to a value that should go under col1 ect.
If I try createDataFrame(data=Body) it gives an error cant infer schmea type for str
Is it possible to get a list like this into a pyspark dataframe?
I have tried appending the header to the body e.g.
body.append(header) and then using the create data frame function but it throws up this error:
field _22: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.LongType'>
This is my whole code for this part for generating the body and header:
Basically I using openpyxl to read an excel file, where it skips the first x rows ect and only reads in sheets which have certain column names.
After the body and header has been generated I want to read it directly into spark.
We had a contractor who wrote it out as a csv and then read it in using spark but just seemed to make more sense to directly put it into spark.
I want to the columns to all be strings at this point
import csv
from os import sys
excel_file = "/dbfs/{}".format(path)
wb = load_workbook(excel_file, read_only=True)
sheet_names = wb.get_sheet_names()
sheets = spark.read.option("multiline", "true").format("json").load(configPath)
if dataFrameContainsColumn(sheets, "sheetNames"):
config_sheets = jsonReader(configFilePath,"sheetNames")
else:
config_sheets= []
skip_rows=-1
#get a list of the required columns
required_fields_list = jsonReader(configFilePath,"requiredColumns")
for worksheet_name in sheet_names:
count=0
sheet_count=0
second_break=False
worksheet = wb.get_sheet_by_name(worksheet_name)
#assign the sheet name to the object sheet
#create empty header and body lists for each sheet
header = []
body = []
#for each row in the sheet we need to append the cells to the header and body
for i,row in enumerate(worksheet.iter_rows()):
#if the row index is greater then skip rows then we want to read that row in as the header
if i==skip_rows+1:
header.append([cell.value for cell in row])
elif i>skip_rows+1:
count=count+1
if count==1:
header=header[0]
header = [w.replace(' ', '_') for w in header]
header = [w.replace('.', '') for w in header]
if(all(elem in header for elem in required_fields_list)==False):
second_break=True
break
else:
count=2
sheet_count=sheet_count+1
body.append([cell.value for cell in row])```

There are a few method to create a dataframe from a list.
You can check them out here
Let Spark infer the schema
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
df = sc.parallelize(list_of_persons).toDF(['name', 'age', 'score'])
df.printSchema()
df.show()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- score: double (nullable = true)
+-----+---+-----+
| name|age|score|
+-----+---+-----+
|Arike| 28| 78.6|
| Bob| 32|45.32|
|Corry| 65|98.47|
+-----+---+-----+
Specify the types using a map transformation
list_of_persons = [('Arike', 28, 78.6), ('Bob', 32, 45.32), ('Corry', 65, 98.47)]
rdd = sc.parallelize(list_of_persons)
person = rdd.map(lambda x: Row(name=x[0], age=int(x[1]), score=float(x[2])))
schemaPeople = sqlContext.createDataFrame(person)
schemaPeople.printSchema()
schemaPeople.show()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- score: double (nullable = true)
+---+-----+-----+
|age| name|score|
+---+-----+-----+
| 28|Arike| 78.6|
| 32| Bob|45.32|
| 65|Corry|98.47|
+---+-----+-----+

Related

Not able to write loop expression in withcolumn in pyspark

i have dataframe where DealKeys has data like as
[{"Charge_Type": "DET", "Country": "VN", "Tariff_Loc": "VNSGN"}]
expected out put could be
[{"keyname": "Charge_Type", "value": "DET", "description": "..."}, {"keyname": "Country", "value": "VN", "description": "..."}, {"keyname": "Tariff_Loc", "value": "VNSGN", "description": "..."}]
when i create dataframe got bellow error
df = df2.withColumn('new_column',({'keyname' : i, 'value' : dictionary[i],'description' : "..."} for i in col("Dealkeys")))
Errro: Column is not iterable
DF2 schema:
root
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- DealKeys: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Charge_Type: string (nullable = true)
| | |-- Country: string (nullable = true)
| | |-- Tariff_Loc: string (nullable = true)
We cannot iterate through a dataframe column in Pyspark, hence the error occured.
To get the expected output, you need to follow the approach specified below.
Create new_column column with value as an empty string beforehand so that we can update its value as we iterate through each row.
Since a column cannot be iterated, we can use collect() method to get DealKeys values so that we can insert them into the corresponding new_column value.
Using df.collect() returns a list of rows (rows can be iterated through). As per schema, DealKeys is also of type Row. Using dealkey_row (DealKeys column values as a row) with asDict(), perform list comprehension to create the list of dictionaries that will be inserted into corresponding Charge_No value.
# df is the initial dataframe
df = df.withColumn("new_column", lit(''))
rows = df.collect()
for row in rows:
key = row[0] #Charge_No column value (string type)
dealkey_row = row[2][0] #DealKeys column value (row type)
lst = [{'keyname' : i, 'value' : dealkey_row[i],'description' : "..."} for i in dealkey_row.asDict()] #row.asDict() to get dictionary
df = df.withColumn('new_column', when(col('Charge_No') == key, str(lst)).otherwise(col('new_column')))
df.show(truncate=False)
Row.asDict() converts a row into a dictionary so that the list comprehension can be easily. Using withColumn() along with when(<condition>,<update_value>) function in pyspark, insert the output of your list comprehension into the new_column column (‘otherwise’ helps to retain the previous value if Charge_No value doesn’t match).
The above code produced the following output when I used it.

Join two spark Dataframe using the nested column and update one of the columns

I am working on some requirement in which I am getting one small table in from of CSV file as follow:
root
|-- ACCT_NO: string (nullable = true)
|-- SUBID: integer (nullable = true)
|-- MCODE: string (nullable = true)
|-- NewClosedDate: timestamp (nullable = true
We also have a very big external hive table in form of Avro which is stored in HDFS as follow:
root
-- accountlinks: array (nullable = true)
| | |-- account: struct (nullable = true)
| | | |-- acctno: string (nullable = true)
| | | |-- subid: string (nullable = true)
| | | |-- mcode: string (nullable = true)
| | | |-- openeddate: string (nullable = true)
| | | |-- closeddate: string (nullable = true)
Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE. If it matches, updates the accountlinks.account.closeddate with NewClosedDate from CSV file.
I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it. In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.
val df = spark.sql("select * from db.table where archive='201711'")
val ExtractedColumn = df
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))
val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")
val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")
All you need is to explode the accountlinks array and then join the 2 dataframes like this:
val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF = explodedDF.join(ReferenceData, joinCondition, "left")
Now you can update the account struct column like below, and collect list to get back the array structure:
val FinalData = joinDF.withColumn("account",
struct($"account.acctno", $"account.subid", $"account.mcode",
$"account.openeddate", $"NewClosedDate".alias("closeddate")
)
)
.groupBy().agg(collect_list($"account").alias("accountlinks"))
The idea is to create a new struct with all the fields from account except closedate that you get from NewCloseDate column.
If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.

PySpark DataFrame change column of string to array before using explode

I have a column called event_data in json format in my spark DataFrame, after reading it using from_json, I get this schema:
root
|-- user_id: string (nullable = true)
|-- event_data: struct (nullable = true)
| |-- af_content_id: string (nullable = true)
| |-- af_currency: string (nullable = true)
| |-- af_order_id: long (nullable = true)
I only need af_content_id from this column. This attribute can be of different formats:
a String
an Integer
a List of Int and Str. eg ['ghhjj23','123546',12356]
None (sometimes event_data doesn't contain af_content_id)
I want to use explode function in order to return a new row for each element in af_content_id when it is of format List. But as when I apply it, I get an error:
from pyspark.sql.functions import explode
def get_content_id(column):
return column.af_content_id
df_transf_1 = df_transf_1.withColumn(
"products_basket",
get_content_id(df_transf_1.event_data)
)
df_transf_1 = df_transf_1.withColumn(
"product_id",
explode(df_transf_1.products_basket)
)
cannot resolve 'explode(products_basket)' due to data type mismatch: input to function explode should be array or map type, not StringType;
I know the reason, it's because of the different types that the field af_content_id may contain, but I don't know how to resolve it. Using pyspark.sql.functions.array() directly on the column doesn't work because it become array of array and explode will not produce the expected result.
A sample code to reproduce the step that I'm stuck on:
import pandas as pd
arr = [
['b5ad805c-f295-4852-82fc-961a88',12732936],
['0FD6955D-484C-4FC8-8C3F-DA7D28',['Gklb38','123655']],
['0E3D17EA-BEEF-4931-8104','12909841'],
['CC2877D0-A15C-4C0A-AD65-762A35C1',[12645715, 12909837, 12909837]]
]
df = pd.DataFrame(arr, columns = ['user_id','products_basket'])
df = df[['user_id','products_basket']].astype(str)
df_transf_1 = spark.createDataFrame(df)
I'm looking for a way to convert products_basket to one only possible format: an Array so that when I apply explode, it will contain one id per row.
If you are starting with a DataFrame like:
df_transf_1.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |['Gklb38', '123655'] |
#|0E3D17EA-BEEF-4931-8104 |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
where the products_basket column is a StringType:
df.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: string (nullable = true)
You can't call explode on products_basket because it's not an array or map.
One workaround is to remove any leading/trailing square brackets and then split the string on ", " (comma followed by a space). This will convert the string into an array of strings.
from pyspark.sql.functions import col, regexp_replace, split
df_transf_new= df_transf_1.withColumn(
"products_basket",
split(regexp_replace(col("products_basket"), r"(^\[)|(\]$)|(')", ""), ", ")
)
df_transf_new.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
The regular expression pattern matches any of the following:
(^\[): An opening square bracket at the start of the string
(\]$): A closing square bracket at the end of the string
('): Any single quote (because your strings are quoted)
and replaces these with an empty string.
This assumes that your data does not contain any needed single quotes or square brackets inside the product_basket.
After the split, the schema of the new DataFrame is:
df_transf_new.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: array (nullable = true)
# | |-- element: string (containsNull = true)
Now you can call explode:
from pyspark.sql.functions import explode
df_transf_new.withColumn("product_id", explode("products_basket")).show(truncate=False)
#+--------------------------------+------------------------------+----------+
#|user_id |products_basket |product_id|
#+--------------------------------+------------------------------+----------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |Gklb38 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |123655 |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12645715 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#+--------------------------------+------------------------------+----------+

Filter a dataFrame based on elemnt from a array column

I'm working with dataframe
root
|-- c: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I'm trying to filter this dataframe based on an element ["value1", "key1"] in the array data i.e if this element exist in data of df so keep it else delete it, I tried
df.filter(col("data").contain("["value1", "key1"])
but it didn't work. Also I tried to
put val f=Array("value1", "key1") then df.filter(col("data").contain(f)) it didn't work also.
Any help please?
Straight forward approach would be to use a udf function as udf function helps to perform logics row by row and in primitive datatypes (thats what your requirement suggests to check every key and value of struct element in array data column)
import org.apache.spark.sql.functions._
//udf to check for key1 in key and value1 in value of every struct in the array field
def containsUdf = udf((data: Seq[Row])=> data.exists(row => row.getAs[String]("key") == "key1" && row.getAs[String]("value") == "value1"))
//calling the udf function in the filter
val filteredDF = df.filter(containsUdf(col("data")))
so the filteredDF should be your desired output

DataFrame columns names conflict with .(dot)

I have a DataFrame df which has this schema:
root
|-- person.name: string (nullable = true)
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
When I do df.select("person.name") I obviously fetch the sub-field name from person. How could I select the column person.name?
For the column name that contains .(dot) you can use the ` character to enclose the column name
df.select("`person.name`")
This selects the outer String person.name: string (nullable = true)
And
df.select("person.name")
This gets the person name which is struct
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
If you have a column name you can just prepend and append ` character for the column name as
"`" + columnName + "`"
I hope this was helpful!
To access the column name with a period using pyspark, do this:
spark.sql("select person.name from person_table")
Note: person_table is a registerTempTable on df.
My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.
Let's create a DataFrame with some sample data:
schema = StructType([
StructField("person.name", StringType(), True),
StructField("person", StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)]))
])
data = [
("charles", Row("chuck", 42)),
("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name| person|
+-----------+-------------+
| charles| [chuck, 42]|
| larry|[chipper, 48]|
+-----------+-------------+
Let's illustrate that selecting person.name will return different results depending on if backticks are used or not.
cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name| person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck| charles|
|larry|[larry, 73]|larry| lawrence|
+-----+-----------+-----+-----------+
You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.
clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
| charles|chuck| 42|
| lawrence|larry| 73|
+-----------+-----+---+
This post explains how and why to avoid dots in PySpark columns names in more detail.