I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+
Related
I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")
I have a PySpark dataframe loaded from a 3 GB json.gz file, with the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
I need to drop the title, author and date fields, or create a new dataFrame that does not include these fields.
So far I've managed to get the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
using
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
But I need a way to keep articleIDs and sources together in the same struct. How can I do this?
Okay, I found something that works:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
But if anyone has a better solution, I'd love to hear it, since this seems very round-about. (And as a sidenote, collect_set only seems to works with java 8.)
I am trying to flatten a complex JSON structure containing nested arrays, struct elements using a generic function which should work for any JSON files with any schema.
Below is a part of sample JSON structure which I want to flatten
root
|-- Data: struct (nullable = true)
| |-- Record: struct (nullable = true)
| | |-- FName: string (nullable = true)
| | |-- LName: long (nullable = true)
| | |-- Address: struct (nullable = true)
| | | |-- Applicant: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Id: long (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | | | |-- Option: long (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- Town: long (nullable = true)
| | |-- IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
to
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant_Id: long (nullable = true)
|-- Data_Record_Address_Applicant_Type: string (nullable = true)
|-- Data_Record_Address_Applicant_Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
I am using the code below as suggested in below thread
How to flatten a struct in a Spark dataframe?
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()
But it doesn't work for array elements. With above code I am getting output as below. Can you please help. How can I modify this piece of code to include arrays too.
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
This is not a duplicate as there is no post regarding a generic function to flatten complex JSON schema that includes arrays too.
I am working on JSON files with DataFrames, and I can't achieve to filter an array's fields.
This is my input struct :
root
|-- MyObject: struct (nullable = true)
| |-- Field1: long (nullable = true)
| |-- Field2: string (nullable = true)
| |-- Field3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Field3_1: boolean (nullable = true)
| | | |-- Field3_2: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
and I want a DF like that :
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = true)
| | |-- Field3_3: string (nullable = true)
The best I got is with
df.select($"MyObject.Field1",
$"MyObject.Field3.Field3_1" as "Field3.Field3_1",
$"MyObject.Field3.Field3_3" as "Field3.Field3_3")
which gives me :
root
|-- Field1: long (nullable = true)
|-- Field3_1: array (nullable = true)
| |-- element: boolean (nullable = true)
|-- Field3_3: array (nullable = true)
| |-- element: string (nullable = true)
I can't use array function because Field3_1 and Field3_3 haven't the same type.
How can I create an array with only selected fields?
I'm a beginner with Spark SQL, maybe I'm missing something!
Thanks.
The easiest solution is to use a udf function as
import org.apache.spark.sql.functions._
def arraystructUdf = udf((f3:Seq[Row])=> f3.map(row => field3(row.getAs[Boolean]("Field3_1"), row.getAs[String]("Field3_3"))))
df.select(col("MyObject.Field1"), arraystructUdf(col("MyObject.Field3")).as("Field3"))
where field3 is a case class
case class field3(Field3_1:Boolean, Field3_3:String)
which should give you
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = false)
| | |-- Field3_3: string (nullable = true)
I hope the answer is helpful
I have the below 2 DataFrame schemas. Inside USER_INFO modules is an array, content is an array nested inside modules. I want to join/attach some additional data (METADATA) to each content element, such that
USER_INFO.modules.content.id = METADATA.cust_id
What would be a solution?
USER_INFO
root
|-- userId: string (nullable = true)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- distance: double (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- impressionId: string (nullable = true)
| | |-- id: string (nullable = true)
METADATA
root
|-- cust_id: string (nullable = true)
|-- image_url: string (nullable = true)