Selecting fields of structs inside an array inside a dataframe - pyspark

I have a PySpark dataframe loaded from a 3 GB json.gz file, with the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
I need to drop the title, author and date fields, or create a new dataFrame that does not include these fields.
So far I've managed to get the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
using
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
But I need a way to keep articleIDs and sources together in the same struct. How can I do this?

Okay, I found something that works:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
But if anyone has a better solution, I'd love to hear it, since this seems very round-about. (And as a sidenote, collect_set only seems to works with java 8.)

Related

convert array of array to array of struct in pyspark

I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+

Extract struct fields in Spark scala

I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")

How to display the string variable in the root using Spark SQL?

Long story short - I am using a spark code in Scala IDE to convert json to csv. I don't have knowledge about spark as I have worked only on RDBMS like Oracle, TD and DB2. All I was given was, the code which will converts the json data to csv and how to pass the arguments to retrieve data from the schema.
Now, I am able to fetch the data which is inside a struct and array by using
val val1 = df.select(explode($"data.business").as("ID")).select($"ID.amountTO")
val1.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(args(2) + "\\Result" + "\\" + timeForpath + "\\val1")
I don't know to export the columns which are not in the struct and are directly in the root of the schema like QAYONOutCome, QA1PartiesComments etc..
root
|-- QAYONOutCome: string (nullable = true)
|-- QA1PartiesComments: string (nullable = true)
|-- QA1PartiesQID: string (nullable = true)
|-- QA1PartiesResponse: string (nullable = true)
|-- QAHolderTypeComments: string (nullable = true)
|-- QAHolderTypeQID: string (nullable = true)
|-- QAHolderTypeResponse: string (nullable = true)
|-- QAhighRiskComments: string (nullable = true)
|-- QAhighRiskQID: string (nullable = true)
|-- QAhighRiskResponse: string (nullable = true)
|-- QA2ClassComments: string (nullable = true)
|-- QA2ClassQID: string (nullable = true)
|-- QA2ClassResponse: string (nullable = true)
|-- QAoutcomeComments: string (nullable = true)
|-- QAoutcomeQID: string (nullable = true)
|-- QAoutcomeResponse: string (nullable = true)
|-- data: struct (nullable = true)
| |-- business: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- amountTO: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Registration: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- line1: string (nullable = true)
| | | | |-- line2: string (nullable = true)
| | | | |-- postCode: string (nullable = true)
Any help is appreciated. Apologies if my question sounds very dumb :(. Please let me know if some more information is needed to provide a solution or clarity. Thanks much in advance.

How to filter an array's fields in a DataFrame?

I am working on JSON files with DataFrames, and I can't achieve to filter an array's fields.
This is my input struct :
root
|-- MyObject: struct (nullable = true)
| |-- Field1: long (nullable = true)
| |-- Field2: string (nullable = true)
| |-- Field3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Field3_1: boolean (nullable = true)
| | | |-- Field3_2: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
and I want a DF like that :
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = true)
| | |-- Field3_3: string (nullable = true)
The best I got is with
df.select($"MyObject.Field1",
$"MyObject.Field3.Field3_1" as "Field3.Field3_1",
$"MyObject.Field3.Field3_3" as "Field3.Field3_3")
which gives me :
root
|-- Field1: long (nullable = true)
|-- Field3_1: array (nullable = true)
| |-- element: boolean (nullable = true)
|-- Field3_3: array (nullable = true)
| |-- element: string (nullable = true)
I can't use array function because Field3_1 and Field3_3 haven't the same type.
How can I create an array with only selected fields?
I'm a beginner with Spark SQL, maybe I'm missing something!
Thanks.
The easiest solution is to use a udf function as
import org.apache.spark.sql.functions._
def arraystructUdf = udf((f3:Seq[Row])=> f3.map(row => field3(row.getAs[Boolean]("Field3_1"), row.getAs[String]("Field3_3"))))
df.select(col("MyObject.Field1"), arraystructUdf(col("MyObject.Field3")).as("Field3"))
where field3 is a case class
case class field3(Field3_1:Boolean, Field3_3:String)
which should give you
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = false)
| | |-- Field3_3: string (nullable = true)
I hope the answer is helpful

Spark Scala: How to Replace a Field in Deeply Nested DataFrame

I have a DataFrame which contains multiple nested columns. The schema is not static and could change upstream of my Spark application. Schema evolution is guaranteed to always be backward compatible. An anonymized, shortened version of the schema is pasted below
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: string (nullable = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
The spark job is supposed to transform "structX.uriArguments" from string to map(string, string). There is a somewhat similar situation asked in this post. However, the answer assumes the schema is static and does not change. So case class does not work in my situation.
What would be the best way to transform structX.uriArguments without hard-coding the entire schema inside the code? The outcome should look like this:
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
Thanks
You could try using the DataFrame.withColumn(). It allows you to reference nested fields. You could add a new map column and drop the flat one. This question shows how to handle structs with withColumn.