I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+
I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")
Long story short - I am using a spark code in Scala IDE to convert json to csv. I don't have knowledge about spark as I have worked only on RDBMS like Oracle, TD and DB2. All I was given was, the code which will converts the json data to csv and how to pass the arguments to retrieve data from the schema.
Now, I am able to fetch the data which is inside a struct and array by using
val val1 = df.select(explode($"data.business").as("ID")).select($"ID.amountTO")
val1.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(args(2) + "\\Result" + "\\" + timeForpath + "\\val1")
I don't know to export the columns which are not in the struct and are directly in the root of the schema like QAYONOutCome, QA1PartiesComments etc..
root
|-- QAYONOutCome: string (nullable = true)
|-- QA1PartiesComments: string (nullable = true)
|-- QA1PartiesQID: string (nullable = true)
|-- QA1PartiesResponse: string (nullable = true)
|-- QAHolderTypeComments: string (nullable = true)
|-- QAHolderTypeQID: string (nullable = true)
|-- QAHolderTypeResponse: string (nullable = true)
|-- QAhighRiskComments: string (nullable = true)
|-- QAhighRiskQID: string (nullable = true)
|-- QAhighRiskResponse: string (nullable = true)
|-- QA2ClassComments: string (nullable = true)
|-- QA2ClassQID: string (nullable = true)
|-- QA2ClassResponse: string (nullable = true)
|-- QAoutcomeComments: string (nullable = true)
|-- QAoutcomeQID: string (nullable = true)
|-- QAoutcomeResponse: string (nullable = true)
|-- data: struct (nullable = true)
| |-- business: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- amountTO: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Registration: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- line1: string (nullable = true)
| | | | |-- line2: string (nullable = true)
| | | | |-- postCode: string (nullable = true)
Any help is appreciated. Apologies if my question sounds very dumb :(. Please let me know if some more information is needed to provide a solution or clarity. Thanks much in advance.
I am working on JSON files with DataFrames, and I can't achieve to filter an array's fields.
This is my input struct :
root
|-- MyObject: struct (nullable = true)
| |-- Field1: long (nullable = true)
| |-- Field2: string (nullable = true)
| |-- Field3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Field3_1: boolean (nullable = true)
| | | |-- Field3_2: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
and I want a DF like that :
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = true)
| | |-- Field3_3: string (nullable = true)
The best I got is with
df.select($"MyObject.Field1",
$"MyObject.Field3.Field3_1" as "Field3.Field3_1",
$"MyObject.Field3.Field3_3" as "Field3.Field3_3")
which gives me :
root
|-- Field1: long (nullable = true)
|-- Field3_1: array (nullable = true)
| |-- element: boolean (nullable = true)
|-- Field3_3: array (nullable = true)
| |-- element: string (nullable = true)
I can't use array function because Field3_1 and Field3_3 haven't the same type.
How can I create an array with only selected fields?
I'm a beginner with Spark SQL, maybe I'm missing something!
Thanks.
The easiest solution is to use a udf function as
import org.apache.spark.sql.functions._
def arraystructUdf = udf((f3:Seq[Row])=> f3.map(row => field3(row.getAs[Boolean]("Field3_1"), row.getAs[String]("Field3_3"))))
df.select(col("MyObject.Field1"), arraystructUdf(col("MyObject.Field3")).as("Field3"))
where field3 is a case class
case class field3(Field3_1:Boolean, Field3_3:String)
which should give you
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = false)
| | |-- Field3_3: string (nullable = true)
I hope the answer is helpful
I have a DataFrame which contains multiple nested columns. The schema is not static and could change upstream of my Spark application. Schema evolution is guaranteed to always be backward compatible. An anonymized, shortened version of the schema is pasted below
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: string (nullable = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
The spark job is supposed to transform "structX.uriArguments" from string to map(string, string). There is a somewhat similar situation asked in this post. However, the answer assumes the schema is static and does not change. So case class does not work in my situation.
What would be the best way to transform structX.uriArguments without hard-coding the entire schema inside the code? The outcome should look like this:
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
Thanks
You could try using the DataFrame.withColumn(). It allows you to reference nested fields. You could add a new map column and drop the flat one. This question shows how to handle structs with withColumn.