I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")
Related
I am a newbie to Spark SQL(using Scala) and have some basic questions regarding an error I am facing.
I am merging 2 data frames (oldData and newData) as follows
if (!oldData.isEmpty) {
oldData
.join(newData, Seq("internalUUID"),"left_anti")
.unionByName(newData)
.drop("all") //Drop records that have null in all fields
} else {
newData
}
The error I see is
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ....
at the 8th column of the second table;;
'Union
:- Project [internalUUID#342, TenantID#339, ObjectName#340, DataSource#341, product#343, plant#344, isMarkedForDeletion#345, distributionProfile#346, productionAspect#347, salesPlant#348, listing#349]
: +- Join LeftAnti, (internalUUID#342 = internalUUID#300)
: :- Relation[TenantID#339,ObjectName#340,DataSource#341,internalUUID#342,product#343,plant#344,isMarkedForDeletion#345,distributionProfile#346,productionAspect#347,salesPlant#348,listing#349] parquet
: +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
+- Project [internalUUID#300, TenantID#298, ObjectName#297, DataSource#296, product#304, plant#303, isMarkedForDeletion#301, distributionProfile#299, productionAspect#305, salesPlant#306, listing#302]
+- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
The schema structure is as follows :
OldData
root
|-- TenantID: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- DataSource: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
| | |-- isListed: boolean (nullable = true)
and NewData
root
|-- DataSource: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- TenantID: string (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- isListed: boolean (nullable = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
However I am not quite sure what does the "8th column of the 2nd table" denote? Moreover the columns are not ordered in the same way in both data frames. Is there any guidance on how to proceed on this?
When using unionByName the order does not matter as it resolves using column names. But this is only applicable for columns at root (those returned by df.columns) and not the nested ones.
In your case, you get that error because you have some column types that don't match between the 2 dataframes.
We can take the example of column listing:
newData => array<struct<isListed:boolean,validFrom:string,validTo:string>>
oldData => array<struct<validFrom:string,validTo:string,isListed:boolean>>
In StructType, the order and the type of the fields is important. You can see it by using this simple code:
val oldListing = new StructType().add("isListed", "boolean").add("validFrom", "string").add("validTo", "string")
val newListing = new StructType().add("validFrom", "string").add("validTo", "string").add("isListed", "boolean")
oldListing == newListing
//res239: Boolean = false
I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+
Long story short - I am using a spark code in Scala IDE to convert json to csv. I don't have knowledge about spark as I have worked only on RDBMS like Oracle, TD and DB2. All I was given was, the code which will converts the json data to csv and how to pass the arguments to retrieve data from the schema.
Now, I am able to fetch the data which is inside a struct and array by using
val val1 = df.select(explode($"data.business").as("ID")).select($"ID.amountTO")
val1.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(args(2) + "\\Result" + "\\" + timeForpath + "\\val1")
I don't know to export the columns which are not in the struct and are directly in the root of the schema like QAYONOutCome, QA1PartiesComments etc..
root
|-- QAYONOutCome: string (nullable = true)
|-- QA1PartiesComments: string (nullable = true)
|-- QA1PartiesQID: string (nullable = true)
|-- QA1PartiesResponse: string (nullable = true)
|-- QAHolderTypeComments: string (nullable = true)
|-- QAHolderTypeQID: string (nullable = true)
|-- QAHolderTypeResponse: string (nullable = true)
|-- QAhighRiskComments: string (nullable = true)
|-- QAhighRiskQID: string (nullable = true)
|-- QAhighRiskResponse: string (nullable = true)
|-- QA2ClassComments: string (nullable = true)
|-- QA2ClassQID: string (nullable = true)
|-- QA2ClassResponse: string (nullable = true)
|-- QAoutcomeComments: string (nullable = true)
|-- QAoutcomeQID: string (nullable = true)
|-- QAoutcomeResponse: string (nullable = true)
|-- data: struct (nullable = true)
| |-- business: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- amountTO: string (nullable = true)
| | | |-- ID: string (nullable = true)
| | | |-- Registration: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- id: long (nullable = true)
| | | | |-- line1: string (nullable = true)
| | | | |-- line2: string (nullable = true)
| | | | |-- postCode: string (nullable = true)
Any help is appreciated. Apologies if my question sounds very dumb :(. Please let me know if some more information is needed to provide a solution or clarity. Thanks much in advance.
I am trying to flatten a complex JSON structure containing nested arrays, struct elements using a generic function which should work for any JSON files with any schema.
Below is a part of sample JSON structure which I want to flatten
root
|-- Data: struct (nullable = true)
| |-- Record: struct (nullable = true)
| | |-- FName: string (nullable = true)
| | |-- LName: long (nullable = true)
| | |-- Address: struct (nullable = true)
| | | |-- Applicant: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Id: long (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | | | |-- Option: long (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- Town: long (nullable = true)
| | |-- IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
to
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant_Id: long (nullable = true)
|-- Data_Record_Address_Applicant_Type: string (nullable = true)
|-- Data_Record_Address_Applicant_Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
I am using the code below as suggested in below thread
How to flatten a struct in a Spark dataframe?
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()
But it doesn't work for array elements. With above code I am getting output as below. Can you please help. How can I modify this piece of code to include arrays too.
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
This is not a duplicate as there is no post regarding a generic function to flatten complex JSON schema that includes arrays too.
I am working on JSON files with DataFrames, and I can't achieve to filter an array's fields.
This is my input struct :
root
|-- MyObject: struct (nullable = true)
| |-- Field1: long (nullable = true)
| |-- Field2: string (nullable = true)
| |-- Field3: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Field3_1: boolean (nullable = true)
| | | |-- Field3_2: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
| | | |-- Field3_3: string (nullable = true)
and I want a DF like that :
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = true)
| | |-- Field3_3: string (nullable = true)
The best I got is with
df.select($"MyObject.Field1",
$"MyObject.Field3.Field3_1" as "Field3.Field3_1",
$"MyObject.Field3.Field3_3" as "Field3.Field3_3")
which gives me :
root
|-- Field1: long (nullable = true)
|-- Field3_1: array (nullable = true)
| |-- element: boolean (nullable = true)
|-- Field3_3: array (nullable = true)
| |-- element: string (nullable = true)
I can't use array function because Field3_1 and Field3_3 haven't the same type.
How can I create an array with only selected fields?
I'm a beginner with Spark SQL, maybe I'm missing something!
Thanks.
The easiest solution is to use a udf function as
import org.apache.spark.sql.functions._
def arraystructUdf = udf((f3:Seq[Row])=> f3.map(row => field3(row.getAs[Boolean]("Field3_1"), row.getAs[String]("Field3_3"))))
df.select(col("MyObject.Field1"), arraystructUdf(col("MyObject.Field3")).as("Field3"))
where field3 is a case class
case class field3(Field3_1:Boolean, Field3_3:String)
which should give you
root
|-- Field1: long (nullable = true)
|-- Field3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Field3_1: boolean (nullable = false)
| | |-- Field3_3: string (nullable = true)
I hope the answer is helpful