I am a newbie to Spark SQL(using Scala) and have some basic questions regarding an error I am facing.
I am merging 2 data frames (oldData and newData) as follows
if (!oldData.isEmpty) {
oldData
.join(newData, Seq("internalUUID"),"left_anti")
.unionByName(newData)
.drop("all") //Drop records that have null in all fields
} else {
newData
}
The error I see is
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ....
at the 8th column of the second table;;
'Union
:- Project [internalUUID#342, TenantID#339, ObjectName#340, DataSource#341, product#343, plant#344, isMarkedForDeletion#345, distributionProfile#346, productionAspect#347, salesPlant#348, listing#349]
: +- Join LeftAnti, (internalUUID#342 = internalUUID#300)
: :- Relation[TenantID#339,ObjectName#340,DataSource#341,internalUUID#342,product#343,plant#344,isMarkedForDeletion#345,distributionProfile#346,productionAspect#347,salesPlant#348,listing#349] parquet
: +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
+- Project [internalUUID#300, TenantID#298, ObjectName#297, DataSource#296, product#304, plant#303, isMarkedForDeletion#301, distributionProfile#299, productionAspect#305, salesPlant#306, listing#302]
+- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
The schema structure is as follows :
OldData
root
|-- TenantID: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- DataSource: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
| | |-- isListed: boolean (nullable = true)
and NewData
root
|-- DataSource: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- TenantID: string (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- isListed: boolean (nullable = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
However I am not quite sure what does the "8th column of the 2nd table" denote? Moreover the columns are not ordered in the same way in both data frames. Is there any guidance on how to proceed on this?
When using unionByName the order does not matter as it resolves using column names. But this is only applicable for columns at root (those returned by df.columns) and not the nested ones.
In your case, you get that error because you have some column types that don't match between the 2 dataframes.
We can take the example of column listing:
newData => array<struct<isListed:boolean,validFrom:string,validTo:string>>
oldData => array<struct<validFrom:string,validTo:string,isListed:boolean>>
In StructType, the order and the type of the fields is important. You can see it by using this simple code:
val oldListing = new StructType().add("isListed", "boolean").add("validFrom", "string").add("validTo", "string")
val newListing = new StructType().add("validFrom", "string").add("validTo", "string").add("isListed", "boolean")
oldListing == newListing
//res239: Boolean = false
I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")
I am trying to flatten a complex JSON structure containing nested arrays, struct elements using a generic function which should work for any JSON files with any schema.
Below is a part of sample JSON structure which I want to flatten
root
|-- Data: struct (nullable = true)
| |-- Record: struct (nullable = true)
| | |-- FName: string (nullable = true)
| | |-- LName: long (nullable = true)
| | |-- Address: struct (nullable = true)
| | | |-- Applicant: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Id: long (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | | | |-- Option: long (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- Town: long (nullable = true)
| | |-- IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
to
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant_Id: long (nullable = true)
|-- Data_Record_Address_Applicant_Type: string (nullable = true)
|-- Data_Record_Address_Applicant_Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
I am using the code below as suggested in below thread
How to flatten a struct in a Spark dataframe?
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()
But it doesn't work for array elements. With above code I am getting output as below. Can you please help. How can I modify this piece of code to include arrays too.
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
This is not a duplicate as there is no post regarding a generic function to flatten complex JSON schema that includes arrays too.
I have a DataFrame which contains multiple nested columns. The schema is not static and could change upstream of my Spark application. Schema evolution is guaranteed to always be backward compatible. An anonymized, shortened version of the schema is pasted below
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: string (nullable = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
The spark job is supposed to transform "structX.uriArguments" from string to map(string, string). There is a somewhat similar situation asked in this post. However, the answer assumes the schema is static and does not change. So case class does not work in my situation.
What would be the best way to transform structX.uriArguments without hard-coding the entire schema inside the code? The outcome should look like this:
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
Thanks
You could try using the DataFrame.withColumn(). It allows you to reference nested fields. You could add a new map column and drop the flat one. This question shows how to handle structs with withColumn.
I have a dataframe with following schema :-
scala> final_df.printSchema
root
|-- mstr_prov_id: string (nullable = true)
|-- prov_ctgry_cd: string (nullable = true)
|-- prov_orgnl_efctv_dt: timestamp (nullable = true)
|-- prov_trmntn_dt: timestamp (nullable = true)
|-- prov_trmntn_rsn_cd: string (nullable = true)
|-- npi_rqrd_ind: string (nullable = true)
|-- prov_stts_aray_txt: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- PROV_STTS_KEY: string (nullable = true)
| | |-- PROV_STTS_EFCTV_DT: timestamp (nullable = true)
| | |-- PROV_STTS_CD: string (nullable = true)
| | |-- PROV_STTS_TRMNTN_DT: timestamp (nullable = true)
| | |-- PROV_STTS_TRMNTN_RSN_CD: string (nullable = true)
I am running following code to do basic cleansing but its not working inside "prov_stts_aray_txt" , basically its not going inside array type and performing transformation desire. I want to iterate through out nested all fields(Flat and nested field within Dataframe and perform basic transformation.
for(dt <- final_df.dtypes){
final_df = final_df.withColumn(dt._1,when(upper(trim(col(dt._1))) === "NULL",lit(" ")).otherwise(col(dt._1)))
}
please help.
Thanks