how to create and match schema in scala - scala

Hi i have a schema coming in as follows
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: long (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabletosellQty: long (nullable = true)
| | | |-- distroAvailableQty: long (nullable = true)
| | | |-- itemNumber: long (nullable = true)
| | | |-- itemUPC: string (nullable = true)
| | | |-- ossIndicator: string (nullable = true)
| | | |-- turnAvailableQty: long (nullable = true)
| | | |-- unitOfMeasurement: string (nullable = true)
| | | |-- weightFormatType: string (nullable = true)
| | | |-- whpkRatio: long (nullable = true)
to map this i have create this following schema type
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: integer (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: struct (nullable = true)
| | |-- availabletosellQty: long (nullable = true)
| | |-- distroAvailableQty: long (nullable = true)
| | |-- itemNumber: long (nullable = true)
| | |-- itemUPC: string (nullable = true)
| | |-- ossIndicator: string (nullable = true)
| | |-- turnAvailableQty: long (nullable = true)
| | |-- unitOfMeasurement: string (nullable = true)
| | |-- weightFormatType: string (nullable = true)
| | |-- whpkRatio: long (nullable = true)
by writing something like this
val testSchema = new StructType()
.add("eventObject", new StructType()
.add("baseDivisionCode", StringType)
.add("countryCode",StringType)
.add("dcNumber", IntegerType)
.add("financialReportingGroup",StringType)
.add("itemList",new StructType(
Array(
StructField("availabletosellQty",LongType),
StructField("distroAvailableQty",LongType),
StructField("itemNumber", LongType),
StructField("itemUPC", StringType),
StructField("ossIndicator",StringType),
StructField("turnAvailableQty",LongType),
StructField("unitOfMeasurement",StringType),
StructField("weightFormatType",StringType),
StructField("whpkRatio",LongType)))))
but it is not matching the schema that i am receiving...what am i doing wrong in this?
i am getting null values when i try to populate the with some data...
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: long (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemNumber: long (nullable = true)
| | | |-- itemUPC: string (nullable = true)
| | | |-- unitOfMeasurement: string (nullable = true)
| | | |-- availabletosellQty: long (nullable = true)
| | | |-- turnAvailableQty: long (nullable = true)
| | | |-- distroAvailableQty: long (nullable = true)
| | | |-- ossIndicator: string (nullable = true)
| | | |-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
|-- baseDivisionCode: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: long (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- itemList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemNumber: long (nullable = true)
| | |-- itemUPC: string (nullable = true)
| | |-- unitOfMeasurement: string (nullable = true)
| | |-- availabletosellQty: long (nullable = true)
| | |-- turnAvailableQty: long (nullable = true)
| | |-- distroAvailableQty: long (nullable = true)
| | |-- ossIndicator: string (nullable = true)
| | |-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
when i further try to flatten it, its erroring out cause of array
"Exception in thread "main" org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(itemList);"
trying to get it to
|-- facilityCountryCode: string (nullable = true)
|-- facilityNum: string (nullable = true)
|-- WMT_CorrelationId: string (nullable = true)
|-- WMT_IdempotencyKey: string (nullable = true)
|-- WMT_Timestamp: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: integer (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- baseDivisionCode: string (nullable = true)
|-- itemNumber: integer (nullable = true)
|-- itemUPC: string (nullable = true)
|-- unitOfMeasurement: string (nullable = true)
|-- availabletosellQty: integer (nullable = true)
|-- turnAvailableQty: integer (nullable = true)
|-- distroAvailableQty: integer (nullable = true)
|-- ossIndicator: string (nullable = true)
|-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
|-- year-month-day: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- hour: integer (nullable = true)
this is what i did
val testParsed=TestExploded.select($"exploded.*",$"kafka_timestamp")
val testFlattened=testParsed.select($"eventObject.*",$"kafka_timestamp")
val test_flattened_further=testFlattened.select($"countryCode",
$"dcNumber",$"financialReportingGroup",$"baseDivisionCode",**$"itemList.*"**,$"kafka_timestamp")

Use ArrayType to specify array type:
val testSchema = new StructType()
.add("eventObject", new StructType()
.add("baseDivisionCode", StringType)
.add("countryCode", StringType)
.add("dcNumber", LongType)
.add("financialReportingGroup", StringType)
.add("itemList", new ArrayType(
new StructType(
Array(
StructField("itemNumber", LongType),
StructField("itemUPC", StringType),
StructField("unitOfMeasurement", StringType),
StructField("availabletosellQty", LongType),
StructField("turnAvailableQty", LongType),
StructField("distroAvailableQty", LongType),
StructField("ossIndicator", StringType),
StructField("weightFormatType", StringType))), containsNull = true)))
To fully flatten the DataFrame you can use explode array of structs and move struct type into top level columns by select("structColName.*") syntax as follows:
df
.select("eventObject.*")
.select(
col("baseDivisionCode"),
col("countryCode"),
col("dcNumber"),
col("financialReportingGroup"),
explode(col("itemList")).as("explodedItemList"))
.select(
col("baseDivisionCode"),
col("countryCode"),
col("dcNumber"),
col("financialReportingGroup"),
col("explodedItemList.*")
)
.printSchema()
Will output:
root
|-- baseDivisionCode: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: long (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- itemNumber: long (nullable = true)
|-- itemUPC: string (nullable = true)
|-- unitOfMeasurement: string (nullable = true)
|-- availabletosellQty: long (nullable = true)
|-- turnAvailableQty: long (nullable = true)
|-- distroAvailableQty: long (nullable = true)
|-- ossIndicator: string (nullable = true)
|-- weightFormatType: string (nullable = true)

Related

Error incompatible column type when using unionByName

I am a newbie to Spark SQL(using Scala) and have some basic questions regarding an error I am facing.
I am merging 2 data frames (oldData and newData) as follows
if (!oldData.isEmpty) {
oldData
.join(newData, Seq("internalUUID"),"left_anti")
.unionByName(newData)
.drop("all") //Drop records that have null in all fields
} else {
newData
}
The error I see is
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ....
at the 8th column of the second table;;
'Union
:- Project [internalUUID#342, TenantID#339, ObjectName#340, DataSource#341, product#343, plant#344, isMarkedForDeletion#345, distributionProfile#346, productionAspect#347, salesPlant#348, listing#349]
: +- Join LeftAnti, (internalUUID#342 = internalUUID#300)
: :- Relation[TenantID#339,ObjectName#340,DataSource#341,internalUUID#342,product#343,plant#344,isMarkedForDeletion#345,distributionProfile#346,productionAspect#347,salesPlant#348,listing#349] parquet
: +- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
+- Project [internalUUID#300, TenantID#298, ObjectName#297, DataSource#296, product#304, plant#303, isMarkedForDeletion#301, distributionProfile#299, productionAspect#305, salesPlant#306, listing#302]
+- LogicalRDD [DataSource#296, ObjectName#297, TenantID#298, distributionProfile#299, internalUUID#300, isMarkedForDeletion#301, listing#302, plant#303, product#304, productionAspect#305, salesPlant#306], false
The schema structure is as follows :
OldData
root
|-- TenantID: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- DataSource: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
| | |-- isListed: boolean (nullable = true)
and NewData
root
|-- DataSource: string (nullable = true)
|-- ObjectName: string (nullable = true)
|-- TenantID: string (nullable = true)
|-- distributionProfile: struct (nullable = true)
| |-- code: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- internalUUID: string (nullable = true)
|-- isMarkedForDeletion: boolean (nullable = true)
|-- listing: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- isListed: boolean (nullable = true)
| | |-- validFrom: string (nullable = true)
| | |-- validTo: string (nullable = true)
|-- plant: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- product: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- internalRefUUID: string (nullable = true)
|-- productionAspect: struct (nullable = true)
| |-- productMovementPlants: struct (nullable = true)
| | |-- unitOfIssue: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| |-- productPlanningPlants: struct (nullable = true)
| | |-- abcIndicator: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- goodsIssueProcessDuration: long (nullable = true)
| | |-- goodsReceiptProcessDuration: long (nullable = true)
| | |-- mrpController: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- mrpType: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
| | |-- sourceOfSupplyCategory: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- internalRefUUID: string (nullable = true)
|-- salesPlant: struct (nullable = true)
| |-- loadingGroup: struct (nullable = true)
| | |-- code: string (nullable = true)
| | |-- internalRefUUID: string (nullable = true)
However I am not quite sure what does the "8th column of the 2nd table" denote? Moreover the columns are not ordered in the same way in both data frames. Is there any guidance on how to proceed on this?
When using unionByName the order does not matter as it resolves using column names. But this is only applicable for columns at root (those returned by df.columns) and not the nested ones.
In your case, you get that error because you have some column types that don't match between the 2 dataframes.
We can take the example of column listing:
newData => array<struct<isListed:boolean,validFrom:string,validTo:string>>
oldData => array<struct<validFrom:string,validTo:string,isListed:boolean>>
In StructType, the order and the type of the fields is important. You can see it by using this simple code:
val oldListing = new StructType().add("isListed", "boolean").add("validFrom", "string").add("validTo", "string")
val newListing = new StructType().add("validFrom", "string").add("validTo", "string").add("isListed", "boolean")
oldListing == newListing
//res239: Boolean = false

Spark dataframe how to select columns using Seq[String]

Input schema
root
|-- class: string (nullable = true)
|-- createdBy: string (nullable = true)
|-- createdDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- id: struct (nullable = true)
| |-- $oid: string (nullable = true)
|-- lastModifiedBy: string (nullable = true)
|-- lastModifiedDate: struct (nullable = true)
| |-- $date: long (nullable = true)
|-- planId: string (nullable = true)
|-- planWeekDataFormatted: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- bbDemoImps: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- bbDemoImpsAttributes: struct (nullable = true)
| | | | | |-- demoId: string (nullable = true)
| | | | | |-- imps: long (nullable = true)
| | | | | |-- ue: long (nullable = true)
| | | | |-- uuid: long (nullable = true)
| | |-- demoValues: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- demoAttributes: struct (nullable = true)
| | | | | |-- cpm: long (nullable = true)
| | | | | |-- cpp: long (nullable = true)
| | | | | |-- demoId: string (nullable = true)
| | | | | |-- grps: long (nullable = true)
| | | | | |-- imps: long (nullable = true)
| | | | | |-- rcImps: long (nullable = true)
| | | | | |-- totalCpm: long (nullable = true)
| | | | | |-- totalGrps: long (nullable = true)
| | | | | |-- totalImps: long (nullable = true)
| | | | | |-- ue: long (nullable = true)
| | | | | |-- vpvh: long (nullable = true)
| | | | |-- demoId: long (nullable = true)
| | |-- hhDemo: struct (nullable = true)
| | | |-- demoId: string (nullable = true)
| | | |-- imps: long (nullable = true)
| | | |-- ue: long (nullable = true)
| | |-- periodId: string (nullable = true)
| | |-- rcPublishedDate: string (nullable = true)
| | |-- unitRates: struct (nullable = true)
| | | |-- rate: long (nullable = true)
| | | |-- rcRate: long (nullable = true)
| | | |-- totalRate: long (nullable = true)
| | | |-- units: string (nullable = true)
| | |-- uuid: long (nullable = true)
| | |-- weekStartDate: long (nullable = true)
|-- planWorkspaceProduct: struct (nullable = true)
| |-- channelId: string (nullable = true)
| |-- commercialTypeId: string (nullable = true)
| |-- lineClassAttributes: struct (nullable = true)
| | |-- canExport: boolean (nullable = true)
| | |-- canInvoice: boolean (nullable = true)
| | |-- canProduce: boolean (nullable = true)
| | |-- guaranteedAudience: long (nullable = true)
| | |-- guaranteedRate: long (nullable = true)
| | |-- hasPerformance: boolean (nullable = true)
| | |-- planAudience: long (nullable = true)
| | |-- planRate: long (nullable = true)
| |-- lineClassId: string (nullable = true)
| |-- lineId: string (nullable = true)
| |-- lineNo: struct (nullable = true)
| | |-- $numberLong: string (nullable = true)
| |-- planProductId: string (nullable = true)
| |-- productId: string (nullable = true)
| |-- spotLengthId: string (nullable = true)
|-- rates: struct (nullable = true)
| |-- period: struct (nullable = true)
| | |-- endDate: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- startDate: long (nullable = true)
|-- version: struct (nullable = true)
| |-- $numberLong: string (nullable = true)
|-- offsets: integer (nullable = true)
|-- modifiedTime: long (nullable = true)
|-- opCode: string (nullable = true)
|-- partition: integer (nullable = true)
|-- tenant: string (nullable = true)
|-- etl_timestamp: long (nullable = false)
|-- topic: string (nullable = true)
Expected output schema
root
|-- class: string (nullable = true)
|-- createdBy: string (nullable = true)
|-- lastModifiedBy: string (nullable = true)
|-- planId: string (nullable = true)
|-- offsets: integer (nullable = true)
|-- modifiedTime: long (nullable = true)
|-- opCode: string (nullable = true)
|-- partition: integer (nullable = true)
|-- tenant: string (nullable = true)
|-- etl_timestamp: long (nullable = false)
|-- topic: string (nullable = true)
|-- createdDate_$date: long (nullable = true)
|-- id_$oid: string (nullable = true)
|-- lastModifiedDate_$date: long (nullable = true)
|-- planWorkspaceProduct_channelId: string (nullable = true)
|-- planWorkspaceProduct_commercialTypeId: string (nullable = true)
|-- planWorkspaceProduct_lineClassId: string (nullable = true)
|-- planWorkspaceProduct_lineId: string (nullable = true)
|-- planWorkspaceProduct_planProductId: string (nullable = true)
|-- planWorkspaceProduct_productId: string (nullable = true)
|-- planWorkspaceProduct_spotLengthId: string (nullable = true)
|-- version_$numberLong: string (nullable = true)
|-- planWeekDataFormatted_periodId: string (nullable = true)
|-- planWeekDataFormatted_rcPublishedDate: string (nullable = true)
|-- planWeekDataFormatted_weekStartDate: long (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_canExport: boolean (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_canInvoice: boolean (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_canProduce: boolean (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_guaranteedAudience: long (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_guaranteedRate: long (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_hasPerformance: boolean (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_planAudience: long (nullable = true)
|-- planWorkspaceProduct_lineClassAttributes_planRate: long (nullable = true)
|-- planWorkspaceProduct_lineNo_$numberLong: string (nullable = true)
|-- rates_period_endDate: long (nullable = true)
|-- rates_period_name: string (nullable = true)
|-- rates_period_startDate: long (nullable = true)
**|-- planWeekDataFormatted_hhDemo_demoId: string (nullable = true)**
|-- planWeekDataFormatted_unitRates_rate: long (nullable = true)
|-- planWeekDataFormatted_unitRates_rcRate: long (nullable = true)
|-- planWeekDataFormatted_unitRates_totalRate: long (nullable = true)
|-- planWeekDataFormatted_unitRates_units: string (nullable = true)
**|-- planWeekDataFormatted_bbDemoImps_bbDemoImpsAttributes_demoId: string (nullable = true)**
**|-- planWeekDataFormatted_demoValues_demoAttributes_demoId: string (nullable = true)**
Trying the below code to explode the ArrayType column 'planWeekDataFormatted', then the nested ArrayType columns 'bbDemoImps', 'demoValues' and trying to extract only the demoIds from each object in the arrays.
//get all columns from resultDF, except "planWeekDataFormatted" column
val dfwithoutPlanWeekData = resultDF.drop("planWeekDataFormatted")
val colsWithoutPlanWeekData = dfwithoutPlanWeekData.columns.toSeq
val planweek_exploded = resultDF.withColumn("planWeekItem", explode($"planWeekDataFormatted"))
.withColumn("bbDemoImpsAttribute", explode($"planWeekItem.bbDemoImps"))
.withColumn("demoValuesAttribute", explode($"planWeekItem.demoValues"))
.withColumn("hhDemoAttribute", $"planWeekItem.hhDemo")
.select(
colsWithoutPlanWeekData.map(c => col(c)): _*,
col("bbDemoImpsAttribute.bbDemoImpsAttributes.demoId").as("bbDemoId"),
col("demoValuesAttribute.demoAttributes.demoId").as("demoId"),
col("hhDemoAttribute.demoId").as("hhDemoId")
).drop("planWeekItem", "bbDemoImpsAttribute", "demoValuesAttribute", "hhDemoAttribute")
Not allowing Spark dataframe to select mapped columns from Seq[String]
Getting the below error
> overloaded method value select with alternatives: [U1, U2, U3,
> U4](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1],
> c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2], c3:
> org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U3], c4:
> org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U4])org.apache.spark.sql.Dataset[(U1,
> U2, U3, U4)] <and> (col: String,cols:
> String*)org.apache.spark.sql.DataFrame <and> (cols:
> org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be
> applied to (String, org.apache.spark.sql.Column,
> org.apache.spark.sql.Column, org.apache.spark.sql.Column)
> .select(
Use :
.select(
(colsWithoutPlanWeekData.map(c => col(c)) ++ Seq(
col("bbDemoImpsAttribute.bbDemoImpsAttributes.demoId").as("bbDemoId"),
col("demoValuesAttribute.demoAttributes.demoId").as("demoId"),
col("hhDemoAttribute.demoId").as("hhDemoId"))): _*
)
Concat the 2 Seq before using the syntactic-sugar : _*

How to append more columns to a structural datafame in scala

I have two dataframes (A and B), A is a structural schema whereas B is a common schema as below and will append B columns into A for C
A:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
B:
root
|-- globalPackageId: long (nullable = true)
|-- order_id: long (nullable = true)
|-- order_address: string (nullable = true)
|-- order_number: integer (nullable = true)
C:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
| |-- order_id: long (nullable = true)
| |-- order_address: string (nullable = true)
| |-- order_number: integer (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
I am struggling to use .withColumn(struct("xxx"), "xxx")
But looks still not expected
Do you have any experience on this
Thanks,

How to join using a nested column in Spark dataframe

I have one dataframe with this schema:
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
data:
+-----------+-----------+--------------------------------------------------+
|Activity_A1|Activity_A2|Details |
+-----------+-----------+--------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr2_Attr1,Agr2_Attr2], [Agr1_Attr1,Agr1_Attr2]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1,Agr4_Attr2], [Agr3_Attr1,Agr3_Attr2]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1,Agr5_Attr2]] |
+-----------+-----------+--------------------------------------------------+
And the second one with this schema:
|-- Agreement_A1: string (nullable = true)
| | |-- Lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Line_A1: string (nullable = true)
| | | | |-- Line_A2: string (nullable = true)
How can I join this two dataframes with the Agreement_A1 column, so the schema of this new dataframe would look like this:
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
| | |-- Lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Line_A1: string (nullable = true)
| | | | |-- Line_A2: string (nullable = true)
Hope this helps. You need to unnest (explode) "Details" and join on "Agreement_A1" with your second dataframe. Then, structure your columns as desired.
scala> df1.show(false)
+-----------+-----------+----------------------------------------------------+
|Activity_A1|Activity_A2|Details |
+-----------+-----------+----------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr2_Attr1, Agr2_Attr2], [Agr1_Attr1, Agr1_Attr2]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1, Agr4_Attr2], [Agr3_Attr1, Agr3_Attr2]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1, Agr5_Attr2]] |
+-----------+-----------+----------------------------------------------------+
scala> df1.printSchema
root
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
scala> df2.show(false)
+------------+--------------------------+
|Agreement_A1|Lines |
+------------+--------------------------+
|Agr1_Attr1 |[[A1At1Line1, A1At1Line2]]|
|Agr3_Attr1 |[[A3At1Line1, A3At1Line2]]|
|Agr4_Attr1 |[[A4At1Line1, A4At1Line2]]|
|Agr5_Attr1 |[[A5At1Line1, A5At1Line2]]|
|Agr6_Attr1 |[[A6At1Line1, A6At1Line2]]|
+------------+--------------------------+
scala> df2.printSchema
root
|-- Agreement_A1: string (nullable = true)
|-- Lines: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Line_A1: string (nullable = true)
| | |-- Line_A2: string (nullable = true)
scala> val outputDF = df1.withColumn("DetailsExploded", explode($"Details")).join(
| df2, $"DetailsExploded.Agreement_A1" === $"Agreement_A1").withColumn(
| "DetailsWithAgreementA1Lines", struct($"DetailsExploded.Agreement_A1" as "Agreement_A1", $"DetailsExploded.Agreement_A2" as "Agreement_A2", $"Lines"))
outputDF: org.apache.spark.sql.DataFrame = [Activity_A1: string, Activity_A2: string ... 5 more fields]
scala> outputDF.show(false)
+-----------+-----------+----------------------------------------------------+------------------------+------------+--------------------------+----------------------------------------------------+
|Activity_A1|Activity_A2|Details |DetailsExploded |Agreement_A1|Lines |DetailsWithAgreementA1Lines |
+-----------+-----------+----------------------------------------------------+------------------------+------------+--------------------------+----------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr2_Attr1, Agr2_Attr2], [Agr1_Attr1, Agr1_Attr2]]|[Agr1_Attr1, Agr1_Attr2]|Agr1_Attr1 |[[A1At1Line1, A1At1Line2]]|[Agr1_Attr1, Agr1_Attr2, [[A1At1Line1, A1At1Line2]]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1, Agr4_Attr2], [Agr3_Attr1, Agr3_Attr2]]|[Agr3_Attr1, Agr3_Attr2]|Agr3_Attr1 |[[A3At1Line1, A3At1Line2]]|[Agr3_Attr1, Agr3_Attr2, [[A3At1Line1, A3At1Line2]]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1, Agr4_Attr2], [Agr3_Attr1, Agr3_Attr2]]|[Agr4_Attr1, Agr4_Attr2]|Agr4_Attr1 |[[A4At1Line1, A4At1Line2]]|[Agr4_Attr1, Agr4_Attr2, [[A4At1Line1, A4At1Line2]]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1, Agr5_Attr2]] |[Agr5_Attr1, Agr5_Attr2]|Agr5_Attr1 |[[A5At1Line1, A5At1Line2]]|[Agr5_Attr1, Agr5_Attr2, [[A5At1Line1, A5At1Line2]]]|
+-----------+-----------+----------------------------------------------------+------------------------+------------+--------------------------+----------------------------------------------------+
scala> outputDF.printSchema
root
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
|-- DetailsExploded: struct (nullable = true)
| |-- Agreement_A1: string (nullable = true)
| |-- Agreement_A2: string (nullable = true)
|-- Agreement_A1: string (nullable = true)
|-- Lines: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Line_A1: string (nullable = true)
| | |-- Line_A2: string (nullable = true)
|-- DetailsWithAgreementA1Lines: struct (nullable = false)
| |-- Agreement_A1: string (nullable = true)
| |-- Agreement_A2: string (nullable = true)
| |-- Lines: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Line_A1: string (nullable = true)
| | | |-- Line_A2: string (nullable = true)
scala> outputDF.groupBy("Activity_A1", "Activity_A2").agg(collect_list($"DetailsWithAgreementA1Lines") as "Details").show(false)
+-----------+-----------+------------------------------------------------------------------------------------------------------------+
|Activity_A1|Activity_A2|Details |
+-----------+-----------+------------------------------------------------------------------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr1_Attr1, Agr1_Attr2, [[A1At1Line1, A1At1Line2]]]] |
|Act2_Attr1 |Act2_Attr2 |[[Agr3_Attr1, Agr3_Attr2, [[A3At1Line1, A3At1Line2]]], [Agr4_Attr1, Agr4_Attr2, [[A4At1Line1, A4At1Line2]]]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1, Agr5_Attr2, [[A5At1Line1, A5At1Line2]]]] |
+-----------+-----------+------------------------------------------------------------------------------------------------------------+
scala> outputDF.groupBy("Activity_A1", "Activity_A2").agg(collect_list($"DetailsWithAgreementA1Lines") as "Details").printSchema
root
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
| | |-- Lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Line_A1: string (nullable = true)
| | | | |-- Line_A2: string (nullable = true)

How can I create a nested column by joining in Spark?

I would like to perform a "join" on two Spark DataFrames (Scala), but instead of a SQL-like join, I'd like to insert the "joined" row from the second DataFrame as a single nested column in the first. The reason to do so is, ultimately, to write back out to JSON with a nested structure. I know the answer is likely already on Stackoverflow, but some searching has not turned up my answer.
Table 1
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Table 2
root
|-- BioProject: string (nullable = true)
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- abstract: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- dbGaP: string (nullable = true)
|-- description: string (nullable = true)
|-- external_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- submitter_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Join is on table1.study_accession with table2.accession. Result is below. Note the new column called study that contains record equivalents of Rows from table 2.
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
|-- accession: string (nullable = true)
|-- study: struct (nullable = true)
| |-- BioProject: string (nullable = true)
| |-- Insdc: string (nullable = true)
| |-- LastMetaUpdate: string (nullable = true)
| |-- LastUpdate: string (nullable = true)
| |-- Published: string (nullable = true)
| |-- Received: string (nullable = true)
| |-- ReplacedBy: string (nullable = true)
| |-- Status: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- abstract: string (nullable = true)
| |-- accession: string (nullable = true)
| |-- alias: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tag: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- dbGaP: string (nullable = true)
| |-- description: string (nullable = true)
| |-- external_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- submitter_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- tags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: string (nullable = true)
From my understanding to your question, lets say you have two dataframes
df1
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
and
df2
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
You will have to combine all the columns of df2 into a struct column and select the columns to be joined and the struct column. Here I am taking col1 as the joining column
import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))
Then final step is to join (here default is the inner join)
df1.join(nestedDF2, Seq("col1"))
which should give you
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
|-- nested_df2: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: double (nullable = false)
I hope the answer is helpful