Create Nested Array DataFrame From Existing DataFrame

Create Nested Array DataFrame From Existing DataFrame - scala

I am attempting to create a nested struct array column from a dataframe during a 'join' operation in scala. The only thing I appear to be able to get working is setting up a array of elements structure which does not look write in the json output.
The current schema I am starting with is:
root
|-- memberId: integer (nullable = false)
|-- memberSubscriberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
root
|-- memberSubscriberId: integer (nullable = false)
|-- subscriberaddresstypecode: string (nullable = false)
|-- lineOne: string (nullable = false)
|-- lineTwo: string (nullable = false)
|-- lineThree: string (nullable = false)
|-- cityName: string (nullable = false)
|-- stateCode: string (nullable = false)
|-- zipCode: string (nullable = false)
|-- countyCode: string (nullable = false)
|-- countryCode: string (nullable = false)
|-- subscriberphonenumber: string (nullable = false)
|-- subscriberphoneextensionnumber: string (nullable = false)
|-- subscriberfaxnumber: string (nullable = false)
|-- subscriberfaxextensionnumber: string (nullable = false)
|-- address: string (nullable = false)
Going to I think:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
Current code:
val clientDF: DataFrame
val addrDF: DataFrame
import spark.implicits._
val nestedAddr = addrDF.select(
$"clientSubscriberId",
array(
struct(
$"lineOne",
$"lineTwo",
$"lineThree",
$"cityName",
$"stateCode",
$"zipCode",
$"countyCode",
$"countryCode"
)
).as("clientAddresses"),
array(
struct(
$"subscriberphonenumber".alias("phoneNumber"),
//$"subscriberphoneextensionnumber"
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("home").alias("telecomType")
),
struct(
$"subscriberfaxnumber".alias("phoneNumber"),
//$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("fax").alias("telecomType")
)
).as("memeberPhoneNumbers")
)
val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))
Resulting schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- lineOne: string (nullable = false)
| | |-- lineTwo: string (nullable = false)
| | |-- lineThree: string (nullable = false)
| | |-- cityName: string (nullable = false)
| | |-- stateCode: string (nullable = false)
| | |-- zipCode: string (nullable = false)
| | |-- countyCode: string (nullable = false)
| | |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- phoneNumber: string (nullable = false)
| | |-- effectiveDate: null (nullable = true)
| | |-- terminationDate: null (nullable = true)
| | |-- isCurrent: null (nullable = true)
| | |-- isActive: null (nullable = true)
| | |-- telecomType: string (nullable = false)
Expected schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
I have tried multiple different things to get it to work:
).as("clientAddresses"),
array(
struct(
).as("clientAddresses"),
struct(
).as("clientAddresses"),
array(
).as("clientAddresses"),
collect_list(
struct(

Simply, the expected schema you want is not possible. I mean, when you have an array, it always contains an element with a given schema, which in your case is a struct. So I'd actually say that the schema you're getting is exactly what you want to achieve.

Related

How to drop nested column or filter nested column in scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
How to remove the column from (webhooks) by taking the input from list
eg filterList: List[String]= List("index","status"). Is there any way to do by iterating row like the intermediate schema will change not the final schema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- status: string (nullable = true)

Check below code.
scala> df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = true)
| |-- index: string (nullable = true)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
scala> val actualColumns = df.select(s"webhooks.*").columns
scala> val removeColumns = Seq("index","status")
scala> val webhooks = struct(actualColumns.filter(c => !removeColumns.contains(c)).map(c => col(s"webhooks.${c}")):_*).as("webhooks")
Output
scala> df.withColumn("webhooks",webhooks).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- updated_at: string (nullable = true)

Can also look at https://stackoverflow.com/a/39943812/2204206
Can be more convenient when removing deeply nested columns

how to create and match schema in scala

Hi i have a schema coming in as follows
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: long (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabletosellQty: long (nullable = true)
| | | |-- distroAvailableQty: long (nullable = true)
| | | |-- itemNumber: long (nullable = true)
| | | |-- itemUPC: string (nullable = true)
| | | |-- ossIndicator: string (nullable = true)
| | | |-- turnAvailableQty: long (nullable = true)
| | | |-- unitOfMeasurement: string (nullable = true)
| | | |-- weightFormatType: string (nullable = true)
| | | |-- whpkRatio: long (nullable = true)
to map this i have create this following schema type
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: integer (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: struct (nullable = true)
| | |-- availabletosellQty: long (nullable = true)
| | |-- distroAvailableQty: long (nullable = true)
| | |-- itemNumber: long (nullable = true)
| | |-- itemUPC: string (nullable = true)
| | |-- ossIndicator: string (nullable = true)
| | |-- turnAvailableQty: long (nullable = true)
| | |-- unitOfMeasurement: string (nullable = true)
| | |-- weightFormatType: string (nullable = true)
| | |-- whpkRatio: long (nullable = true)
by writing something like this
val testSchema = new StructType()
.add("eventObject", new StructType()
.add("baseDivisionCode", StringType)
.add("countryCode",StringType)
.add("dcNumber", IntegerType)
.add("financialReportingGroup",StringType)
.add("itemList",new StructType(
Array(
StructField("availabletosellQty",LongType),
StructField("distroAvailableQty",LongType),
StructField("itemNumber", LongType),
StructField("itemUPC", StringType),
StructField("ossIndicator",StringType),
StructField("turnAvailableQty",LongType),
StructField("unitOfMeasurement",StringType),
StructField("weightFormatType",StringType),
StructField("whpkRatio",LongType)))))
but it is not matching the schema that i am receiving...what am i doing wrong in this?
i am getting null values when i try to populate the with some data...
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: long (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemNumber: long (nullable = true)
| | | |-- itemUPC: string (nullable = true)
| | | |-- unitOfMeasurement: string (nullable = true)
| | | |-- availabletosellQty: long (nullable = true)
| | | |-- turnAvailableQty: long (nullable = true)
| | | |-- distroAvailableQty: long (nullable = true)
| | | |-- ossIndicator: string (nullable = true)
| | | |-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
|-- baseDivisionCode: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: long (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- itemList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemNumber: long (nullable = true)
| | |-- itemUPC: string (nullable = true)
| | |-- unitOfMeasurement: string (nullable = true)
| | |-- availabletosellQty: long (nullable = true)
| | |-- turnAvailableQty: long (nullable = true)
| | |-- distroAvailableQty: long (nullable = true)
| | |-- ossIndicator: string (nullable = true)
| | |-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
when i further try to flatten it, its erroring out cause of array
"Exception in thread "main" org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(itemList);"
trying to get it to
|-- facilityCountryCode: string (nullable = true)
|-- facilityNum: string (nullable = true)
|-- WMT_CorrelationId: string (nullable = true)
|-- WMT_IdempotencyKey: string (nullable = true)
|-- WMT_Timestamp: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: integer (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- baseDivisionCode: string (nullable = true)
|-- itemNumber: integer (nullable = true)
|-- itemUPC: string (nullable = true)
|-- unitOfMeasurement: string (nullable = true)
|-- availabletosellQty: integer (nullable = true)
|-- turnAvailableQty: integer (nullable = true)
|-- distroAvailableQty: integer (nullable = true)
|-- ossIndicator: string (nullable = true)
|-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
|-- year-month-day: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- hour: integer (nullable = true)
this is what i did
val testParsed=TestExploded.select($"exploded.*",$"kafka_timestamp")
val testFlattened=testParsed.select($"eventObject.*",$"kafka_timestamp")
val test_flattened_further=testFlattened.select($"countryCode",
$"dcNumber",$"financialReportingGroup",$"baseDivisionCode",**$"itemList.*"**,$"kafka_timestamp")

Use ArrayType to specify array type:
val testSchema = new StructType()
.add("eventObject", new StructType()
.add("baseDivisionCode", StringType)
.add("countryCode", StringType)
.add("dcNumber", LongType)
.add("financialReportingGroup", StringType)
.add("itemList", new ArrayType(
new StructType(
Array(
StructField("itemNumber", LongType),
StructField("itemUPC", StringType),
StructField("unitOfMeasurement", StringType),
StructField("availabletosellQty", LongType),
StructField("turnAvailableQty", LongType),
StructField("distroAvailableQty", LongType),
StructField("ossIndicator", StringType),
StructField("weightFormatType", StringType))), containsNull = true)))
To fully flatten the DataFrame you can use explode array of structs and move struct type into top level columns by select("structColName.*") syntax as follows:
df
.select("eventObject.*")
.select(
col("baseDivisionCode"),
col("countryCode"),
col("dcNumber"),
col("financialReportingGroup"),
explode(col("itemList")).as("explodedItemList"))
.select(
col("baseDivisionCode"),
col("countryCode"),
col("dcNumber"),
col("financialReportingGroup"),
col("explodedItemList.*")
)
.printSchema()
Will output:
root
|-- baseDivisionCode: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: long (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- itemNumber: long (nullable = true)
|-- itemUPC: string (nullable = true)
|-- unitOfMeasurement: string (nullable = true)
|-- availabletosellQty: long (nullable = true)
|-- turnAvailableQty: long (nullable = true)
|-- distroAvailableQty: long (nullable = true)
|-- ossIndicator: string (nullable = true)
|-- weightFormatType: string (nullable = true)

How to append more columns to a structural datafame in scala

I have two dataframes (A and B), A is a structural schema whereas B is a common schema as below and will append B columns into A for C
A:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
B:
root
|-- globalPackageId: long (nullable = true)
|-- order_id: long (nullable = true)
|-- order_address: string (nullable = true)
|-- order_number: integer (nullable = true)
C:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
| |-- order_id: long (nullable = true)
| |-- order_address: string (nullable = true)
| |-- order_number: integer (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
I am struggling to use .withColumn(struct("xxx"), "xxx")
But looks still not expected
Do you have any experience on this
Thanks,

How to cast all columns of a DataFrame (with Nested StructTypes and nested ArrayType) to string in Spark

root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: Long (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: Int (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
val structCastExpression1 = df.schema
.filter(_.dataType.isInstanceOf[StructType])
.map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
.map{ case (col, sub) => s"""cast($col as struct${sub.map{ c =>
s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1, // cast(s2
as struct<u:string,v:string> ) as s2)
val otherColumns = df.schema
.filterNot(_.dataType.isInstanceOf[StructType])
.map( c=> s""" cast(${c.name} as string) as ${c.name} """) //List(" cast(id as string) as id ", " cast(d as string) as d")
//original columns val originalColumns = df.columns
// Union both the expressions into one big expression val
finalExpression = otherColumns.union(structCastExpression1) //
List(" cast(id as string) as id ", // " cast(d as string) as d
", // cast(s1 as struct<x:string,y:string> ) as s1, //
cast(s2 as struct<u:string,v:string> ) as s2 )
// Use `selectExpr` to pass the expression
df.selectExpr(finalExpression : _*)
.select(originalColumns.head, originalColumns.tail: _*)
.printSchema
After i am using this
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: string (nullable = true)
| |-- javaEnabled: string (nullable = true)
| |-- language: string (nullable = true)
expected out put is
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: String (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: String (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)

How can I create a nested column by joining in Spark?

I would like to perform a "join" on two Spark DataFrames (Scala), but instead of a SQL-like join, I'd like to insert the "joined" row from the second DataFrame as a single nested column in the first. The reason to do so is, ultimately, to write back out to JSON with a nested structure. I know the answer is likely already on Stackoverflow, but some searching has not turned up my answer.
Table 1
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Table 2
root
|-- BioProject: string (nullable = true)
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- abstract: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- dbGaP: string (nullable = true)
|-- description: string (nullable = true)
|-- external_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- submitter_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Join is on table1.study_accession with table2.accession. Result is below. Note the new column called study that contains record equivalents of Rows from table 2.
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
|-- accession: string (nullable = true)
|-- study: struct (nullable = true)
| |-- BioProject: string (nullable = true)
| |-- Insdc: string (nullable = true)
| |-- LastMetaUpdate: string (nullable = true)
| |-- LastUpdate: string (nullable = true)
| |-- Published: string (nullable = true)
| |-- Received: string (nullable = true)
| |-- ReplacedBy: string (nullable = true)
| |-- Status: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- abstract: string (nullable = true)
| |-- accession: string (nullable = true)
| |-- alias: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tag: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- dbGaP: string (nullable = true)
| |-- description: string (nullable = true)
| |-- external_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- submitter_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- tags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: string (nullable = true)

From my understanding to your question, lets say you have two dataframes
df1
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
and
df2
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
You will have to combine all the columns of df2 into a struct column and select the columns to be joined and the struct column. Here I am taking col1 as the joining column
import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))
Then final step is to join (here default is the inner join)
df1.join(nestedDF2, Seq("col1"))
which should give you
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
|-- nested_df2: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: double (nullable = false)
I hope the answer is helpful

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Create Nested Array DataFrame From Existing DataFrame - scala

Simply, the expected schema you want is not possible. I mean, when you have an array, it always contains an element with a given schema, which in your case is a struct. So I'd actually say that the schema you're getting is exactly what you want to achieve.

Related

How to drop nested column or filter nested column in scala

how to create and match schema in scala

How to append more columns to a structural datafame in scala

How to cast all columns of a DataFrame (with Nested StructTypes and nested ArrayType) to string in Spark

How can I create a nested column by joining in Spark?

Categories

Resources