How can I create a nested column by joining in Spark? - scala

I would like to perform a "join" on two Spark DataFrames (Scala), but instead of a SQL-like join, I'd like to insert the "joined" row from the second DataFrame as a single nested column in the first. The reason to do so is, ultimately, to write back out to JSON with a nested structure. I know the answer is likely already on Stackoverflow, but some searching has not turned up my answer.
Table 1
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Table 2
root
|-- BioProject: string (nullable = true)
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- abstract: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- dbGaP: string (nullable = true)
|-- description: string (nullable = true)
|-- external_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- submitter_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Join is on table1.study_accession with table2.accession. Result is below. Note the new column called study that contains record equivalents of Rows from table 2.
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
|-- accession: string (nullable = true)
|-- study: struct (nullable = true)
| |-- BioProject: string (nullable = true)
| |-- Insdc: string (nullable = true)
| |-- LastMetaUpdate: string (nullable = true)
| |-- LastUpdate: string (nullable = true)
| |-- Published: string (nullable = true)
| |-- Received: string (nullable = true)
| |-- ReplacedBy: string (nullable = true)
| |-- Status: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- abstract: string (nullable = true)
| |-- accession: string (nullable = true)
| |-- alias: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tag: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- dbGaP: string (nullable = true)
| |-- description: string (nullable = true)
| |-- external_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- submitter_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- tags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: string (nullable = true)

From my understanding to your question, lets say you have two dataframes
df1
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
and
df2
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
You will have to combine all the columns of df2 into a struct column and select the columns to be joined and the struct column. Here I am taking col1 as the joining column
import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))
Then final step is to join (here default is the inner join)
df1.join(nestedDF2, Seq("col1"))
which should give you
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
|-- nested_df2: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: double (nullable = false)
I hope the answer is helpful

Related

Spark reading Open Street Map data and selecting entries

I have OpenStreetMap (OSM) data from a .orc stored in var nlorc of a country for which I am trying to read out data for specific cities. As far as I know, a city entity is defined as a 'relation' in OSM. The nlorc.printSchema() of my data returns the following:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: decimal(9,7) (nullable = true)
|-- lon: decimal(10,7) (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- changeset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- uid: long (nullable = true)
|-- user: string (nullable = true)
|-- version: long (nullable = true)
|-- visible: boolean (nullable = true)
As an example, https://www.openstreetmap.org/relation/47798#map=13/51.4373/4.8888 shows that the name of the city is part of "Tags". How can I access the keys of Tags and select specific cities?
You can use getItem to access the elements of the map:
df = ...
df.filter(df("tags").getItem("name")==="Baarle-Nassau").show()

How to drop nested column or filter nested column in scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
How to remove the column from (webhooks) by taking the input from list
eg filterList: List[String]= List("index","status"). Is there any way to do by iterating row like the intermediate schema will change not the final schema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- status: string (nullable = true)
Check below code.
scala> df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = true)
| |-- index: string (nullable = true)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
scala> val actualColumns = df.select(s"webhooks.*").columns
scala> val removeColumns = Seq("index","status")
scala> val webhooks = struct(actualColumns.filter(c => !removeColumns.contains(c)).map(c => col(s"webhooks.${c}")):_*).as("webhooks")
Output
scala> df.withColumn("webhooks",webhooks).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- updated_at: string (nullable = true)
Can also look at https://stackoverflow.com/a/39943812/2204206
Can be more convenient when removing deeply nested columns

How to append more columns to a structural datafame in scala

I have two dataframes (A and B), A is a structural schema whereas B is a common schema as below and will append B columns into A for C
A:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
B:
root
|-- globalPackageId: long (nullable = true)
|-- order_id: long (nullable = true)
|-- order_address: string (nullable = true)
|-- order_number: integer (nullable = true)
C:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
| |-- order_id: long (nullable = true)
| |-- order_address: string (nullable = true)
| |-- order_number: integer (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
I am struggling to use .withColumn(struct("xxx"), "xxx")
But looks still not expected
Do you have any experience on this
Thanks,

How to cast all columns of a DataFrame (with Nested StructTypes and nested ArrayType) to string in Spark

root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: Long (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: Int (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
val structCastExpression1 = df.schema
.filter(_.dataType.isInstanceOf[StructType])
.map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
.map{ case (col, sub) => s"""cast($col as struct${sub.map{ c =>
s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1, // cast(s2
as struct<u:string,v:string> ) as s2)
val otherColumns = df.schema
.filterNot(_.dataType.isInstanceOf[StructType])
.map( c=> s""" cast(${c.name} as string) as ${c.name} """) //List(" cast(id as string) as id ", " cast(d as string) as d")
//original columns val originalColumns = df.columns
// Union both the expressions into one big expression val
finalExpression = otherColumns.union(structCastExpression1) //
List(" cast(id as string) as id ", // " cast(d as string) as d
", // cast(s1 as struct<x:string,y:string> ) as s1, //
cast(s2 as struct<u:string,v:string> ) as s2 )
// Use `selectExpr` to pass the expression
df.selectExpr(finalExpression : _*)
.select(originalColumns.head, originalColumns.tail: _*)
.printSchema
After i am using this
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: string (nullable = true)
| |-- javaEnabled: string (nullable = true)
| |-- language: string (nullable = true)
expected out put is
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: String (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: String (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)

Create Nested Array DataFrame From Existing DataFrame

I am attempting to create a nested struct array column from a dataframe during a 'join' operation in scala. The only thing I appear to be able to get working is setting up a array of elements structure which does not look write in the json output.
The current schema I am starting with is:
root
|-- memberId: integer (nullable = false)
|-- memberSubscriberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
root
|-- memberSubscriberId: integer (nullable = false)
|-- subscriberaddresstypecode: string (nullable = false)
|-- lineOne: string (nullable = false)
|-- lineTwo: string (nullable = false)
|-- lineThree: string (nullable = false)
|-- cityName: string (nullable = false)
|-- stateCode: string (nullable = false)
|-- zipCode: string (nullable = false)
|-- countyCode: string (nullable = false)
|-- countryCode: string (nullable = false)
|-- subscriberphonenumber: string (nullable = false)
|-- subscriberphoneextensionnumber: string (nullable = false)
|-- subscriberfaxnumber: string (nullable = false)
|-- subscriberfaxextensionnumber: string (nullable = false)
|-- address: string (nullable = false)
Going to I think:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
Current code:
val clientDF: DataFrame
val addrDF: DataFrame
import spark.implicits._
val nestedAddr = addrDF.select(
$"clientSubscriberId",
array(
struct(
$"lineOne",
$"lineTwo",
$"lineThree",
$"cityName",
$"stateCode",
$"zipCode",
$"countyCode",
$"countryCode"
)
).as("clientAddresses"),
array(
struct(
$"subscriberphonenumber".alias("phoneNumber"),
//$"subscriberphoneextensionnumber"
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("home").alias("telecomType")
),
struct(
$"subscriberfaxnumber".alias("phoneNumber"),
//$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("fax").alias("telecomType")
)
).as("memeberPhoneNumbers")
)
val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))
Resulting schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- lineOne: string (nullable = false)
| | |-- lineTwo: string (nullable = false)
| | |-- lineThree: string (nullable = false)
| | |-- cityName: string (nullable = false)
| | |-- stateCode: string (nullable = false)
| | |-- zipCode: string (nullable = false)
| | |-- countyCode: string (nullable = false)
| | |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- phoneNumber: string (nullable = false)
| | |-- effectiveDate: null (nullable = true)
| | |-- terminationDate: null (nullable = true)
| | |-- isCurrent: null (nullable = true)
| | |-- isActive: null (nullable = true)
| | |-- telecomType: string (nullable = false)
Expected schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
I have tried multiple different things to get it to work:
).as("clientAddresses"),
array(
struct(
).as("clientAddresses"),
struct(
).as("clientAddresses"),
array(
).as("clientAddresses"),
collect_list(
struct(
Simply, the expected schema you want is not possible. I mean, when you have an array, it always contains an element with a given schema, which in your case is a struct. So I'd actually say that the schema you're getting is exactly what you want to achieve.