Not able to add data through with column using case statement pyspark

Not able to add data through with column using case statement pyspark - pyspark

Code like as below:
#To get deal keys
schema of lt_online:
root
|-- FT/RT: string (nullable = true)
|-- Country: string (nullable = true)
|-- Charge_Type: string (nullable = true)
|-- Tariff_Loc: string (nullable = true)
|-- Charge_No: string (nullable = true)
|-- Status: string (nullable = true)
|-- Validity_from: string (nullable = true)
|-- Validity_to: string (nullable = true)
|-- Range_Basis: string (nullable = true)
|-- Limited_Parties: string (nullable = true)
|-- Charge_Detail: string (nullable = true)
|-- Freetime_Unit: string (nullable = true)
|-- Freetime: string (nullable = true)
|-- Count_Holidays: string (nullable = true)
|-- Majeure: string (nullable = true)
|-- Start_Event: string (nullable = true)
|-- Same/Next_Day: string (nullable = true)
|-- Next_Day_if_AFTER: string (nullable = true)
|-- Availability_Date: string (nullable = true)
|-- Route_Group: string (nullable = true)
|-- Route_Code: string (nullable = true)
|-- Origin: string (nullable = true)
|-- LoadZone: string (nullable = true)
|-- FDischZone: string (nullable = true)
|-- PODZone: string (nullable = true)
|-- FDestZone: string (nullable = true)
|-- Equipment_Group: string (nullable = true)
|-- Equipment_Type: string (nullable = true)
|-- Range_From: string (nullable = true)
|-- Range_To: void (nullable = true)
|-- Cargo_Type: string (nullable = true)
|-- Commodity: string (nullable = true)
|-- SC_Group: string (nullable = true)
|-- SC_Number: string (nullable = true)
|-- IMO: string (nullable = true)
|-- Shipper_Group: string (nullable = true)
|-- Cnee_Group: string (nullable = true)
|-- Direction: string (nullable = true)
|-- Service: string (nullable = true)
|-- Haulage: string (nullable = true)
|-- Transport_Type: string (nullable = true)
|-- Option1: string (nullable = true)
|-- Option2: string (nullable = true)
|-- 1st_of_Route_Group: string (nullable = true)
|-- 1st_of_LoadZone: string (nullable = true)
|-- 1st_of_FDischZone: string (nullable = true)
|-- 1st_of_PODZone: string (nullable = true)
|-- 1st_of_FDestZone: string (nullable = true)
|-- 1st_of_Equipment_Group: string (nullable = true)
|-- 1st_of_SC_Group: string (nullable = true)
|-- 1st_of_Shipper_Group: string (nullable = true)
|-- 1st_of_Cnee_Group: string (nullable = true)
pyspark code as below df=lt_online.withColumn("dealkeys",lit('')).withColumn("dealAttributes",lit(''))
start=[]
start_dict={}
dealatt=["Charge_No","Status","Validity_from","Validity_to"]
dealkeys=["Charge_Type","Direction"]
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
#final = row[i]
start_dict[i]=row[i]
df_deal_att = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
#key = row['Charge_No']
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#final_val= {"value" : row['Charge_Type']}
#start.append(final_val)
#df3=lt_online.withColumn("new_column",str(start))
print(key,start_dict)
df3 = df_deal_att.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
when i run DF3 dataframe dealAttributes and dealkeys old data got blank and latest record only inserted.
Please see the screenshot

Since the lt_online dataframe is large, I have selected only the required columns from it. The following is the schema of the lt_online dataframe that I have selected.
The problem arrises because you are not changing df in place, but assigning it to df_deal_att. This will update df_deal_att (also df3) only for the current row in loop (because df is not changing in the entire process). Using df_deal_att.show() inside the loop will help in understanding this.
Use the following code instead to get the desired output:
for index,row in lt_online.toPandas().iterrows():
start=[]
start_dict={}
key = row['Charge_No']
for i in dealatt:
start_dict[i]=row[i]
#ASSIGN TO df INSTEAD OF df_deal_att
df = df.withColumn('dealkeys', when(col('Charge_No') == key , str(start_dict)).otherwise(col('dealkeys')))
for i in dealkeys:
final = {"keyname" : i,"value" : row[i],"description":".."}
start.append(final)
#USE df and ASSIGN TO df INSTEAD OF USING df_deal_att AND ASSIGNING TO df3
df = df.withColumn('dealAttributes', when(col('Charge_No') == key , str(start)).otherwise(col('dealAttributes')))
Assigning the df dataframe after adding the column value based on condition to df itself (instead of using df_deal_att or df3) helps in solving the issue. The following image reflects the output achieved after using the above code.

Related

Pyspark create temp view from dataframe

I am trying to read thorugh spark.sql a huge csv.
I created a dataframe from a CSV, the dataframe seems created correctly.
I read the schema and I can perform select and filter.
I would like to create a temp view to execute same research using sql, I am more comfortable with it but the temp view seems created on the csv header only.
Where am I making the mistake?
Thanks
>>> df = spark.read.options(header=True,inferSchema=True,delimiter=";").csv("./elenco_dm_tutti_csv_formato_opendata_UltimaVersione.csv")
>>> df.printSchema()
root
|-- TIPO: integer (nullable = true)
|-- PROGRESSIVO_DM_ASS: integer (nullable = true)
|-- DATA_PRIMA_PUBBLICAZIONE: string (nullable = true)
|-- DM_RIFERIMENTO: integer (nullable = true)
|-- GRUPPO_DM_SIMILI: integer (nullable = true)
|-- ISCRIZIONE_REPERTORIO: string (nullable = true)
|-- INIZIO_VALIDITA: string (nullable = true)
|-- FINE_VALIDITA: string (nullable = true)
|-- FABBRICANTE_ASSEMBLATORE: string (nullable = true)
|-- CODICE_FISCALE: string (nullable = true)
|-- PARTITA_IVA_VATNUMBER: string (nullable = true)
|-- CODICE_CATALOGO_FABBR_ASS: string (nullable = true)
|-- DENOMINAZIONE_COMMERCIALE: string (nullable = true)
|-- CLASSIFICAZIONE_CND: string (nullable = true)
|-- DESCRIZIONE_CND: string (nullable = true)
|-- DATAFINE_COMMERCIO: string (nullable = true)
>>> df.count()
1653697
>>> df.createOrReplaceTempView("mask")
>>> spark.sql("select count(*) from mask")
DataFrame[count(1): bigint]

Spark operations like sql() do not process anything by default. You need to add .show() or .collect() to get results.

How to drop nested column or filter nested column in scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
How to remove the column from (webhooks) by taking the input from list
eg filterList: List[String]= List("index","status"). Is there any way to do by iterating row like the intermediate schema will change not the final schema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- status: string (nullable = true)

Check below code.
scala> df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = true)
| |-- index: string (nullable = true)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
scala> val actualColumns = df.select(s"webhooks.*").columns
scala> val removeColumns = Seq("index","status")
scala> val webhooks = struct(actualColumns.filter(c => !removeColumns.contains(c)).map(c => col(s"webhooks.${c}")):_*).as("webhooks")
Output
scala> df.withColumn("webhooks",webhooks).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- updated_at: string (nullable = true)

Can also look at https://stackoverflow.com/a/39943812/2204206
Can be more convenient when removing deeply nested columns

Create Nested Array DataFrame From Existing DataFrame

I am attempting to create a nested struct array column from a dataframe during a 'join' operation in scala. The only thing I appear to be able to get working is setting up a array of elements structure which does not look write in the json output.
The current schema I am starting with is:
root
|-- memberId: integer (nullable = false)
|-- memberSubscriberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
root
|-- memberSubscriberId: integer (nullable = false)
|-- subscriberaddresstypecode: string (nullable = false)
|-- lineOne: string (nullable = false)
|-- lineTwo: string (nullable = false)
|-- lineThree: string (nullable = false)
|-- cityName: string (nullable = false)
|-- stateCode: string (nullable = false)
|-- zipCode: string (nullable = false)
|-- countyCode: string (nullable = false)
|-- countryCode: string (nullable = false)
|-- subscriberphonenumber: string (nullable = false)
|-- subscriberphoneextensionnumber: string (nullable = false)
|-- subscriberfaxnumber: string (nullable = false)
|-- subscriberfaxextensionnumber: string (nullable = false)
|-- address: string (nullable = false)
Going to I think:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
Current code:
val clientDF: DataFrame
val addrDF: DataFrame
import spark.implicits._
val nestedAddr = addrDF.select(
$"clientSubscriberId",
array(
struct(
$"lineOne",
$"lineTwo",
$"lineThree",
$"cityName",
$"stateCode",
$"zipCode",
$"countyCode",
$"countryCode"
)
).as("clientAddresses"),
array(
struct(
$"subscriberphonenumber".alias("phoneNumber"),
//$"subscriberphoneextensionnumber"
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("home").alias("telecomType")
),
struct(
$"subscriberfaxnumber".alias("phoneNumber"),
//$"subscriberfaxextensionnumber".map(c => col(c).as("phoneNumber"))
lit(null).alias("effectiveDate"),
lit(null).alias("terminationDate"),
lit(null).alias("isCurrent"),
lit(null).alias("isActive"),
lit("fax").alias("telecomType")
)
).as("memeberPhoneNumbers")
)
val addrMbrDF = mbrDF.join(nestedAddr, Seq("clientSubscriberId"))
Resulting schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- lineOne: string (nullable = false)
| | |-- lineTwo: string (nullable = false)
| | |-- lineThree: string (nullable = false)
| | |-- cityName: string (nullable = false)
| | |-- stateCode: string (nullable = false)
| | |-- zipCode: string (nullable = false)
| | |-- countyCode: string (nullable = false)
| | |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- phoneNumber: string (nullable = false)
| | |-- effectiveDate: null (nullable = true)
| | |-- terminationDate: null (nullable = true)
| | |-- isCurrent: null (nullable = true)
| | |-- isActive: null (nullable = true)
| | |-- telecomType: string (nullable = false)
Expected schema:
root
|-- memberSubscriberId: integer (nullable = false)
|-- memberId: integer (nullable = false)
|-- memberIdSuffix: integer (nullable = false)
|-- memberLastName: string (nullable = false)
|-- memberFirstName: string (nullable = false)
|-- memberMiddleInitial: string (nullable = false)
|-- memberSocialSecurityNumber: string (nullable = false)
|-- memberGender: string (nullable = false)
|-- memberBirthDate: timestamp (nullable = false)
|-- memberworkphonenumber: string (nullable = false)
|-- memberworkphoneextensionnumber: string (nullable = false)
|-- membercellphone: string (nullable = false)
|-- memberAddresses: array (nullable = false)
| |-- lineOne: string (nullable = false)
| |-- lineTwo: string (nullable = false)
| |-- lineThree: string (nullable = false)
| |-- cityName: string (nullable = false)
| |-- stateCode: string (nullable = false)
| |-- zipCode: string (nullable = false)
| |-- countyCode: string (nullable = false)
| |-- countryCode: string (nullable = false)
|-- memeberPhoneNumbers: array (nullable = false)
| |-- phoneNumber: string (nullable = false)
| |-- effectiveDate: null (nullable = true)
| |-- terminationDate: null (nullable = true)
| |-- isCurrent: null (nullable = true)
| |-- isActive: null (nullable = true)
| |-- telecomType: string (nullable = false)
I have tried multiple different things to get it to work:
).as("clientAddresses"),
array(
struct(
).as("clientAddresses"),
struct(
).as("clientAddresses"),
array(
).as("clientAddresses"),
collect_list(
struct(

Simply, the expected schema you want is not possible. I mean, when you have an array, it always contains an element with a given schema, which in your case is a struct. So I'd actually say that the schema you're getting is exactly what you want to achieve.

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

My question is if there are any approaches to update the schema of a DataFrame without explicitly calling SparkSession.createDataFrame(dataframe.rdd, newSchema).
Details are as follows.
I have an original Spark DataFrame with schema below:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
I applied Dataset.mapPartitions on the original DataFrame and got a new DataFrame (returned by Dataset.mapPartitions).
The reason for using Dataset.mapPartitions but not Dataset.map is better transformation speed.
In this new DataFrame, every row should have a schema like below:
root
|-- column21: string (nullable = true)
|-- column22: long (nullable = true)
|-- column23: string (nullable = true)
|-- column24: long (nullable = true)
|-- column25: struct (nullable = true)
| |-- column251: string (nullable = true)
| |-- column252: string (nullable = true)
| |-- column253: string (nullable = true)
| |-- column254: string (nullable = true)
| |-- column255: string (nullable = true)
| |-- column256: string (nullable = true)
So the schema of the new DataFrame should be the same as the above.
However, the schema of the new DataFrame won't be updated automatically. The output of applying Dataset.printSchema method on the new DataFrame is still original:
root
|-- column11: string (nullable = true)
|-- column12: string (nullable = true)
|-- column13: string (nullable = true)
|-- column14: string (nullable = true)
|-- column15: string (nullable = true)
|-- column16: string (nullable = true)
|-- column17: string (nullable = true)
|-- column18: string (nullable = true)
|-- column19: string (nullable = true)
So, in order to get the correct (updated) schema, what I'm doing is using SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
My concern here is that falling back to RDD (newDataFrame.rdd) will hurt the transformation speed because Spark Catalyst doesn't handle RDD as well as Dataset/DataFrame.
My question is if there are any approaches to update the schema of the new DataFrame without explicitly calling SparkSession.createDataFrame(newDataFrame.rdd, newSchema).
Thanks a lot.

You can use RowEncoder to define schema for newDataFrame.
See following example.
val originalDF = spark.sparkContext.parallelize(List(("Tonny", "city1"), ("Rogger", "city2"), ("Michal", "city3"))).toDF("name", "city")
val r = scala.util.Random
val encoderForNewDF = RowEncoder(StructType(Array(
StructField("name", StringType),
StructField("num", IntegerType),
StructField("city", StringType)
)))
val newDF = originalDF.mapPartitions { partition =>
partition.map{ row =>
val name = row.getAs[String]("name")
val city = row.getAs[String]("city")
val num = r.nextInt
Row.fromSeq(Array[Any](name, num, city))
}
} (encoderForNewDF)
newDF.printSchema()
|-- name: string (nullable = true)
|-- num: integer (nullable = true)
|-- city: string (nullable = true)
Row Encoder for spark: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-RowEncoder.html

How can I create a nested column by joining in Spark?

I would like to perform a "join" on two Spark DataFrames (Scala), but instead of a SQL-like join, I'd like to insert the "joined" row from the second DataFrame as a single nested column in the first. The reason to do so is, ultimately, to write back out to JSON with a nested structure. I know the answer is likely already on Stackoverflow, but some searching has not turned up my answer.
Table 1
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Table 2
root
|-- BioProject: string (nullable = true)
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- abstract: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- dbGaP: string (nullable = true)
|-- description: string (nullable = true)
|-- external_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- submitter_id: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- namespace: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
Join is on table1.study_accession with table2.accession. Result is below. Note the new column called study that contains record equivalents of Rows from table 2.
root
|-- Insdc: string (nullable = true)
|-- LastMetaUpdate: string (nullable = true)
|-- LastUpdate: string (nullable = true)
|-- Published: string (nullable = true)
|-- Received: string (nullable = true)
|-- ReplacedBy: string (nullable = true)
|-- Status: string (nullable = true)
|-- Type: string (nullable = true)
|-- accession: string (nullable = true)
|-- alias: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- value: string (nullable = true)
|-- center_name: string (nullable = true)
|-- design_description: string (nullable = true)
|-- geo_accession: string (nullable = true)
|-- instrument_model: string (nullable = true)
|-- library_construction_protocol: string (nullable = true)
|-- library_name: string (nullable = true)
|-- library_selection: string (nullable = true)
|-- library_source: string (nullable = true)
|-- library_strategy: string (nullable = true)
|-- paired: boolean (nullable = true)
|-- platform: string (nullable = true)
|-- read_spec: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- base_coord: long (nullable = true)
| | |-- read_class: string (nullable = true)
| | |-- read_index: long (nullable = true)
| | |-- read_type: string (nullable = true)
|-- sample_accession: string (nullable = true)
|-- spot_length: long (nullable = true)
|-- study_accession: string (nullable = true)
|-- tags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- title: string (nullable = true)
|-- accession: string (nullable = true)
|-- study: struct (nullable = true)
| |-- BioProject: string (nullable = true)
| |-- Insdc: string (nullable = true)
| |-- LastMetaUpdate: string (nullable = true)
| |-- LastUpdate: string (nullable = true)
| |-- Published: string (nullable = true)
| |-- Received: string (nullable = true)
| |-- ReplacedBy: string (nullable = true)
| |-- Status: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- abstract: string (nullable = true)
| |-- accession: string (nullable = true)
| |-- alias: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tag: string (nullable = true)
| | | |-- value: string (nullable = true)
| |-- dbGaP: string (nullable = true)
| |-- description: string (nullable = true)
| |-- external_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- submitter_id: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- namespace: string (nullable = true)
| |-- tags: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- title: string (nullable = true)

From my understanding to your question, lets say you have two dataframes
df1
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
and
df2
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: double (nullable = false)
You will have to combine all the columns of df2 into a struct column and select the columns to be joined and the struct column. Here I am taking col1 as the joining column
import org.apache.spark.sql.functions._
val nestedDF2 = df2.select($"col1", struct(df2.columns.map(col):_*).as("nested_df2"))
Then final step is to join (here default is the inner join)
df1.join(nestedDF2, Seq("col1"))
which should give you
root
|-- col1: string (nullable = true)
|-- col2: integer (nullable = false)
|-- col3: double (nullable = false)
|-- nested_df2: struct (nullable = false)
| |-- col1: string (nullable = true)
| |-- col2: string (nullable = true)
| |-- col3: double (nullable = false)
I hope the answer is helpful

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Not able to add data through with column using case statement pyspark - pyspark

Related

Pyspark create temp view from dataframe

How to drop nested column or filter nested column in scala

Create Nested Array DataFrame From Existing DataFrame

How to update the schema of a Spark DataFrame (methods like Dataset.withColumn and Datset.select don't work in my case)

How can I create a nested column by joining in Spark?

Categories

Resources