root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
How to remove the column from (webhooks) by taking the input from list
eg filterList: List[String]= List("index","status"). Is there any way to do by iterating row like the intermediate schema will change not the final schema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| | |-- index: string (nullable = false)
| | |-- status: string (nullable = true)
Check below code.
scala> df.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = true)
| |-- index: string (nullable = true)
| |-- failed_at: string (nullable = true)
| |-- status: string (nullable = true)
| |-- updated_at: string (nullable = true)
scala> val actualColumns = df.select(s"webhooks.*").columns
scala> val removeColumns = Seq("index","status")
scala> val webhooks = struct(actualColumns.filter(c => !removeColumns.contains(c)).map(c => col(s"webhooks.${c}")):_*).as("webhooks")
Output
scala> df.withColumn("webhooks",webhooks).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: struct (nullable = false)
| |-- failed_at: string (nullable = true)
| |-- updated_at: string (nullable = true)
Can also look at https://stackoverflow.com/a/39943812/2204206
Can be more convenient when removing deeply nested columns
Hi i have a schema coming in as follows
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: long (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabletosellQty: long (nullable = true)
| | | |-- distroAvailableQty: long (nullable = true)
| | | |-- itemNumber: long (nullable = true)
| | | |-- itemUPC: string (nullable = true)
| | | |-- ossIndicator: string (nullable = true)
| | | |-- turnAvailableQty: long (nullable = true)
| | | |-- unitOfMeasurement: string (nullable = true)
| | | |-- weightFormatType: string (nullable = true)
| | | |-- whpkRatio: long (nullable = true)
to map this i have create this following schema type
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: integer (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: struct (nullable = true)
| | |-- availabletosellQty: long (nullable = true)
| | |-- distroAvailableQty: long (nullable = true)
| | |-- itemNumber: long (nullable = true)
| | |-- itemUPC: string (nullable = true)
| | |-- ossIndicator: string (nullable = true)
| | |-- turnAvailableQty: long (nullable = true)
| | |-- unitOfMeasurement: string (nullable = true)
| | |-- weightFormatType: string (nullable = true)
| | |-- whpkRatio: long (nullable = true)
by writing something like this
val testSchema = new StructType()
.add("eventObject", new StructType()
.add("baseDivisionCode", StringType)
.add("countryCode",StringType)
.add("dcNumber", IntegerType)
.add("financialReportingGroup",StringType)
.add("itemList",new StructType(
Array(
StructField("availabletosellQty",LongType),
StructField("distroAvailableQty",LongType),
StructField("itemNumber", LongType),
StructField("itemUPC", StringType),
StructField("ossIndicator",StringType),
StructField("turnAvailableQty",LongType),
StructField("unitOfMeasurement",StringType),
StructField("weightFormatType",StringType),
StructField("whpkRatio",LongType)))))
but it is not matching the schema that i am receiving...what am i doing wrong in this?
i am getting null values when i try to populate the with some data...
|-- eventObject: struct (nullable = true)
| |-- baseDivisionCode: string (nullable = true)
| |-- countryCode: string (nullable = true)
| |-- dcNumber: long (nullable = true)
| |-- financialReportingGroup: string (nullable = true)
| |-- itemList: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- itemNumber: long (nullable = true)
| | | |-- itemUPC: string (nullable = true)
| | | |-- unitOfMeasurement: string (nullable = true)
| | | |-- availabletosellQty: long (nullable = true)
| | | |-- turnAvailableQty: long (nullable = true)
| | | |-- distroAvailableQty: long (nullable = true)
| | | |-- ossIndicator: string (nullable = true)
| | | |-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
|-- baseDivisionCode: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: long (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- itemList: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemNumber: long (nullable = true)
| | |-- itemUPC: string (nullable = true)
| | |-- unitOfMeasurement: string (nullable = true)
| | |-- availabletosellQty: long (nullable = true)
| | |-- turnAvailableQty: long (nullable = true)
| | |-- distroAvailableQty: long (nullable = true)
| | |-- ossIndicator: string (nullable = true)
| | |-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
when i further try to flatten it, its erroring out cause of array
"Exception in thread "main" org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(itemList);"
trying to get it to
|-- facilityCountryCode: string (nullable = true)
|-- facilityNum: string (nullable = true)
|-- WMT_CorrelationId: string (nullable = true)
|-- WMT_IdempotencyKey: string (nullable = true)
|-- WMT_Timestamp: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: integer (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- baseDivisionCode: string (nullable = true)
|-- itemNumber: integer (nullable = true)
|-- itemUPC: string (nullable = true)
|-- unitOfMeasurement: string (nullable = true)
|-- availabletosellQty: integer (nullable = true)
|-- turnAvailableQty: integer (nullable = true)
|-- distroAvailableQty: integer (nullable = true)
|-- ossIndicator: string (nullable = true)
|-- weightFormatType: string (nullable = true)
|-- kafka_timestamp: timestamp (nullable = true)
|-- year-month-day: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- hour: integer (nullable = true)
this is what i did
val testParsed=TestExploded.select($"exploded.*",$"kafka_timestamp")
val testFlattened=testParsed.select($"eventObject.*",$"kafka_timestamp")
val test_flattened_further=testFlattened.select($"countryCode",
$"dcNumber",$"financialReportingGroup",$"baseDivisionCode",**$"itemList.*"**,$"kafka_timestamp")
Use ArrayType to specify array type:
val testSchema = new StructType()
.add("eventObject", new StructType()
.add("baseDivisionCode", StringType)
.add("countryCode", StringType)
.add("dcNumber", LongType)
.add("financialReportingGroup", StringType)
.add("itemList", new ArrayType(
new StructType(
Array(
StructField("itemNumber", LongType),
StructField("itemUPC", StringType),
StructField("unitOfMeasurement", StringType),
StructField("availabletosellQty", LongType),
StructField("turnAvailableQty", LongType),
StructField("distroAvailableQty", LongType),
StructField("ossIndicator", StringType),
StructField("weightFormatType", StringType))), containsNull = true)))
To fully flatten the DataFrame you can use explode array of structs and move struct type into top level columns by select("structColName.*") syntax as follows:
df
.select("eventObject.*")
.select(
col("baseDivisionCode"),
col("countryCode"),
col("dcNumber"),
col("financialReportingGroup"),
explode(col("itemList")).as("explodedItemList"))
.select(
col("baseDivisionCode"),
col("countryCode"),
col("dcNumber"),
col("financialReportingGroup"),
col("explodedItemList.*")
)
.printSchema()
Will output:
root
|-- baseDivisionCode: string (nullable = true)
|-- countryCode: string (nullable = true)
|-- dcNumber: long (nullable = true)
|-- financialReportingGroup: string (nullable = true)
|-- itemNumber: long (nullable = true)
|-- itemUPC: string (nullable = true)
|-- unitOfMeasurement: string (nullable = true)
|-- availabletosellQty: long (nullable = true)
|-- turnAvailableQty: long (nullable = true)
|-- distroAvailableQty: long (nullable = true)
|-- ossIndicator: string (nullable = true)
|-- weightFormatType: string (nullable = true)
I have two dataframes (A and B), A is a structural schema whereas B is a common schema as below and will append B columns into A for C
A:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
B:
root
|-- globalPackageId: long (nullable = true)
|-- order_id: long (nullable = true)
|-- order_address: string (nullable = true)
|-- order_number: integer (nullable = true)
C:
root
|-- package: struct (nullable = true)
| |-- globalPackageId: long (nullable = true)
| |-- naPackageId: string (nullable = true)
| |-- packageName: string (nullable = true)
| |-- order_id: long (nullable = true)
| |-- order_address: string (nullable = true)
| |-- order_number: integer (nullable = true)
|-- supplies: struct (nullable = true)
| |-- supplyMask: integer (nullable = true)
| |-- supplyIds: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- timestampDetails: struct (nullable = true)
| |-- packageTimestamp: string (nullable = true)
| |-- onboardTimestamp: string (nullable = true)
I am struggling to use .withColumn(struct("xxx"), "xxx")
But looks still not expected
Do you have any experience on this
Thanks,
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: Long (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: Int (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
val structCastExpression1 = df.schema
.filter(_.dataType.isInstanceOf[StructType])
.map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
.map{ case (col, sub) => s"""cast($col as struct${sub.map{ c =>
s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1, // cast(s2
as struct<u:string,v:string> ) as s2)
val otherColumns = df.schema
.filterNot(_.dataType.isInstanceOf[StructType])
.map( c=> s""" cast(${c.name} as string) as ${c.name} """) //List(" cast(id as string) as id ", " cast(d as string) as d")
//original columns val originalColumns = df.columns
// Union both the expressions into one big expression val
finalExpression = otherColumns.union(structCastExpression1) //
List(" cast(id as string) as id ", // " cast(d as string) as d
", // cast(s1 as struct<x:string,y:string> ) as s1, //
cast(s2 as struct<u:string,v:string> ) as s2 )
// Use `selectExpr` to pass the expression
df.selectExpr(finalExpression : _*)
.select(originalColumns.head, originalColumns.tail: _*)
.printSchema
After i am using this
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: string (nullable = true)
| |-- javaEnabled: string (nullable = true)
| |-- language: string (nullable = true)
expected out put is
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: String (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: String (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
I have a dataframe with following schema :-
scala> final_df.printSchema
root
|-- mstr_prov_id: string (nullable = true)
|-- prov_ctgry_cd: string (nullable = true)
|-- prov_orgnl_efctv_dt: timestamp (nullable = true)
|-- prov_trmntn_dt: timestamp (nullable = true)
|-- prov_trmntn_rsn_cd: string (nullable = true)
|-- npi_rqrd_ind: string (nullable = true)
|-- prov_stts_aray_txt: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- PROV_STTS_KEY: string (nullable = true)
| | |-- PROV_STTS_EFCTV_DT: timestamp (nullable = true)
| | |-- PROV_STTS_CD: string (nullable = true)
| | |-- PROV_STTS_TRMNTN_DT: timestamp (nullable = true)
| | |-- PROV_STTS_TRMNTN_RSN_CD: string (nullable = true)
I am running following code to do basic cleansing but its not working inside "prov_stts_aray_txt" , basically its not going inside array type and performing transformation desire. I want to iterate through out nested all fields(Flat and nested field within Dataframe and perform basic transformation.
for(dt <- final_df.dtypes){
final_df = final_df.withColumn(dt._1,when(upper(trim(col(dt._1))) === "NULL",lit(" ")).otherwise(col(dt._1)))
}
please help.
Thanks