I am trying a logic to return an empty column if column does not exist in dataframe.
Schema changes very frequent, sometime the whole struct will be missing (temp1) or array inside struct will be missing (suffix)
Schema looks like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp1: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- code1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- suffix: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
Or like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
When I am trying the below logic for the second schema, getting an exception that Struct not found
def has_Column(df: DataFrame, path: String) = Try(df(path)).isSuccess
df.withColumn("id", col("id")).
withColumn("tempLn", explode(col("temp"))).
withColumn("temp1_code1", when(lit(has_Column(df, "tempLn.temp1.code1")), concat_ws(" ",col("tempLn.temp1.code1"))).otherwise(lit("").cast("string"))).
withColumn("temp2_suffix", when(lit(has_Column(df, "tempLn.temp2.suffix")), concat_ws(" ",col("tempLn.temp2.suffix"))).otherwise(lit("").cast("string")))
Error:
org.apache.spark.sql.AnalysisException: No such struct field temp1;
You need to do the check the existence outside the select/withColumn... methods. As you reference it in the then part of case when expression, Spark tries to resolve it during the analysis of the query.
So you'll need to test like this:
if (has_Column(df, "tempLn.temp1.code1"))
df.withColumn("temp2_suffix", concat_ws(" ",col("tempLn.temp2.suffix")))
else
df.withColumn("temp2_suffix", lit(""))
To do it for multiple columns you can use foldLeft like this:
val df1 = Seq(
("tempLn.temp1.code1", "temp1_code1"),
("tempLn.temp2.suffix", "temp2_suffix")
).foldLeft(df) {
case (acc, (field, newCol)) => {
if (has_Column(acc, field))
acc.withColumn(newCol, concat_ws(" ", col(field)))
else
acc.withColumn(newCol, lit(""))
}
}
Related
I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+
I have a PySpark dataframe loaded from a 3 GB json.gz file, with the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
I need to drop the title, author and date fields, or create a new dataFrame that does not include these fields.
So far I've managed to get the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
using
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
But I need a way to keep articleIDs and sources together in the same struct. How can I do this?
Okay, I found something that works:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
But if anyone has a better solution, I'd love to hear it, since this seems very round-about. (And as a sidenote, collect_set only seems to works with java 8.)
I have one dataframe with this schema:
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
data:
+-----------+-----------+--------------------------------------------------+
|Activity_A1|Activity_A2|Details |
+-----------+-----------+--------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr2_Attr1,Agr2_Attr2], [Agr1_Attr1,Agr1_Attr2]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1,Agr4_Attr2], [Agr3_Attr1,Agr3_Attr2]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1,Agr5_Attr2]] |
+-----------+-----------+--------------------------------------------------+
And the second one with this schema:
|-- Agreement_A1: string (nullable = true)
| | |-- Lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Line_A1: string (nullable = true)
| | | | |-- Line_A2: string (nullable = true)
How can I join this two dataframes with the Agreement_A1 column, so the schema of this new dataframe would look like this:
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
| | |-- Lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Line_A1: string (nullable = true)
| | | | |-- Line_A2: string (nullable = true)
Hope this helps. You need to unnest (explode) "Details" and join on "Agreement_A1" with your second dataframe. Then, structure your columns as desired.
scala> df1.show(false)
+-----------+-----------+----------------------------------------------------+
|Activity_A1|Activity_A2|Details |
+-----------+-----------+----------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr2_Attr1, Agr2_Attr2], [Agr1_Attr1, Agr1_Attr2]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1, Agr4_Attr2], [Agr3_Attr1, Agr3_Attr2]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1, Agr5_Attr2]] |
+-----------+-----------+----------------------------------------------------+
scala> df1.printSchema
root
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
scala> df2.show(false)
+------------+--------------------------+
|Agreement_A1|Lines |
+------------+--------------------------+
|Agr1_Attr1 |[[A1At1Line1, A1At1Line2]]|
|Agr3_Attr1 |[[A3At1Line1, A3At1Line2]]|
|Agr4_Attr1 |[[A4At1Line1, A4At1Line2]]|
|Agr5_Attr1 |[[A5At1Line1, A5At1Line2]]|
|Agr6_Attr1 |[[A6At1Line1, A6At1Line2]]|
+------------+--------------------------+
scala> df2.printSchema
root
|-- Agreement_A1: string (nullable = true)
|-- Lines: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Line_A1: string (nullable = true)
| | |-- Line_A2: string (nullable = true)
scala> val outputDF = df1.withColumn("DetailsExploded", explode($"Details")).join(
| df2, $"DetailsExploded.Agreement_A1" === $"Agreement_A1").withColumn(
| "DetailsWithAgreementA1Lines", struct($"DetailsExploded.Agreement_A1" as "Agreement_A1", $"DetailsExploded.Agreement_A2" as "Agreement_A2", $"Lines"))
outputDF: org.apache.spark.sql.DataFrame = [Activity_A1: string, Activity_A2: string ... 5 more fields]
scala> outputDF.show(false)
+-----------+-----------+----------------------------------------------------+------------------------+------------+--------------------------+----------------------------------------------------+
|Activity_A1|Activity_A2|Details |DetailsExploded |Agreement_A1|Lines |DetailsWithAgreementA1Lines |
+-----------+-----------+----------------------------------------------------+------------------------+------------+--------------------------+----------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr2_Attr1, Agr2_Attr2], [Agr1_Attr1, Agr1_Attr2]]|[Agr1_Attr1, Agr1_Attr2]|Agr1_Attr1 |[[A1At1Line1, A1At1Line2]]|[Agr1_Attr1, Agr1_Attr2, [[A1At1Line1, A1At1Line2]]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1, Agr4_Attr2], [Agr3_Attr1, Agr3_Attr2]]|[Agr3_Attr1, Agr3_Attr2]|Agr3_Attr1 |[[A3At1Line1, A3At1Line2]]|[Agr3_Attr1, Agr3_Attr2, [[A3At1Line1, A3At1Line2]]]|
|Act2_Attr1 |Act2_Attr2 |[[Agr4_Attr1, Agr4_Attr2], [Agr3_Attr1, Agr3_Attr2]]|[Agr4_Attr1, Agr4_Attr2]|Agr4_Attr1 |[[A4At1Line1, A4At1Line2]]|[Agr4_Attr1, Agr4_Attr2, [[A4At1Line1, A4At1Line2]]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1, Agr5_Attr2]] |[Agr5_Attr1, Agr5_Attr2]|Agr5_Attr1 |[[A5At1Line1, A5At1Line2]]|[Agr5_Attr1, Agr5_Attr2, [[A5At1Line1, A5At1Line2]]]|
+-----------+-----------+----------------------------------------------------+------------------------+------------+--------------------------+----------------------------------------------------+
scala> outputDF.printSchema
root
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
|-- DetailsExploded: struct (nullable = true)
| |-- Agreement_A1: string (nullable = true)
| |-- Agreement_A2: string (nullable = true)
|-- Agreement_A1: string (nullable = true)
|-- Lines: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Line_A1: string (nullable = true)
| | |-- Line_A2: string (nullable = true)
|-- DetailsWithAgreementA1Lines: struct (nullable = false)
| |-- Agreement_A1: string (nullable = true)
| |-- Agreement_A2: string (nullable = true)
| |-- Lines: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Line_A1: string (nullable = true)
| | | |-- Line_A2: string (nullable = true)
scala> outputDF.groupBy("Activity_A1", "Activity_A2").agg(collect_list($"DetailsWithAgreementA1Lines") as "Details").show(false)
+-----------+-----------+------------------------------------------------------------------------------------------------------------+
|Activity_A1|Activity_A2|Details |
+-----------+-----------+------------------------------------------------------------------------------------------------------------+
|Act1_Attr1 |Act1_Attr2 |[[Agr1_Attr1, Agr1_Attr2, [[A1At1Line1, A1At1Line2]]]] |
|Act2_Attr1 |Act2_Attr2 |[[Agr3_Attr1, Agr3_Attr2, [[A3At1Line1, A3At1Line2]]], [Agr4_Attr1, Agr4_Attr2, [[A4At1Line1, A4At1Line2]]]]|
|Act3_Attr1 |Act3_Attr2 |[[Agr5_Attr1, Agr5_Attr2, [[A5At1Line1, A5At1Line2]]]] |
+-----------+-----------+------------------------------------------------------------------------------------------------------------+
scala> outputDF.groupBy("Activity_A1", "Activity_A2").agg(collect_list($"DetailsWithAgreementA1Lines") as "Details").printSchema
root
|-- Activity_A1: string (nullable = true)
|-- Activity_A2: string (nullable = true)
|-- Details: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Agreement_A1: string (nullable = true)
| | |-- Agreement_A2: string (nullable = true)
| | |-- Lines: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Line_A1: string (nullable = true)
| | | | |-- Line_A2: string (nullable = true)
I have the below 2 DataFrame schemas. Inside USER_INFO modules is an array, content is an array nested inside modules. I want to join/attach some additional data (METADATA) to each content element, such that
USER_INFO.modules.content.id = METADATA.cust_id
What would be a solution?
USER_INFO
root
|-- userId: string (nullable = true)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- distance: double (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- impressionId: string (nullable = true)
| | |-- id: string (nullable = true)
METADATA
root
|-- cust_id: string (nullable = true)
|-- image_url: string (nullable = true)
How can I get all individual elements from MEMEBERDETAIL?
scala> xmlDF.printSchema
root
|-- MEMBERDETAIL: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FILE_ID: double (nullable = true)
| | |-- INP_SOURCE_ID: long (nullable = true)
| | |-- NET_DB_CR_SW: string (nullable = true)
| | |-- NET_PYM_AMT: string (nullable = true)
| | |-- ORGNTD_DB_CR_SW: string (nullable = true)
| | |-- ORGNTD_PYM_AMT: double (nullable = true)
| | |-- RCVD_DB_CR_SW: string (nullable = true)
| | |-- RCVD_PYM_AMT: string (nullable = true)
| | |-- RECON_DATE: string (nullable = true)
| | |-- SLNO: long (nullable = true)
scala> xmlDF.head
res147: org.apache.spark.sql.Row = [WrappedArray([1.1610100000001425E22,1,D, 94,842.38,C,0.0,D, 94,842.38,2016-10-10,1], [1.1610100000001425E22,1,D, 33,169.84,C,0.0,D, 33,169.84,2016-10-10,2], [1.1610110000001425E22,1,D, 155,500.88,C,0.0,D, 155,500.88,2016-10-11,3], [1.1610110000001425E22,1,D, 164,952.29,C,0.0,D, 164,952.29,2016-10-11,4], [1.1610110000001425E22,1,D, 203,061.06,C,0.0,D, 203,061.06,2016-10-11,5], [1.1610110000001425E22,1,D, 104,040.01,C,0.0,D, 104,040.01,2016-10-11,6], [2.1610110000001427E22,1,C, 849.14,C,849.14,C, 0.00,2016-10-11,7], [1.1610100000001465E22,1,D, 3.78,C,0.0,D, 3.78,2016-10-10,1], [1.1610100000001465E22,1,D, 261.54,C,0.0,D, ...
After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.
xmlDF.select($"MEMBERDETAIL".getItem(0)).head().get(0)
res56: Any = [1.1610100000001425E22,1,D,94,842.38,C,0.0,D,94,842.38,2016-10-10,1]
And StructType is like below -
res61: org.apache.spark.sql.DataFrame = [MEMBERDETAIL[0]: struct<FILE_ID:double,INP_SOURCE_ID:bigint,NET_DB_CR_SW:string,NET_PYM_AMT:string,ORGNTD_DB_CR_SW:string,ORGNTD_PYM_AMT:double,RCVD_DB_CR_SW:string,RCVD_PYM_AMT:string,RECON_DATE:string,SLNO:bigint>]
This actually helped me -
xmlDF.selectExpr("explode(MEMBERDETAIL) as e").select("e.FILE_ID", "e.INP_SOURCE_ID", "e.NET_DB_CR_SW", "e.NET_PYM_AMT", "e.ORGNTD_DB_CR_SW", "e.ORGNTD_PYM_AMT", "e.RCVD_DB_CR_SW", "e.RCVD_PYM_AMT", "e.RECON_DATE").show()