How to aggregate array of struct column in pyspark without exploding - pyspark

I would like to aggregate sum of Order.amount for each customerId where date<='30/03/2021'(mm/dd/yyyy), taking advantage of having the array per userId rows.
output based on the below input data.
1 250
2 450
CustomerID Order
1 [[1,100,01/01/2021],[2,200,06/01/2021],[3,150,03/01/2021]]
2 [[1,200,02/01/2021],[2,250,03/01/2021],[3,300,05/01/2021]]
CustomerID
array : Order
struct of element
Order
amount
date

Suppose df is your dataframe,
df = spark.createDataFrame(
((1, [[1,100,'01/01/2021'],[2,200,'06/01/2021'],[3,150,'03/01/2021']]),
(2 , [[1,200,'02/01/2021'],[2,250,'03/01/2021'],[3,300,'05/01/2021']])),
"CustomerID : int, Order : array<struct<Order: int, amount: int, date: string>>")
df.printSchema()
# root
# |-- CustomerID: integer (nullable = true)
# |-- Order: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- Order: integer (nullable = true)
# | | |-- amount: integer (nullable = true)
# | | |-- date: string (nullable = true)
for Spark version >= 2.4.x, you can use higher order functions to work with array. In this case, filter the dates then aggregate the amounts.
import pyspark.sql.functions as F
sql_expr = """
aggregate(
filter(order, x -> to_date(x.date, 'MM/dd/yyyy') <= '2021-03-30').amount,
0, (a, b) -> a + b)
"""
df = df.withColumn('total_amount', F.expr(sql_expr))
df.show()
# +----------+--------------------+------------+
# |CustomerID| Order|total_amount|
# +----------+--------------------+------------+
# | 1|[[1, 100, 01/01/2...| 250|
# | 2|[[1, 200, 02/01/2...| 450|
# +----------+--------------------+------------+

Related

In SparkSQL how could I select a subset of columns from a nested struct and keep it as a nested struct in the result using SQL statement?

I can do the following statement in SparkSQL:
result_df = spark.sql("""select
one_field,
field_with_struct
from purchases""")
And resulting data frame will have the field with full struct in field_with_struct.
one_field
field_with_struct
123
{name1,val1,val2,f2,f4}
555
{name2,val3,val4,f6,f7}
I want to select only few fields from field_with_struct, but keep them still in struct in the resulting data frame. If something could be possible (this is not real code):
result_df = spark.sql("""select
one_field,
struct(
field_with_struct.name,
field_with_struct.value2
) as my_subset
from purchases""")
To get this:
one_field
my_subset
123
{name1,val2}
555
{name2,val4}
Is there any way of doing this with SQL? (not with fluent API)
There's a much simpler solution making use of arrays_zip, no need to explode/collect_list (which can be error prone/difficult with complex data since it relies on using something like an id column):
>>> from pyspark.sql import Row
>>> from pyspark.sql.functions import arrays_zip
>>> df = sc.createDataFrame((([Row(x=1, y=2, z=3), Row(x=2, y=3, z=4)],),), ['array_of_structs'])
>>> df.show(truncate=False)
+----------------------+
|array_of_structs |
+----------------------+
|[{1, 2, 3}, {2, 3, 4}]|
+----------------------+
>>> df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
| | |-- z: long (nullable = true)
>>> # Selecting only two of the nested fields:
>>> selected_df = df.select(arrays_zip("array_of_structs.x", "array_of_structs.y").alias("array_of_structs"))
>>> selected_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> selected_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
EDIT Adding in the corresponding Spark SQL code, since that was requested by the OP:
>>> df.createTempView("test_table")
>>> sql_df = sc.sql("""
SELECT
transform(array_of_structs, x -> struct(x.x, x.y)) as array_of_structs
FROM test_table
""")
>>> sql_df.printSchema()
root
|-- array_of_structs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
>>> sql_df.show()
+----------------+
|array_of_structs|
+----------------+
|[{1, 2}, {2, 3}]|
+----------------+
In fact, the pseudo code which I have provided is working. For a nested array of object it's not so straightforward. At first, the array should be exploded (EXPLODE() function) and then selected a subset. After that it's possible to make a COLLECT_LIST().
WITH
unfold_by_items AS (SELECT id, EXPLODE(Items) AS item FROM spark_tbl_items)
, format_items as (SELECT
id
, STRUCT(
item.item_id
, item.name
) AS item
FROM unfold_by_items)
, fold_by_items AS (SELECT id, COLLECT_LIST(item) AS Items FROM format_items GROUP BY id)
SELECT * FROM fold_by_items
This will choose only two fields from the struct in Items and in the end returns a dataset which contains again an array with Items.

How to apply Sha2 for a particular column which is inside in the form of array struct in Hive or spark sql? Dynamically

I am having data in Hive
id name kyc
1001 smith [pnno:999,ssn:12345,email:ss#mail.com]
when we select these columns the output will be
1001.smith, [999,12345,ss#mail.com]
I have to apply SHA2 inside this array column and also the output should display
1001,smith,[999,*****(sha2 masked value), ss#gmail.com]
The output should be same array struct format
I am currently creating a separate view and joining the query, Is there any way to handle this in a Hive query or inside spark/scala using dataframe Dynamically?
Also, using any config for spark/scala?
Thank you
You can use transform to encrypt the ssn field in the array of structs:
// sample dataframe
df.show(false)
+----+-----+---------------------------+
|id |name |kyc |
+----+-----+---------------------------+
|1001|smith|[[999, 12345, ss#mail.com]]|
+----+-----+---------------------------+
// sample schema
df.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = false)
// |-- kyc: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- pnno: integer (nullable = false)
// | | |-- ssn: integer (nullable = false)
// | | |-- email: string (nullable = false)
val df2 = df.withColumn(
"kyc",
expr("""
transform(kyc,
x -> struct(x.pnno pnno, sha2(string(x.ssn), 512) ssn, x.email email)
)
""")
)
df2.show(false)
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |kyc |
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+
|1001|smith|[[999, 3627909a29c31381a071ec27f7c9ca97726182aed29a7ddd2e54353322cfb30abb9e3a6df2ac2c20fe23436311d678564d0c8d305930575f60e2d3d048184d79, ss#mail.com]]|
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+

Converting a dataframe to an array of struct of column names and values

Suppose I have a dataframe like this
val customer = Seq(
("C1", "Jackie Chan", 50, "Dayton", "M"),
("C2", "Harry Smith", 30, "Beavercreek", "M"),
("C3", "Ellen Smith", 28, "Beavercreek", "F"),
("C4", "John Chan", 26, "Dayton","M")
).toDF("cid","name","age","city","sex")
How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark
The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:
val cols = customer.columns.tail
val result = customer.select('cid,
array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")
result.show(false)
+---+-----------------------------------------------------------+
|cid|array |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]] |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]] |
+---+-----------------------------------------------------------+
result.printSchema()
root
|-- cid: string (nullable = true)
|-- array: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- name: string (nullable = false)
| | |-- value: string (nullable = true)
You can do it using array and struct functions:
customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))
will make:
|-- cid: string (nullable = true)
|-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- column_name: string (nullable = false)
| | |-- column_value: string (nullable = true)
Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
or wrap the map col in an array if you want it
This way you can still do numerical or string transformations on the relevant key or value. For example:
df.select('cid',
create_map(lit("name"), col("name"), lit("age"), col("age"),
lit("city"), col("city"), lit("sex"),col("sex")
).alias('map_col')
)
df.select('*',
map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)
Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

PySpark Dataframe Transpose as List

I'm working with pyspark sql api, and trying to group rows with repeated values into a list of rest of contents. It's similar to transpose, but instead of pivoting all values, will put values into array.
Current output:
group_id | member_id | name
55 | 123 | jake
55 | 234 | tim
65 | 345 | chris
Desired output:
group_id | members
55 | [[123, 'jake'], [234, 'tim']]
65 | [345, 'chris']
You need to groupby the group_id and use pyspark.sql.functions.collect_list() as the aggregation function.
As for combining the member_id and name columns, you have two options:
Option 1: Use pyspark.sql.functions.array:
from pyspark.sql.functions import array, collect_list
df1 = df.groupBy("group_id")\
.agg(collect_list(array("member_id", "name")).alias("members"))
df1.show(truncate=False)
#+--------+-------------------------------------------------+
#|group_id|members |
#+--------+-------------------------------------------------+
#|55 |[WrappedArray(123, jake), WrappedArray(234, tim)]|
#|65 |[WrappedArray(345, chris)] |
#+--------+-------------------------------------------------+
This returns a WrappedArray of arrays of strings. The integers are converted to strings because you can't have mixed type arrays.
df1.printSchema()
#root
# |-- group_id: integer (nullable = true)
# |-- members: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: string (containsNull = true)
Option 2: Use pyspark.sql.functions.struct
from pyspark.sql.functions import collect_list, struct
df2 = df.groupBy("group_id")\
.agg(collect_list(struct("member_id", "name")).alias("members"))
df2.show(truncate=False)
#+--------+-----------------------+
#|group_id|members |
#+--------+-----------------------+
#|65 |[[345,chris]] |
#|55 |[[123,jake], [234,tim]]|
#+--------+-----------------------+
This returns an array of structs, with named fields for member_id and name
df2.printSchema()
#root
# |-- group_id: integer (nullable = true)
# |-- members: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- member_id: integer (nullable = true)
# | | |-- name: string (nullable = true)
What's useful about the struct method is that you can access elements of the nested array by name using the dot accessor:
df2.select("group_id", "members.member_id").show()
#+--------+----------+
#|group_id| member_id|
#+--------+----------+
#| 65| [345]|
#| 55|[123, 234]|
#+--------+----------+

Flatten Nested Struct in PySpark Array

Given a schema like:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string
How can I get a schema like:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string
Currently, I explode the array, flatten the structure by selecting advisor.* and then group by first_name, last_name and rebuild the array with collect_list. I'm hoping there's a cleaner/shorter way to do this. Currently, there's a lot of pain renaming some fields and stuff that I don't want to get into here. Thanks!
You can use udf to change the datatype of nested columns in dataframe.
Suppose you have read the dataframe as df1
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def foo(data):
return
(
list(map(
lambda x: (
x["school"],
x["advisors"]["advisor1"],
x["advisors"]["advisor1"]
),
data
))
)
struct = ArrayType(
StructType([
StructField("school", StringType()),
StructField("advisor1", StringType()),
StructField("advisor2", StringType())
])
)
udf_foo = udf(foo, struct)
df2 = df1.withColumn("degrees", udf_foo("degrees"))
df2.printSchema()
output:
root
|-- degrees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- school: string (nullable = true)
| | |-- advisor1: string (nullable = true)
| | |-- advisor2: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
Here's a more generic solution which can flatten multiple nested struct layers:
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
just call with:
my_flattened_df = flatten_df(my_df_having_structs, 3)
(second parameter is the level of layers to be flattened, in my case it's 3)