How to cast column with nested array in pyspark - pyspark

I have this schema in pyspark:
root
|-- SortedLenders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- LenderID: string (nullable = true)
| | |-- MaxProfit: string (nullable = true)
|-- FilteredOutDecisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ApprovedAmount: integer (nullable = true)
| | |-- Reasons: array (nullable = true)
| | | |-- element: integer (containsNull = true)
How do I cast FilteredOutDecisions.Reasons column to double? Thank you, in advance!

Try this:
df = (
df
.withColumn('newFilteredOutDecisions', f.expr('transform(FilteredOutDecisions, element -> struct(element.ApprovedAmount as ApprovedAmount, transform(element.Reasons, value -> cast (value as double)) as Reasons))'))
)

Related

Add a field already exists in df pyspark in struct field

I have the follow df:
sku category price state infos_gerais
33344 mmmma 3.00 SP [{5, 5656655, 5845454}]
33344 mmmma 3.00 MG [{5, 6565767, 5854545}]
33344 mmmma 3.00 RS [{5, 8788787, 4564646}]
The schema of df follow:
|-- sku: string (nullable = true)
|-- category: string (nullable = true)
|-- price: double (nullable = true)
|-- state: string (nullable = true)
|-- infos_gerais: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- service_type_id: integer (nullable = true)
| | |-- cep_ini: integer (nullable = true)
| | |-- cep_fim: integer (nullable = true)
See that in df the field that don't repeat is 'state', so I need insert this field in struct 'infos_gerais' and apply a groupBy, so I try this below code, but return a error. Anyone can help me?
df_end = df_end.withColumn(
"infos_gerais",
sf.collect_list(
sf.struct(
sf.col("infos_gerais.*"),
sf.col('infos_gerais.state').alias('state'))
)
)
I need the follow df output:
sku category price infos_gerais
33344 mmmma 3.00 [{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG},{5, 8788787, 4564646, RS}]
given you have an array of structs, you can use transform to process the elements of the array and withField on the structs to add/replace a struct field.
here's a simple example
data_sdf. \
withColumn('infos_gerais',
func.transform('infos_gerais', lambda x: x.withField('state', func.col('state')))
). \
groupBy('sku', 'category', 'price'). \
agg(func.flatten(func.collect_list('infos_gerais')).alias('infos_gerais')). \
show(truncate=False)
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |sku |category|price|infos_gerais |
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |33344|mmmma |3.0 |[{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG}, {5, 8788787, 4564646, RS}]|
# +-----+--------+-----+---------------------------------------------------------------------------------+
# root
# |-- sku: string (nullable = true)
# |-- category: string (nullable = true)
# |-- price: double (nullable = true)
# |-- infos_gerais: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- service_type_id: integer (nullable = true)
# | | |-- cep_ini: integer (nullable = true)
# | | |-- cep_fim: integer (nullable = true)
# | | |-- state: string (nullable = true)

How to convert a spark dataframe to a list of structs in scala

I have a spark dataframe composed of 12 rows and different columns, 22 in this case.
I want to convert it into a dataframe of the format:
root
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- ast: double (nullable = true)
| | |-- blk: double (nullable = true)
| | |-- dreb: double (nullable = true)
| | |-- fg3_pct: double (nullable = true)
| | |-- fg3a: double (nullable = true)
| | |-- fg3m: double (nullable = true)
| | |-- fg_pct: double (nullable = true)
| | |-- fga: double (nullable = true)
| | |-- fgm: double (nullable = true)
| | |-- ft_pct: double (nullable = true)
| | |-- fta: double (nullable = true)
| | |-- ftm: double (nullable = true)
| | |-- games_played: long (nullable = true)
| | |-- seconds: double (nullable = true)
| | |-- oreb: double (nullable = true)
| | |-- pf: double (nullable = true)
| | |-- player_id: long (nullable = true)
| | |-- pts: double (nullable = true)
| | |-- reb: double (nullable = true)
| | |-- season: long (nullable = true)
| | |-- stl: double (nullable = true)
| | |-- turnover: double (nullable = true)
Where each element of the dataframe data field corresponds to a different row of the original dataframe.
The final goal is exporting it to .json file which will have the format:
{"data": [{row1}, {row2}, ..., {row12}]}
The code I am employing at the moment is the following:
val best_12_struct = best_12.withColumn("data", array((0 to 11).map(i => struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"),
col("fg3m"), col("fg_pct"), col("fga"), col("fgm"),
col("ft_pct"), col("fta"), col("ftm"), col("games_played"),
col("seconds"), col("oreb"), col("pf"), col("player_id"),
col("pts"), col("reb"), col("season"), col("stl"), col("turnover"))) : _*))
val best_12_data = best_12_struct.select("data")
But the array(0 to 11) copies 12 times the same element into data. Therefore, the .json I finally obtain has 12 {"data": ...}, being in each the same row copied 12 times, instead of just one {"data": ...} with 12 elements, corresponding each to one row of the original dataframe.
you have 12 times the same row as the method withColumn will only pick information from the current treated row.
You need to aggregate rows at dataframe level with collect_list that is an aggregate function as follow:
import org.apache.spark.sql.functions._
val best_12_data = best_12
.withColumn("row", struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"), col("fg3m"), col("fg_pct"), col("fga"), col("fgm"), col("ft_pct"), col("fta"), col("ftm"), col("games_played"), col("seconds"), col("oreb"), col("pf"), col("player_id"), col("pts"), col("reb"), col("season"), col("stl"), col("turnover")))
.agg(collect_list(col("row")).as("data"))

How to parse the JSON data using Spark-Scala

I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated.
I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results.
jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V")).drop(col("element")).show()
+-------------+--------- --+
| stime|can_value |
+-------------+--------- +
|value of E |value of V |
+-------------+----------- +
df.printSchema
-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
I will need output like below:
-----------------+-------------+---------+
|SIGNAL |stime |can_value|
+-----------------+-------------+---------+
|ABS |value of E |value of V |
|ADA |value of E |value of V |
|ADW |value of E |value of V |
+-----------------+-------------+---------+
To get the expected output, and to insert values in Signal column:
jsonDF.select(explode($"ABS") as "element")
.withColumn("stime", col("element.E"))
.withColumn("can_value", col("element.V"))
.drop(col("element"))
.withColumn("SIGNAL",lit("ABS"))
.show()
And the generalized version of the above approach:
(Based on the result of df.printSchema assuming that, you have signal values as column names, and those columns contain array having elements of the form struct(E,V))
val columns:Array[String] = df.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = df.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("SIGNAL"),
col("element.E").as("stime"),
col("element.V").as("can_value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)

How to project an array of structs in spark dataframe API

I have a Dataframe like this:
val df = Seq(
Seq(("a","b","c"))
)
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,b,c]]|
+---------+
I want to select only c1 and c3, such that:
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,c]] |
+---------+
Can this be done without UDF?
I can do it with an UDF, but I'd like a solution without it, something like
df
.select($"arr.c1".as("arr"))
root
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
But this only works to select 1 struct element, I've also tried :
df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))
but this gives
root
|-- arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- c1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- c3: array (nullable = true)
| | | |-- element: string (containsNull = true)
I can only give an answer for the Python API but I am sure the Scala API has something very similar.
The key is the function arrays_zip, which, according to the documentation, "[r]eturns a merged array of structs in which the N-th struct contains all N-th values of input arrays."
Example (still from the documentation):
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
# Prints: [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]

How to join two dataframes where key to be used for joining has different datatype in both dataframes

I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.