How to project an array of structs in spark dataframe API - scala

I have a Dataframe like this:
val df = Seq(
Seq(("a","b","c"))
)
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,b,c]]|
+---------+
I want to select only c1 and c3, such that:
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,c]] |
+---------+
Can this be done without UDF?
I can do it with an UDF, but I'd like a solution without it, something like
df
.select($"arr.c1".as("arr"))
root
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
But this only works to select 1 struct element, I've also tried :
df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))
but this gives
root
|-- arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- c1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- c3: array (nullable = true)
| | | |-- element: string (containsNull = true)

I can only give an answer for the Python API but I am sure the Scala API has something very similar.
The key is the function arrays_zip, which, according to the documentation, "[r]eturns a merged array of structs in which the N-th struct contains all N-th values of input arrays."
Example (still from the documentation):
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
# Prints: [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]

Related

How to cast column with nested array in pyspark

I have this schema in pyspark:
root
|-- SortedLenders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- LenderID: string (nullable = true)
| | |-- MaxProfit: string (nullable = true)
|-- FilteredOutDecisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ApprovedAmount: integer (nullable = true)
| | |-- Reasons: array (nullable = true)
| | | |-- element: integer (containsNull = true)
How do I cast FilteredOutDecisions.Reasons column to double? Thank you, in advance!
Try this:
df = (
df
.withColumn('newFilteredOutDecisions', f.expr('transform(FilteredOutDecisions, element -> struct(element.ApprovedAmount as ApprovedAmount, transform(element.Reasons, value -> cast (value as double)) as Reasons))'))
)

How to convert a spark dataframe to a list of structs in scala

I have a spark dataframe composed of 12 rows and different columns, 22 in this case.
I want to convert it into a dataframe of the format:
root
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- ast: double (nullable = true)
| | |-- blk: double (nullable = true)
| | |-- dreb: double (nullable = true)
| | |-- fg3_pct: double (nullable = true)
| | |-- fg3a: double (nullable = true)
| | |-- fg3m: double (nullable = true)
| | |-- fg_pct: double (nullable = true)
| | |-- fga: double (nullable = true)
| | |-- fgm: double (nullable = true)
| | |-- ft_pct: double (nullable = true)
| | |-- fta: double (nullable = true)
| | |-- ftm: double (nullable = true)
| | |-- games_played: long (nullable = true)
| | |-- seconds: double (nullable = true)
| | |-- oreb: double (nullable = true)
| | |-- pf: double (nullable = true)
| | |-- player_id: long (nullable = true)
| | |-- pts: double (nullable = true)
| | |-- reb: double (nullable = true)
| | |-- season: long (nullable = true)
| | |-- stl: double (nullable = true)
| | |-- turnover: double (nullable = true)
Where each element of the dataframe data field corresponds to a different row of the original dataframe.
The final goal is exporting it to .json file which will have the format:
{"data": [{row1}, {row2}, ..., {row12}]}
The code I am employing at the moment is the following:
val best_12_struct = best_12.withColumn("data", array((0 to 11).map(i => struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"),
col("fg3m"), col("fg_pct"), col("fga"), col("fgm"),
col("ft_pct"), col("fta"), col("ftm"), col("games_played"),
col("seconds"), col("oreb"), col("pf"), col("player_id"),
col("pts"), col("reb"), col("season"), col("stl"), col("turnover"))) : _*))
val best_12_data = best_12_struct.select("data")
But the array(0 to 11) copies 12 times the same element into data. Therefore, the .json I finally obtain has 12 {"data": ...}, being in each the same row copied 12 times, instead of just one {"data": ...} with 12 elements, corresponding each to one row of the original dataframe.
you have 12 times the same row as the method withColumn will only pick information from the current treated row.
You need to aggregate rows at dataframe level with collect_list that is an aggregate function as follow:
import org.apache.spark.sql.functions._
val best_12_data = best_12
.withColumn("row", struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"), col("fg3m"), col("fg_pct"), col("fga"), col("fgm"), col("ft_pct"), col("fta"), col("ftm"), col("games_played"), col("seconds"), col("oreb"), col("pf"), col("player_id"), col("pts"), col("reb"), col("season"), col("stl"), col("turnover")))
.agg(collect_list(col("row")).as("data"))

How to parse the JSON data using Spark-Scala

I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated.
I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results.
jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V")).drop(col("element")).show()
+-------------+--------- --+
| stime|can_value |
+-------------+--------- +
|value of E |value of V |
+-------------+----------- +
df.printSchema
-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
I will need output like below:
-----------------+-------------+---------+
|SIGNAL |stime |can_value|
+-----------------+-------------+---------+
|ABS |value of E |value of V |
|ADA |value of E |value of V |
|ADW |value of E |value of V |
+-----------------+-------------+---------+
To get the expected output, and to insert values in Signal column:
jsonDF.select(explode($"ABS") as "element")
.withColumn("stime", col("element.E"))
.withColumn("can_value", col("element.V"))
.drop(col("element"))
.withColumn("SIGNAL",lit("ABS"))
.show()
And the generalized version of the above approach:
(Based on the result of df.printSchema assuming that, you have signal values as column names, and those columns contain array having elements of the form struct(E,V))
val columns:Array[String] = df.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = df.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("SIGNAL"),
col("element.E").as("stime"),
col("element.V").as("can_value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)

Fetch values from nested struct in Spark DF

Spark DF with one column, where each row is of the
Type:
org.apache.spark.sql.Row
Form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B1: struct (nullable = true)
| | | |-- B11: string (nullable = true)
| | | |-- B12: string (nullable = true)
| | |-- B2: string (nullable = true)
I am trying to get the value of
A1->B1->B11.
Any methods to fetch this with the DataFrame APIs or indexing without converting each row into a seq and then iterating through it which affects my performance badly. Any suggestions would be great. Thanks

How to join two dataframes where key to be used for joining has different datatype in both dataframes

I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.