Given a schema like:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisors: struct
| | | |-- advisor1: string
| | | |-- advisor2: string
How can I get a schema like:
root
|-- first_name: string
|-- last_name: string
|-- degrees: array
| |-- element: struct
| | |-- school: string
| | |-- advisor1: string
| | |-- advisor2: string
Currently, I explode the array, flatten the structure by selecting advisor.* and then group by first_name, last_name and rebuild the array with collect_list. I'm hoping there's a cleaner/shorter way to do this. Currently, there's a lot of pain renaming some fields and stuff that I don't want to get into here. Thanks!
You can use udf to change the datatype of nested columns in dataframe.
Suppose you have read the dataframe as df1
from pyspark.sql.functions import udf
from pyspark.sql.types import *
def foo(data):
return
(
list(map(
lambda x: (
x["school"],
x["advisors"]["advisor1"],
x["advisors"]["advisor1"]
),
data
))
)
struct = ArrayType(
StructType([
StructField("school", StringType()),
StructField("advisor1", StringType()),
StructField("advisor2", StringType())
])
)
udf_foo = udf(foo, struct)
df2 = df1.withColumn("degrees", udf_foo("degrees"))
df2.printSchema()
output:
root
|-- degrees: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- school: string (nullable = true)
| | |-- advisor1: string (nullable = true)
| | |-- advisor2: string (nullable = true)
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
Here's a more generic solution which can flatten multiple nested struct layers:
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
just call with:
my_flattened_df = flatten_df(my_df_having_structs, 3)
(second parameter is the level of layers to be flattened, in my case it's 3)
Related
I have the follow df:
sku category price state infos_gerais
33344 mmmma 3.00 SP [{5, 5656655, 5845454}]
33344 mmmma 3.00 MG [{5, 6565767, 5854545}]
33344 mmmma 3.00 RS [{5, 8788787, 4564646}]
The schema of df follow:
|-- sku: string (nullable = true)
|-- category: string (nullable = true)
|-- price: double (nullable = true)
|-- state: string (nullable = true)
|-- infos_gerais: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- service_type_id: integer (nullable = true)
| | |-- cep_ini: integer (nullable = true)
| | |-- cep_fim: integer (nullable = true)
See that in df the field that don't repeat is 'state', so I need insert this field in struct 'infos_gerais' and apply a groupBy, so I try this below code, but return a error. Anyone can help me?
df_end = df_end.withColumn(
"infos_gerais",
sf.collect_list(
sf.struct(
sf.col("infos_gerais.*"),
sf.col('infos_gerais.state').alias('state'))
)
)
I need the follow df output:
sku category price infos_gerais
33344 mmmma 3.00 [{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG},{5, 8788787, 4564646, RS}]
given you have an array of structs, you can use transform to process the elements of the array and withField on the structs to add/replace a struct field.
here's a simple example
data_sdf. \
withColumn('infos_gerais',
func.transform('infos_gerais', lambda x: x.withField('state', func.col('state')))
). \
groupBy('sku', 'category', 'price'). \
agg(func.flatten(func.collect_list('infos_gerais')).alias('infos_gerais')). \
show(truncate=False)
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |sku |category|price|infos_gerais |
# +-----+--------+-----+---------------------------------------------------------------------------------+
# |33344|mmmma |3.0 |[{5, 5656655, 5845454, SP}, {5, 6565767, 5854545, MG}, {5, 8788787, 4564646, RS}]|
# +-----+--------+-----+---------------------------------------------------------------------------------+
# root
# |-- sku: string (nullable = true)
# |-- category: string (nullable = true)
# |-- price: double (nullable = true)
# |-- infos_gerais: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- service_type_id: integer (nullable = true)
# | | |-- cep_ini: integer (nullable = true)
# | | |-- cep_fim: integer (nullable = true)
# | | |-- state: string (nullable = true)
I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated.
I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results.
jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V")).drop(col("element")).show()
+-------------+--------- --+
| stime|can_value |
+-------------+--------- +
|value of E |value of V |
+-------------+----------- +
df.printSchema
-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
I will need output like below:
-----------------+-------------+---------+
|SIGNAL |stime |can_value|
+-----------------+-------------+---------+
|ABS |value of E |value of V |
|ADA |value of E |value of V |
|ADW |value of E |value of V |
+-----------------+-------------+---------+
To get the expected output, and to insert values in Signal column:
jsonDF.select(explode($"ABS") as "element")
.withColumn("stime", col("element.E"))
.withColumn("can_value", col("element.V"))
.drop(col("element"))
.withColumn("SIGNAL",lit("ABS"))
.show()
And the generalized version of the above approach:
(Based on the result of df.printSchema assuming that, you have signal values as column names, and those columns contain array having elements of the form struct(E,V))
val columns:Array[String] = df.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = df.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("SIGNAL"),
col("element.E").as("stime"),
col("element.V").as("can_value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)
Spark DF with one column, where each row is of the
Type:
org.apache.spark.sql.Row
Form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B1: struct (nullable = true)
| | | |-- B11: string (nullable = true)
| | | |-- B12: string (nullable = true)
| | |-- B2: string (nullable = true)
I am trying to get the value of
A1->B1->B11.
Any methods to fetch this with the DataFrame APIs or indexing without converting each row into a seq and then iterating through it which affects my performance badly. Any suggestions would be great. Thanks
I have a Dataframe like this:
val df = Seq(
Seq(("a","b","c"))
)
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,b,c]]|
+---------+
I want to select only c1 and c3, such that:
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,c]] |
+---------+
Can this be done without UDF?
I can do it with an UDF, but I'd like a solution without it, something like
df
.select($"arr.c1".as("arr"))
root
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
But this only works to select 1 struct element, I've also tried :
df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))
but this gives
root
|-- arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- c1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- c3: array (nullable = true)
| | | |-- element: string (containsNull = true)
I can only give an answer for the Python API but I am sure the Scala API has something very similar.
The key is the function arrays_zip, which, according to the documentation, "[r]eturns a merged array of structs in which the N-th struct contains all N-th values of input arrays."
Example (still from the documentation):
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
# Prints: [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]
I have two dataframes df1, df2 whose schema is as follows:
DF1 is of the form:
slotSize2: struct (nullable = true)
| |-- 120x600: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 160x600: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
| |-- 250x250: struct (nullable = true)
| | |-- pView: string (nullable = true)
| |-- 300x250: struct (nullable = true)
| | |-- level: string (nullable = true)
| | |-- pATF: string (nullable = true)
| | |-- pView: string (nullable = true)
| | |-- pViewV1: string (nullable = true)
| | |-- sPos: string (nullable = true)
Dataframe df2 has schema :
root
|-- bidId: array (nullable = true)
| |-- element: string (containsNull = true)
|-- slotSize1: array (nullable = true)
| |-- element: string (containsNull = true)
In dataframe df2 we have slotSize (named as slotSize1) in form of string and in dataframe df1 we have nested form of slotsize i.e. for every slotsize we have corresponding map.
I want to join two dataframes df1, df2 to form a new dataframe df3 which has schema (bidId, slotSize , viewMap) where bidId is present in df1 , slotSIze is of form 120x600 and is present in both schemas, and viewMap corresponds to the nested map corresponding to every slotSize in df1.