access fields of an array within pyspark dataframe - pyspark

I am developing sql queries to a spark dataframe that are based on a group of ORC files. The program goes like this:
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("test").getOrCreate()
sdf = spark_session.read.orc("../data/")
sdf.createOrReplaceTempView("test")
Now I have a table called "test". If I do something like:
spark_session.sql("select count(*) from test")
then the result will be fine. But I need to get more columns in the query, including some of the fields in array.
In [8]: sdf.take(1)[0]["person"]
Out[8]:
[Row(name='name', value='tom'),
Row(name='age', value='20'),
Row(name='gender', value='m')]
I have tried something like:
spark_session.sql("select person.age, count(*) from test group by person.age")
But this does not work. My question is: how to access the fields in the "person" array?
Thanks!
EDIT:
result of sdf.printSchema()
In [3]: sdf.printSchema()
root
|-- person: integer (nullable = true)
|-- customtags: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
Error messages:
AnalysisException: 'No such struct field age in name, value; line 16 pos 8'

I don't know how to do this using only PySpark-SQL, but here is a way to do it using PySpark DataFrames.
Basically, we can convert the struct column into a MapType() using the create_map() function. Then we can directly access the fields using string indexing.
Consider the following example:
Define Schema
schema = StructType([
StructField('person', IntegerType()),
StructField(
'customtags',
ArrayType(
StructType(
[
StructField('name', StringType()),
StructField('value', StringType())
]
)
)
)
]
)
Create Example DataFrame
data = [
(
1,
[
{'name': 'name', 'value': 'tom'},
{'name': 'age', 'value': '20'},
{'name': 'gender', 'value': 'm'}
]
),
(
2,
[
{'name': 'name', 'value': 'jerry'},
{'name': 'age', 'value': '20'},
{'name': 'gender', 'value': 'm'}
]
),
(
3,
[
{'name': 'name', 'value': 'ann'},
{'name': 'age', 'value': '20'},
{'name': 'gender', 'value': 'f'}
]
)
]
df = sqlCtx.createDataFrame(data, schema)
df.show(truncate=False)
#+------+------------------------------------+
#|person|customtags |
#+------+------------------------------------+
#|1 |[[name,tom], [age,20], [gender,m]] |
#|2 |[[name,jerry], [age,20], [gender,m]]|
#|3 |[[name,ann], [age,20], [gender,f]] |
#+------+------------------------------------+
Convert the struct column to a map
from operator import add
import pyspark.sql.functions as f
df = df.withColumn(
'customtags',
f.create_map(
*reduce(
add,
[
[f.col('customtags')['name'][i],
f.col('customtags')['value'][i]] for i in range(3)
]
)
)
)\
.select('person', 'customtags')
df.show(truncate=False)
#+------+------------------------------------------+
#|person|customtags |
#+------+------------------------------------------+
#|1 |Map(name -> tom, age -> 20, gender -> m) |
#|2 |Map(name -> jerry, age -> 20, gender -> m)|
#|3 |Map(name -> ann, age -> 20, gender -> f) |
#+------+------------------------------------------+
The catch here is that you have to know apriori the length of the ArrayType() (in this case 3) as I don't know of a way to dynamically loop over it. This also assumes that the array has the same length for all rows.
I had to use reduce(add, ...) here because create_map() expects pairs of elements in the form of (key, value).
Group by fields in the map column
df.groupBy((f.col('customtags')['name']).alias('name')).count().show()
#+-----+-----+
#| name|count|
#+-----+-----+
#| ann| 1|
#|jerry| 1|
#| tom| 1|
#+-----+-----+
df.groupBy((f.col('customtags')['gender']).alias('gender')).count().show()
#+------+-----+
#|gender|count|
#+------+-----+
#| m| 2|
#| f| 1|
#+------+-----+

Related

Pyspark - How do I Flatten Nested Struct Column perserving parent name

If I have a dataframe with a struct column named structA, and in it we have 3 columns named a,b and c
if I want to flat the struct I can easily do that with df.select("structA.*") and it will display
a
b
c
1
2
3
3
5
6
What I wanted is
structA.a
structA.b
structA.c
1
2
3
3
5
6
How can I do this?
I'm afraid it's not straightforward as it should. You'll need to loop through the schema to get and build your desired column names, then rename columns in a bulk. Something like this
Sample dataset
df = spark.createDataFrame([
((1, 2, 3),),
((4, 5, 6),),
], 'structA struct<a:int, b:int, c:int>')
df.show()
df.printSchema()
+---------+
| structA|
+---------+
|{1, 2, 3}|
|{4, 5, 6}|
+---------+
root
|-- structA: struct (nullable = true)
| |-- a: integer (nullable = true)
| |-- b: integer (nullable = true)
| |-- c: integer (nullable = true)
from pyspark.sql import functions as F
struct_col = 'structA'
struct_cols = [[F.col(b.name).alias(f'{a.name}_{b.name}') for b in a.dataType.fields] for a in df.schema if a.name == struct_col][0]
# [Column<'a AS structA_a'>, Column<'b AS structA_b'>, Column<'c AS structA_c'>]
df.select(f'{struct_col}.*').select(struct_cols).show()
+---------+---------+---------+
|structA_a|structA_b|structA_c|
+---------+---------+---------+
| 1| 2| 3|
| 4| 5| 6|
+---------+---------+---------+
I did the following in order to do this:
from pyspark.sql import functions as F
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
#create dataframe ----
df_data=[((1,2,3),(-1,-2,-3)),((4,5,6),(-4,-5,-6))]
structureSchema = StructType([
StructField('structA', StructType([
StructField('a', IntegerType(), True),
StructField('b', IntegerType(), True),
StructField('c', IntegerType(), True)
])),
StructField('structB', StructType([
StructField('a', IntegerType(), True),
StructField('b', IntegerType(), True),
StructField('c', IntegerType(), True)
]))
])
df=spark.createDataFrame(df_data,structureSchema)
df.show()
+---------+------------+
| structA| structB|
+---------+------------+
|[1, 2, 3]|[-1, -2, -3]|
|[4, 5, 6]|[-4, -5, -6]|
+---------+------------+
If we have multiple struct like columns we need to find them like this:
nested_cols = [c[0] for c in df.dtypes if c[1][:6] == 'struct']
nested_cols
['structA', 'structB']
Now I will create a json like object with the new structure
struct_columns={}
for struct_column in nested_cols:
struct_columns[struct_column]=df.select(struct_column+".*").columns
struct_columns
{'structA': ['a', 'b', 'c'], 'structB': ['a', 'b', 'c']}
With the structure, I will create the flattened data frame
flatten_df=df
for key in struct_columns:
print(key)
for column in struct_columns[key]:
flatten_df=flatten_df.withColumn(key+"_"+column,F.expr(f"{key}.{column}"))
flatten_df.drop(*df.columns).show()
+---------+---------+---------+---------+---------+---------+
|structA_a|structA_b|structA_c|structB_a|structB_b|structB_c|
+---------+---------+---------+---------+---------+---------+
| 1| 2| 3| -1| -2| -3|
| 4| 5| 6| -4| -5| -6|
+---------+---------+---------+---------+---------+---------+

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case?
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("
{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))
I built a simple example with a struct column and a few dummy columns:
from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
[
StructField('addresses',
StructType(
[StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("country", StringType(), True),
StructField("code", IntegerType(), True)]
)
)
]
)
rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
({'state': 'ca', 'street': 'baker', 'country': 'USA', 'code': 101},)]
df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))
print(df.show())
print(df.printSchema())
Output:
+--------------------+-----------+----+
| addresses| id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+
root
|-- addresses: struct (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- country: string (nullable = true)
| |-- code: integer (nullable = true)
|-- id: long (nullable = false)
|-- name: string (nullable = false)
To drop the whole struct column, you can simply use the drop function:
df2 = df.drop('addresses')
print(df2.show())
Output:
+-----------+----+
| id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
To drop specific fields, in a struct column, it's a bit more complicated - there are some other similar questions here:
Dropping a nested column from Spark DataFrame
Dropping nested column of Dataframe with PySpark
In any case, I found them to be a bit complicated - my approach would just be to reassign the original column with the subset of struct fields you want to keep:
columns_to_keep = ['country', 'code']
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+----------+-----------+----+
| addresses| id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
Alternatively, if you just wanted to specify the columns you want to remove rather than the columns you want to keep:
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+------------+-----------+----+
| addresses| id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
Hope this helps!

Unable to explode() Map[String, Struct] in Spark

Been struggling with this for a while and still can't make my mind around it.
I'm trying to flatMap (or use .withColumn with explode() instead as it seems easier so I don't lose column names), but I'm always getting the error UDTF expected 2 aliases but got 'name' instead.
I've revisited some similar questions but none of them shed some light as their schemas are too simple.
The column of the schema I'm trying to perform flatMap with is the following one...
StructField(CarMake,
StructType(
List(
StructField(
Models,
MapType(
StringType,
StructType(
List(
StructField(Variant, StringType),
StructField(GasOrPetrol, StringType)
)
)
)
)
)
))
What I'm trying to achieve by calling explode() like this...
carsDS
.withColumn("modelsAndVariant", explode($"carmake.models"))
...is to achieve a Row without that nested Map and Struct so I get as many rows as variants are.
Example input
(country: Sweden, carMake: Volvo, carMake.Models: {"850": ("T5", "petrol"), "V50": ("T5", "petrol")})
Example output
(country: Sweden, carMake: Volvo, Model: "850", Variant: "T5", GasOrPetrol: "petrol"}
(country: Sweden, carMake: Volvo, Model: "V50", Variant: "T5", GasOrPetrol: "petrol"}
Basically leaving the nested Map with its inner Struct all in the same level.
Try this:
case class Models(variant:String, gasOrPetrol:String)
case class CarMake(brand:String, models : Map[String, Models] )
case class MyRow(carMake:CarMake)
val df = List(
MyRow(CarMake("volvo",Map(
"850" -> Models("T5","petrol"),
"V50" -> Models("T5","petrol")
)))
).toDF()
df.printSchema()
df.show()
gives
root
|-- carMake: struct (nullable = true)
| |-- brand: string (nullable = true)
| |-- models: map (nullable = true)
| | |-- key: string
| | |-- value: struct (valueContainsNull = true)
| | | |-- variant: string (nullable = true)
| | | |-- gasOrPetrol: string (nullable = true)
+--------------------+
| carMake|
+--------------------+
|[volvo, [850 -> [...|
+--------------------+
now explode, note that withColumn does not work because èxplode on a map returns 2 columns (key and value), so you need to use select:
val cols: Array[Column] = df.columns.map(col)
df
.select((cols:+explode($"carMake.models")):_*)
.select((cols:+$"key".as("model"):+$"value.*"):_*)
.show()
gives:
+--------------------+-----+-------+-----------+
| carMake|model|variant|gasOrPetrol|
+--------------------+-----+-------+-----------+
|[volvo, [850 -> [...| 850| T5| petrol|
|[volvo, [850 -> [...| V50| T5| petrol|
+--------------------+-----+-------+-----------+

Spark scala dataframe: Merging multiple columns into single column

I have a spark dataframe which looks something like below:
+---+------+----+
| id|animal|talk|
+---+------+----+
| 1| bat|done|
| 2| mouse|mone|
| 3| horse| gun|
| 4| horse|some|
+---+------+----+
I want to generate a new column, say merged which would look something like
+---+-----------------------------------------------------------+
| id| merged columns |
+---+-----------------------------------------------------------+
| 1| [{name: animal, value: bat}, {name: talk, value: done}] |
| 2| [{name: animal, value: mouse}, {name: talk, value: mone}] |
| 3| [{name: animal, value: horse}, {name: talk, value: gun}] |
| 4| [{name: animal, value: horse}, {name: talk, value: some}] |
+---+-----------------------------------------------------------+
Basically, combining all the columns into an Array of case class merged(name:String, value: String).
Can anyone help me with how to do this in Scala?
Here for simplicity I have used only two columns but generic answer which can work for N number of columns would greatly help.
Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft to iteratively convert the wanted columns to StructType name-value columns, and group them into an ArrayType column:
import org.apache.spark.sql.functions._
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val cols = df.columns.filter(_ != "id")
val resultDF = cols.
foldLeft(df)( (accDF, c) =>
accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
).
select($"id", array(cols.map(col): _*).as("merged"))
resultDF.show(false)
// +---+-----------------------------+
// |id |merged |
// +---+-----------------------------+
// |1 |[[animal,bat], [talk,done]] |
// |2 |[[animal,mouse], [talk,mone]]|
// |3 |[[animal,horse], [talk,gun]] |
// |4 |[[animal,horse], [talk,some]]|
// +---+-----------------------------+
resultDF.printSchema
// root
// |-- id: integer (nullable = false)
// |-- merged: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- name: string (nullable = false)
// | | |-- value: string (nullable = true)

Join/unfolded mapType column in spark back with the original dataframe

I have a dataframe in (py)Spark, where 1 of the columns is from the type 'map'. That column I want to flatten or split into multiple columns which should be added to the original dataframe. I'm able to unfold the column with flatMap, however I loose the key to join the new dataframe (from the unfolded column) with the original dataframe.
My schema is like this:
rroot
|-- key: string (nullable = true)
|-- metric: map (nullable = false)
| |-- key: string
| |-- value: float (valueContainsNull = true)
As you can see, the column 'metric' is a map-field. This is the column that I want to flatten. Before flattening it looks like:
+----+---------------------------------------------------+
|key |metric |
+----+---------------------------------------------------+
|123k|Map(metric1 -> 1.3, metric2 -> 6.3, metric3 -> 7.6)|
|d23d|Map(metric1 -> 1.5, metric2 -> 2.0, metric3 -> 2.2)|
|as3d|Map(metric1 -> 2.2, metric2 -> 4.3, metric3 -> 9.0)|
+----+---------------------------------------------------+
To convert that field to columns I do
df2.select('metric').rdd.flatMap(lambda x: x).toDF().show()
which gives
+------------------+-----------------+-----------------+
| metric1| metric2| metric3|
+------------------+-----------------+-----------------+
|1.2999999523162842|6.300000190734863|7.599999904632568|
| 1.5| 2.0|2.200000047683716|
| 2.200000047683716|4.300000190734863| 9.0|
+------------------+-----------------+-----------------+
However I don't see the key , therefore I don't know how to add this data to the original dataframe.
What I want is:
+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k| 1.3| 6.3| 7.6|
|d23d| 1.5| 2.0| 2.2|
|as3d| 2.2| 4.3| 9.0|
+----+-------+-------+-------+
My question thus is: How can i get df2 back to df (given that i originally don't know df and only have df2)
To make df2:
rdd = sc.parallelize([('123k', 1.3, 6.3, 7.6),
('d23d', 1.5, 2.0, 2.2),
('as3d', 2.2, 4.3, 9.0)
])
schema = StructType([StructField('key', StringType(), True),
StructField('metric1', FloatType(), True),
StructField('metric2', FloatType(), True),
StructField('metric3', FloatType(), True)])
df = sqlContext.createDataFrame(rdd, schema)
from pyspark.sql.functions import lit, col, create_map
from itertools import chain
metric = create_map(list(chain(*(
(lit(name), col(name)) for name in df.columns if "metric" in name
)))).alias("metric")
df2 = df.select("key", metric)
from pyspark.sql.functions import explode
# fetch column names of the original dataframe from keys of MapType 'metric' column
col_names = df2.select(explode("metric")).select("key").distinct().sort("key").rdd.flatMap(lambda x: x).collect()
exprs = [col("key")] + [col("metric").getItem(k).alias(k) for k in col_names]
df2_to_original_df = df2.select(*exprs)
df2_to_original_df.show()
Output is:
+----+-------+-------+-------+
| key|metric1|metric2|metric3|
+----+-------+-------+-------+
|123k| 1.3| 6.3| 7.6|
|d23d| 1.5| 2.0| 2.2|
|as3d| 2.2| 4.3| 9.0|
+----+-------+-------+-------+
I can select a certain key from a maptype by doing:
df.select('maptypecolumn'.'key')
In my example I did it as follows:
columns= df2.select('metric').rdd.flatMap(lambda x: x).toDF().columns
for i in columns:
df2= df2.withColumn(i,lit(df2.metric[i]))
You can access key and value for example like this:
from pyspark.sql.functions import explode
df.select(explode("custom_dimensions")).select("key")