pyspark: groupby and aggregate avg and first on multiple columns - pyspark

I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually
sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'],
['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4'])
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| a| 2| 4| cc| anc|
| a| 4| 7| cd| abc|
| b| 6| 0| as| asd|
| b| 2| 4| ad| acb|
| c| 4| 4| sd| acc|
+---+----+----+----+----+
This is what I am trying
mean_cols = ['col1', 'col2']
first_cols = ['col3', 'col4']
sc.groupby('id').agg(*[ f.mean for col in mean_cols], *[f.first for col in first_cols])
but it's not working. How can I do it like this with pyspark

The best way for multiple functions on multiple columns is to use the .agg(*expr) format.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
fn_l = [F.min,F.max,F.mean,F.first]
col_l=['col1','col2','col3']
expr = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
tst_r = tst.groupby('col4').agg(*expr)
The result will be
tst_r.show()
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
|col4|min_col1|min_col2|min_col3|max_col1|max_col2|max_col3|mean_col1|mean_col2|mean_col3|first_col1|first_col2|first_col3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
| 5| 5| 6| 7| 7| 8| 9| 6.0| 7.0| 8.0| 5| 6| 7|
| 4| 1| 2| 3| 3| 4| 5| 2.0| 3.0| 4.0| 1| 2| 3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
For selectively applying functions on columns, you can have multiple expression arrays and concatenate them in aggregation.
fn_l = [F.min,F.max]
fn_2=[F.mean,F.first]
col_l=['col1','col2']
col_2=['col1','col3','col4']
expr1 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
expr2 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_2 for coln in col_2]
tst_r = tst.groupby('col4').agg(*(expr1+expr2))

A simpler way to do:
import pyspark.sql.functions as F
tst_r = ( tst.groupby('col4')
.agg(*[F.mean(col).alias(f"{col}_mean") for col in means_col],
*[F.first(col).alias(f"{col}_first") for col in firsts_col]) )

Related

Dataframe transform column for each day into rows [duplicate]

Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala?
I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.
Spark >= 3.4
In Spark 3.4 or later you can use built-in melt method
(sdf
.melt(
ids=['A'], values=['B', 'C'],
variableColumnName="variable",
valueColumnName="value")
.show())
+---+--------+-----+
| A|variable|value|
+---+--------+-----+
| a| B| 1|
| a| C| 2|
| b| B| 3|
| b| C| 4|
| c| B| 5|
| c| C| 6|
+---+--------+-----+
This method is available across all APIs so could be used in Scala
sdf.melt(Array($"A"), Array($"B", $"C"), "variable", "value")
or SQL
SELECT * FROM sdf UNPIVOT (val FOR col in (col_1, col_2))
Spark 3.2 (Python only, requires Pandas and pyarrow)
(sdf
.to_koalas()
.melt(id_vars=['A'], value_vars=['B', 'C'])
.to_spark()
.show())
+---+--------+-----+
| A|variable|value|
+---+--------+-----+
| a| B| 1|
| a| C| 2|
| b| B| 3|
| b| C| 4|
| c| B| 5|
| c| C| 6|
+---+--------+-----+
Spark < 3.2
There is no built-in function (if you work with SQL and Hive support enabled you can use stack function, but it is not exposed in Spark and has no native implementation) but it is trivial to roll your own. Required imports:
from pyspark.sql.functions import array, col, explode, lit, struct
from pyspark.sql import DataFrame
from typing import Iterable
Example implementation:
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""Convert :class:`DataFrame` from wide to long format."""
# Create array<struct<variable: str, value: ...>>
_vars_and_vals = array(*(
struct(lit(c).alias(var_name), col(c).alias(value_name))
for c in value_vars))
# Add to the DataFrame and explode
_tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
cols = id_vars + [
col("_vars_and_vals")[x].alias(x) for x in [var_name, value_name]]
return _tmp.select(*cols)
And some tests (based on Pandas doctests):
import pandas as pd
pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C'])
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
sdf = spark.createDataFrame(pdf)
melt(sdf, id_vars=['A'], value_vars=['B', 'C']).show()
+---+--------+-----+
| A|variable|value|
+---+--------+-----+
| a| B| 1|
| a| C| 2|
| b| B| 3|
| b| C| 4|
| c| B| 5|
| c| C| 6|
+---+--------+-----+
Note: For use with legacy Python versions remove type annotations.
Related:
R SparkR - equivalent to melt function
Gather in sparklyr
Came across this question in my search for an implementation of melt in Spark for Scala.
Posting my Scala port in case someone also stumbles upon this.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame}
/** Extends the [[org.apache.spark.sql.DataFrame]] class
*
* #param df the data frame to melt
*/
implicit class DataFrameFunctions(df: DataFrame) {
/** Convert [[org.apache.spark.sql.DataFrame]] from wide to long format.
*
* melt is (kind of) the inverse of pivot
* melt is currently (02/2017) not implemented in spark
*
* #see reshape packe in R (https://cran.r-project.org/web/packages/reshape/index.html)
* #see this is a scala adaptation of http://stackoverflow.com/questions/41670103/pandas-melt-function-in-apache-spark
*
* #todo method overloading for simple calling
*
* #param id_vars the columns to preserve
* #param value_vars the columns to melt
* #param var_name the name for the column holding the melted columns names
* #param value_name the name for the column holding the values of the melted columns
*
*/
def melt(
id_vars: Seq[String], value_vars: Seq[String],
var_name: String = "variable", value_name: String = "value") : DataFrame = {
// Create array<struct<variable: str, value: ...>>
val _vars_and_vals = array((for (c <- value_vars) yield { struct(lit(c).alias(var_name), col(c).alias(value_name)) }): _*)
// Add to the DataFrame and explode
val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
val cols = id_vars.map(col _) ++ { for (x <- List(var_name, value_name)) yield { col("_vars_and_vals")(x).alias(x) }}
return _tmp.select(cols: _*)
}
}
Since I'm am not that advanced considering Scala, I'm sure there is room for improvement.
Any comments are welcome.
Voted for user6910411's answer. It works as expected, however, it cannot handle None values well. thus I refactored his melt function to the following:
from pyspark.sql.functions import array, col, explode, lit
from pyspark.sql.functions import create_map
from pyspark.sql import DataFrame
from typing import Iterable
from itertools import chain
def melt(
df: DataFrame,
id_vars: Iterable[str], value_vars: Iterable[str],
var_name: str="variable", value_name: str="value") -> DataFrame:
"""Convert :class:`DataFrame` from wide to long format."""
# Create map<key: value>
_vars_and_vals = create_map(
list(chain.from_iterable([
[lit(c), col(c)] for c in value_vars]
))
)
_tmp = df.select(*id_vars, explode(_vars_and_vals)) \
.withColumnRenamed('key', var_name) \
.withColumnRenamed('value', value_name)
return _tmp
Test is with the following dataframe:
import pandas as pd
pdf = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6},
'D': {1: 7, 2: 9}})
pd.melt(pdf, id_vars=['A'], value_vars=['B', 'C', 'D'])
A variable value
0 a B 1.0
1 b B 3.0
2 c B 5.0
3 a C 2.0
4 b C 4.0
5 c C 6.0
6 a D NaN
7 b D 7.0
8 c D 9.0
sdf = spark.createDataFrame(pdf)
melt(sdf, id_vars=['A'], value_vars=['B', 'C', 'D']).show()
+---+--------+-----+
| A|variable|value|
+---+--------+-----+
| a| B| 1.0|
| a| C| 2.0|
| a| D| NaN|
| b| B| 3.0|
| b| C| 4.0|
| b| D| 7.0|
| c| B| 5.0|
| c| C| 6.0|
| c| D| 9.0|
+---+--------+-----+
UPD
Finally i've found most effective implementation for me. It uses all resources for cluster in my yarn configuration.
from pyspark.sql.functions import explode
def melt(df):
sp = df.columns[1:]
return (df
.rdd
.map(lambda x: [str(x[0]), [(str(i[0]),
float(i[1] if i[1] else 0)) for i in zip(sp, x[1:])]],
preservesPartitioning = True)
.toDF()
.withColumn('_2', explode('_2'))
.rdd.map(lambda x: [str(x[0]),
str(x[1][0]),
float(x[1][1] if x[1][1] else 0)],
preservesPartitioning = True)
.toDF()
)
For very wide dataframe I've got performance decreasing at _vars_and_vals generation from user6910411 answer.
It was useful to implement melting via selectExpr
columns=['a', 'b', 'c', 'd', 'e', 'f']
pd_df = pd.DataFrame([[1,2,3,4,5,6], [4,5,6,7,9,8], [7,8,9,1,2,4], [8,3,9,8,7,4]], columns=columns)
df = spark.createDataFrame(pd_df)
+---+---+---+---+---+---+
| a| b| c| d| e| f|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 5| 6|
| 4| 5| 6| 7| 9| 8|
| 7| 8| 9| 1| 2| 4|
| 8| 3| 9| 8| 7| 4|
+---+---+---+---+---+---+
cols = df.columns[1:]
df.selectExpr('a', "stack({}, {})".format(len(cols), ', '.join(("'{}', {}".format(i, i) for i in cols))))
+---+----+----+
| a|col0|col1|
+---+----+----+
| 1| b| 2|
| 1| c| 3|
| 1| d| 4|
| 1| e| 5|
| 1| f| 6|
| 4| b| 5|
| 4| c| 6|
| 4| d| 7|
| 4| e| 9|
| 4| f| 8|
| 7| b| 8|
| 7| c| 9|
...
Use list comprehension to create struct column of column names and col values and explode the new column using the magic inline. Code below;
melted_df=(df.withColumn(
#Create struct of column names and corresponding values
'tab',F.array(*[F.struct(lit(x).alias('var'),F.col(x).alias('val'))for x in df.columns if x!='A'] ))
#Explode the column
.selectExpr('A',"inline(tab)")
)
melted_df.show()
+---+---+---+
| A|var|val|
+---+---+---+
| a| B| 1|
| a| C| 2|
| b| B| 3|
| b| C| 4|
| c| B| 5|
| c| C| 6|
+---+---+---+
1) Copy & paste
2) Change the first 2 variables
to_melt = {'latin', 'greek', 'chinese'}
new_names = ['lang', 'letter']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
)
null is created if some values contain null. To remove it, add this:
.filter(f"!{new_names[1]} is null")
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([(101, "A", "Σ", "西"), (102, "B", "Ω", "诶")], ['ID', 'latin', 'greek', 'chinese'])
df.show()
# +---+-----+-----+-------+
# | ID|latin|greek|chinese|
# +---+-----+-----+-------+
# |101| A| Σ| 西|
# |102| B| Ω| 诶|
# +---+-----+-----+-------+
to_melt = {'latin', 'greek', 'chinese'}
new_names = ['lang', 'letter']
melt_str = ','.join([f"'{c}', `{c}`" for c in to_melt])
df = df.select(
*(set(df.columns) - to_melt),
F.expr(f"stack({len(to_melt)}, {melt_str}) ({','.join(new_names)})")
)
df.show()
# +---+-------+------+
# | ID| lang|letter|
# +---+-------+------+
# |101| latin| A|
# |101| greek| Σ|
# |101|chinese| 西|
# |102| latin| B|
# |102| greek| Ω|
# |102|chinese| 诶|
# +---+-------+------+

PySpark: How to groupby with Or in columns

I want to groupby in PySpark, but the value can appear in more than a columns, so if it appear in any of the selected column it will be grouped by.
For example, if I have this table in Pyspark:
I want to sum the visits and investments for each ID, so that the result would be:
Note that the ID1 was the sum of the rows 0,1,3 which have the ID1 in one of the first three columns [ID1 Visits = 500 + 100 + 200 = 800].
The ID2 was the sum of the rows 1,2, etc
OBS 1: For the sake of simplicity my example was a simple dataframe, but in real is a much larger df with a lot of rows and a lot of variables, and other operations, not just "sum".
This can't be worked on pandas, because is too large. Should be in PySpark
OBS2: For ilustration I printed in pandas the tables, but in real it is in the PySpark
I appreciate all the help and thank you very much in advance
First of all let's create our test dataframe.
>>> import pandas as pd
>>> data = {
"ID1": [1, 2, 5, 1],
"ID2": [1, 1, 3, 3],
"ID3": [4, 3, 2, 4],
"Visits": [500, 100, 200, 200],
"Investment": [1000, 200, 400, 200]
}
>>> df = spark.createDataFrame(pd.DataFrame(data))
>>> df.show()
+---+---+---+------+----------+
|ID1|ID2|ID3|Visits|Investment|
+---+---+---+------+----------+
| 1| 1| 4| 500| 1000|
| 2| 1| 3| 100| 200|
| 5| 3| 2| 200| 400|
| 1| 3| 4| 200| 200|
+---+---+---+------+----------+
Once we have DataFrame that we can operate on we have to define a function which will return list of unique IDs from columns ID1, ID2 and ID3.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import ArrayType, IntegerType
>>> #F.udf(returnType=ArrayType(IntegerType()))
... def ids_list(*cols):
... return list(set(cols))
Now it's time to apply our udf on a DataFrame.
>>> df = df.withColumn('ids', ids_list('ID1', 'ID2', 'ID3'))
>>> df.show()
+---+---+---+------+----------+---------+
|ID1|ID2|ID3|Visits|Investment| ids|
+---+---+---+------+----------+---------+
| 1| 1| 4| 500| 1000| [1, 4]|
| 2| 1| 3| 100| 200|[1, 2, 3]|
| 5| 3| 2| 200| 400|[2, 3, 5]|
| 1| 3| 4| 200| 200|[1, 3, 4]|
+---+---+---+------+----------+---------+
To make use of ids column we have to explode it into separate rows and drop ids column.
>>> df = df.withColumn("ID", F.explode('ids')).drop('ids')
>>> df.show()
+---+---+---+------+----------+---+
|ID1|ID2|ID3|Visits|Investment| ID|
+---+---+---+------+----------+---+
| 1| 1| 4| 500| 1000| 1|
| 1| 1| 4| 500| 1000| 4|
| 2| 1| 3| 100| 200| 1|
| 2| 1| 3| 100| 200| 2|
| 2| 1| 3| 100| 200| 3|
| 5| 3| 2| 200| 400| 2|
| 5| 3| 2| 200| 400| 3|
| 5| 3| 2| 200| 400| 5|
| 1| 3| 4| 200| 200| 1|
| 1| 3| 4| 200| 200| 3|
| 1| 3| 4| 200| 200| 4|
+---+---+---+------+----------+---+
Finally we have to group our DataFrame by ID column and calculate sums. Final result is ordered by ID.
>>> final_df = (
... df.groupBy('ID')
... .agg( F.sum('Visits'), F.sum('Investment') )
... .orderBy('ID')
... )
>>> final_df.show()
+---+-----------+---------------+
| ID|sum(Visits)|sum(Investment)|
+---+-----------+---------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+---------------+
I hope you make it useful.
You can do something like below:
Create array of all id columns- > ids column below
explode ids column
Now you will get duplicates, to avoid duplicate aggregation use distinct
Finally groupBy ids column and perform all your aggregations
Note: : If your dataset can have exact duplicate rows then add one columns with df.withColumn('uid', f.monotonically_increasing_id()) before creating array otherwise distinct will drop it.
Example for your dataset:
import pyspark.sql.functions as f
df.withColumn('ids', f.explode(f.array('id1','id2','id3'))).distinct().groupBy('ids').agg(f.sum('visits'), f.sum('investments')).orderBy('ids').show()
+---+-----------+----------------+
|ids|sum(visits)|sum(investments)|
+---+-----------+----------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+----------------+

create column in pyspark based on conditons [duplicate]

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Split large array columns into multiple columns - Pyspark

I have:
+---+-------+-------+
| id| var1| var2|
+---+-------+-------+
| a|[1,2,3]|[1,2,3]|
| b|[2,3,4]|[2,3,4]|
+---+-------+-------+
I want:
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
The solution provided by How to split a list to multiple columns in Pyspark?
df1.select('id', df1.var1[0], df1.var1[1], ...).show()
works, but some of my arrays are very long (max 332).
How can I write this so that it takes account of all length arrays?
This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])
columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()
df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+

Different aggregate operations on different columns pyspark

I am trying to apply different aggregation functions to different columns in a pyspark dataframe. Following some suggestions on stackoverflow, I tried this:
the_columns = ["product1","product2"]
the_columns2 = ["customer1","customer2"]
exprs = [mean(col(d)) for d in the_columns1, count(col(c)) for c in the_columns2]
followed by
df.groupby(*group).agg(*exprs)
where "group" is a column not present in either the_columns or the_columns2. This does not work. How to do different aggregation functions on different columns?
You are very close already, instead of put the expressions in a list, add them so you have a flat list of expressions:
exprs = [mean(col(d)) for d in the_columns1] + [count(col(c)) for c in the_columns2]
Here is a demo:
import pyspark.sql.functions as F
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 2| 1|
| 1| 2| 2| 2|
| 2| 3| 3| 3|
| 2| 4| 3| 4|
+---+---+---+---+
cols = ['b']
cols2 = ['c', 'd']
exprs = [F.mean(F.col(x)) for x in cols] + [F.count(F.col(x)) for x in cols2]
df.groupBy('a').agg(*exprs).show()
+---+------+--------+--------+
| a|avg(b)|count(c)|count(d)|
+---+------+--------+--------+
| 1| 1.5| 2| 2|
| 2| 3.5| 2| 2|
+---+------+--------+--------+