Split large array columns into multiple columns - Pyspark - pyspark

I have:
+---+-------+-------+
| id| var1| var2|
+---+-------+-------+
| a|[1,2,3]|[1,2,3]|
| b|[2,3,4]|[2,3,4]|
+---+-------+-------+
I want:
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+
The solution provided by How to split a list to multiple columns in Pyspark?
df1.select('id', df1.var1[0], df1.var1[1], ...).show()
works, but some of my arrays are very long (max 332).
How can I write this so that it takes account of all length arrays?

This solution will work for your problem, no matter the number of initial columns and the size of your arrays. Moreover, if a column has different array sizes (eg [1,2], [3,4,5]), it will result in the maximum number of columns with null values filling the gap.
from pyspark.sql import functions as F
df = spark.createDataFrame(sc.parallelize([['a', [1,2,3], [1,2,3]], ['b', [2,3,4], [2,3,4]]]), ["id", "var1", "var2"])
columns = df.drop('id').columns
df_sizes = df.select(*[F.size(col).alias(col) for col in columns])
df_max = df_sizes.agg(*[F.max(col).alias(col) for col in columns])
max_dict = df_max.collect()[0].asDict()
df_result = df.select('id', *[df[col][i] for col in columns for i in range(max_dict[col])])
df_result.show()
>>>
+---+-------+-------+-------+-------+-------+-------+
| id|var1[0]|var1[1]|var1[2]|var2[0]|var2[1]|var2[2]|
+---+-------+-------+-------+-------+-------+-------+
| a| 1| 2| 3| 1| 2| 3|
| b| 2| 3| 4| 2| 3| 4|
+---+-------+-------+-------+-------+-------+-------+

Related

How to do a groupBy by a given column but still keep all the rows of the original DataFrame?

I want to do a groupBy and aggregate by a given column in PySpark but I still want to keep all the rows from the original DataFrame.
For example lets say we have the following DataFrame and we want to do a max on the "value" column then we would get the result below.
Original DataFrame
+--+-----+
|id|value|
+--+-----+
| 1| 1|
| 1| 2|
| 2| 3|
| 2| 4|
+--+-----+
Result
+--+-----+---+
|id|value|max|
+--+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+--+-----+---+
You can do it simply by joining aggregated dataframe with original dataframe
aggregated_df = (
df
.groupby('id')
.agg(F.max('value').alias('max'))
)
max_value_df = (
df
.join(aggregated_df, 'id')
)
Use window function
df.withColumn('max', max('value').over(Window.partitionBy('id'))).show()
+---+-----+---+
| id|value|max|
+---+-----+---+
| 1| 1| 2|
| 1| 2| 2|
| 2| 3| 4|
| 2| 4| 4|
+---+-----+---+

pyspark: groupby and aggregate avg and first on multiple columns

I have a following sample pyspark dataframe and after groupby I want to calculate mean, and first of multiple columns, In real case I have 100s of columns, so I cant do it individually
sp = spark.createDataFrame([['a',2,4,'cc','anc'], ['a',4,7,'cd','abc'], ['b',6,0,'as','asd'], ['b', 2, 4, 'ad','acb'],
['c', 4, 4, 'sd','acc']], ['id', 'col1', 'col2','col3', 'col4'])
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| a| 2| 4| cc| anc|
| a| 4| 7| cd| abc|
| b| 6| 0| as| asd|
| b| 2| 4| ad| acb|
| c| 4| 4| sd| acc|
+---+----+----+----+----+
This is what I am trying
mean_cols = ['col1', 'col2']
first_cols = ['col3', 'col4']
sc.groupby('id').agg(*[ f.mean for col in mean_cols], *[f.first for col in first_cols])
but it's not working. How can I do it like this with pyspark
The best way for multiple functions on multiple columns is to use the .agg(*expr) format.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import numpy as np
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,4,5,1),(5,6,7,8),(7,8,9,2)],schema=['col1','col2','col3','col4'])
fn_l = [F.min,F.max,F.mean,F.first]
col_l=['col1','col2','col3']
expr = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
tst_r = tst.groupby('col4').agg(*expr)
The result will be
tst_r.show()
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
|col4|min_col1|min_col2|min_col3|max_col1|max_col2|max_col3|mean_col1|mean_col2|mean_col3|first_col1|first_col2|first_col3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
| 5| 5| 6| 7| 7| 8| 9| 6.0| 7.0| 8.0| 5| 6| 7|
| 4| 1| 2| 3| 3| 4| 5| 2.0| 3.0| 4.0| 1| 2| 3|
+----+--------+--------+--------+--------+--------+--------+---------+---------+---------+----------+----------+----------+
For selectively applying functions on columns, you can have multiple expression arrays and concatenate them in aggregation.
fn_l = [F.min,F.max]
fn_2=[F.mean,F.first]
col_l=['col1','col2']
col_2=['col1','col3','col4']
expr1 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_l for coln in col_l]
expr2 = [fn(coln).alias(str(fn.__name__)+'_'+str(coln)) for fn in fn_2 for coln in col_2]
tst_r = tst.groupby('col4').agg(*(expr1+expr2))
A simpler way to do:
import pyspark.sql.functions as F
tst_r = ( tst.groupby('col4')
.agg(*[F.mean(col).alias(f"{col}_mean") for col in means_col],
*[F.first(col).alias(f"{col}_first") for col in firsts_col]) )

PySpark: How to groupby with Or in columns

I want to groupby in PySpark, but the value can appear in more than a columns, so if it appear in any of the selected column it will be grouped by.
For example, if I have this table in Pyspark:
I want to sum the visits and investments for each ID, so that the result would be:
Note that the ID1 was the sum of the rows 0,1,3 which have the ID1 in one of the first three columns [ID1 Visits = 500 + 100 + 200 = 800].
The ID2 was the sum of the rows 1,2, etc
OBS 1: For the sake of simplicity my example was a simple dataframe, but in real is a much larger df with a lot of rows and a lot of variables, and other operations, not just "sum".
This can't be worked on pandas, because is too large. Should be in PySpark
OBS2: For ilustration I printed in pandas the tables, but in real it is in the PySpark
I appreciate all the help and thank you very much in advance
First of all let's create our test dataframe.
>>> import pandas as pd
>>> data = {
"ID1": [1, 2, 5, 1],
"ID2": [1, 1, 3, 3],
"ID3": [4, 3, 2, 4],
"Visits": [500, 100, 200, 200],
"Investment": [1000, 200, 400, 200]
}
>>> df = spark.createDataFrame(pd.DataFrame(data))
>>> df.show()
+---+---+---+------+----------+
|ID1|ID2|ID3|Visits|Investment|
+---+---+---+------+----------+
| 1| 1| 4| 500| 1000|
| 2| 1| 3| 100| 200|
| 5| 3| 2| 200| 400|
| 1| 3| 4| 200| 200|
+---+---+---+------+----------+
Once we have DataFrame that we can operate on we have to define a function which will return list of unique IDs from columns ID1, ID2 and ID3.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import ArrayType, IntegerType
>>> #F.udf(returnType=ArrayType(IntegerType()))
... def ids_list(*cols):
... return list(set(cols))
Now it's time to apply our udf on a DataFrame.
>>> df = df.withColumn('ids', ids_list('ID1', 'ID2', 'ID3'))
>>> df.show()
+---+---+---+------+----------+---------+
|ID1|ID2|ID3|Visits|Investment| ids|
+---+---+---+------+----------+---------+
| 1| 1| 4| 500| 1000| [1, 4]|
| 2| 1| 3| 100| 200|[1, 2, 3]|
| 5| 3| 2| 200| 400|[2, 3, 5]|
| 1| 3| 4| 200| 200|[1, 3, 4]|
+---+---+---+------+----------+---------+
To make use of ids column we have to explode it into separate rows and drop ids column.
>>> df = df.withColumn("ID", F.explode('ids')).drop('ids')
>>> df.show()
+---+---+---+------+----------+---+
|ID1|ID2|ID3|Visits|Investment| ID|
+---+---+---+------+----------+---+
| 1| 1| 4| 500| 1000| 1|
| 1| 1| 4| 500| 1000| 4|
| 2| 1| 3| 100| 200| 1|
| 2| 1| 3| 100| 200| 2|
| 2| 1| 3| 100| 200| 3|
| 5| 3| 2| 200| 400| 2|
| 5| 3| 2| 200| 400| 3|
| 5| 3| 2| 200| 400| 5|
| 1| 3| 4| 200| 200| 1|
| 1| 3| 4| 200| 200| 3|
| 1| 3| 4| 200| 200| 4|
+---+---+---+------+----------+---+
Finally we have to group our DataFrame by ID column and calculate sums. Final result is ordered by ID.
>>> final_df = (
... df.groupBy('ID')
... .agg( F.sum('Visits'), F.sum('Investment') )
... .orderBy('ID')
... )
>>> final_df.show()
+---+-----------+---------------+
| ID|sum(Visits)|sum(Investment)|
+---+-----------+---------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+---------------+
I hope you make it useful.
You can do something like below:
Create array of all id columns- > ids column below
explode ids column
Now you will get duplicates, to avoid duplicate aggregation use distinct
Finally groupBy ids column and perform all your aggregations
Note: : If your dataset can have exact duplicate rows then add one columns with df.withColumn('uid', f.monotonically_increasing_id()) before creating array otherwise distinct will drop it.
Example for your dataset:
import pyspark.sql.functions as f
df.withColumn('ids', f.explode(f.array('id1','id2','id3'))).distinct().groupBy('ids').agg(f.sum('visits'), f.sum('investments')).orderBy('ids').show()
+---+-----------+----------------+
|ids|sum(visits)|sum(investments)|
+---+-----------+----------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+----------------+

Improve the efficiency of Spark SQL in repeated calls to groupBy/count. Pivot the outcome

I have a Spark DataFrame consisting of columns of integers. I want to tabulate each column and pivot the outcome by the column names.
In the following toy example, I start with this DataFrame df
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 1| 1| 0| 2|
| 1| 1| 1| 1| 1|
| 2| 2| 2| 3| 3|
| 0| 0| 0| 0| 1|
| 1| 1| 1| 0| 0|
| 3| 3| 3| 2| 2|
| 0| 1| 1| 1| 0|
+---+---+---+---+---+
Each cell can only contain one of {0, 1, 2, 3}. Now I want to tabulate the counts in each column. Ideally, I would have a column for each label (0, 1, 2, 3), and a row for each column. I do:
val output = df.columns.map(cs => df.select(cs).groupBy(cs).count().orderBy(cs).
withColumnRenamed(cs, "severity").
withColumnRenamed("count", "counts").withColumn("window", lit(cs))
)
I get an Array of DataFrames, one for each row of the df. Each of these dataframes has 4 rows (one for each outcome). Then I do:
val longOutput = output.reduce(_ union _) // flatten the array to produce one dataframe
longOutput.show()
to collapse the Array.
+--------+------+------+
|severity|counts|window|
+--------+------+------+
| 0| 2| a|
| 1| 3| a|
| 2| 1| a|
| 3| 1| a|
| 0| 1| b|
| 1| 4| b|
| 2| 1| b|
| 3| 1| b|
...
And finally, I pivot on the original column names
longOutput.cache()
val results = longOutput.groupBy("window").pivot("severity").agg(first("counts"))
results.show()
+------+---+---+---+---+
|window| 0| 1| 2| 3|
+------+---+---+---+---+
| e| 2| 2| 2| 1|
| d| 3| 2| 1| 1|
| c| 1| 4| 1| 1|
| b| 1| 4| 1| 1|
| a| 2| 3| 1| 1|
+------+---+---+---+---+
However the reduction piece took 8 full seconds on the toy example. It ran for over 2 hours on my actual data which had 1000 columns and 400,000 rows before I terminated it. I am running locally on a machine with 12 cores and 128G of RAM. But clearly, what I'm doing is slow on even a small amount of data, so machine size is not in itself the problem. The column groupby/count took only 7 minutes on the full data set. But then I can't do anything with that Array[DataFrame].
I tried several ways of avoiding union. I tried writing out my array to disk, but that failed due to a memory problem after several hours of effort. I also tried to adjust memory allowances on Zeppelin
So I need a way of doing the tabulation that does not give me an Array of DataFrames, but rather a simple data frame.
The problem with your code is that you trigger one spark job per column and then a big union. In general, it's much faster to try and keep everything within the same one.
In your case, instead of dividing the work, you could explode the dataframe to do everything in one pass like this:
df
.select(array(df.columns.map(c => struct(lit(c) as "name", col(c) as "value") ) : _*) as "a")
.select(explode('a))
.select($"col.name" as "name", $"col.value" as "value")
.groupBy("name")
.pivot("value")
.count()
.show()
This first line is the only one that's a bit tricky. It creates an array of tuples where each column name is mapped to its value. Then we explode it (one line per element of the array) and finally compute a basic pivot.

Different aggregate operations on different columns pyspark

I am trying to apply different aggregation functions to different columns in a pyspark dataframe. Following some suggestions on stackoverflow, I tried this:
the_columns = ["product1","product2"]
the_columns2 = ["customer1","customer2"]
exprs = [mean(col(d)) for d in the_columns1, count(col(c)) for c in the_columns2]
followed by
df.groupby(*group).agg(*exprs)
where "group" is a column not present in either the_columns or the_columns2. This does not work. How to do different aggregation functions on different columns?
You are very close already, instead of put the expressions in a list, add them so you have a flat list of expressions:
exprs = [mean(col(d)) for d in the_columns1] + [count(col(c)) for c in the_columns2]
Here is a demo:
import pyspark.sql.functions as F
df.show()
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+
| 1| 1| 2| 1|
| 1| 2| 2| 2|
| 2| 3| 3| 3|
| 2| 4| 3| 4|
+---+---+---+---+
cols = ['b']
cols2 = ['c', 'd']
exprs = [F.mean(F.col(x)) for x in cols] + [F.count(F.col(x)) for x in cols2]
df.groupBy('a').agg(*exprs).show()
+---+------+--------+--------+
| a|avg(b)|count(c)|count(d)|
+---+------+--------+--------+
| 1| 1.5| 2| 2|
| 2| 3.5| 2| 2|
+---+------+--------+--------+