I have a spark dataframe in pyspark and I need to drop all constant columns from my dataframe. Since I don't know which columns are constant I cannot manually unselect the constant columns, i.e. I need an automatic procedure. I am surprised I was not able to find a simple solution on stackoverflow.
Example:
import pandas as pd
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
d = {'col1': [1, 2, 3, 4, 5],
'col2': [1, 2, 3, 4, 5],
'col3': [0, 0, 0, 0, 0],
'col4': [0, 0, 0, 0, 0]}
df_panda = pd.DataFrame(data=d)
df_spark = spark.createDataFrame(df_panda)
df_spark.show()
Output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 1| 0| 0|
| 2| 2| 0| 0|
| 3| 3| 0| 0|
| 4| 4| 0| 0|
| 5| 5| 0| 0|
+----+----+----+----+
Desired output:
+----+----+
|col1|col2|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+----+----+
What is the best way to automatically drop constant columns in pyspark?
Count distinct values in each column first and then drop columns that contain only one distinct value:
import pyspark.sql.functions as f
cnt = df_spark.agg(*(f.countDistinct(c).alias(c) for c in df_spark.columns)).first()
cnt
# Row(col1=5, col2=5, col3=1, col4=1)
df_spark.drop(*[c for c in cnt.asDict() if cnt[c] == 1]).show()
+----+----+
|col1|col2|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+----+----+
Related
I am having difficulty implementing this existing answer:
PySpark - get row number for each row in a group
Consider the following:
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'A', 20220722, 1],
[1, 'A', 20220723, 1],
[1, 'B', 20220724, 2],
[2, 'B', 20220722, 1],
[2, 'C', 20220723, 2],
[2, 'B', 20220724, 3],
]),
['ID', 'State', 'Time', 'Expected'])
# rank
w = Window.partitionBy('State').orderBy('ID', 'Time')
df = df.withColumn('rn', F.row_number().over(w))
df = df.withColumn('rank', F.rank().over(w))
df = df.withColumn('dense', F.dense_rank().over(w))
# view
df.show()
+---+-----+--------+--------+---+----+-----+
| ID|State| Time|Expected| rn|rank|dense|
+---+-----+--------+--------+---+----+-----+
| 1| A|20220722| 1| 1| 1| 1|
| 1| A|20220723| 1| 2| 2| 2|
| 1| B|20220724| 2| 1| 1| 1|
| 2| B|20220722| 1| 2| 2| 2|
| 2| B|20220724| 3| 3| 3| 3|
| 2| C|20220723| 2| 1| 1| 1|
+---+-----+--------+--------+---+----+-----+
How can I get the expected value and also sort the dates correctly such that they are ascending?
you restart your count for each new id value, which means the id field is your partition field, not state.
an approach with sum window function.
data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')), func.lit(True)).cast('int')
). \
withColumn('rank',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time', 'state').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+-----+--------+--------+----------+----+
# | id|state| time|expected|st_notsame|rank|
# +---+-----+--------+--------+----------+----+
# | 1| A|20220722| 1| 1| 1|
# | 1| A|20220723| 1| 0| 1|
# | 1| B|20220724| 2| 1| 2|
# | 2| B|20220722| 1| 1| 1|
# | 2| C|20220723| 2| 1| 2|
# | 2| B|20220724| 3| 1| 3|
# +---+-----+--------+--------+----------+----+
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get your desired ranking
For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3 is equal to zero have only unique values in col2.
According to your needs, you can consider grouping statistics according to col1 and col2.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)
Problem statement
Consider the following data (see code generation at the bottom)
+-----+-----+-------+--------+
|index|group|low_num|high_num|
+-----+-----+-------+--------+
| 0| 1| 1| 1|
| 1| 1| 2| 2|
| 2| 1| 3| 3|
| 3| 2| 1| 3|
+-----+-----+-------+--------+
Then for a given index, I want to count how many times that one indexes high_num is greater than low_num for all low_num in the group.
For instance, consider the second row with index: 1. Index: 1 is in group: 1 and the high_num is 2. high_num on index 1 is greater than the high_num on index 0, equal to low_num, and smaller than the one on index 2. So the high_num of index: 1 is greater than low_num across the group once, so then I want the value in the answer column to say 1.
Dataset with desired output
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+
Dataset generation code
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.getOrCreate()
)
## Example df
## Note the inclusion of "desired" which is the desired output.
df = spark.createDataFrame(
[
(0, 1, 1, 1, 0),
(1, 1, 2, 2, 1),
(2, 1, 3, 3, 2),
(3, 2, 1, 3, 1)
],
schema=["index", "group", "low_num", "high_num", "desired"]
)
Pseudocode that might have solved the problem
A pseusocode might look like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w_spec = Window.partitionBy("group").rowsBetween(
Window.unboundedPreceding, Window.unboundedFollowing)
## F.collect_list_when does not exist
## F.current_col does not exist
## Probably wouldn't work like this anyways
ddf = df.withColumn("Counts",
F.size(F.collect_list_when(
F.current_col("high_number") > F.col("low_number"), 1
).otherwise(None).over(w_spec))
)
You can do a filter on the collect_list, and check its size:
import pyspark.sql.functions as F
df2 = df.withColumn(
'desired',
F.expr('size(filter(collect_list(low_num) over (partition by group), x -> x < high_num))')
)
df2.show()
+-----+-----+-------+--------+-------+
|index|group|low_num|high_num|desired|
+-----+-----+-------+--------+-------+
| 0| 1| 1| 1| 0|
| 1| 1| 2| 2| 1|
| 2| 1| 3| 3| 2|
| 3| 2| 1| 3| 1|
+-----+-----+-------+--------+-------+
I am trying to rank a column when the "ID" column numbering starts from 1 to max and then resets from 1.
So, the first three rows have a continuous numbering on "ID"; hence these should be grouped with group rank =1. Rows four and five are in another group, group rank = 2.
The rows are sorted by "rownum" column. I am aware of the row_number window function but I don't think I can apply for this use case as there is no constant window. I can only think of looping through each row in the dataframe but not sure how I can update a column when number resets to 1.
val df = Seq(
(1, 1 ),
(2, 2 ),
(3, 3 ),
(4, 1),
(5, 2),
(6, 1),
(7, 1),
(8, 2)
).toDF("rownum", "ID")
df.show()
Expected result is below:
You can do it with 2 window-functions, the first one to flag the state, the second one to calculate a running sum:
df
.withColumn("increase", $"ID" > lag($"ID",1).over(Window.orderBy($"rownum")))
.withColumn("group_rank_of_ID",sum(when($"increase",lit(0)).otherwise(lit(1))).over(Window.orderBy($"rownum")))
.drop($"increase")
.show()
gives:
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
As #Prithvi noted, we can use lead here.
The tricky part is in order to use window function such as lead, we need to at least provide the order.
Consider
val nextID = lag('ID, 1, -1) over Window.orderBy('rownum)
val isNewGroup = 'ID <= nextID cast "integer"
val group_rank_of_ID = sum(isNewGroup) over Window.orderBy('rownum)
/* you can try
df.withColumn("intermediate", nextID).show
// ^^^^^^^-- can be `isNewGroup`, or other vals
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 0|
| 2| 2| 0|
| 3| 3| 0|
| 4| 1| 1|
| 5| 2| 1|
| 6| 1| 2|
| 7| 1| 3|
| 8| 2| 3|
+------+---+----------------+
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID + 1).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
*/
I want to groupby in PySpark, but the value can appear in more than a columns, so if it appear in any of the selected column it will be grouped by.
For example, if I have this table in Pyspark:
I want to sum the visits and investments for each ID, so that the result would be:
Note that the ID1 was the sum of the rows 0,1,3 which have the ID1 in one of the first three columns [ID1 Visits = 500 + 100 + 200 = 800].
The ID2 was the sum of the rows 1,2, etc
OBS 1: For the sake of simplicity my example was a simple dataframe, but in real is a much larger df with a lot of rows and a lot of variables, and other operations, not just "sum".
This can't be worked on pandas, because is too large. Should be in PySpark
OBS2: For ilustration I printed in pandas the tables, but in real it is in the PySpark
I appreciate all the help and thank you very much in advance
First of all let's create our test dataframe.
>>> import pandas as pd
>>> data = {
"ID1": [1, 2, 5, 1],
"ID2": [1, 1, 3, 3],
"ID3": [4, 3, 2, 4],
"Visits": [500, 100, 200, 200],
"Investment": [1000, 200, 400, 200]
}
>>> df = spark.createDataFrame(pd.DataFrame(data))
>>> df.show()
+---+---+---+------+----------+
|ID1|ID2|ID3|Visits|Investment|
+---+---+---+------+----------+
| 1| 1| 4| 500| 1000|
| 2| 1| 3| 100| 200|
| 5| 3| 2| 200| 400|
| 1| 3| 4| 200| 200|
+---+---+---+------+----------+
Once we have DataFrame that we can operate on we have to define a function which will return list of unique IDs from columns ID1, ID2 and ID3.
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import ArrayType, IntegerType
>>> #F.udf(returnType=ArrayType(IntegerType()))
... def ids_list(*cols):
... return list(set(cols))
Now it's time to apply our udf on a DataFrame.
>>> df = df.withColumn('ids', ids_list('ID1', 'ID2', 'ID3'))
>>> df.show()
+---+---+---+------+----------+---------+
|ID1|ID2|ID3|Visits|Investment| ids|
+---+---+---+------+----------+---------+
| 1| 1| 4| 500| 1000| [1, 4]|
| 2| 1| 3| 100| 200|[1, 2, 3]|
| 5| 3| 2| 200| 400|[2, 3, 5]|
| 1| 3| 4| 200| 200|[1, 3, 4]|
+---+---+---+------+----------+---------+
To make use of ids column we have to explode it into separate rows and drop ids column.
>>> df = df.withColumn("ID", F.explode('ids')).drop('ids')
>>> df.show()
+---+---+---+------+----------+---+
|ID1|ID2|ID3|Visits|Investment| ID|
+---+---+---+------+----------+---+
| 1| 1| 4| 500| 1000| 1|
| 1| 1| 4| 500| 1000| 4|
| 2| 1| 3| 100| 200| 1|
| 2| 1| 3| 100| 200| 2|
| 2| 1| 3| 100| 200| 3|
| 5| 3| 2| 200| 400| 2|
| 5| 3| 2| 200| 400| 3|
| 5| 3| 2| 200| 400| 5|
| 1| 3| 4| 200| 200| 1|
| 1| 3| 4| 200| 200| 3|
| 1| 3| 4| 200| 200| 4|
+---+---+---+------+----------+---+
Finally we have to group our DataFrame by ID column and calculate sums. Final result is ordered by ID.
>>> final_df = (
... df.groupBy('ID')
... .agg( F.sum('Visits'), F.sum('Investment') )
... .orderBy('ID')
... )
>>> final_df.show()
+---+-----------+---------------+
| ID|sum(Visits)|sum(Investment)|
+---+-----------+---------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+---------------+
I hope you make it useful.
You can do something like below:
Create array of all id columns- > ids column below
explode ids column
Now you will get duplicates, to avoid duplicate aggregation use distinct
Finally groupBy ids column and perform all your aggregations
Note: : If your dataset can have exact duplicate rows then add one columns with df.withColumn('uid', f.monotonically_increasing_id()) before creating array otherwise distinct will drop it.
Example for your dataset:
import pyspark.sql.functions as f
df.withColumn('ids', f.explode(f.array('id1','id2','id3'))).distinct().groupBy('ids').agg(f.sum('visits'), f.sum('investments')).orderBy('ids').show()
+---+-----------+----------------+
|ids|sum(visits)|sum(investments)|
+---+-----------+----------------+
| 1| 800| 1400|
| 2| 300| 600|
| 3| 500| 800|
| 4| 700| 1200|
| 5| 200| 400|
+---+-----------+----------------+