Get groups with duplicated values in PySpark

Get groups with duplicated values in PySpark - pyspark

For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3 is equal to zero have only unique values in col2.

According to your needs, you can consider grouping statistics according to col1 and col2.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)

Related

PySpark: Incremental Row Counter

I am having difficulty implementing this existing answer:
PySpark - get row number for each row in a group
Consider the following:
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'A', 20220722, 1],
[1, 'A', 20220723, 1],
[1, 'B', 20220724, 2],
[2, 'B', 20220722, 1],
[2, 'C', 20220723, 2],
[2, 'B', 20220724, 3],
]),
['ID', 'State', 'Time', 'Expected'])
# rank
w = Window.partitionBy('State').orderBy('ID', 'Time')
df = df.withColumn('rn', F.row_number().over(w))
df = df.withColumn('rank', F.rank().over(w))
df = df.withColumn('dense', F.dense_rank().over(w))
# view
df.show()
+---+-----+--------+--------+---+----+-----+
| ID|State| Time|Expected| rn|rank|dense|
+---+-----+--------+--------+---+----+-----+
| 1| A|20220722| 1| 1| 1| 1|
| 1| A|20220723| 1| 2| 2| 2|
| 1| B|20220724| 2| 1| 1| 1|
| 2| B|20220722| 1| 2| 2| 2|
| 2| B|20220724| 3| 3| 3| 3|
| 2| C|20220723| 2| 1| 1| 1|
+---+-----+--------+--------+---+----+-----+
How can I get the expected value and also sort the dates correctly such that they are ascending?

you restart your count for each new id value, which means the id field is your partition field, not state.
an approach with sum window function.
data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')), func.lit(True)).cast('int')
). \
withColumn('rank',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time', 'state').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+-----+--------+--------+----------+----+
# | id|state| time|expected|st_notsame|rank|
# +---+-----+--------+--------+----------+----+
# | 1| A|20220722| 1| 1| 1|
# | 1| A|20220723| 1| 0| 1|
# | 1| B|20220724| 2| 1| 2|
# | 2| B|20220722| 1| 1| 1|
# | 2| C|20220723| 2| 1| 2|
# | 2| B|20220724| 3| 1| 3|
# +---+-----+--------+--------+----------+----+
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get your desired ranking

PySpark join columns after pivot

For the following example DataFrame:
df = spark.createDataFrame(
[
('2017-01-01', 'A', 1),
('2017-01-01', 'B', 2),
('2017-01-01', 'C', 3),
('2017-01-02', 'A', 4),
('2017-01-02', 'B', 5),
('2017-01-02', 'C', 6),
('2017-01-03', 'A', 7),
('2017-01-03', 'B', 8),
('2017-01-03', 'C', 9),
],
('date', 'type', 'value')
)
I would like to convert it to have the columns equal to all unique "types" (A, B, and C).
Currently, I have found this code works closest to what I would like to achieve:
df.groupby("date", "type").pivot("type").sum().orderBy("date").show()
+----------+----+----+----+----+
| date|type| A| B| C|
+----------+----+----+----+----+
|2017-01-01| C|null|null| 3|
|2017-01-01| A| 1|null|null|
|2017-01-01| B|null| 2|null|
|2017-01-02| B|null| 5|null|
|2017-01-02| C|null|null| 6|
|2017-01-02| A| 4|null|null|
|2017-01-03| A| 7|null|null|
|2017-01-03| C|null|null| 9|
|2017-01-03| B|null| 8|null|
+----------+----+----+----+----+
The issue is that I still have too many rows (containing all "null").
What I would like to get is:
+----------+---+---+---+
| date| A| B| C|
+----------+---+---+---+
|2017-01-01| 1| 2| 3|
|2017-01-02| 4| 5| 6|
|2017-01-03| 7| 8| 9|
+----------+---+---+---+
Aka, I would like something that has similar functionality to pandas.DataFrame.unstack().
If anyone has any tips on how I can achieve this in PySpark that would be great.

You need to do another group by "date" column then select max values from A,B,C.
Example:
df.groupby("date", "type").pivot("type").sum().orderBy("date").groupBy("date").agg(max(col("A")).alias("A"),max(col("B")).
#+----------+---+---+---+
#| date| A| B| c|
#+----------+---+---+---+
#|2017-01-01| 1| 2| 3|
#|2017-01-02| 4| 5| 6|
#|2017-01-03| 7| 8| 9|
#+----------+---+---+---+
# dynamic way
aggregate = ["A","B","C"]
funs=[max]
exprs=[f(col(c)).alias(c) for f in funs for c in aggregate]
df.groupby("date", "type").pivot("type").sum().orderBy("date").groupBy("date").agg(*exprs).show()
#+----------+---+---+---+
#| date| A| B| c|
#+----------+---+---+---+
#|2017-01-01| 1| 2| 3|
#|2017-01-02| 4| 5| 6|
#|2017-01-03| 7| 8| 9|
#+----------+---+---+---+

How to automatically drop constant columns in pyspark?

I have a spark dataframe in pyspark and I need to drop all constant columns from my dataframe. Since I don't know which columns are constant I cannot manually unselect the constant columns, i.e. I need an automatic procedure. I am surprised I was not able to find a simple solution on stackoverflow.
Example:
import pandas as pd
import pyspark
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
d = {'col1': [1, 2, 3, 4, 5],
'col2': [1, 2, 3, 4, 5],
'col3': [0, 0, 0, 0, 0],
'col4': [0, 0, 0, 0, 0]}
df_panda = pd.DataFrame(data=d)
df_spark = spark.createDataFrame(df_panda)
df_spark.show()
Output:
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| 1| 1| 0| 0|
| 2| 2| 0| 0|
| 3| 3| 0| 0|
| 4| 4| 0| 0|
| 5| 5| 0| 0|
+----+----+----+----+
Desired output:
+----+----+
|col1|col2|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+----+----+
What is the best way to automatically drop constant columns in pyspark?

Count distinct values in each column first and then drop columns that contain only one distinct value:
import pyspark.sql.functions as f
cnt = df_spark.agg(*(f.countDistinct(c).alias(c) for c in df_spark.columns)).first()
cnt
# Row(col1=5, col2=5, col3=1, col4=1)
df_spark.drop(*[c for c in cnt.asDict() if cnt[c] == 1]).show()
+----+----+
|col1|col2|
+----+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
+----+----+

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.

Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()

Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
(2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 10|
| 1| b| 12|
| 1| c| 13|
| 2| a| 14|
| 2| c| 11|
| 1| b| 12|
| 2| c| 12|
| 3| r| 11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
Level1:
If I do groupby on col1 and do a sum of Col3. I will get below two columns.
1. col1
2. sum(col3)
I will loose col2 here.
Level2:
If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns.
1. col1
2. col2
3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
How can I do this, can anyone explain?
spark : 1.6.2
Scala : 2.10

One option is to do the two sum separately and then join them back:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 3| r| 11.0| 11.0|
| 1| a| 10.0| 47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 1| a| 10.0| 47.0|
| 3| r| 11.0| 11.0|
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
+----+----+----------+----------+