df = spark.createDataFrame(
[
['A', '1', '3'],
['A', '2', '7'],
['A', '3', '1'],
['A', '1', '5'],
['A', '3', '4'],
['A', '5', '2'],
['B', '1', '8'],
['B', '2', '4'],
['B', '4', '2'],
['B', '6', '8']
],
['col1', 'col2', 'col3']
)
df.show()
Grouping by col1, and getting value of col2 as condition to add new column:
+----+------------+------------+
|col1| col2| col3|
+----+------------+------------+
| A| [1, 2, 3]| [3, 7, 1]|
| A| [1, 3, 5]| [5, 4, 2]|
| B|[1, 2, 4, 6]|[8, 4, 2, 8]|
+----+------------+------------+
changed the content of the question, added one column to order those rows, if some value in this column are duplicate, please don't care the order of those rows:
df = spark.createDataFrame(
[
['A', '1', '3','2'],
['A', '2', '7','2'],
['A', '3', '1','2'],
['A', '1', '5','3'],
['A', '3', '4','3'],
['A', '5', '2','4'],
['B', '1', '8','4'],
['B', '2', '4','5'],
['B', '4', '2','6'],
['B', '6', '8','7']
],
['col1', 'col2', 'col3', 'col4']
)
df.show()
As you have a new column for sorting, you can use .sum() and Window to create a group column:
df = df.orderBy(['col1', 'col4', 'col2'])
df = df.withColumn(
'group',
func.sum(func.when(func.col('col2')=="1", 1).otherwise(0)).over(Window.partitionBy(func.col('col1')).orderBy(func.asc(func.col('col4'))))
)
df.show()
+----+----+----+----+-----+
|col1|col2|col3|col4|group|
+----+----+----+----+-----+
| A| 1| 3| 2| 1|
| A| 2| 7| 2| 1|
| A| 3| 1| 2| 1|
| A| 1| 5| 3| 2|
| A| 3| 4| 3| 2|
| A| 5| 2| 4| 2|
| B| 1| 8| 4| 1|
| B| 2| 4| 5| 1|
| B| 4| 2| 6| 1|
| B| 6| 8| 7| 1|
+----+----+----+----+-----+
Then you can use the grouping and collect list:
df\
.groupby('col1', 'group')\
.agg(
func.collect_list('col2').alias('col2'),
func.collect_list('col3').alias('col3')
)\
.drop('group')\
.show(10, False)
+----+------------+------------+
|col1|col2 |col3 |
+----+------------+------------+
|A |[1, 2, 3] |[3, 7, 1] |
|A |[1, 3, 5] |[5, 4, 2] |
|B |[1, 2, 4, 6]|[8, 4, 2, 8]|
+----+------------+------------+
Related
How can I remove all rows before a certain condition in PySpark?
df = spark.createDataFrame(sc.parallelize([
['A', '2019-01-01', None, None, None],
['A', '2019-01-02', None, None, None],
['A', '2019-01-03', 'O', 'O', 1],
['A', '2019-01-04', 'O', 'P', 2],
['A', '2019-01-05', 'O', 'P', 3],
['A', '2019-01-06', None, None, None],
['A', '2019-01-07', None, None, 4],
]),
['ID', 'Time', 'State', 'State2', 'LatestRecord'])
# expected
+---+----------+-----+------+------------+
| ID| Time|State|State2|LatestRecord|
+---+----------+-----+------+------------+
| A|2019-01-03| O| O| 1|
| A|2019-01-04| O| P| 2|
| A|2019-01-05| O| P| 3|
| A|2019-01-06| null| null| null|
| A|2019-01-07| null| null| 4|
+---+----------+-----+------+------------+
The condition that jumped out to me was to say, remove all rows where Time is less than LatestRecord == 1 but I am quite stuck as to how to make that happen.
My (failed) attempts so far:
# transform min date; fails
df = df.withColumn('earliestDate', F.when( F.col('LatestRecord') == 1, F.col('Time').over(Window.partitionBy('ID'))))
# then when earliestDate >= Time, filter
df = df.filter(df.earliestDate >= df.Time)
Devised an answer as follows that seems to work / can filter on:
# condition that I can filter by
w = Window.partitionBy("ID").orderBy(F.asc('Time'))
criteria = F.when((F.col("LatestRecord") == 1), F.lit(1)).otherwise(F.lit(0))
df.withColumn("Flag", F.when(F.sum(criteria).over(w) > 0, F.lit(1)).otherwise(F.lit(0))).show()
+---+----------+-----+------+------------+----+
| ID| Time|State|State2|LatestRecord|Flag|
+---+----------+-----+------+------------+----+
| A|2019-01-01| null| null| null| 0|
| A|2019-01-02| null| null| null| 0|
| A|2019-01-03| O| O| 1| 1|
| A|2019-01-04| O| P| 2| 1|
| A|2019-01-05| O| P| 3| 1|
| A|2019-01-06| null| null| null| 1|
| A|2019-01-07| null| null| 4| 1|
+---+----------+-----+------+------------+----+
I am having difficulty implementing this existing answer:
PySpark - get row number for each row in a group
Consider the following:
# create df
df = spark.createDataFrame(sc.parallelize([
[1, 'A', 20220722, 1],
[1, 'A', 20220723, 1],
[1, 'B', 20220724, 2],
[2, 'B', 20220722, 1],
[2, 'C', 20220723, 2],
[2, 'B', 20220724, 3],
]),
['ID', 'State', 'Time', 'Expected'])
# rank
w = Window.partitionBy('State').orderBy('ID', 'Time')
df = df.withColumn('rn', F.row_number().over(w))
df = df.withColumn('rank', F.rank().over(w))
df = df.withColumn('dense', F.dense_rank().over(w))
# view
df.show()
+---+-----+--------+--------+---+----+-----+
| ID|State| Time|Expected| rn|rank|dense|
+---+-----+--------+--------+---+----+-----+
| 1| A|20220722| 1| 1| 1| 1|
| 1| A|20220723| 1| 2| 2| 2|
| 1| B|20220724| 2| 1| 1| 1|
| 2| B|20220722| 1| 2| 2| 2|
| 2| B|20220724| 3| 3| 3| 3|
| 2| C|20220723| 2| 1| 1| 1|
+---+-----+--------+--------+---+----+-----+
How can I get the expected value and also sort the dates correctly such that they are ascending?
you restart your count for each new id value, which means the id field is your partition field, not state.
an approach with sum window function.
data_sdf. \
withColumn('st_notsame',
func.coalesce(func.col('state') != func.lag('state').over(wd.partitionBy('id').orderBy('time')), func.lit(True)).cast('int')
). \
withColumn('rank',
func.sum('st_notsame').over(wd.partitionBy('id').orderBy('time', 'state').rowsBetween(-sys.maxsize, 0))
). \
show()
# +---+-----+--------+--------+----------+----+
# | id|state| time|expected|st_notsame|rank|
# +---+-----+--------+--------+----------+----+
# | 1| A|20220722| 1| 1| 1|
# | 1| A|20220723| 1| 0| 1|
# | 1| B|20220724| 2| 1| 2|
# | 2| B|20220722| 1| 1| 1|
# | 2| C|20220723| 2| 1| 2|
# | 2| B|20220724| 3| 1| 3|
# +---+-----+--------+--------+----------+----+
you first flag all the consecutive occurrences of the state as 0 and others as 1 - this'll enable you to do a running sum
use the sum window with infinite lookback for each id to get your desired ranking
For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3 is equal to zero have only unique values in col2.
According to your needs, you can consider grouping statistics according to col1 and col2.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)
I have a complicated question. I have the following column which is a pair of tuples and the other column is their frequencies.
Col1 Col2
('A','B') 5
('C','C') 4
('F','D') 8
I also have the other column which is the frequency of element of tuples and their frequency:
Col3 Col4
'A' 2
'B' 5
'C' 1
'F' 2
'D' 3
I need to make a new column from frequencies. For each tuple (A,B) I need to have frequancy of A , frequancy of B and frequancy of the tuple.
out put:
Col1 new_col
('A','B') 2,5,5
('C','C') 1,1,4
('F','D') 2,3,8
creation of the data based on your example
dataset 1:
b = "Col1 Col2".split()
a = [
(["A", "B"], 5),
(["C", "C"], 4),
(["F", "D"], 8),
]
df1 = spark.createDataFrame(a, b)
df1.show()
+------+----+
| Col1|Col2|
+------+----+
|[A, B]| 5|
|[C, C]| 4|
|[F, D]| 8|
+------+----+
dataset 2 :
b = "Col3 Col4".split()
a = [
["A", 2],
["B", 5],
["C", 1],
["F", 2],
["D", 3],
]
df2 = spark.createDataFrame(a, b)
df2.show()
+----+----+
|Col3|Col4|
+----+----+
| A| 2|
| B| 5|
| C| 1|
| F| 2|
| D| 3|
+----+----+
preparation of df1
df1 = df1.withColumn("value1", df1["col1"].getItem(0)).withColumn(
"value2", df1["col1"].getItem(1)
)
df1.show()
+------+----+------+------+
| Col1|Col2|value1|value2|
+------+----+------+------+
|[A, B]| 5| A| B|
|[C, C]| 4| C| C|
|[F, D]| 8| F| D|
+------+----+------+------+
join of the dataframes
df3 = (
df1.join(
df2.alias("value1"), on=F.col("value1") == F.col("value1.col3"), how="left"
)
.join(df2.alias("value2"), on=F.col("value2") == F.col("value2.col3"), how="left")
.select(
"col1",
"value1.col4",
"value2.col4",
"col2",
)
)
df3.show()
+------+----+----+----+
| col1|col4|col4|col2|
+------+----+----+----+
|[A, B]| 2| 5| 5|
|[F, D]| 2| 3| 8|
|[C, C]| 1| 1| 4|
+------+----+----+----+
For the following example DataFrame:
df = spark.createDataFrame(
[
('2017-01-01', 'A', 1),
('2017-01-01', 'B', 2),
('2017-01-01', 'C', 3),
('2017-01-02', 'A', 4),
('2017-01-02', 'B', 5),
('2017-01-02', 'C', 6),
('2017-01-03', 'A', 7),
('2017-01-03', 'B', 8),
('2017-01-03', 'C', 9),
],
('date', 'type', 'value')
)
I would like to convert it to have the columns equal to all unique "types" (A, B, and C).
Currently, I have found this code works closest to what I would like to achieve:
df.groupby("date", "type").pivot("type").sum().orderBy("date").show()
+----------+----+----+----+----+
| date|type| A| B| C|
+----------+----+----+----+----+
|2017-01-01| C|null|null| 3|
|2017-01-01| A| 1|null|null|
|2017-01-01| B|null| 2|null|
|2017-01-02| B|null| 5|null|
|2017-01-02| C|null|null| 6|
|2017-01-02| A| 4|null|null|
|2017-01-03| A| 7|null|null|
|2017-01-03| C|null|null| 9|
|2017-01-03| B|null| 8|null|
+----------+----+----+----+----+
The issue is that I still have too many rows (containing all "null").
What I would like to get is:
+----------+---+---+---+
| date| A| B| C|
+----------+---+---+---+
|2017-01-01| 1| 2| 3|
|2017-01-02| 4| 5| 6|
|2017-01-03| 7| 8| 9|
+----------+---+---+---+
Aka, I would like something that has similar functionality to pandas.DataFrame.unstack().
If anyone has any tips on how I can achieve this in PySpark that would be great.
You need to do another group by "date" column then select max values from A,B,C.
Example:
df.groupby("date", "type").pivot("type").sum().orderBy("date").groupBy("date").agg(max(col("A")).alias("A"),max(col("B")).
#+----------+---+---+---+
#| date| A| B| c|
#+----------+---+---+---+
#|2017-01-01| 1| 2| 3|
#|2017-01-02| 4| 5| 6|
#|2017-01-03| 7| 8| 9|
#+----------+---+---+---+
# dynamic way
aggregate = ["A","B","C"]
funs=[max]
exprs=[f(col(c)).alias(c) for f in funs for c in aggregate]
df.groupby("date", "type").pivot("type").sum().orderBy("date").groupBy("date").agg(*exprs).show()
#+----------+---+---+---+
#| date| A| B| c|
#+----------+---+---+---+
#|2017-01-01| 1| 2| 3|
#|2017-01-02| 4| 5| 6|
#|2017-01-03| 7| 8| 9|
#+----------+---+---+---+