PySpark: Create incrementing group column counter - pyspark

How can I generate the expected value, ExpectedGroup such that the same value exists when True, but changes and increments by 1, when we run into a False statement in cond1.
Consider:
df = spark.createDataFrame(sc.parallelize([
['A', '2019-01-01', 'P', 'O', 2, None],
['A', '2019-01-02', 'O', 'O', 5, 1],
['A', '2019-01-03', 'O', 'O', 10, 1],
['A', '2019-01-04', 'O', 'P', 4, None],
['A', '2019-01-05', 'P', 'P', 300, None],
['A', '2019-01-06', 'P', 'O', 2, None],
['A', '2019-01-07', 'O', 'O', 5, 2],
['A', '2019-01-08', 'O', 'O', 10, 2],
['A', '2019-01-09', 'O', 'P', 4, None],
['A', '2019-01-10', 'P', 'P', 300, None],
['B', '2019-01-01', 'P', 'O', 2, None],
['B', '2019-01-02', 'O', 'O', 5, 3],
['B', '2019-01-03', 'O', 'O', 10, 3],
['B', '2019-01-04', 'O', 'P', 4, None],
['B', '2019-01-05', 'P', 'P', 300, None],
]),
['ID', 'Time', 'FromState', 'ToState', 'Hours', 'ExpectedGroup'])
# condition statement
cond1 = (df.FromState == 'O') & (df.ToState == 'O')
df = df.withColumn('condition', cond1.cast("int"))
df = df.withColumn('conditionLead', F.lead('condition').over(Window.orderBy('ID', 'Time')))
df = df.na.fill(value=0, subset=["conditionLead"])
df = df.withColumn('finalCondition', ( (F.col('condition') == 1) & (F.col('conditionLead') == 1)).cast('int'))
# working pandas option:
# cond1 = ( (df.FromState == 'O') & (df.ToState == 'O') )
# df['ExpectedGroup'] = (cond1.shift(-1) & cond1).cumsum().mask(~cond1)
# other working option:
# cond1 = ( (df.FromState == 'O') & (df.ToState == 'O') )
# df['ExpectedGroup'] = (cond1.diff()&cond1).cumsum().where(cond1)
# failing here
windowval = (Window.partitionBy('ID').orderBy('Time').rowsBetween(Window.unboundedPreceding, 0))
df = df.withColumn('ExpectedGroup2', F.sum(F.when(cond1, F.col('finalCondition'))).over(windowval))

Just use the same logic shown in your Pandas code, use Window lag function to get the previous value of cond1, set the flag to 1 only when the current cond1 is true and the previous cond1 is false, and then do the cumsum based on cond1, see below code(BTW, you probably want to add ID to partitionBy clause of the WindSpec, in that case the last ExpectedGroup1 should be 1 instead of 3):
from pyspark.sql import functions as F, Window
w = Window.partitionBy().orderBy('ID', 'time')
df_new = (df.withColumn('cond1', (F.col('FromState')=='O') & (F.col('ToState')=='O'))
.withColumn('f', F.when(F.col('cond1') & (~F.lag(F.col('cond1')).over(w)),1).otherwise(0))
.withColumn('ExpectedGroup1', F.when(F.col('cond1'), F.sum('f').over(w)))
)
df_new.show()
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
| ID| Time|FromState|ToState|Hours|ExpectedGroup|cond1| f|ExpectedGroup1|
+---+----------+---------+-------+-----+-------------+-----+---+--------------+
| A|2019-01-01| P| O| 2| null|false| 0| null|
| A|2019-01-02| O| O| 5| 1| true| 1| 1|
| A|2019-01-03| O| O| 10| 1| true| 0| 1|
| A|2019-01-04| O| P| 4| null|false| 0| null|
| A|2019-01-05| P| P| 300| null|false| 0| null|
| A|2019-01-06| P| O| 2| null|false| 0| null|
| A|2019-01-07| O| O| 5| 2| true| 1| 2|
| A|2019-01-08| O| O| 10| 2| true| 0| 2|
| A|2019-01-09| O| P| 4| null|false| 0| null|
| A|2019-01-10| P| P| 300| null|false| 0| null|
| B|2019-01-01| P| O| 2| null|false| 0| null|
| B|2019-01-02| O| O| 5| 3| true| 1| 3|
| B|2019-01-03| O| O| 10| 3| true| 0| 3|
| B|2019-01-04| O| P| 4| null|false| 0| null|
| B|2019-01-05| P| P| 300| null|false| 0| null|
+---+----------+---------+-------+-----+-------------+-----+---+--------------+

How to create a group column counter in PySpark?
To create a group column counter in PySpark, we can use the Window function and the row_number function. The Window function allows us to define a partitioning and ordering criteria for the rows, and the row_number function returns the position of the row within the window.
For example, suppose we have a PySpark DataFrame called df with the following data:
id
name
category
1
A
X
2
B
X
3
C
Y
4
D
Y
5
E
Z
We want to create a new column called group_counter that assigns a number to each row within the same category, starting from 1. To do this, we can use the following code:
# Import the required modules
from pyspark.sql import Window
from pyspark.sql.functions import row_number
# Define the window specification
window = Window.partitionBy("category").orderBy("id")
# Create the group column counter
df = df.withColumn("group_counter", row_number().over(window))
# Show the result
df.show()
The output of the code is:
id
name
category
group_counter
1
A
X
1
2
B
X
2
3
C
Y
1
4
D
Y
2
5
E
Z
1
As we can see, the group_counter column increments by 1 for each row within the same category, and resets to 1 when the category changes.
Why is creating a group column counter useful?
Creating a group column counter can be useful for various purposes, such as:
Ranking the rows within a group based on some criteria, such as sales, ratings, or popularity.
Assigning labels or identifiers to the rows within a group, such as customer segments, product categories, or order numbers.
Performing calculations or aggregations based on the group column counter, such as cumulative sums, averages, or percentages.

Related

When Assembling a Vector, should I be concerned about the format of the vectorized features

Working with PySpark in Databricks. I noticed that when I assemble a vector using just a few columns, then the output is how I expect to see it. But when I use a larger number of columns (and many of them have 0's), then the output column looks different.
Example: Just a few columns...
import pandas as pd
#create data
data = [[1, 10, 0, 1], [2, 15, 16, 1], [3, 0, 10, 0]]
pdf = pd.DataFrame(data, columns=["id", "iv1", "iv2", "dv"])
df1 = spark.createDataFrame(pdf)
df2 = spark.createDataFrame(data, schema="id LONG, iv1 integer, iv2 integer, dv integer")
df2.show()
+---+---+---+---+
| id|iv1|iv2| dv|
+---+---+---+---+
| 1| 10| 0| 1|
| 2| 15| 16| 1|
| 3| 0| 10| 0|
+---+---+---+---+
#assemble vector
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols= ["iv1", "iv2"],
outputCol= "features")
output = assembler.transform(df2)
output.select("features").show(truncate=False)
+-----------+
|features |
+-----------+
|[10.0,0.0] |
|[15.0,16.0]|
|[0.0,10.0] |
+-----------+
In the above example, the output is easy to read as [10.0,0.0]. Compare to the following example which uses more columns and many zeros.
data = [[58,1,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,1]]
pdf = pd.DataFrame(data, columns=["iv1", "iv2", "iv3", "iv4", "iv5", "iv6", "iv7", "iv8", "iv9", "iv10", "iv11", "iv12", "iv13", "iv14", "iv15", "iv16", "iv17", "dv"])
df1 = spark.createDataFrame(pdf)
df2 = spark.createDataFrame(data, schema="""iv1 integer, iv2 integer, iv3 integer, iv4 integer, iv5 integer, iv6 integer,
iv7 integer, iv8 integer, iv9 integer, iv10 integer, iv11 integer, iv12 integer, iv13 integer,
iv14 integer, iv15 integer, iv16 integer, iv17 integer, dv integer""")
df2.show()
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+---+
|iv1|iv2|iv3|iv4|iv5|iv6|iv7|iv8|iv9|iv10|iv11|iv12|iv13|iv14|iv15|iv16|iv17| dv|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+---+
| 58| 1| 0| 0| 0| 1| 2| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1|
+---+---+---+---+---+---+---+---+---+----+----+----+----+----+----+----+----+---+
assembler = VectorAssembler(inputCols= ["iv1", "iv2", "iv3", "iv4", "iv5", "iv6", "iv7", "iv8", "iv9",
"iv10", "iv11", "iv12", "iv13", "iv14", "iv15", "iv16", "iv17"],
outputCol= "features")
output = assembler.transform(df2)
output.select("features").show(truncate=False)
+------------------------------------------------+
|features |
+------------------------------------------------+
|(17,[0,1,5,6],[58.0,1.0,1.0,2.0])|
+------------------------------------------------+
The output is (17,[0,1,5,6],[58.0,1.0,1.0,2.0]).
Compared to: [10.0,0.0] when just a few well-filled columns are vectorized.
My question is this: Do I need to be concerned of this format when passing this vectorized output to train a regression model?

Get the first row with positive amount in Pyspark

I have data like this
I want to flag the first positive amount as below
How do I flag the first positive amount for each id as shown above in Active column?
df = spark.createDataFrame(
[
('10/01/2022', '1', None),
('18/01/2022', '1', 50),
('31/01/2022', '1', -100)
], ['Date', 'Id', 'Amount']
)
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy('Id').orderBy('Date')
df\
.withColumn('only_pos', F.when(F.col('Amount')>0, F.col('Amount')).otherwise(F.lit(None)))\
.withColumn('First_pos', F.first('only_pos', True).over(w))\
.withColumn('Active', F.when(F.col('only_pos')==F.col('First_pos'),F.lit('Yes')).otherwise(F.lit(None)))\
.select('Date', 'Id', 'Amount', 'Active')\
.show()
+----------+---+------+------+
| Date| Id|Amount|Active|
+----------+---+------+------+
|10/01/2022| 1| null| null|
|18/01/2022| 1| 50| Yes|
|31/01/2022| 1| -100| null|
+----------+---+------+------+

Select Data from multiple rows to one row

I'm pretty new to functional programming and pyspark and I currently struggle to condense the data I want from my source data
Let's say I have two tables as DataFrames:
# if not already created automatically, instantiate Sparkcontext
spark = SparkSession.builder.getOrCreate()
columns = ['Id', 'JoinId', 'Name']
vals = [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')]
persons = spark.createDataFrame(vals,columns)
columns = ['Id', 'JoinId', 'Specification', 'Date', 'Destination']
vals = [(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')]
movements = spark.createDataFrame(vals,columns)
persons.show()
+---+------+----------+
| Id|JoinId| Name|
+---+------+----------+
| 1| 11| FirstName|
| 2| 12|SecondName|
| 3| 13| ThirdName|
+---+------+----------+
movements.show()
+---+------+-------------+--------+-------------+
| Id|JoinId|Specification| Date| Destination|
+---+------+-------------+--------+-------------+
| 1| 10| I|20051205|New York City|
| 2| 11| I|19991112| Berlin|
| 3| 11| O|20030101| Madrid|
| 4| 13| I|20200113| Paris|
| 5| 11| U|20070806| Lissabon|
+---+------+-------------+--------+-------------+
What I want to create is
+--------+----------+---------+---------+-----------+
|PersonId|PersonName| IDate| ODate|Destination|
| 1| FirstName| 19991112| 20030101| Berlin|
| 3| ThirdName| 20200113| | Paris|
+--------+----------+---------+---------+-----------+
The rules would be:
PersonId is the Id of the Person
IDate is the Date saved in the Movements DataFrame where Specification is I
ODate the Date saved in the Movements DataFrame where Specification is O
The Destination is the Destination of the joined entry where the Specification was I
I already joined the dataframes on JoinId
joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, col('P_JoinId') == movements.JoinId, how='inner')
joined.show()
+---+--------+---------+---+------+-------------+--------+-----------+
| Id|P_JoinId| Name| Id|JoinId|Specification| Date|Destination|
+---+--------+---------+---+------+-------------+--------+-----------+
| 1| 11|FirstName| 2| 11| I|19991112| Berlin|
| 1| 11|FirstName| 3| 11| O|20030101| Madrid|
| 1| 11|FirstName| 5| 11| U|20070806| Lissabon|
| 3| 13|ThirdName| 4| 13| I|20200113| Paris|
+---+--------+---------+---+------+-------------+--------+-----------+
But I'm struggling to select data from multiple rows and put them with the given rules into a single row...
Thank you for your help
Note : I have renamed the id in movements to Id_Movements,to avoid confusion in grouping later.
You can pivot your joined data based on the specification and do some aggregation on date and destination. Then you will get the date and destination specification wise.
import pyspark.sql.functions as F
persons =sqlContext.createDataFrame( [(1, 11, 'FirstName'), (2, 12, 'SecondName'), (3, 13, 'ThirdName')],schema=['Id', 'JoinId', 'Name'])
movements=sqlContext.createDataFrame([(1, 10, 'I', '20051205', 'New York City'), (2, 11, 'I', '19991112', 'Berlin'), (3, 11, 'O', '20030101', 'Madrid'), (4, 13, 'I', '20200113', 'Paris'), (5, 11, 'U', '20070806', 'Lissabon')],schema=['Id_movements', 'JoinId', 'Specification', 'Date', 'Destination'])
df_joined = persons.withColumnRenamed('JoinId', 'P_JoinId').join(movements, F.col('P_JoinId') == movements.JoinId, how='inner')
#%%
df_pivot = df_joined.groupby(['Id','Name']).pivot('Specification').agg(F.min('Date').alias("date"),F.min('Destination').alias('destination'))
Here I have chosen the min aggregation, but you can choose the one as per your need and drop the irrelevant columns
results :
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| Id| Name| I_date|I_destination| O_date|O_destination| U_date|U_destination|
+---+---------+--------+-------------+--------+-------------+--------+-------------+
| 1|FirstName|19991112| Berlin|20030101| Madrid|20070806| Lissabon|
| 3|ThirdName|20200113| Paris| null| null| null| null|
+---+---------+--------+-------------+--------+-------------+--------+-------------+

How to apply conditional counts (with reset) to grouped data in PySpark?

I have PySpark code that effectively groups up rows numerically, and increments when a certain condition is met. I'm having trouble figuring out how to transform this code, efficiently, into one that can be applied to groups.
Take this sample dataframe df
df = sqlContext.createDataFrame(
[
(33, [], '2017-01-01'),
(33, ['apple', 'orange'], '2017-01-02'),
(33, [], '2017-01-03'),
(33, ['banana'], '2017-01-04')
],
('ID', 'X', 'date')
)
This code achieves what I want for this sample df, which is to order by date and to create groups ('grp') that increment when the size column goes back to 0.
df \
.withColumn('size', size(col('X'))) \
.withColumn(
"grp",
sum((col('size') == 0).cast("int")).over(Window.orderBy('date'))
).show()
This is partly based on Pyspark - Cumulative sum with reset condition
Now what I am trying to do is apply the same approach to a dataframe that has multiple IDs - achieving a result that looks like
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2)
],
('ID', 'X', 'date', 'size', 'group')
)
edit for clarity
1) For the first date of each ID - the group should be 1 - regardless of what shows up in any other column.
2) However, for each subsequent date, I need to check the size column. If the size column is 0, then I increment the group number. If it is any non-zero, positive integer, then I continue the previous group number.
I've seen a few way to handle this in pandas, but I'm having difficulty understanding the applications in pyspark and the ways in which grouped data is different in pandas vs spark (e.g. do I need to use something called UADFs?)
Create a column zero_or_first by checking whether the size is zero or the row is the first row. Then sum.
df2 = sqlContext.createDataFrame(
[
(33, [], '2017-01-01', 0, 1),
(33, ['apple', 'orange'], '2017-01-02', 2, 1),
(33, [], '2017-01-03', 0, 2),
(33, ['banana'], '2017-01-04', 1, 2),
(55, ['coffee'], '2017-01-01', 1, 1),
(55, [], '2017-01-03', 0, 2),
(55, ['banana'], '2017-01-01', 1, 1)
],
('ID', 'X', 'date', 'size', 'group')
)
w = Window.partitionBy('ID').orderBy('date')
df2 = df2.withColumn('row', F.row_number().over(w))
df2 = df2.withColumn('zero_or_first', F.when((F.col('size')==0)|(F.col('row')==1), 1).otherwise(0))
df2 = df2.withColumn('grp', F.sum('zero_or_first').over(w))
df2.orderBy('ID').show()
Here' the output. You can see that column group == grp. Where group is the expected results.
+---+---------------+----------+----+-----+---+-------------+---+
| ID| X| date|size|group|row|zero_or_first|grp|
+---+---------------+----------+----+-----+---+-------------+---+
| 33| []|2017-01-01| 0| 1| 1| 1| 1|
| 33| [banana]|2017-01-04| 1| 2| 4| 0| 2|
| 33|[apple, orange]|2017-01-02| 2| 1| 2| 0| 1|
| 33| []|2017-01-03| 0| 2| 3| 1| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1| 1| 1|
| 55| [banana]|2017-01-01| 1| 1| 2| 0| 1|
| 55| []|2017-01-03| 0| 2| 3| 1| 2|
+---+---------------+----------+----+-----+---+-------------+---+
I added a window function, and created an index within each ID. Then I expanded the conditional statement to also reference that index. The following seems to produce my desired output dataframe - but I am interested in knowing if there is a more efficient way to do this.
window = Window.partitionBy('ID').orderBy('date')
df \
.withColumn('size', size(col('X'))) \
.withColumn('index', rank().over(window).alias('index')) \
.withColumn(
"grp",
sum(((col('size') == 0) | (col('index') == 1)).cast("int")).over(window)
).show()
which yields
+---+---------------+----------+----+-----+---+
| ID| X| date|size|index|grp|
+---+---------------+----------+----+-----+---+
| 33| []|2017-01-01| 0| 1| 1|
| 33|[apple, orange]|2017-01-02| 2| 2| 1|
| 33| []|2017-01-03| 0| 3| 2|
| 33| [banana]|2017-01-04| 1| 4| 2|
| 55| [coffee]|2017-01-01| 1| 1| 1|
| 55| []|2017-01-03| 0| 2| 2|
+---+---------------+----------+----+-----+---+

How label properly original observations with predicted clusters using kmeans in Pyspark?

I'd like to understand how the k-means method works in PySpark.
For this, I've done this small example:
In [120]: entry = [ [1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[5,5,5],[5,5,5],[1,1,1],[5,5,5]]
In [121]: rdd_entry = sc.parallelize(entry)
In [122]: clusters = KMeans.train(rdd_entry, k=5, maxIterations=10, initializationMode="random")
In [123]: rdd_labels = clusters.predict(rdd_entry)
In [125]: rdd_labels.collect()
Out[125]: [3, 1, 0, 0, 2, 2, 2, 3, 2]
In [126]: entry
Out[126]:
[[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[5, 5, 5],
[5, 5, 5],
[1, 1, 1],
[5, 5, 5]]
At first glance it seems that rdd_labels returns the cluster to which each observation belongs, respecting the order of the original rdd. Although in this example it is evident, how can I be sure in a case where I will work with 8 million observations?
Also, I'd like to know how to join rdd_entry and rdd_labels, respecting that order, so that each observation of rdd_entry is correctly labeled with its cluster.
I tried to do a .join(), but it jumps error
In [127]: rdd_total = rdd_entry.join(rdd_labels)
In [128]: rdd_total.collect()
TypeError: 'int' object has no attribute '__getitem__'
Hope it helps! (this solution is based on pyspark.ml)
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
#sample data
df = sc.parallelize([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[5,5,5],[5,5,5],[1,1,1],[5,5,5]]).\
toDF(('col1','col2','col3'))
vecAssembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vector_df = vecAssembler.transform(df)
#kmeans clustering
kmeans=KMeans(k=3, seed=1)
model=kmeans.fit(vector_df)
predictions=model.transform(vector_df)
predictions.show()
Output is:
+----+----+----+-------------+----------+
|col1|col2|col3| features|prediction|
+----+----+----+-------------+----------+
| 1| 1| 1|[1.0,1.0,1.0]| 0|
| 2| 2| 2|[2.0,2.0,2.0]| 0|
| 3| 3| 3|[3.0,3.0,3.0]| 2|
| 4| 4| 4|[4.0,4.0,4.0]| 1|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
| 1| 1| 1|[1.0,1.0,1.0]| 0|
| 5| 5| 5|[5.0,5.0,5.0]| 1|
+----+----+----+-------------+----------+
Although pyspark.ml has better approach I thought of writing code to achieve the same result using pyspark.mllib (trigger was the comment from #Muhammad). So here goes the solution based on pyspark.mllib...
from pyspark.mllib.clustering import KMeans
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType
#sample data
rdd = sc.parallelize([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5],[5,5,5],[5,5,5],[1,1,1],[5,5,5]])
#K-Means example
model = KMeans.train(rdd, k=3, seed=1)
labels = model.predict(rdd)
#add cluster label to the original data
df1 = rdd.toDF(('col1','col2','col3')) \
.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
df2 = spark.createDataFrame(labels, IntegerType()).toDF(('label')) \
.withColumn('row_index', row_number().over(Window.orderBy(monotonically_increasing_id())))
df = df1.join(df2, on=["row_index"]).drop("row_index")
df.show()