create another columns for checking different value in pyspark - pyspark

I wish to have below expected output:
My code:
import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
'values': [10,5,3,-1,0,-10,-4,10,0,10]})
sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()
I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.
Expected output:
id values sign numbering
0 0 10 1 1
1 1 5 1 1
2 2 3 1 1
3 3 -1 -1 2
4 4 0 0 3
5 5 -10 -1 4
6 6 -4 -1 4
7 7 10 1 5
8 8 0 0 6
9 9 10 1 7

Here's a way you can do using a custom function:
import pyspark.sql.functions as F
# compare the next value with previous
def f(x):
c = 1
l = [c]
last_value = [x[0]]
for i in x[1:]:
if i == last_value[-1]:
l.append(c)
else:
c += 1
l.append(c)
last_value.append(i)
return l
# take sign column as a list
sign_list = sp_dataframe.select('sign').rdd.map(lambda x: x.sign).collect()
# create a new dataframe using the output
sp = spark.createDataFrame(pd.DataFrame(f(sign_list), columns=['numbering']))
Append a list as a column to a dataframe is a bit tricky in pyspark. For this we'll need to create a dummy row_idx to join the dataframes.
# create dummy indexes
sp_dataframe = sp_dataframe.withColumn("row_idx", F.monotonically_increasing_id())
sp = sp.withColumn("row_idx", F.monotonically_increasing_id())
# join the dataframes
final_df = (sp_dataframe
.join(sp, sp_dataframe.row_idx == sp.row_idx)
.orderBy('id')
.drop("row_idx"))
final_df.show()
+---+------+----+---------+
| id|values|sign|numbering|
+---+------+----+---------+
| 0| 10| 1| 1|
| 1| 5| 1| 1|
| 2| 3| 1| 1|
| 3| -1| -1| 2|
| 4| 0| 0| 3|
| 5| -10| -1| 4|
| 6| -4| -1| 4|
| 7| 10| 1| 5|
| 8| 0| 0| 6|
| 9| 10| 1| 7|
+---+------+----+---------+

Related

pct_change function in PySpark

I need to calculate percentage change of values in value column for each ID separately in a new pct_change column. Example df below.
Some sources in the internet say that there is a pct_change() function available in pyspark 2.4+ which would make this easy but I am on 3.0.1 and I am not able to import it from pyspark.sql.functions.
ID value pct_change
1 1 nan
1 2 1
1 4 1
2 1 nan
2 1 0
2 0.5 -0.5
3 5 nan
3 5 0
3 7 0.4
Use Window function in pyspark
Code and logic below
w =Window.partitionBy('ID').orderBy('index')#.rowsBetween(-1,0)
(df.withColumn('index', monotonically_increasing_id())#Create an index to OrderBy
.withColumn('pct_change', (col('value')-lag('value').over(w))#Calculate change in consecutive rows
/lag('value').over(w))#Find rate of change in consecutive row
.drop('index')#Drop the ordering column
).show()
+---+-----+----------+
| ID|value|pct_change|
+---+-----+----------+
| 1| 1.0| null|
| 1| 2.0| 1.0|
| 1| 4.0| 1.0|
| 2| 1.0| null|
| 2| 1.0| 0.0|
| 2| 0.5| -0.5|
| 3| 5.0| null|
| 3| 5.0| 0.0|
| 3| 7.0| 0.4|
+---+-----+----------+

Spark dataframe - transform rows with same ID to columns

I want to transform below source dataframe (using pyspark):
Key
ID
segment
1
A
m1
2
A
m1
3
B
m1
4
C
m2
1
D
m1
2
E
m1
3
F
m1
4
G
m2
1
J
m1
2
J
m1
3
J
m1
4
K
m2
Into below result dataframe:
ID
key1
key2
A
1
2
B
3
-
C
4
-
D
1
-
F
3
-
G
4
-
J
1
2
J
1
3
J
2
3
K
4
-
In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.
Thank you for your help
Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.
w =Window.partitionBy('ID').orderBy(desc('Key'))
df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
.withColumn('key2', col('key2').isNotNull())# query to create boolean selection
.withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
.orderBy('ID', 'key')#Reorder frame
)
df.show()
+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
| 1| A| m1| 2|
| 2| A| m1| 2|
| 3| B| m1| 1|
| 4| C| m2| 1|
| 1| D| m1| 1|
| 2| E| m1| 1|
| 3| F| m1| 1|
| 4| G| m2| 1|
| 1| J| m1| 2|
| 2| J| m1| 3|
| 3| J| m1| 3|
| 4| K| m2| 1|
+---+---+-------+----+

Pyspark Counter based on column condition

Hello I would like to create a new column with a counter based on the condition of Tag1 column.
I have this:
Time
Tag1
1
0
2
1
3
1
4
1
5
0
6
0
7
1
8
1
9
1
10
1
11
0
12
0
And I would like this:
Time
Tag1
Counter
1
0
0
2
1
1
3
1
2
4
1
3
5
0
0
6
0
0
7
1
1
8
1
2
9
1
3
10
1
4
11
0
0
12
0
0
I tried with function.when(df.Tag1 == 1, function.lag(df.Tag1)+1).otherwise(0) but doesn't work.
Any idea?
Thanks a lot
Window function
new = (df.withColumn('Counter',(col('Tag1')=='0'))#Create Bool
.withColumn('Counter', F.sum(F.col('Counter').cast('integer')).over(Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0)))# Create Group by summing bool
.withColumn('Counter', when(col('Tag1')==0, col('Tag1')).otherwise(F.sum('Tag1').over(Window.partitionBy('Counter').orderBy().rowsBetween(-sys.maxsize, 0))))#Conditionally add
)
new.show()
+----+----+-------+
|Time|Tag1|Counter|
+----+----+-------+
| 1| 0| 0|
| 2| 1| 1|
| 3| 1| 2|
| 4| 1| 3|
| 5| 0| 0|
| 6| 0| 0|
| 7| 1| 1|
| 8| 1| 2|
| 9| 1| 3|
| 10| 1| 4|
| 11| 0| 0|
| 12| 0| 0|
+----+----+-------+

Change value on duplicated rows using Pyspark, keeping the first record as is

how can I change the column status value on rows that contains duplicate records on specific columns, and keep the first one(with the lower id) as A, for example:
logic:
if the account_id and user_id already exists the status is E, the first record(lower id) is A
if the user_id exists and the account_id is different the status is I, the first record(lower id) is A
input sample:
id
account_id
user_id
1
a
1
2
a
1
3
b
1
4
c
2
5
c
2
6
c
2
7
d
3
8
d
3
9
e
3
output sample
id
account_id
user_id
status
1
a
1
A
2
a
1
E
3
b
1
I
4
c
2
A
5
c
2
E
6
c
2
E
7
d
3
A
8
d
3
E
9
e
3
I
I think I need to group into multiple datasets and join it back, compare and change the values, I think I'm overthinking, help?
Thanks!!
Thank you
Two window functions would help you to determine the duplications and rank them.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
# Distinguishes between "first occurrence" vs "2nd occurrence" and so on
.withColumn('rank', F.rank().over(W.partitionBy('account_id', 'user_id').orderBy('id')))
# Detecting if there is no duplication per pair of 'account_id' and 'user_id'
.withColumn('count', F.count('*').over(W.partitionBy('account_id', 'user_id')))
# building status based on conditions
.withColumn('status', F
.when(F.col('count') == 1, 'I') # if there is only one record, status is 'I'
.when(F.col('rank') == 1, 'A') # if there is more than one record, the first occurrence is 'A'
.otherwise('E') # finally, the other occurrences are 'E'
)
.orderBy('id')
.show()
)
# Output
# +---+----------+-------+----+-----+------+
# | id|account_id|user_id|rank|count|status|
# +---+----------+-------+----+-----+------+
# | 1| a| 1| 1| 2| A|
# | 2| a| 1| 2| 2| E|
# | 3| b| 1| 1| 1| I|
# | 4| c| 2| 1| 3| A|
# | 5| c| 2| 2| 3| E|
# | 6| c| 2| 3| 3| E|
# | 7| d| 3| 1| 2| A|
# | 8| d| 3| 2| 2| E|
# | 9| e| 3| 1| 1| I|
# +---+----------+-------+----+-----+------+

How can I add a column to a DataFrame which groups rows in chunks of N? Like NTILE, but with a fixed bucket size

Say I have a DataFrame like:
+------------+-----------+-----+
| feed|artist |count|
+------------+-----------+-----+
| y| Kanye West| 9|
| y| Radiohead| 6|
| y| Zero 7| 3|
| y| Puts Marie| 1|
| gwas| Drax| 7|
| gwas| Calibre| 4|
| gwas| Aphex Twin| 1|
| gwas| Jay Z| 1|
| x| DJ Krush| 2|
| x| Titeknots| 1|
+------------+-----------+-----+
I want to add a new column which chunks the rows into buckets of N rows for each partition (feed).
It seems like the inverse of NTILE to me. NTILE lets you choose the # of buckets but I want to choose the bucket-size instead.
Here's the desired result. Notice how each feed is chunked into groups of N = 2, including the x feed which has just one chunk of 2 rows. (Edit: each partition is ordered by count, so group 1 in each partition will be the rows with the highest value for count)
+------------+-----------+-----+-----+
| feed|artist |count|group|
+------------+-----------+-----+-----+
| y| Kanye West| 1| 9|
| y| Radiohead| 1| 6|
| y| Zero 7| 1| 3|
| y| Puts Marie| 1| 1|
| gwas| Drax| 7| 7|
| gwas| Calibre| 1| 4|
| gwas| Aphex Twin| 1| 1|
| gwas| Jay Z| 8| 1|
| x| DJ Krush| 2| 2|
| x| Titeknots| 1| 1|
+------------+-----------+-----+-----+
As a bonus, I would like each bucket to be a different size. E.g. List(2, 2, 4, 10, 10, -1) would mean that the first bucket has 2 rows, the second has 2 rows, the third has 4 rows, etc., and the final bucket (-1) contains the remainder.
EDIT
(Another useful variation)
While implementing the answers, I realized that there's another variation which I would prefer:
Add a column to a DataFrame which chunks its rows into groups of N, without knowing the size of the DataFrame.
Example:
If N = 100 and the DataFrame has 800 rows, it chunk it into 8 buckets of 100. If the DataFrame has 950 rows, it will chunk it into 9 buckets of 100, and 1 bucket of 50. It should not require a scan/call to .count().
The example DataFrames are analogous to the ones above.
(meta: should I make a new question for this variation? I feel like "NTILE with a fixed bucket size" is a more elegant problem and probably more common than my original use-case)
If I understand you correctly, this can be handled by using an SQL expression:
import org.apache.spark.sql.functions.{expr,row_number,desc}
import org.apache.spark.sql.expressions.Window
// set up WindowSpec
val w1 = Window.partitionBy("feed").orderBy(desc("count"))
val L = List(2, 2, 4, 10, 10, -1)
// dynamically create SQL expression from the List `L` to map row_number into group-id
var sql_expr = "CASE"
var running_total = 0
for(i <- 1 to L.size) {
running_total += L(i-1)
sql_expr += (if(L(i-1) > 0) s" WHEN rn <= $running_total THEN $i " else s" ELSE $i END")
}
println(sql_expr)
//CASE WHEN rn <= 2 THEN 1 WHEN rn <= 4 THEN 2 WHEN rn <= 8 THEN 3 WHEN rn <= 18 THEN 4 WHEN rn <= 28 THEN 5 ELSE 6 END
val df_new = df.withColumn("rn", row_number().over(w1)).withColumn("group", expr(sql_expr)).drop("rn")
df_new.show
+----+----------+-----+-----+
|feed| artist|count|group|
+----+----------+-----+-----+
|gwas| Drax| 7| 1|
|gwas| Calibre| 4| 1|
|gwas|Aphex Twin| 1| 2|
|gwas| Jay Z| 1| 2|
| x| DJ Krush| 2| 1|
| x| Titeknots| 1| 1|
| y|Kanye West| 9| 1|
| y| Radiohead| 6| 1|
| y| Zero 7| 3| 2|
| y|Puts Marie| 1| 2|
+----+----------+-----+-----+
For a fixed N, just cast (row_number-1)/N + 1 to int:
val N = 2
val df_new = df.withColumn("group", ((row_number().over(w1)-1)/N+1).cast("int"))
This could work :
val bucketDef = List(2, 2, 4, 10, 10)
val bucketRunsum = bucketDef.scanLeft(1)( _ + _) // calc running sum
// maps a row-number to a bucket
val indexBucketMapping = bucketRunsum.zip(bucketRunsum.tail)
.zipWithIndex
.map{case ((start,end),index) => ((start,end),index+1)} // make index start at 1
// gives List(((1,3),1), ((3,5),2), ((5,9),3), ((9,19),4), ((19,29),5))
// udf to assign a bucket to a given row-number
val calcBucket = udf((rnb:Long) => indexBucketMapping
.find{case ((start,end),_) => start<=rnb && rnb < end}
.map(_._2) // get index
.getOrElse(indexBucketMapping.last._2+1) // is in last bucket
)
df
.withColumn("group",calcBucket(row_number().over(Window.partitionBy($"feed").orderBy($"count"))))
alternatively (without UDF), construct a DataFrame which maps a row-number to a bucket and then join
val bucketSizeDef =List(2, 2, 4, 10, 10)
val bucketDef = (1 +: bucketSizeDef).zipWithIndex.map{case (bs,index) => (bs,index+1)}
.toDF("bucketSize","group")
.withColumn("i",sum($"bucketSize").over(Window.orderBy($"group")))
.withColumn("i_to",coalesce(lead($"i",1).over(Window.orderBy($"group")),lit(Long.MaxValue)))
.drop($"bucketSize")
bucketDef.show()
gives:
+-----+---+-------------------+
|group| i| i_to|
+-----+---+-------------------+
| 1| 1| 3|
| 2| 3| 5|
| 3| 5| 9|
| 4| 9| 19|
| 5| 19| 29|
| 6| 29|9223372036854775807|
+-----+---+-------------------+
then join to df:
df
.withColumn("rnb",row_number().over(Window.partitionBy($"feed").orderBy($"count")))
.join(broadcast(bucketDef),$"rnb">= $"i" and $"rnb"< $"i_to")
.drop("rnb","i","i_to")