using spark sql for conditional lag summation - pyspark

Below is the dataframe i have
df = sqlContext.createDataFrame(
[("0", "0"), ("1", "2"), ("2", "3"), ("3", "4"), ("4", "0"), ("5", "5"), ("6", "5")],
["id", "value"])
+---+-----+
| id|value|
+---+-----+
| 0| 0|
| 1| 2|
| 2| 3|
| 3| 4|
| 4| 0|
| 5| 5|
| 6| 5|
+---+-----+
And what I want to get is :
+---+-----+---+-----+
| id|value|masterid|partsum|
+---+-----|---+-----+
| 0| 0| 0| 0|
| 1| 2| 0| 2|
| 2| 3| 0| 5|
| 3| 4| 0| 9|
| 4| 0| 4| 0|
| 5| 5| 4| 5|
| 6| 5| 4| 10|
+---+-----+---+-----+
So I try to use SparkSQL to do so:
df=df.withColumn("masterid", F.when( df.value !=0 , F.lag(df.id)).otherwise(df.id))
I original thought the lag function can help me process before next iteration so as to get the masterid col. Unfortunately, after i check the manual , it cant help.
So , i would like to ask if there are any special functions i could use to do what i want? Or is there any "conditional lag" function i could use? so that, when i see non-zero item, i can use lag until find a zero number?

IIUC, you can try defining a sub-group label (g in the below code) and two Window Specs:
from pyspark.sql import Window, functions as F
w1 = Window.orderBy('id')
w2 = Window.partitionBy('g').orderBy('id')
df.withColumn('g', F.sum(F.expr('if(value=0,1,0)')).over(w1)).select(
'id'
, 'value'
, F.first('id').over(w2).alias('masterid')
, F.sum('value').over(w2).alias('partsum')
).show()
#+---+-----+--------+-------+
#| id|value|masterid|partsum|
#+---+-----+--------+-------+
#| 0| 0| 0| 0.0|
#| 1| 2| 0| 2.0|
#| 2| 3| 0| 5.0|
#| 3| 4| 0| 9.0|
#| 4| 0| 4| 0.0|
#| 5| 5| 4| 5.0|
#| 6| 5| 4| 10.0|
#+---+-----+--------+-------+

Related

Spark Categorize ordered dataframe values by a condition

Let's say I have a dataframe
val userData = spark.createDataFrame(Seq(
(1, 0),
(2, 2),
(3, 3),
(4, 0),
(5, 3),
(6, 4)
)).toDF("order_clause", "some_value")
userData.withColumn("passed", when(col("some_value") <= 1.5,1))
.show()
+------------+----------+------+
|order_clause|some_value|passed|
+------------+----------+------+
| 1| 0| 1|
| 2| 2| null|
| 3| 3| null|
| 4| 0| 1|
| 5| 3| null|
| 6| 4| null|
+------------+----------+------+
That dataframe is ordered by order_clause. When values in some_value become smaller than 1.5 I can say one round is done.
What I want to do is create column round like:
+------------+----------+------+-----+
|order_clause|some_value|passed|round|
+------------+----------+------+-----+
| 1| 0| 1| 1|
| 2| 2| null| 1|
| 3| 3| null| 1|
| 4| 0| 1| 2|
| 5| 3| null| 2|
| 6| 4| null| 2|
+------------+----------+------+-----+
Now I could be able to get subsets of rounds in this dataframe. I searched for hints how to do this but have not found a way to do this.
You're probably looking for a rolling sum of the passed column. You can do it using a sum window function:
import org.apache.spark.sql.expressions.Window
val result = userData.withColumn(
"passed",
when(col("some_value") <= 1.5, 1)
).withColumn(
"round",
sum("passed").over(Window.orderBy("order_clause"))
)
result.show
+------------+----------+------+-----+
|order_clause|some_value|passed|round|
+------------+----------+------+-----+
| 1| 0| 1| 1|
| 2| 2| null| 1|
| 3| 3| null| 1|
| 4| 0| 1| 2|
| 5| 3| null| 2|
| 6| 4| null| 2|
+------------+----------+------+-----+
Or more simply
import org.apache.spark.sql.expressions.Window
val result = userData.withColumn(
"round",
sum(when(col("some_value") <= 1.5, 1)).over(Window.orderBy("order_clause"))
)

Pyspark: How to group rows into N groups?

I am performing a df.groupBy().apply() in my pyspark script and want to create a custom column that has grouped all my rows into N (as even as possible, so rows/n) groups. That why, I can ensure the number of groups sent to my udf function everytime the script runs.
How can I do this using pyspark?
If you need an exact split, then you need windowing
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
w=Window.orderBy(F.lit(1))
tst_mod = tst.withColumn("id",(F.row_number().over(w))%3) # 3 is the group size in this example
tst_mod.show()
+----+----+----+----+---+
|col1|col2|col3|col4| id|
+----+----+----+----+---+
| 5| 3| 7| 5| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 7| 3| 9| 5| 1|
| 1| 2| 3| 4| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 5| 3| 7| 5| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 3| 2| 5| 4| 1|
| 7| 3| 9| 5| 2|
| 3| 2| 5| 4| 0|
| 1| 2| 3| 4| 1|
+----+----+----+----+---+
tst_mod.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 1| 6|
| 2| 5|
| 0| 5|
+---+-----+
If you are ok with a normal distribution, then you can try a technique called salting
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
tst_salt= tst.withColumn("salt", F.rand(seed=10)*3)
If you groupby the column salt, you will have a normally distributed group

Spark window functions: first match in window

I'm trying to extend the results of my previous question, but haven't been able to figure out how to achieve my new goal.
Before, I wanted to key on either a flag match or a string match. Now, I want to create a unique grouping key from a run starting with either a flag being true or the first string match preceding a run of true flag values.
Here's some example data:
val msgList = List("b", "f")
val df = spark.createDataFrame(Seq(("a", false), ("b", false), ("c", false), ("b", false), ("c", true), ("d", false), ("e", true), ("f", true), ("g", false)))
.toDF("message", "flag")
.withColumn("index", monotonically_increasing_id)
df.show
+-------+-----+-----+
|message| flag|index|
+-------+-----+-----+
| a|false| 0|
| b|false| 1|
| c|false| 2|
| b|false| 3|
| c| true| 4|
| d|false| 5|
| e| true| 6|
| f| true| 7|
| g|false| 8|
+-------+-----+-----+
The desired output is something equivalent to either of key1 or key2:
+-------+-----+-----+-----+-----+
|message| flag|index| key1| key2|
+-------+-----+-----+-----+-----+
| a|false| 0| 0| null|
| b|false| 1| 1| 1|
| c|false| 2| 1| 1|
| b|false| 3| 1| 1|
| c| true| 4| 1| 1|
| d|false| 5| 2| null|
| e| true| 6| 3| 2|
| f| true| 7| 3| 2|
| g|false| 8| 4| null|
+-------+-----+-----+-----+-----+
From the answer to my previous question, I already have a precursor:
import org.apache.spark.sql.expressions.Window
val checkMsg = udf { (s: String) => s != null && msgList.exists(s.contains(_)) }
val df2 = df.withColumn("message_match", checkMsg($"message"))
.withColumn("match_or_flag", when($"message_match" || $"flag", 1).otherwise(0))
.withColumn("lead", lead("match_or_flag", -1, 1).over(Window.orderBy("index")))
.withColumn("switched", when($"match_or_flag" =!= $"lead", $"index"))
.withColumn("base_key", last("switched", ignoreNulls = true).over(Window.orderBy("index").rowsBetween(Window.unboundedPreceding, 0)))
df2.show
+-------+-----+-----+-------------+-------------+----+--------+--------+
|message| flag|index|message_match|match_or_flag|lead|switched|base_key|
+-------+-----+-----+-------------+-------------+----+--------+--------+
| a|false| 0| false| 0| 1| 0| 0|
| b|false| 1| true| 1| 0| 1| 1|
| c|false| 2| false| 0| 1| 2| 2|
| b|false| 3| true| 1| 0| 3| 3|
| c| true| 4| false| 1| 1| null| 3|
| d|false| 5| false| 0| 1| 5| 5|
| e| true| 6| false| 1| 0| 6| 6|
| f| true| 7| true| 1| 1| null| 6|
| g|false| 8| false| 0| 1| 8| 8|
+-------+-----+-----+-------------+-------------+----+--------+--------+
base_key here is somewhat close to key1 above, but assigns separate keys to rows 1 and rows 3-4. I want rows 1-4 to get a single key based on the fact that row 1 contains the first msgList match within or preceding a run of flag = true.
Looking at the Spark window function API, it looks like there might be some way to use rangeBetween to accomplish this as of Spark 2.3.0, but the docs are bare enough that I haven't been able to figure out how to make it work.

How to union 2 dataframe without creating additional rows?

I have 2 dataframes and I wanted to do .filter($"item" === "a") while keeping the "S/N" in number numbers.
I tried the following but it ended up with additional rows when I use union. Is there a way to union 2 dataframes without creating additional rows?
var DF1 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
var DF2 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
DF2 = DF2.filter($"item"==="a")
DF3=DF1.withColumn("item",lit(0)).withColumn("value",lit(0))
DF1.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| b| 3|
| 4| b| 4|
| 5| a| 2|
+---+----+-----+
DF2.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
+---+----+-----+
DF3.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
DF2.union(someDF3).show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
Left outer join your S/Ns with filtered dataframe, then use coalesce to get rid of nulls:
val DF3 = DF1.select("S/N")
val DF4 = (DF3.join(DF2, Seq("S/N"), joinType="leftouter")
.withColumn("item", coalesce($"item", lit(0)))
.withColumn("value", coalesce($"value", lit(0))))
DF4.show
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| 0| 0|
| 4| 0| 0|
| 5| a| 2|
+---+----+-----+

Pyspark - Ranking columns keeping ties

I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.