Related
I have the below dataframe .
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
created by
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
Here I have a base column M and is ranked as ID based on Amount.
I am trying to compute the percentile keeping M as a group but for every last three values of Amount.
I am Using the below code to find the percentile for a group. But how can I target the last three values. ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
Expected Output
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
This seems to be little bit tricky to me as I am still learning spark.Expecting answers from enthusiasts here.
Adding orderBy("Amount") and rowsBetween(-2,0) to the Window definition gets the required result:
orderBy sorts the rows within each group by Amount
rowsBetween takes only the current row and the two rows before into account when calculating the percentile
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
prints
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+
I have the following dataframe:
+----+----+-----+
|col1|col2|value|
+----+----+-----+
| 11| a| 1|
| 11| a| 2|
| 11| b| 3|
| 11| a| 4|
| 11| b| 5|
| 22| a| 6|
| 22| b| 7|
+----+----+-----+
I want to calculate to calculate the cumsum of the 'value' column that is partitioned by 'col1' and ordered by 'col2'.
This is the desired output:
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 1| 1|
| 11| a| 2| 3|
| 11| a| 4| 7|
| 11| b| 3| 10|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
I have used this code which gives me the df shown below. It is not what I wanted. Can someone help me please?
df.withColumn("cumsum", F.sum("value").over(Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedPreceding, 0)))
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 2| 7|
| 11| a| 1| 7|
| 11| a| 4| 7|
| 11| b| 3| 15|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
You have to use .rowsBetween instead of .rangeBetween in your window clause.
rowsBetween (vs) rangeBetween
Example:
df.withColumn("cumsum", sum("value").over(Window.partitionBy("col1").orderBy("col2").rowsBetween(Window.unboundedPreceding, 0))).show()
#+----+----+-----+------+
#|col1|col2|value|cumsum|
#+----+----+-----+------+
#| 11| a| 1| 1|
#| 11| a| 2| 3|
#| 11| a| 4| 7|
#| 11| b| 3| 10|
#| 11| b| 5| 15|
#| 12| a| 6| 6|
#| 12| b| 7| 13|
#+----+----+-----+------+
I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_997.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
df_998.show
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
+--------+-------+---+-------+-------+
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
+--------+-------+---+-------+-------+
Thanks for your suggestions as I am very stuck in here...
Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
).toDF("policyId","FECMVTO","aux","IND_DEF")
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
show
// +--------+---+-------+-------+--------+--------+
// |policyId|aux|FECMVTO|IND_DEF|IND_DEF2|IND_DEF3|
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).
#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)
I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.
I have a tall table which contains up to 10 values per group. How can I transform this table into a wide format i.e. add 2 columns where these resemble the value smaller or equal to a threshold?
I want to find the maximum per group, but it needs to be smaller than a specified value like:
min(max('value1), lit(5)).over(Window.partitionBy('grouping))
However min()will only work for a column and not for the Scala value which is returned from the inner function?
The problem can be described as:
Seq(Seq(1,2,3,4).max,5).min
Where Seq(1,2,3,4) is returned by the window.
How can I formulate this in spark sql?
edit
E.g.
+--------+-----+---------+
|grouping|value|something|
+--------+-----+---------+
| 1| 1| first|
| 1| 2| second|
| 1| 3| third|
| 1| 4| fourth|
| 1| 7| 7|
| 1| 10| 10|
| 21| 1| first|
| 21| 2| second|
| 21| 3| third|
+--------+-----+---------+
created by
case class MyThing(grouping: Int, value:Int, something:String)
val df = Seq(MyThing(1,1, "first"), MyThing(1,2, "second"), MyThing(1,3, "third"),MyThing(1,4, "fourth"),MyThing(1,7, "7"), MyThing(1,10, "10"),
MyThing(21,1, "first"), MyThing(21,2, "second"), MyThing(21,3, "third")).toDS
Where
df
.withColumn("somethingAtLeast5AndMaximum5", max('value).over(Window.partitionBy('grouping)))
.withColumn("somethingAtLeast6OupToThereshold2", max('value).over(Window.partitionBy('grouping)))
.show
returns
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5| somethingAtLeast6OupToThereshold2 |
+--------+-----+---------+----------------------------+-------------------------+
| 1| 1| first| 10| 10|
| 1| 2| second| 10| 10|
| 1| 3| third| 10| 10|
| 1| 4| fourth| 10| 10|
| 1| 7| 7| 10| 10|
| 1| 10| 10| 10| 10|
| 21| 1| first| 3| 3|
| 21| 2| second| 3| 3|
| 21| 3| third| 3| 3|
+--------+-----+---------+----------------------------+-------------------------+
Instead, I rather would want to formulate:
lit(Seq(max('value).asInstanceOf[java.lang.Integer], new java.lang.Integer(2)).min).over(Window.partitionBy('grouping))
But that does not work as max('value) is not a scalar value.
Expected output should look like
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5|somethingAtLeast6OupToThereshold2|
+--------+-----+---------+----------------------------+-------------------------+
| 1| 4| fourth| 4| 7|
| 21| 1| first| 3| NULL|
+--------+-----+---------+----------------------------+-------------------------+
edit2
When trying a pivot
df.groupBy("grouping").pivot("value").agg(first('something)).show
+--------+-----+------+-----+------+----+----+
|grouping| 1| 2| 3| 4| 7| 10|
+--------+-----+------+-----+------+----+----+
| 1|first|second|third|fourth| 7| 10|
| 21|first|second|third| null|null|null|
+--------+-----+------+-----+------+----+----+
The second part of the problem remains that some columns might not exist or be null.
When aggregating to arrays:
df.groupBy("grouping").agg(collect_list('value).alias("value"), collect_list('something).alias("something"))
+--------+-------------------+--------------------+
|grouping| value| something|
+--------+-------------------+--------------------+
| 1|[1, 2, 3, 4, 7, 10]|[first, second, t...|
| 21| [1, 2, 3]|[first, second, t...|
+--------+-------------------+--------------------+
The values are already next to each other, but the right values need to be selected. This is probably still more efficient than a join or window function.
Would be easier to do in two separate steps - calculate max over Window, and then use when...otherwise on result to produce min(x, 5):
df.withColumn("tmp", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('tmp > lit(5), 5).otherwise('tmp))
EDIT: some example data to clarify this:
val df = Seq((1, 1),(1, 2),(1, 3),(1, 4),(2, 7),(2, 8))
.toDF("grouping", "value1")
df.withColumn("result", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('result > lit(5), 5).otherwise('result))
.show()
// +--------+------+------+
// |grouping|value1|result|
// +--------+------+------+
// | 1| 1| 4| // 4, because Seq(Seq(1,2,3,4).max,5).min = 4
// | 1| 2| 4|
// | 1| 3| 4|
// | 1| 4| 4|
// | 2| 7| 5| // 5, because Seq(Seq(7,8).max,5).min = 5
// | 2| 8| 5|
// +--------+------+------+