Pyspark: How to group rows into N groups? - pyspark

I am performing a df.groupBy().apply() in my pyspark script and want to create a custom column that has grouped all my rows into N (as even as possible, so rows/n) groups. That why, I can ensure the number of groups sent to my udf function everytime the script runs.
How can I do this using pyspark?

If you need an exact split, then you need windowing
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
w=Window.orderBy(F.lit(1))
tst_mod = tst.withColumn("id",(F.row_number().over(w))%3) # 3 is the group size in this example
tst_mod.show()
+----+----+----+----+---+
|col1|col2|col3|col4| id|
+----+----+----+----+---+
| 5| 3| 7| 5| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 7| 3| 9| 5| 1|
| 1| 2| 3| 4| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 5| 3| 7| 5| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 3| 2| 5| 4| 1|
| 7| 3| 9| 5| 2|
| 3| 2| 5| 4| 0|
| 1| 2| 3| 4| 1|
+----+----+----+----+---+
tst_mod.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 1| 6|
| 2| 5|
| 0| 5|
+---+-----+
If you are ok with a normal distribution, then you can try a technique called salting
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
tst_salt= tst.withColumn("salt", F.rand(seed=10)*3)
If you groupby the column salt, you will have a normally distributed group

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

Efficient code for imputation of negative values using pyspark

I am working on a data set which contains item wise- date wise information about the quantity sold of that particular item. However, there are some negative values in the ' quantity sold' column which I intend to impute. The logic used here would be to replace such negative values with the mode of the quantity sold for each item at date level. I have already computed the count of each distinct value of the quantity sold and obtained the maximum quantity sold of a particular item on each given date. However I am unable to find a function that would replace the negative values with the max qty sold for each item* date combination. I am relatively newer to pyspark. Which would be best approach to use in this case?
Based on the limited information you provided , you can try something like this -
from pyspark import SparkContext
from pyspark.sql import SQLContext
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
sql = SQLContext(sc)
input_list = [
(1,10,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-07")
,(1,5,"2019-11-08")
,(1,6,"2019-11-08")
,(1,7,"2019-11-09")
,(1,7,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,8,"2019-11-09")
,(1,-10,"2019-11-09")
,(2,10,"2019-11-07")
,(2,3,"2019-11-07")
,(2,9,"2019-11-07")
,(2,9,"2019-11-08")
,(2,-10,"2019-11-08")
,(2,5,"2019-11-09")
,(2,5,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,2,"2019-11-09")
,(2,-10,"2019-11-09")
]
sparkDF = sql.createDataFrame(input_list,['product_id','sold_qty','date'])
sparkDF = sparkDF.withColumn('date',F.to_date(F.col('date'), 'yyyy-MM-dd'))
Mode Implementation
#### Mode Implemention
modeDF = sparkDF.groupBy('date', 'sold_qty')\
.agg(F.count(F.col('sold_qty')).alias('mode_count'))\
.select(F.col('date'),F.col('sold_qty').alias('mode_sold_qty'),F.col('mode_count'))
window = Window.partitionBy("date").orderBy(F.desc("mode_count"))
#### Filtering out the most occurred value
modeDF = modeDF\
.withColumn('order', F.row_number().over(window))\
.where(F.col('order') == 1)\
Merging back with Base DataFrame to impute
sparkDF = sparkDF.join(modeDF
,sparkDF['date'] == modeDF['date']
,'inner'
).select(sparkDF['*'],modeDF['mode_sold_qty'],modeDF['mode_count'])
sparkDF = sparkDF.withColumn('imputed_sold_qty',F.when(F.col('sold_qty') < 0,F.col('mode_sold_qty'))\
.otherwise(F.col('sold_qty')))
>>> sparkDF.show(100)
+----------+--------+----------+-------------+----------+----------------+
|product_id|sold_qty| date|mode_sold_qty|mode_count|imputed_sold_qty|
+----------+--------+----------+-------------+----------+----------------+
| 1| 7|2019-11-09| 2| 3| 7|
| 1| 7|2019-11-09| 2| 3| 7|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| 8|2019-11-09| 2| 3| 8|
| 1| -10|2019-11-09| 2| 3| 2|
| 2| 5|2019-11-09| 2| 3| 5|
| 2| 5|2019-11-09| 2| 3| 5|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| 2|2019-11-09| 2| 3| 2|
| 2| -10|2019-11-09| 2| 3| 2|
| 1| 5|2019-11-08| 9| 1| 5|
| 1| 6|2019-11-08| 9| 1| 6|
| 2| 9|2019-11-08| 9| 1| 9|
| 2| -10|2019-11-08| 9| 1| 9|
| 1| 10|2019-11-07| 5| 2| 10|
| 1| 5|2019-11-07| 5| 2| 5|
| 1| 5|2019-11-07| 5| 2| 5|
| 2| 10|2019-11-07| 5| 2| 10|
| 2| 3|2019-11-07| 5| 2| 3|
| 2| 9|2019-11-07| 5| 2| 9|
+----------+--------+----------+-------------+----------+----------------+

Percentile over a specific column

I have the below dataframe .
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
created by
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
Here I have a base column M and is ranked as ID based on Amount.
I am trying to compute the percentile keeping M as a group but for every last three values of Amount.
I am Using the below code to find the percentile for a group. But how can I target the last three values. ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
Expected Output
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
This seems to be little bit tricky to me as I am still learning spark.Expecting answers from enthusiasts here.
Adding orderBy("Amount") and rowsBetween(-2,0) to the Window definition gets the required result:
orderBy sorts the rows within each group by Amount
rowsBetween takes only the current row and the two rows before into account when calculating the percentile
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
prints
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+

How to replace for loop in python with map transformation in pyspark where we want to compare previous row and current row with multiple conditions

just dragged in a road block kind situation while applying map function on pyspark dataframe and need your help in coming out of this.
Though problem is even more complicated but let me simplify it with below example using dictionary and for loop, and need solution in pyspark.
Here example of python code on dummy data, I want same in pyspark map transformation with when, clause using window or any other way.
Problem - I have a pyspark dataframe with column name as keys in below dictionary and want to add/modify section column with similar logic applied in for loop in this example.
record=[
{'id':xyz,'SN':xyz,'miles':xyz,'feet':xyz,'MP':xyz,'section':xyz},
{'id':xyz,'SN':xyz,'miles':xyz,'feet':xyz,'MP':xyz,'section':xyz},
{'id':xyz,'SN':xyz,'miles':xyz,'feet':xyz,'MP':xyz,'section':xyz}
]
last_rec='null'
section=0
for cur_rec in record:
if lastTrack != null:
if (last_rec.id != cur_rec.id | last_rec.SN != cur_rec.SN):
section+=1
elif (last_rec.miles == cur_rec.miles & abs(last_rec.feet- cur_rec.feet) > 1):
section+=1
elif (last_rec.MP== 555 & cur_rec.MP != 555):
section+=1
elif (abs(last_rec.miles- cur_rec.miles) > 1):
section+=1
cur_rec['section']= section
last_rec = cur_rec
Your window function is a cumulative sum of a boolean variable.
Let's start with a sample dataframe:
import numpy as np
record_df = spark.createDataFrame(
[list(x) for x in zip(*[np.random.randint(0, 10, 100).tolist() for _ in range(5)])],
['id', 'SN', 'miles', 'feet', 'MP'])
record_df.show()
+---+---+-----+----+---+
| id| SN|miles|feet| MP|
+---+---+-----+----+---+
| 9| 5| 7| 5| 1|
| 0| 6| 3| 7| 5|
| 8| 2| 7| 3| 5|
| 0| 2| 6| 5| 8|
| 0| 8| 9| 1| 5|
| 8| 5| 1| 6| 0|
| 0| 3| 9| 0| 3|
| 6| 4| 9| 0| 8|
| 5| 8| 8| 1| 0|
| 3| 0| 9| 9| 9|
| 1| 1| 2| 7| 0|
| 1| 3| 7| 7| 6|
| 4| 9| 5| 5| 5|
| 3| 6| 0| 0| 0|
| 5| 5| 5| 9| 3|
| 8| 3| 7| 8| 1|
| 7| 1| 3| 1| 8|
| 3| 1| 5| 2| 5|
| 6| 2| 3| 5| 6|
| 9| 4| 5| 9| 1|
+---+---+-----+----+---+
A cumulative sum is an ordered window function, therefore we'll need to use monotonically_increasing_id to give an order to our rows:
import pyspark.sql.functions as psf
record_df = record_df.withColumn(
'rn',
psf.monotonically_increasing_id())
For the boolean variable we'll need to use lag:
from pyspark.sql import Window
w = Window.orderBy('rn')
record_df = record_df.select(
record_df.columns
+ [psf.lag(c).over(w).alias('prev_' + c) for c in ['id', 'SN', 'miles', 'feet', 'MP']])
Since all the conditions yield the same result on section, it is an orclause:
clause = (psf.col("prev_id") != psf.col("id")) | (psf.col("prev_SN") != psf.col("SN")) \
| ((psf.col("prev_miles") == psf.col("miles")) & (psf.abs(psf.col("prev_feet") - psf.col("feet")) > 1)) \
| ((psf.col("prev_MP") == 555) & (psf.col("MP") != 555)) \
| (psf.abs(psf.col("prev_miles") - psf.col("miles")) > 1)
record_df = record_df.withColumn("tmp", (clause).cast('int'))
And finally for the cumulative sum
record_df = record_df.withColumn("section", psf.sum("tmp").over(w))

Pyspark - Ranking columns keeping ties

I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.