Percentile over a specific column - scala

I have the below dataframe .
scala> df.show
+---+------+---+
| M|Amount| Id|
+---+------+---+
| 1| 5| 1|
| 1| 10| 2|
| 1| 15| 3|
| 1| 20| 4|
| 1| 25| 5|
| 1| 30| 6|
| 2| 2| 1|
| 2| 4| 2|
| 2| 6| 3|
| 2| 8| 4|
| 2| 10| 5|
| 2| 12| 6|
| 3| 1| 1|
| 3| 2| 2|
| 3| 3| 3|
| 3| 4| 4|
| 3| 5| 5|
| 3| 6| 6|
+---+------+---+
created by
val df=Seq( (1,5,1), (1,10,2), (1,15,3), (1,20,4), (1,25,5), (1,30,6), (2,2,1), (2,4,2), (2,6,3), (2,8,4), (2,10,5), (2,12,6), (3,1,1), (3,2,2), (3,3,3), (3,4,4), (3,5,5), (3,6,6) ).toDF("M","Amount","Id")
Here I have a base column M and is ranked as ID based on Amount.
I am trying to compute the percentile keeping M as a group but for every last three values of Amount.
I am Using the below code to find the percentile for a group. But how can I target the last three values. ?
df.withColumn("percentile",percentile_approx(col("Amount") ,lit(.5)) over Window.partitionBy("M"))
Expected Output
+---+------+---+-----------------------------------+
| M|Amount| Id| percentile |
+---+------+---+-----------------------------------+
| 1| 5| 1| percentile(Amount) whose (Id-1) |
| 1| 10| 2| percentile(Amount) whose (Id-1,2) |
| 1| 15| 3| percentile(Amount) whose (Id-1,3) |
| 1| 20| 4| percentile(Amount) whose (Id-2,4) |
| 1| 25| 5| percentile(Amount) whose (Id-3,5) |
| 1| 30| 6| percentile(Amount) whose (Id-4,6) |
| 2| 2| 1| percentile(Amount) whose (Id-1) |
| 2| 4| 2| percentile(Amount) whose (Id-1,2) |
| 2| 6| 3| percentile(Amount) whose (Id-1,3) |
| 2| 8| 4| percentile(Amount) whose (Id-2,4) |
| 2| 10| 5| percentile(Amount) whose (Id-3,5) |
| 2| 12| 6| percentile(Amount) whose (Id-4,6) |
| 3| 1| 1| percentile(Amount) whose (Id-1) |
| 3| 2| 2| percentile(Amount) whose (Id-1,2) |
| 3| 3| 3| percentile(Amount) whose (Id-1,3) |
| 3| 4| 4| percentile(Amount) whose (Id-2,4) |
| 3| 5| 5| percentile(Amount) whose (Id-3,5) |
| 3| 6| 6| percentile(Amount) whose (Id-4,6) |
+---+------+---+----------------------------------+
This seems to be little bit tricky to me as I am still learning spark.Expecting answers from enthusiasts here.

Adding orderBy("Amount") and rowsBetween(-2,0) to the Window definition gets the required result:
orderBy sorts the rows within each group by Amount
rowsBetween takes only the current row and the two rows before into account when calculating the percentile
val w = Window.partitionBy("M").orderBy("Amount").rowsBetween(-2,0)
df.withColumn("percentile",PercentileApprox.percentile_approx(col("Amount") ,lit(.5))
.over(w))
.orderBy("M", "Amount") //not really required, just to make the output more readable
.show()
prints
+---+------+---+----------+
| M|Amount| Id|percentile|
+---+------+---+----------+
| 1| 5| 1| 5|
| 1| 10| 2| 5|
| 1| 15| 3| 10|
| 1| 20| 4| 15|
| 1| 25| 5| 20|
| 1| 30| 6| 25|
| 2| 2| 1| 2|
| 2| 4| 2| 2|
| 2| 6| 3| 4|
| 2| 8| 4| 6|
| 2| 10| 5| 8|
| 2| 12| 6| 10|
| 3| 1| 1| 1|
| 3| 2| 2| 1|
| 3| 3| 3| 2|
| 3| 4| 4| 3|
| 3| 5| 5| 4|
| 3| 6| 6| 5|
+---+------+---+----------+

Related

Pyspark keep state within tasks

This is related to this question: Pyspark dataframe column value dependent on value from another row but this one gets even more complicated.
I have a dataframe:
columns = ['id','seq','manufacturer']
data = [("1",1,"Factory"), ("1",2,"Sub-Factory-1"), ("1",3,"Order"),("1",4,"Sub-Factory-1"),("2",1,"Factory"), ("2",2,"Sub-Factory-1"), ("2",5,"Sub-Factory-1"),("3",1, "Sub-Factory-1"),("3",2,"Order"), ("3",4, "Sub-Factory-1"), ("4", 1,"Factory"), ("4",3, "Sub-Factory-1"),("4",4, "Sub-Factory-1"),("5",1,"Sub-Factory-1"), ("5",2, "Sub-Factory-1"), ("5", 6,"Order"), ("6",2,"Factory"), ("6",3, "Order"), ("6",4,"Sub-Factory-1"), ("6", 6,"Sub-Factory-1"), ("6",7,"Order"), ("7",1,"Sub-Factory-1"), ("7",2,"Factory" ), ("7", 3,"Order"), ("7", 4,"Sub-Factory-1"),("7",5,"Factory"), ("7",8, "Sub-Factory-1"),("7",10,"Sub-Factory-1")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.orderBy('id','seq').show(40)
+---+---+-------------+
| id|seq| manufacturer|
+---+---+-------------+
| 1| 1| Factory|
| 1| 2|Sub-Factory-1|
| 1| 3| Order|
| 1| 4|Sub-Factory-1|
| 2| 1| Factory|
| 2| 2|Sub-Factory-1|
| 2| 5|Sub-Factory-1|
| 3| 1|Sub-Factory-1|
| 3| 2| Order|
| 3| 4|Sub-Factory-1|
| 4| 1| Factory|
| 4| 3|Sub-Factory-1|
| 4| 4|Sub-Factory-1|
| 5| 1|Sub-Factory-1|
| 5| 2|Sub-Factory-1|
| 5| 6| Order|
| 6| 2| Factory|
| 6| 3| Order|
| 6| 4|Sub-Factory-1|
| 6| 6|Sub-Factory-1|
| 6| 7| Order|
| 7| 1|Sub-Factory-1|
| 7| 2| Factory|
| 7| 3| Order|
| 7| 4|Sub-Factory-1|
| 7| 5| Factory|
| 7| 8|Sub-Factory-1|
| 7| 10|Sub-Factory-1|
+---+---+-------------+
What I want to do is to assign hierarchical values to another column(not saying its the best idea) that I can use with the logic from Pyspark dataframe column value dependent on value from another row. So within id group and seq order I want only the first Sub-Factory to attribute to Factory, if there is a Factory within same id and seq order above the Sub-Factory.
So end result should look like:
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
| 6| 7| Order| 0|
| 7| 1|Sub-Factory-1| 0|
| 7| 2| Factory| 1|
| 7| 3| Order| 0|
| 7| 4|Sub-Factory-1| 1|
| 7| 5| Factory| 1|
| 7| 8|Sub-Factory-1| 1|
| 7| 10|Sub-Factory-1| 0|
+---+---+-------------+-------+
The dataset is large so I can't use something like df.collect() and then loop over data because it crashes memory. My first idea was to use an accumulator like:
acc = sc.accumulator(0)
def myFunc(manufacturer):
if manufacturer == 'Factory':
acc.value = 1
return 1
elif manufacturer == 'Sub-Factory-1' and acc.value == 1:
acc.value = 0
return 1
else:
return 0
myFuncUDF = F.udf(myFunc, IntegerType())
df = df.withColumn('test', myFuncUDF(col('manufacturer')))
But it's a bad idea since accumulator cannot be accessed within tasks.
Also Window function solves it if I want to attribute all Sub-Factories from above Factory within same id but now only the first Sub-Factory should get attributed. Any ideas?
from pyspark.sql.window import Window
from pyspark.sql.functions import *
df_mod = df.filter(df.manufacturer == 'Sub-Factory-1')
W = Window.partitionBy("id").orderBy("seq")
df_mod = df_mod.withColumn("rank",rank().over(W))
df_mod = df_mod.filter(col('rank') == 1)
df_mod2 = df.filter(col('manufacturer') == 'Factory')\
.select('id', 'seq', col('manufacturer').alias('Factory_chk_2'))
df_f = df\
.join(df_mod, ['id', 'seq'], 'left')\
.select('id', 'seq', df.manufacturer, 'rank')\
.join(df_mod2, 'id', 'left')\
.select('id', df.seq, df.manufacturer, 'rank', 'Factory_chk_2')\
.withColumn('Factory_chk', when(df.manufacturer=='Factory', 1))\
.withColumn('Factory_chk_2', when(col('Factory_chk_2')=='Factory', 1))\
.withColumn('checker',when(col('Factory_chk_2')=='1', coalesce(col('rank'),col('Factory_chk'))).otherwise(lit(0)))\
.select('id', 'seq', 'manufacturer', 'checker')\
.na.fill(value=0)\
.orderBy('id', 'seq')
df_f.show()
+---+---+-------------+-------+
| id|seq| manufacturer|checker|
+---+---+-------------+-------+
| 1| 1| Factory| 1|
| 1| 2|Sub-Factory-1| 1|
| 1| 3| Order| 0|
| 1| 4|Sub-Factory-1| 0|
| 2| 1| Factory| 1|
| 2| 2|Sub-Factory-1| 1|
| 2| 5|Sub-Factory-1| 0|
| 3| 1|Sub-Factory-1| 0|
| 3| 2| Order| 0|
| 3| 4|Sub-Factory-1| 0|
| 4| 1| Factory| 1|
| 4| 3|Sub-Factory-1| 1|
| 4| 4|Sub-Factory-1| 0|
| 5| 1|Sub-Factory-1| 0|
| 5| 2|Sub-Factory-1| 0|
| 5| 6| Order| 0|
| 6| 2| Factory| 1|
| 6| 3| Order| 0|
| 6| 4|Sub-Factory-1| 1|
| 6| 6|Sub-Factory-1| 0|
+---+---+-------------+-------+
only showing top 20 rows

Pyspark: How to group rows into N groups?

I am performing a df.groupBy().apply() in my pyspark script and want to create a custom column that has grouped all my rows into N (as even as possible, so rows/n) groups. That why, I can ensure the number of groups sent to my udf function everytime the script runs.
How can I do this using pyspark?
If you need an exact split, then you need windowing
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
w=Window.orderBy(F.lit(1))
tst_mod = tst.withColumn("id",(F.row_number().over(w))%3) # 3 is the group size in this example
tst_mod.show()
+----+----+----+----+---+
|col1|col2|col3|col4| id|
+----+----+----+----+---+
| 5| 3| 7| 5| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 7| 3| 9| 5| 1|
| 1| 2| 3| 4| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 5| 3| 7| 5| 2|
| 7| 3| 9| 5| 0|
| 1| 2| 3| 4| 1|
| 3| 2| 5| 4| 2|
| 5| 3| 7| 5| 0|
| 3| 2| 5| 4| 1|
| 7| 3| 9| 5| 2|
| 3| 2| 5| 4| 0|
| 1| 2| 3| 4| 1|
+----+----+----+----+---+
tst_mod.groupby('id').count().show()
+---+-----+
| id|count|
+---+-----+
| 1| 6|
| 2| 5|
| 0| 5|
+---+-----+
If you are ok with a normal distribution, then you can try a technique called salting
import pyspark.sql.functions as F
from pyspark.sql import Window
#Test data
tst = sqlContext.createDataFrame([(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5),(1,2,3,4),(3,2,5,4),(5,3,7,5),(7,3,9,5)],schema=['col1','col2','col3','col4'])
tst_salt= tst.withColumn("salt", F.rand(seed=10)*3)
If you groupby the column salt, you will have a normally distributed group

PySpark: counting rows based on current row value

I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')

Pyspark - Ranking columns keeping ties

I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.

How to find the previous occurence of a value 'a' before some value 'b'

I join two data frames and have the resulting data frame as below.Now I want to
+---------+-----------+-----------+-------------------+---------+-------------------+
|a |b | c | d | e | f |
+---------+-----------+-----------+-------------------+---------+-------------------+
| 7| 2| 1|2015-04-12 23:59:01| null| null |
| 15| 2| 2|2015-04-12 23:59:02| | |
| 11| 2| 4|2015-04-12 23:59:03| null| null|
| 3| 2| 4|2015-04-12 23:59:04| null| null|
| 8| 2| 3|2015-04-12 23:59:05| {NORMAL} 2015-04-12 23:59:05|
| 16| 2| 3|2017-03-12 23:59:06| null| null|
| 5| 2| 3|2015-04-12 23:59:07| null| null|
| 18| 2| 3|2015-03-12 23:59:08| null| null|
| 17| 2| 1|2015-03-12 23:59:09| null| null|
| 6| 2| 1|2015-04-12 23:59:10| null| null|
| 19| 2| 3|2015-03-12 23:59:11| null| null|
| 9| 2| 3|2015-04-12 23:59:12| null| null|
| 1| 2| 2|2015-04-12 23:59:13| null| null|
| 1| 2| 2|2015-04-12 23:59:14| null| null|
| 1| 2| 2|2015-04-12 23:59:15| null| null|
| 10| 3| 2|2015-04-12 23:59:16| null| null|
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
| 12| 3| 1|2015-04-12 23:59:18| null| null|
| 13| 3| 1|2015-04-12 23:59:19| null| null|
| 14| 2| 1|2015-04-12 23:59:20| null| null|
+---------+-----------+-----------+-------------------+---------+-------------------+
Now I have to find the first occuring 1 before each 3 in column c .For example
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
Before this record I want to know the first occured 1 in column c which is
| 17| 2| 1|2015-03-12 23:59:09| null| null|
Any help is appreciated
You can use Spark window function lag import org.apache.spark.sql.expressions.Window
In first step you filter your data on the column "c" based on value as either 1 or 3. You will get data similar to
dft.show()
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1| 7| 2| 1|
| 2| 15| 2| 3|
| 3| 11| 2| 3|
| 4| 3| 2| 1|
| 5| 8| 2| 3|
+---+---+---+---+
Next, define the window
val w = Window.orderBy("id")
Once this is done, create a new column and put previous value in it
dft.withColumn("prev", lag("c",1).over(w)).show()
+---+---+---+---+----+
| id| a| b| c|prev|
+---+---+---+---+----+
| 1| 7| 2| 1|null|
| 2| 15| 2| 3| 1|
| 3| 11| 2| 3| 3|
| 4| 3| 2| 1| 3|
| 5| 8| 2| 3| 1|
+---+---+---+---+----+
Finally filter on the values of column "c" and "prev"
Note: Do combine the steps when you are writing final code, so as to apply filter directly.