Pyspark: Window function for stddev and quantiles produce NaN and Nulls - pyspark

Trying to compute the stddev and 25,75 quantiles but they produce NaN and Null values
# Window Time = 30min
window_time = 1800
# Stats fields for window
stat_fields = ['source_packets', 'destination_packets']
df = sqlContext.createDataFrame([('192.168.1.1','10.0.0.1',22,51000, 17, 1, "2017-03-10T15:27:18+00:00"),
('192.168.1.2','10.0.0.2',51000,22, 1,2, "2017-03-15T12:27:18+00:00"),
('192.168.1.2','10.0.0.2',53,51000, 2,3, "2017-03-15T12:28:18+00:00"),
('192.168.1.2','10.0.0.2',51000,53, 3,4, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',80,51000, 4,5, "2017-03-15T12:28:18+00:00"),
('192.168.1.3','10.0.0.3',51000,80, 5,6, "2017-03-15T12:29:18+00:00"),
('192.168.1.3','10.0.0.3',22,51000, 25,7, "2017-03-18T11:27:18+00:00")],
["source_ip","destination_ip","source_port","destination_port", "source_packets", "destination_packets", "timestampGMT"])
def add_stats_column(r_df, field, window):
'''
Input:
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
'''
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.partitionBy("ip")
.orderBy(F.col("timestamp"))
.rangeBetween(-window_time, 0))
df2 = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.withColumn("arr",F.array(F.col("source_ip"),F.col("destination_ip")))\
.selectExpr("explode(arr) as ip","*")\
.drop(*['arr','source_ip','destination_ip'])
df2 = (reduce(partial(add_stats_column,window=w_s),
stat_fields,
df2
))
#print(df2.explain())
df2.show(100)
output
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| ip|source_port|destination_port|source_packets|destination_packets| timestampGMT| timestamp|source_packets_sum_30m|source_packets_avg_30m|source_packets_std_30m|source_packets_min_30m|source_packets_max_30m|source_packets_q25_30m|source_packets_q75_30m|destination_packets_sum_30m|destination_packets_avg_30m|destination_packets_std_30m|destination_packets_min_30m|destination_packets_max_30m|destination_packets_q25_30m|destination_packets_q75_30m|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
|192.168.1.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
|192.168.1.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
|192.168.1.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
| 10.0.0.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
| 10.0.0.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
| 10.0.0.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
| 10.0.0.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
| 10.0.0.3| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
| 10.0.0.3| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
| 10.0.0.3| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
|192.168.1.2| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
|192.168.1.2| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
|192.168.1.2| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
|192.168.1.1| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
+-----------+-----------+----------------+--------------+-------------------+--------------------+----------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+

from pyspark api doc, we can get that:
pyspark.sql.functions.stddev(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_pop(col)
Aggregate function: returns population standard deviation of the expression in a group.
New in version 1.6.
pyspark.sql.functions.stddev_samp(col)
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
so, maybe you can try stddev_pop:population standard deviation other than stddev:unbiased sample standard deviation.
unbiased sample standard deviation cause divide by zero error (get NaN) when only one sample.

Related

Pyspark cumsum over same values in orderBy column

I have the following dataframe:
+----+----+-----+
|col1|col2|value|
+----+----+-----+
| 11| a| 1|
| 11| a| 2|
| 11| b| 3|
| 11| a| 4|
| 11| b| 5|
| 22| a| 6|
| 22| b| 7|
+----+----+-----+
I want to calculate to calculate the cumsum of the 'value' column that is partitioned by 'col1' and ordered by 'col2'.
This is the desired output:
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 1| 1|
| 11| a| 2| 3|
| 11| a| 4| 7|
| 11| b| 3| 10|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
I have used this code which gives me the df shown below. It is not what I wanted. Can someone help me please?
df.withColumn("cumsum", F.sum("value").over(Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedPreceding, 0)))
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 2| 7|
| 11| a| 1| 7|
| 11| a| 4| 7|
| 11| b| 3| 15|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
You have to use .rowsBetween instead of .rangeBetween in your window clause.
rowsBetween (vs) rangeBetween
Example:
df.withColumn("cumsum", sum("value").over(Window.partitionBy("col1").orderBy("col2").rowsBetween(Window.unboundedPreceding, 0))).show()
#+----+----+-----+------+
#|col1|col2|value|cumsum|
#+----+----+-----+------+
#| 11| a| 1| 1|
#| 11| a| 2| 3|
#| 11| a| 4| 7|
#| 11| b| 3| 10|
#| 11| b| 5| 15|
#| 12| a| 6| 6|
#| 12| b| 7| 13|
#+----+----+-----+------+

Apache Spark visualization

I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
Patient.csv
+---+---------+------+---+-----------------+-----------+------------+-------------+
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
+---+---------+------+---+-----------------+-----------+------------+-------------+
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
+---+---------+------+---+-----------------+-----------+------------+-------------+
And coverType.csv
+--------------+-----------------+
|cover_type_key| cover_type_label|
+--------------+-----------------+
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
+--------------+-----------------+
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/Patient.csv")
.load()
val coverTypeDF=spark.read
.format("csv")
.option("header","true")
.option("inferSchema","true")
.option("nullValue","NA")
.option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss")
.option("mode","failfast")
.option("path","/spark-data/covertype.csv")
.load()
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.
To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
display(PatientDF.groupBy(Disease_Type).count())
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
display(PatientDF.select(year(to_timestamp("Infected_Date")).alias("year")).orderBy("year"))

Spark Scala Window extend result until the end

I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
df_997.show
+--------+-------+---+-------+
|policyId|FECMVTO|aux|IND_DEF|
+--------+-------+---+-------+
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
+--------+-------+---+-------+
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
df_998.show
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
+--------+-------+---+-------+-------+
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
+--------+-------+---+-------+-------+
|policyId|FECMVTO|aux|IND_DEF|row_num|
+--------+-------+---+-------+-------+
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
+--------+-------+---+-------+-------+
Thanks for your suggestions as I am very stuck in here...
Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
).toDF("policyId","FECMVTO","aux","IND_DEF")
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
show
// +--------+---+-------+-------+--------+--------+
// |policyId|aux|FECMVTO|IND_DEF|IND_DEF2|IND_DEF3|
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).
#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)

Pyspark - Ranking columns keeping ties

I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns)
+--------+----------+-----+----+
| Entity| id| colA|colB|
+-------------------+-----+----+
| a|8589934652| 21| 50|
| b| 112| 9| 23|
| c|8589934629| 9| 23|
| d|8589934702| 8| 21|
| e| 20| 2| 21|
| f|8589934657| 2| 5|
| g|8589934601| 1| 5|
| h|8589934653| 1| 4|
| i|8589934620| 0| 4|
| j|8589934643| 0| 3|
| k|8589934618| 0| 3|
| l|8589934602| 0| 2|
| m|8589934664| 0| 2|
| n| 25| 0| 1|
| o| 67| 0| 1|
| p|8589934642| 0| 1|
| q|8589934709| 0| 1|
| r|8589934660| 0| 1|
| s| 30| 0| 1|
| t| 55| 0| 1|
+--------+----------+-----+----+
What I'd like is a way to rank this dataframe where tied values receive the same rank such as:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 21| 2| 3|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+--------+----------+-----+----+---------+---------+
My current implementation with the first dataframe looks like:
def getRanks(mydf, cols=None, ascending=False):
from pyspark import Row
# This takes a dataframe and a list of columns to rank
# If no list is provided, it ranks *all* columns
# returns a new dataframe
def addRank(ranked_rdd, col, ascending):
# This assumes an RDD of the form (Row(...), list[...])
# it orders the rdd by col, finds the order, then adds that to the
# list
myrdd = ranked_rdd.sortBy(lambda (row, ranks): row[col],
ascending=ascending).zipWithIndex()
return myrdd.map(lambda ((row, ranks), index): (row, ranks +
[index+1]))
myrdd = mydf.rdd
fields = myrdd.first().__fields__
ranked_rdd = myrdd.map(lambda x: (x, []))
if (cols is None):
cols = fields
for col in cols:
ranked_rdd = addRank(ranked_rdd, col, ascending)
rank_names = [x + "_rank" for x in cols]
# Hack to make sure columns come back in the right order
ranked_rdd = ranked_rdd.map(lambda (row, ranks): Row(*row.__fields__ +
rank_names)(*row + tuple(ranks)))
return ranked_rdd.toDF()
which produces:
+--------+----------+-----+----+---------+---------+
| Entity| id| colA|colB|colA_rank|colB_rank|
+-------------------+-----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 3| 3|
| d|8589934702| 8| 21| 4| 4|
| e| 20| 2| 21| 5| 5|
| f|8589934657| 2| 5| 6| 6|
| g|8589934601| 1| 5| 7| 7|
| h|8589934653| 1| 4| 8| 8|
| i|8589934620| 0| 4| 9| 9|
| j|8589934643| 0| 3| 10| 10|
| k|8589934618| 0| 3| 11| 11|
| l|8589934602| 0| 2| 12| 12|
| m|8589934664| 0| 2| 13| 13|
| n| 25| 0| 1| 14| 14|
| o| 67| 0| 1| 15| 15|
| p|8589934642| 0| 1| 16| 16|
| q|8589934709| 0| 1| 17| 17|
| r|8589934660| 0| 1| 18| 18|
| s| 30| 0| 1| 19| 19|
| t| 55| 0| 1| 20| 20|
+--------+----------+-----+----+---------+---------+
As you can see, the function getRanks() takes a dataframe, specifies the columns to be ranked, sorts them, and uses zipWithIndex() to generate an ordering or rank. However, I can't figure out a way to preserve ties.
This stackoverflow post is the closest solution I've found:
rank-users-by-column But it appears to only handle 1 column (I think).
Thanks so much for the help in advance!
EDIT: column 'id' is generated from calling monotonically_increasing_id() and in my implementation is cast to a string.
You're looking for dense_rank
First let's create our dataframe:
df = spark.createDataFrame(sc.parallelize([["a",8589934652,21,50],["b",112,9,23],["c",8589934629,9,23],
["d",8589934702,8,21],["e",20,2,21],["f",8589934657,2,5],
["g",8589934601,1,5],["h",8589934653,1,4],["i",8589934620,0,4],
["j",8589934643,0,3],["k",8589934618,0,3],["l",8589934602,0,2],
["m",8589934664,0,2],["n",25,0,1],["o",67,0,1],["p",8589934642,0,1],
["q",8589934709,0,1],["r",8589934660,0,1],["s",30,0,1],["t",55,0,1]]
), ["Entity","id","colA","colB"])
We'll define two windowSpec:
from pyspark.sql import Window
import pyspark.sql.functions as psf
wA = Window.orderBy(psf.desc("colA"))
wB = Window.orderBy(psf.desc("colB"))
df = df.withColumn(
"colA_rank",
psf.dense_rank().over(wA)
).withColumn(
"colB_rank",
psf.dense_rank().over(wB)
)
+------+----------+----+----+---------+---------+
|Entity| id|colA|colB|colA_rank|colB_rank|
+------+----------+----+----+---------+---------+
| a|8589934652| 21| 50| 1| 1|
| b| 112| 9| 23| 2| 2|
| c|8589934629| 9| 23| 2| 2|
| d|8589934702| 8| 21| 3| 3|
| e| 20| 2| 21| 4| 3|
| f|8589934657| 2| 5| 4| 4|
| g|8589934601| 1| 5| 5| 4|
| h|8589934653| 1| 4| 5| 5|
| i|8589934620| 0| 4| 6| 5|
| j|8589934643| 0| 3| 6| 6|
| k|8589934618| 0| 3| 6| 6|
| l|8589934602| 0| 2| 6| 7|
| m|8589934664| 0| 2| 6| 7|
| n| 25| 0| 1| 6| 8|
| o| 67| 0| 1| 6| 8|
| p|8589934642| 0| 1| 6| 8|
| q|8589934709| 0| 1| 6| 8|
| r|8589934660| 0| 1| 6| 8|
| s| 30| 0| 1| 6| 8|
| t| 55| 0| 1| 6| 8|
+------+----------+----+----+---------+---------+
I'll also pose an alternative:
for cols in data.columns[2:]:
lookup = (data.select(cols)
.distinct()
.orderBy(cols, ascending=False)
.rdd
.zipWithIndex()
.map(lambda x: x[0] + (x[1], ))
.toDF([cols, cols+"_rank_lookup"]))
name = cols + "_ranks"
data = data.join(lookup, [cols]).withColumn(name,col(cols+"_rank_lookup")
+ 1).drop(cols + "_rank_lookup")
Not as elegant as dense_rank() and I'm uncertain as to performance implications.

How to find the previous occurence of a value 'a' before some value 'b'

I join two data frames and have the resulting data frame as below.Now I want to
+---------+-----------+-----------+-------------------+---------+-------------------+
|a |b | c | d | e | f |
+---------+-----------+-----------+-------------------+---------+-------------------+
| 7| 2| 1|2015-04-12 23:59:01| null| null |
| 15| 2| 2|2015-04-12 23:59:02| | |
| 11| 2| 4|2015-04-12 23:59:03| null| null|
| 3| 2| 4|2015-04-12 23:59:04| null| null|
| 8| 2| 3|2015-04-12 23:59:05| {NORMAL} 2015-04-12 23:59:05|
| 16| 2| 3|2017-03-12 23:59:06| null| null|
| 5| 2| 3|2015-04-12 23:59:07| null| null|
| 18| 2| 3|2015-03-12 23:59:08| null| null|
| 17| 2| 1|2015-03-12 23:59:09| null| null|
| 6| 2| 1|2015-04-12 23:59:10| null| null|
| 19| 2| 3|2015-03-12 23:59:11| null| null|
| 9| 2| 3|2015-04-12 23:59:12| null| null|
| 1| 2| 2|2015-04-12 23:59:13| null| null|
| 1| 2| 2|2015-04-12 23:59:14| null| null|
| 1| 2| 2|2015-04-12 23:59:15| null| null|
| 10| 3| 2|2015-04-12 23:59:16| null| null|
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
| 12| 3| 1|2015-04-12 23:59:18| null| null|
| 13| 3| 1|2015-04-12 23:59:19| null| null|
| 14| 2| 1|2015-04-12 23:59:20| null| null|
+---------+-----------+-----------+-------------------+---------+-------------------+
Now I have to find the first occuring 1 before each 3 in column c .For example
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
Before this record I want to know the first occured 1 in column c which is
| 17| 2| 1|2015-03-12 23:59:09| null| null|
Any help is appreciated
You can use Spark window function lag import org.apache.spark.sql.expressions.Window
In first step you filter your data on the column "c" based on value as either 1 or 3. You will get data similar to
dft.show()
+---+---+---+---+
| id| a| b| c|
+---+---+---+---+
| 1| 7| 2| 1|
| 2| 15| 2| 3|
| 3| 11| 2| 3|
| 4| 3| 2| 1|
| 5| 8| 2| 3|
+---+---+---+---+
Next, define the window
val w = Window.orderBy("id")
Once this is done, create a new column and put previous value in it
dft.withColumn("prev", lag("c",1).over(w)).show()
+---+---+---+---+----+
| id| a| b| c|prev|
+---+---+---+---+----+
| 1| 7| 2| 1|null|
| 2| 15| 2| 3| 1|
| 3| 11| 2| 3| 3|
| 4| 3| 2| 1| 3|
| 5| 8| 2| 3| 1|
+---+---+---+---+----+
Finally filter on the values of column "c" and "prev"
Note: Do combine the steps when you are writing final code, so as to apply filter directly.