I want to do calculation on only a specified subset of a dataframe by creating a window that can include a given Date:
window_row = Window.partitionBy('I1','Id2')
df=df.withColumn('max_Date', when((col('Date')<=target_date),max('Date').over(window_row)))
df=df.withColumn('cum_sum', when((col('Date')==col('max_Date')),sum('Sale').over(window_row)))
When target_date= '2020-01-01', I get the following output:
|I1| Id2| Date| Sale| max_Date|cum_sum|
|AA| B0| 2019-07-01| 1|2020-12-01| null|
|AA| B0| 2020-01-01| 23|2020-12-01| null|
|AA| B0| 2020-01-01| 2|2020-01-01| null|
|AA| B0| 2020-02-01| 0| null| null|
|AA| B0| 2020-12-01| 116| null| null|
|BB| C0| 2019-03-01| 1|2020-03-01| null|
|BB| C0| 2019-05-01| 26|2020-03-01| null|
|BB| C0| 2020-03-01| 1| null| null|
|CC| B0| 2019-03-01| 8|2019-04-01| null|
|CC| B0| 2019-04-01| 1|2019-04-01| 1|
However, the desired output is:
|I1| Id2| Date| Sale| max_Date|cum_sum|
|AA| B0| 2019-07-01| 1|2020-01-01| null|
|AA| B0| 2020-01-01| 23|2020-01-01| 25|
|AA| B0| 2020-01-01| 2|2020-01-01| 25|
|AA| B0| 2020-02-01| 0| null| null|
|AA| B0| 2020-12-01| 116| null| null|
|BB| C0| 2019-03-01| 1|2019-05-01| null|
|BB| C0| 2019-05-01| 26|2019-05-01| 26|
|BB| C0| 2020-03-01| 1| null| null|
|CC| B0| 2019-03-01| 8|2019-04-01| null|
|CC| B0| 2019-04-01| 1|2019-04-01| 1|
How do I implement this in an efficient way?
Here is a smaller example that can help get you in the right direction for windowing. Here is an example of a window with sum, max, and cumulative sum operations. I was a having some trouble following data and desired output above, so I created a smaller sample of data. Hope this helps.
# create data with pandas
import pandas as pd
data = {
"DATE": ["2020", "2019", "2018", "2020", "2019", "2018"],
"VALUE": [1, 3, 4, 5, 9, 6]
pd_df = pd.DataFrame(data)
# make it a spark dataframe
spark_df = spark.createDataFrame(pd_df)
Here is the shown input data frame
|2020| 1|
|2019| 3|
|2018| 4|
|2020| 5|
|2019| 9|
|2018| 6|
# perform window operations
from pyspark.sql.window import Window
from pyspark.sql.functions import max, sum
w = Window.partitionBy("DATE")
w_preceding = Window.partitionBy("DATE").orderBy('VALUE').rangeBetween(Window.unboundedPreceding, 0)
spark_df = spark_df.withColumn("MAX_WINDOW", max("VALUE").over(w))
spark_df = spark_df.withColumn("SUM_WINDOW", sum("VALUE").over(w))
spark_df = spark_df.withColumn("CUM_SUM_WINDOW", sum("VALUE").over(w_preceding))
here is the final result dataframe
|2020| 1| 5| 6| 1|
|2020| 5| 5| 6| 6|
|2019| 3| 9| 12| 3|
|2019| 9| 9| 12| 12|
|2018| 4| 6| 10| 4|
|2018| 6| 6| 10| 10|
Let me know if this helps.
This might helpful
>>> w = Window.partitionBy("date")
>>> w_preceding = Window.partitionBy("I1","ID2").orderBy('DATE').rangeBetween(Window.unboundedPreceding, 0)
>>> df=df.withColumn('max_Date', when((col('Date')<=target_date),max('Date').over(w)))
>>> df=df.withColumn('cum_sum', when((col('Date')==col('max_Date')),sum('Sale').over(w_preceding)))
>>> df.show()
| Date| I1|Id2|Sale| max_Date|cum_sum|
|2019-05-01| BB| C0| 26|2019-05-01| 26.0|
|2020-03-01| BB| C0| 1|2020-03-01| 27.0|
|2019-07-01| AA| B0| 1|2019-07-01| 2.0|
|2019-07-01| AA| B0| 1|2019-07-01| 2.0|
|2020-02-01| AA| B0| 0|2020-02-01| 2.0|
|2020-09-01| AA| B0| 0| null| null|
|2019-04-01| CC| B0| 1|2019-04-01| 1.0|
Trying to compute the stddev and 25,75 quantiles but they produce NaN and Null values
# Window Time = 30min
window_time = 1800
# Stats fields for window
stat_fields = ['source_packets', 'destination_packets']
df = sqlContext.createDataFrame([('','',22,51000, 17, 1, "2017-03-10T15:27:18+00:00"),
('','',51000,22, 1,2, "2017-03-15T12:27:18+00:00"),
('','',53,51000, 2,3, "2017-03-15T12:28:18+00:00"),
('','',51000,53, 3,4, "2017-03-15T12:29:18+00:00"),
('','',80,51000, 4,5, "2017-03-15T12:28:18+00:00"),
('','',51000,80, 5,6, "2017-03-15T12:29:18+00:00"),
('','',22,51000, 25,7, "2017-03-18T11:27:18+00:00")],
["source_ip","destination_ip","source_port","destination_port", "source_packets", "destination_packets", "timestampGMT"])
def add_stats_column(r_df, field, window):
r_df: dataframe
field: field to generate stats with
window: pyspark window to be used
r_df = r_df \
.withColumn('{}_sum_30m'.format(field), F.sum(field).over(window))\
.withColumn('{}_avg_30m'.format(field), F.avg(field).over(window))\
.withColumn('{}_std_30m'.format(field), F.stddev(field).over(window))\
.withColumn('{}_min_30m'.format(field), F.min(field).over(window))\
.withColumn('{}_max_30m'.format(field), F.max(field).over(window))\
.withColumn('{}_q25_30m'.format(field), F.expr("percentile_approx('{}', 0.25)".format(field)).over(window))\
.withColumn('{}_q75_30m'.format(field), F.expr("percentile_approx('{}', 0.75)".format(field)).over(window))
return r_df
w_s = (Window()
.rangeBetween(-window_time, 0))
df2 = df.withColumn("timestamp", F.unix_timestamp(F.to_timestamp("timestampGMT"))) \
.selectExpr("explode(arr) as ip","*")\
df2 = (reduce(partial(add_stats_column,window=w_s),
| ip|source_port|destination_port|source_packets|destination_packets| timestampGMT| timestamp|source_packets_sum_30m|source_packets_avg_30m|source_packets_std_30m|source_packets_min_30m|source_packets_max_30m|source_packets_q25_30m|source_packets_q75_30m|destination_packets_sum_30m|destination_packets_avg_30m|destination_packets_std_30m|destination_packets_min_30m|destination_packets_max_30m|destination_packets_q25_30m|destination_packets_q75_30m|
|| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
|| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
|| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
|| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
|| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
|| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
|| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
|| 80| 51000| 4| 5|2017-03-15T12:28:...|1489580898| 4| 4.0| NaN| 4| 4| null| null| 5| 5.0| NaN| 5| 5| null| null|
|| 51000| 80| 5| 6|2017-03-15T12:29:...|1489580958| 9| 4.5| 0.7071067811865476| 4| 5| null| null| 11| 5.5| 0.7071067811865476| 5| 6| null| null|
|| 22| 51000| 25| 7|2017-03-18T11:27:...|1489836438| 25| 25.0| NaN| 25| 25| null| null| 7| 7.0| NaN| 7| 7| null| null|
|| 51000| 22| 1| 2|2017-03-15T12:27:...|1489580838| 1| 1.0| NaN| 1| 1| null| null| 2| 2.0| NaN| 2| 2| null| null|
|| 53| 51000| 2| 3|2017-03-15T12:28:...|1489580898| 3| 1.5| 0.7071067811865476| 1| 2| null| null| 5| 2.5| 0.7071067811865476| 2| 3| null| null|
|| 51000| 53| 3| 4|2017-03-15T12:29:...|1489580958| 6| 2.0| 1.0| 1| 3| null| null| 9| 3.0| 1.0| 2| 4| null| null|
|| 22| 51000| 17| 1|2017-03-10T15:27:...|1489159638| 17| 17.0| NaN| 17| 17| null| null| 1| 1.0| NaN| 1| 1| null| null|
from pyspark api doc, we can get that:
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
Aggregate function: returns population standard deviation of the expression in a group.
New in version 1.6.
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
New in version 1.6.
so, maybe you can try stddev_pop:population standard deviation other than stddev:unbiased sample standard deviation.
unbiased sample standard deviation cause divide by zero error (get NaN) when only one sample.
I'm new to Apache Spark and trying to learn visualization in Apache Spark/Databricks at the moment. If I have the following csv datasets;
| Id|Post_Code|Height|Age|Health_Cover_Type|Temperature|Disease_Type|Infected_Date|
| 1| 2096| 131| 22| 5| 37| 4| 891717742|
| 2| 2090| 136| 18| 5| 36| 1| 881250949|
| 3| 2004| 120| 9| 2| 36| 2| 878887136|
| 4| 2185| 155| 41| 1| 36| 1| 896029926|
| 5| 2195| 145| 25| 5| 37| 1| 887100886|
| 6| 2079| 172| 52| 2| 37| 5| 871205766|
| 7| 2006| 176| 27| 1| 37| 3| 879487476|
| 8| 2605| 129| 15| 5| 36| 1| 876343336|
| 9| 2017| 145| 19| 5| 37| 4| 897281846|
| 10| 2112| 171| 47| 5| 38| 6| 882539696|
| 11| 2112| 102| 8| 5| 36| 5| 873648586|
| 12| 2086| 151| 11| 1| 35| 1| 894724066|
| 13| 2142| 148| 22| 2| 37| 1| 889446276|
| 14| 2009| 158| 57| 5| 38| 2| 887072826|
| 15| 2103| 167| 34| 1| 37| 3| 892094506|
| 16| 2095| 168| 37| 5| 36| 1| 893400966|
| 17| 2010| 156| 20| 3| 38| 5| 897313586|
| 18| 2117| 143| 17| 5| 36| 2| 875238076|
| 19| 2204| 155| 24| 4| 38| 6| 884159506|
| 20| 2103| 138| 15| 5| 37| 4| 886765356|
And coverType.csv
|cover_type_key| cover_type_label|
| 1| Single|
| 2| Couple|
| 3| Family|
| 4| Concession|
| 5| Disable|
Which I've managed to load as DataFrames (Patient and coverType);
val PatientDF=spark.read
val coverTypeDF=spark.read
How do I generate a bar chart visualization to show the distribution of different Disease_Type in my dataset.
How do I generate a bar chart visualization to show the average Post_Code of each cover type with string labels for cover type.
How do I extract the year (YYYY) from the Infected_Date (represented in date (unix seconds since 1/1/1970 UTC)) ordering the result in decending order of the year and average age.
To display charts natively with Databricks you need to use the display function on a dataframe. For number one, we can accomplish what you'd like by aggregating the dataframe on disease type.
Then you can use the charting options to build a bar chart, you can do the same for your 2nd question, but instead of .count() use .avg("Post_Code")
For the third question you need to use the year function after casting the timestamp to a date and an orderBy.
from pyspark.sql.functions import *
I will expose my problem based on the initial dataframe and the one I want to achieve:
val df_997 = Seq [(Int, Int, Int, Int)]((1,1,7,10),(1,10,4,300),(1,3,14,50),(1,20,24,70),(1,30,12,90),(2,10,4,900),(2,25,30,40),(2,15,21,60),(2,5,10,80)).toDF("policyId","FECMVTO","aux","IND_DEF").orderBy(asc("policyId"), asc("FECMVTO"))
| 1| 1| 7| 10|
| 1| 3| 14| 50|
| 1| 10| 4| 300|
| 1| 20| 24| 70|
| 1| 30| 12| 90|
| 2| 5| 10| 80|
| 2| 10| 4| 900|
| 2| 15| 21| 60|
| 2| 25| 30| 40|
Imagine I have partitioned this DF by the column policyId and created the column row_num based on it to better see the Windows:
val win = Window.partitionBy("policyId").orderBy("FECMVTO")
val df_998 = df_997.withColumn("row_num",row_number().over(win))
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 10| 4| 300| 3|
| 1| 20| 24| 70| 4|
| 1| 30| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 10| 4| 900| 2|
| 2| 15| 21| 60| 3|
| 2| 25| 30| 40| 4|
Now, for each window, if the value of aux is 4, I want to set the value of IND_DEF column for that register to the column FEC_MVTO for this register on until the end of the window.
The resulting DF would be:
| 1| 1| 7| 10| 1|
| 1| 3| 14| 50| 2|
| 1| 300| 4| 300| 3|
| 1| 300| 24| 70| 4|
| 1| 300| 12| 90| 5|
| 2| 5| 10| 80| 1|
| 2| 900| 4| 900| 2|
| 2| 900| 21| 60| 3|
| 2| 900| 30| 40| 4|
Thanks for your suggestions as I am very stuck in here...
Here's one approach: First left-join the DataFrame with its aux == 4 filtered version, followed by applying Window function first to backfill nulls with the wanted IND_DEF values per partition, and finally conditionally recreate column FECMVTO:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1,1,7,10), (1,10,4,300), (1,3,14,50), (1,20,24,70), (1,30,12,90),
(2,10,4,900), (2,25,30,40), (2,15,21,60), (2,5,10,80)
val win = Window.partitionBy("policyId").orderBy("FECMVTO").
rowsBetween(Window.unboundedPreceding, 0)
val df2 = df.
select($"policyId", $"aux", $"IND_DEF".as("IND_DEF2")).
where($"aux" === 4)
df.join(df2, Seq("policyId", "aux"), "left_outer").
withColumn("IND_DEF3", first($"IND_DEF2", ignoreNulls=true).over(win)).
withColumn("FECMVTO", coalesce($"IND_DEF3", $"FECMVTO")).
// +--------+---+-------+-------+--------+--------+
// +--------+---+-------+-------+--------+--------+
// | 1| 7| 1| 10| null| null|
// | 1| 14| 3| 50| null| null|
// | 1| 4| 300| 300| 300| 300|
// | 1| 24| 300| 70| null| 300|
// | 1| 12| 300| 90| null| 300|
// | 2| 10| 5| 80| null| null|
// | 2| 4| 900| 900| 900| 900|
// | 2| 21| 900| 60| null| 900|
// | 2| 30| 900| 40| null| 900|
// +--------+---+-------+-------+--------+--------+
Columns IND_DEF2, IND_DEF3 are kept only for illustration (and can certainly be dropped).
#I believe below can be solution for your issue
Considering input_df is your input dataframe
//Step#1 - Filter rows with IND_DEF = 4 from input_df
val only_FECMVTO_4_df1 = input_df.filter($"IND_DEF" === 4)
//Step#2 - Filling FECMVTO value from IND_DEF for the above result
val only_FECMVTO_4_df2 = only_FECMVTO_4_df1.withColumn("FECMVTO_NEW",$"IND_DEF").drop($"FECMVTO").withColumnRenamed("FECMVTO",$"FECMVTO_NEW")
//Step#3 - removing all the records from step#1 from input_df
val input_df_without_FECMVTO_4 = input_df.except(only_FECMVTO_4_df1)
//combining Step#2 output with output of Step#3
val final_df = input_df_without_FECMVTO_4.union(only_FECMVTO_4_df2)
So i have two data frame .
Data Frame 1 like this :
| OrgId|ItemId|segmentId|Sequence|Action|
|4295877341| 136| 4| 1| I|!||
|4295877346| 136| 4| 1| I|!||
|4295877341| 138| 2| 1| I|!||
|4295877341| 141| 4| 1| I|!||
|4295877341| 143| 2| 1| I|!||
|4295877341| 145| 14| 1| I|!||
| 123456789| 145| 14| 1| I|!||
| 809580109| 145| 9| 9| I|!||
DataFrame2 is like below
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
| 123456789| 145| 14| 1| D|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
| 809580109| 145| NULL| NULL| I|!||
Now i have to join both data frame update data frame 1 column with matching records with data frame 2 .
Now key in both data frame is OrgId and ItemId.
So the expected output should be .
| OrgId|ItemId|segmentId|Sequence|Action|
|4295877346| 136| 4| 1| I|!||
|4295877341| 145| 14| 1| I|!||
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
| 809580109| 145| 9| 9| I|!||
So i need to update data frame 1 with data frame 2 records .
If records in data frame 1 is not found in 2 then also we need to retain that records .
If any new records are found in dataframe 2 then that records needs to added in the output
Here is what i am doing ..
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
.select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
But i am getting below output .
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
I am not getting 4295877346| 136| 4| 1| I|!| record from data frame 1 ...
left_outer gives me below output
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
Let me explain first whats your mistake.
if you only join as below
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
You will get
| OrgId|ItemId|segmentId|Sequence|Action|segmentId_1|Sequence_1|Action_1|
|4295877346| 136| 4| 1| I|!|| null| null| null|
|4295877341| 145| 14| 1| I|!|| null| null| null|
|4295877343| 149| null| null| null| 15| 2| I|!||
|4295877341| 136| 4| 1| I|!|| null| null| I|!||
| 123456789| 145| 14| 1| I|!|| 14| 1| D|!||
|4295877341| 138| 2| 1| I|!|| 11| 22| I|!||
|4295877341| 141| 4| 1| I|!|| 10| 1| I|!||
|4295877341| 143| 2| 1| I|!|| 1| 1| I|!||
It is full evident that the filter in your code is filtering the null as well in Action_1 column
So the working code for you is to change the null values that you get after you join to valid data from other table where the data is present.
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "outer")
.withColumn("segmentId_1", when($"segmentId_1".isNotNull, $"segmentId_1").otherwise($"segmentId"))
.withColumn("Sequence_1", when($"Sequence_1".isNotNull, $"Sequence_1").otherwise($"Sequence"))
.withColumn("Action_1", when($"Action_1".isNotNull, $"Action_1").otherwise($"Action"))
.select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
.filter(!$"Action_1".contains("D") )
you should be getting the desired output as
| OrgId|ItemId|segmentId_1|Sequence_1|Action_1|
|4295877346| 136| 4| 1| I|!||
|4295877341| 145| 14| 1| I|!||
|4295877343| 149| 15| 2| I|!||
|4295877341| 136| null| null| I|!||
|4295877341| 138| 11| 22| I|!||
|4295877341| 141| 10| 1| I|!||
|4295877341| 143| 1| 1| I|!||
Try left-outer instead of outer:
val df3 = df1.join(df2, Seq("OrgId", "ItemId"), "left_outer")
.select($"OrgId", $"ItemId",$"segmentId_1",$"Sequence_1",$"Action_1")
Left outer should retain all non matched in the left.
A nice tutorial here.
I join two data frames and have the resulting data frame as below.Now I want to
|a |b | c | d | e | f |
| 7| 2| 1|2015-04-12 23:59:01| null| null |
| 15| 2| 2|2015-04-12 23:59:02| | |
| 11| 2| 4|2015-04-12 23:59:03| null| null|
| 3| 2| 4|2015-04-12 23:59:04| null| null|
| 8| 2| 3|2015-04-12 23:59:05| {NORMAL} 2015-04-12 23:59:05|
| 16| 2| 3|2017-03-12 23:59:06| null| null|
| 5| 2| 3|2015-04-12 23:59:07| null| null|
| 18| 2| 3|2015-03-12 23:59:08| null| null|
| 17| 2| 1|2015-03-12 23:59:09| null| null|
| 6| 2| 1|2015-04-12 23:59:10| null| null|
| 19| 2| 3|2015-03-12 23:59:11| null| null|
| 9| 2| 3|2015-04-12 23:59:12| null| null|
| 1| 2| 2|2015-04-12 23:59:13| null| null|
| 1| 2| 2|2015-04-12 23:59:14| null| null|
| 1| 2| 2|2015-04-12 23:59:15| null| null|
| 10| 3| 2|2015-04-12 23:59:16| null| null|
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
| 12| 3| 1|2015-04-12 23:59:18| null| null|
| 13| 3| 1|2015-04-12 23:59:19| null| null|
| 14| 2| 1|2015-04-12 23:59:20| null| null|
Now I have to find the first occuring 1 before each 3 in column c .For example
| 4| 2| 3|2015-04-12 23:59:17| {NORMAL}|2015-04-12 23:59:17|
Before this record I want to know the first occured 1 in column c which is
| 17| 2| 1|2015-03-12 23:59:09| null| null|
Any help is appreciated
You can use Spark window function lag import org.apache.spark.sql.expressions.Window
In first step you filter your data on the column "c" based on value as either 1 or 3. You will get data similar to
| id| a| b| c|
| 1| 7| 2| 1|
| 2| 15| 2| 3|
| 3| 11| 2| 3|
| 4| 3| 2| 1|
| 5| 8| 2| 3|
Next, define the window
val w = Window.orderBy("id")
Once this is done, create a new column and put previous value in it
dft.withColumn("prev", lag("c",1).over(w)).show()
| id| a| b| c|prev|
| 1| 7| 2| 1|null|
| 2| 15| 2| 3| 1|
| 3| 11| 2| 3| 3|
| 4| 3| 2| 1| 3|
| 5| 8| 2| 3| 1|
Finally filter on the values of column "c" and "prev"
Note: Do combine the steps when you are writing final code, so as to apply filter directly.