I want every 3 month of average sales in pyspark .
Input
Input:
Product Date Sales
A 01/04/2020 50
A 02/04/2020 60
A 01/05/2020 70
A 05/05/2020 80
A 10/06/2020 100
A 13/06/2020 150
A 25/07/2020 160
output:output
Product Date Sales 3month Avg sales
A 01/04/2020 50 36.67
A 02/04/2020 60 36.67
A 01/05/2020 70 86.67
A 05/05/2020 80 86.67
A 10/06/2020 100 170
A 13/06/2020 150 170
A 25/07/2020 160 186.67
Avg of july is sales of (may+june+july)/3=560/3=186.67
Sometimes, the dense_rank is quite expensive and so I have calculated the custom index and to similar steps with #Cena.
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('Product').orderBy('index').rangeBetween(-2, 0)
df.withColumn('Date', to_date('Date', 'dd/MM/yyyy')) \
.withColumn('index', (year('Date') - 2020) * 12 + month('Date')) \
.withColumn('avg', sum('Sales').over(w) / 3) \
.show()
+-------+----------+-----+-----+------------------+
|Product| Date|Sales|index| avg|
+-------+----------+-----+-----+------------------+
| A|2020-04-01| 50| 4|36.666666666666664|
| A|2020-04-02| 60| 4|36.666666666666664|
| A|2020-05-01| 70| 5| 86.66666666666667|
| A|2020-05-05| 80| 5| 86.66666666666667|
| A|2020-06-10| 100| 6| 170.0|
| A|2020-06-13| 150| 6| 170.0|
| A|2020-07-25| 160| 7|186.66666666666666|
+-------+----------+-----+-----+------------------+
You can use dense_rank() over the month column to compute the moving average. Cast the date and extract the month from it. dense_rank() rank over the month gives you consecutive ranks.
For the moving average, you can use rangeBetween(-2, 0) to look back 2 months from the current month. Sum by sales and divide by 3 for the output.
Your df:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
from pyspark.sql.window import Window
row = Row("Product", "Date", "Sales")
df = sc.parallelize([row("A", "01/04/2020", 50),row("A", "02/04/2020", 60),row("A", "01/05/2020", 70),row("A", "05/05/2020", 80),row("A", "10/06/2020", 100),row("A", "13/06/2020", 150),row("A", "25/07/2020", 160)]).toDF()
df = df.withColumn('date_cast', from_unixtime(unix_timestamp('Date', 'dd/MM/yyyy')).cast(DateType()))
df = df.withColumn('month', month("date_cast"))
w=Window().partitionBy("Product").orderBy("month")
df = df.withColumn('rank', F.dense_rank().over(w))
w2 = (Window().partitionBy(col("Product")).orderBy("rank").rangeBetween(-2, 0))
df.select(col("*"), ((F.sum("Sales").over(w2))/3).alias("mean"))\
.drop("date_cast", "month", "rank").show()
Output:
+-------+----------+-----+------------------+
|Product| Date|Sales| mean|
+-------+----------+-----+------------------+
| A|01/04/2020| 50|36.666666666666664|
| A|02/04/2020| 60|36.666666666666664|
| A|01/05/2020| 70| 86.66666666666667|
| A|05/05/2020| 80| 86.66666666666667|
| A|10/06/2020| 100| 170.0|
| A|13/06/2020| 150| 170.0|
| A|25/07/2020| 160|186.66666666666666|
+-------+----------+-----+------------------+
Related
I have a dataframe like so:
df = sc.parallelize([("num1", "1"), ("num2", "5"), ("total", "10")]).toDF(("key", "val"))
key val
num1 1
num2 5
total 10
I want to pivot only the total row to a new column and keep its value for each row:
key val total
num1 1 10
num2 5 10
I've tried pivoting and aggregating but cannot get the one column with the same value.
You could join a dataframe with only the total to a dataframe without the total.
Another option would be to collect the total and add it as a literal.
from pyspark.sql import functions as f
# option 1
df1 = df.filter("key <> 'total'")
df2 = df.filter("key = 'total'").select(f.col('val').alias('total'))
df1.join(df2).show()
+----+---+-----+
| key|val|total|
+----+---+-----+
|num1| 1| 10|
|num2| 5| 10|
+----+---+-----+
# option 2
total = df.filter("key = 'total'").select('val').collect()[0][0]
df.filter("key <> 'total'").withColumn('total', f.lit(total)).show()
+----+---+-----+
| key|val|total|
+----+---+-----+
|num1| 1| 10|
|num2| 5| 10|
+----+---+-----+
I want to explode the column in spark scala
reference_month M M+1 M+2
2020-01-01 10 12 10
2020-02-01 10 12 10
The output should be like
reference_month Month reference_date_id
2020-01-01 10 2020-01
2020-01-01 12 2020-02
2020-01-01 10 2020-03
2020-02-01 10 2020-02
2020-02-01 12 2020-03
2020-02-01 10 2020-04
Where reference_date_id = reference_month + x ( where x is derived from m, m+1,m+2).
Is there any way by which we can get the output in this format in spark scala?
You can you unpivot technique of Apache Spark
import org.apache.spark.sql.functions.expr
data.select($"reference_month",expr("stack(3,`M`,`M+1`,`M+2`) as (Month )")).show()
You can use **stack** function
import sys
from pyspark.sql.types import StructType,StructField,IntegerType,StringType
from pyspark.sql.functions import when,concat_ws,lpad,row_number,sum,col,expr,substring,length
from pyspark.sql.window import Window
schema = StructType([
StructField("reference_month", StringType(), True),\
StructField("M", IntegerType(), True),\
StructField("M+1", IntegerType(), True),\
StructField("M+2", IntegerType(), True)
])
mnt = [("2020-01-01",10,12,10),("2020-02-01",10,12,10)]
df=spark.createDataFrame(mnt,schema)
newdf = df.withColumn("t",col("reference_month").cast("date")).drop("reference_month").withColumnRenamed("t","reference_month")
exp = expr("""stack(3,`M`,`M+1`,`M+2`) as (Values)""")
t = newdf.select("reference_month",exp).withColumn('mnth',substring("reference_month",6,2)).withColumn("newmnth",col("mnth").cast("Integer")).drop('mnth')
windowval = (Window.partitionBy('reference_month').orderBy('reference_month').rowsBetween(-sys.maxsize, 0))
ref_cal=t.withColumn("reference_date_id",row_number().over(windowval)-1)
ref_cal.withColumn('new_dt',concat_ws('-',substring("reference_month",1,4),when(length(col("reference_date_id")+col("newmnth"))<2,lpad(col("reference_date_id")+col("newmnth"),2,'0')).otherwise(col("reference_date_id")+col("newmnth")))).drop("newmnth","reference_date_id").withColumnRenamed("new_dt","reference_date_id").orderBy("reference_month").show()
+---------------+------+-----------------+
|reference_month|Values|reference_date_id|
+---------------+------+-----------------+
| 2020-01-01| 10| 2020-01|
| 2020-01-01| 12| 2020-02|
| 2020-01-01| 10| 2020-03|
| 2020-02-01| 10| 2020-02|
| 2020-02-01| 12| 2020-03|
| 2020-02-01| 10| 2020-04|
+---------------+------+-----------------+
We can create an array with M,M+1,M+2 and then explode the array to get required dataframe.
Example:
df.selectExpr("reference_month","array(M,`M+1`,`M+2`)as arr").
selectExpr("reference_month","explode(arr) as Month").show()
+---------------+-----+
|reference_month|Month|
+---------------+-----+
| 202001| 10|
| 202001| 12|
| 202001| 10|
| 202002| 10|
| 202002| 12|
| 202002| 10|
+---------------+-----+
//or
val cols= Seq("M","M+1","M+2")
df.withColumn("arr",array(cols.head,cols.tail:_*)).drop(cols:_*).
selectExpr("reference_month","explode(arr) as Month").show()
I have a dataframe that looks like this:
Genres | Year | Number_Movies
Drama |2015 | 705
Romance|2015 | 203
Comedy |2015 | 586
Drama |2014 | 605
Romance|2014 | 293
Comedy |2014 | 786
I would like to return the gender by year that has the maximum number of movies:
Genres | Year | Number_Movies
Drama |2015 | 705
Comedy |2014 | 786
Please help if possible. Thanks a lot.
Here are few options that can solve this -
df = spark.createDataFrame([('Drama',2015,705),('Romance',2015,203),('Comedy',2015,586),('Drama',2014,605),('Romance',2014,293),('Comedy ',2014,786)],['Genres','Year','Number_Movies'])
First Option: Define a rank using window function (partition by - Year and order by - Number_Movies desc). Highest Number_Movies each year will get rank "1".
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,desc
w = Window.partitionBy("Year").orderBy(desc("Number_Movies"))
rank = row_number().over(w).alias('rank')
df.withColumn("rank", rank)\
.where("rank=1")\
.drop("rank")\
.show()
#+-------+----+-------------+
#| Genres|Year|Number_Movies|
#+-------+----+-------------+
#|Comedy |2014| 786|
#| Drama|2015| 705|
#+-------+----+-------------+
Second Option: Get maxumum of Number_Movies for each year and self join with dataframe to get the Genres.
from pyspark.sql.functions import max,col
joining_condition = [col('a.Year') == col('b.Year'), col('a.max_Number_Movies') == col('b.Number_Movies')]
df.groupBy("Year").\
agg(max("Number_Movies").alias("max_Number_Movies")).alias("a").\
join(df.alias("b"), joining_condition).\
selectExpr("b.Genres","b.Year","b.Number_Movies").\
show()
#+-------+----+-------------+
#| Genres|Year|Number_Movies|
#+-------+----+-------------+
#|Comedy |2014| 786|
#| Drama|2015| 705|
#+-------+----+-------------+
I've a spark data frame with columns - "date" of type timestamp and "quantity" of type long. For each date, I've some value for quantity. The dates are sorted in increasing order. But there are some dates which are missing.
For eg -
Current df -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
14-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
20-09-2016 | 2
As you can see, the df has some missing dates like 12-09-2016, 13-09-2016 etc. I want to put 0 in the quantity field for those missing dates such that resultant df should look like -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
12-09-2016 | 0
13-09-2016 | 0
14-09-2016 | 0
15-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
18-09-2016 | 0
19-09-2016 | 0
20-09-2016 | 2
Any help/suggestion regarding this will be appreciated. Thanks in advance.
Note that I am coding in scala.
I have written this answer in a bit verbose way for easy understanding of the code. It can be optimized.
Needed imports
import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, TimestampType}
UDFs for String to Valid date format
val date_transform = udf((date: String) => {
val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
val dt = LocalDate.parse(date, dtFormatter)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
Below UDF code taken from Iterate over dates range
def fill_dates = udf((start: String, excludedDiff: Int) => {
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val fromDt = LocalDateTime.parse(start, dtFormatter)
(1 to (excludedDiff - 1)).map(day => {
val dt = fromDt.plusDays(day)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
})
Setting up sample dataframe (df)
val df = Seq(
("10-09-2016", 1),
("11-09-2016", 2),
("14-09-2016", 0),
("16-09-2016", 1),
("17-09-2016", 0),
("20-09-2016", 2)).toDF("date", "quantity")
.withColumn("date", date_transform($"date").cast(TimestampType))
.withColumn("quantity", $"quantity".cast(LongType))
df.printSchema()
root
|-- date: timestamp (nullable = true)
|-- quantity: long (nullable = false)
df.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-14 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Create a temporary dataframe(tempDf) to union with df:
val w = Window.orderBy($"date")
val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
.filter($"diff" > 1) // Pick date diff more than one day to generate our date
.withColumn("next_dates", fill_dates($"date", $"diff"))
.withColumn("quantity", lit("0"))
.withColumn("date", explode($"next_dates"))
.withColumn("date", $"date".cast(TimestampType))
tempDf.show(false)
+-------------------+--------+----+------------------------+
|date |quantity|diff|next_dates |
+-------------------+--------+----+------------------------+
|2016-09-12 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-13 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-15 00:00:00|0 |2 |[2016-09-15] |
|2016-09-18 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
|2016-09-19 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
+-------------------+--------+----+------------------------+
Now union two dataframes
val result = df.union(tempDf.select("date", "quantity"))
.orderBy("date")
result.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-12 00:00:00| 0|
|2016-09-13 00:00:00| 0|
|2016-09-14 00:00:00| 0|
|2016-09-15 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-18 00:00:00| 0|
|2016-09-19 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Based on the #mrsrinivas excellent answer, here is the PySpark version.
Needed imports
from typing import List
import datetime
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
from pyspark.sql.types import DateType, ArrayType
UDF to create the range of next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]
Function the create the DateFrame filling the dates (support "grouping" columns):
def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))
window = Window.orderBy(*group_columns, date_column)
return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
.filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
.withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
.drop("_diff", "_next_dates")
The usage of the function:
fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
df = df.union(fill_df)
It assumes that the date column is already in date type.
Here is a slight modification, to use this function with months and enter measure columns (columns that should be set to zero) instead of group columns:
from typing import List
import datetime
from dateutil import relativedelta
import math
import pyspark.sql.functions as f
from pyspark.sql import DataFrame, Window
from pyspark.sql.types import DateType, ArrayType
def fill_time_gaps_date_diff_based(df: pyspark.sql.dataframe.DataFrame, measure_columns: list, date_column: str):
group_columns = [col for col in df.columns if col not in [date_column]+measure_columns]
# save measure sums for qc
qc = df.agg({col: 'sum' for col in measure_columns}).collect()
# convert month to date
convert_int_to_date = f.udf(lambda mth: datetime.datetime(year=math.floor(mth/100), month=mth%100, day=1), DateType())
df = df.withColumn(date_column, convert_int_to_date(date_column))
# sort values
df = df.orderBy(group_columns)
# get_fill_dates_df (instead of months_between also use date_diff for days)
window = Window.orderBy(*group_columns, date_column)
# calculate diff column
fill_df = df.withColumn(
"_diff",
f.months_between(f.lead(date_column, 1).over(window), date_column).cast(IntegerType())
).filter(
f.col("_diff") > 1
)
# generate next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [
start_date + relativedelta.relativedelta(months=months)
for months in range(1, diff)
]
get_next_dates_udf = f.udf(_get_next_dates, ArrayType(DateType()))
fill_df = fill_df.withColumn(
"_next_dates",
get_next_dates_udf(date_column, "_diff")
)
# set measure columns to 0
for col in measure_columns:
fill_df = fill_df.withColumn(col, f.lit(0))
# explode next_dates column
fill_df = fill_df.withColumn(date_column, f.explode('_next_dates'))
# drop unneccessary columns
fill_df = fill_df.drop(
"_diff",
"_next_dates"
)
# union df with fill_df
df = df.union(fill_df)
# qc: should be removed for productive runs
if qc != df.agg({col: 'sum' for col in measure_columns}).collect():
raise ValueError('Sums before and after run do not fit.')
return df
Please note, that I assume that the month is given as Integer in the form YYYYMM. This could easily be adjusted by modifying the "convert month to date" part.
I have a below requirement to aggregate the data on Spark dataframe in scala.
I have a spark dataframe with two columns.
mo_id sales
201601 11.01
201602 12.01
201603 13.01
201604 14.01
201605 15.01
201606 16.01
201607 17.01
201608 18.01
201609 19.01
201610 20.01
201611 21.01
201612 22.01
As shown above the dataframe has two columns 'mo_id' and 'sales'.
I want to add a new column (agg_sales)to the dataframe which should have the sum of sales upto the current month like as shown below.
mo_id sales agg_sales
201601 10 10
201602 20 30
201603 30 60
201604 40 100
201605 50 150
201606 60 210
201607 70 280
201608 80 360
201609 90 450
201610 100 550
201611 110 660
201612 120 780
Description:
For the month 201603 agg_sales will be sum of sales from 201601 to 201603.
For the month 201604 agg_sales will be sum of sales from 201601 to 201604.
and so on.
Can anyone please help to do this.
Versions using : Spark 1.6.2 and Scala 2.10
You are looking for a cumulative sum which can be accomplished with a window function:
scala> val df = sc.parallelize(Seq((201601, 10), (201602, 20), (201603, 30), (201604, 40), (201605, 50), (201606, 60), (201607, 70), (201608, 80), (201609, 90), (201610, 100), (201611, 110), (201612, 120))).toDF("id","sales")
df: org.apache.spark.sql.DataFrame = [id: int, sales: int]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ordering = Window.orderBy("id")
ordering: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#75d454a4
scala> df.withColumn("agg_sales", sum($"sales").over(ordering)).show
16/12/27 21:11:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-----+-------------+
| id|sales| agg_sales |
+------+-----+-------------+
|201601| 10| 10|
|201602| 20| 30|
|201603| 30| 60|
|201604| 40| 100|
|201605| 50| 150|
|201606| 60| 210|
|201607| 70| 280|
|201608| 80| 360|
|201609| 90| 450|
|201610| 100| 550|
|201611| 110| 660|
|201612| 120| 780|
+------+-----+-------------+
Note that I defined the ordering on the ids, you would probably want some sort of time stamp to order the summation.