How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2 - scala

Eg: I would like to add the quantity sold by the date.
Date Quantity
11/4/2017 20
11/4/2017 23
11/4/2017 12
11/5/2017 18
11/5/2017 12
Output with the new Column:
Date Quantity, New_Column
11/4/2017 20 55
11/4/2017 23 55
11/4/2017 12 55
11/5/2017 18 30
11/5/2017 12 30

Simply use sum as a window function by specifying a WindowSpec:
import org.apache.spark.sql.expressions.Window
df.withColumn("New_Column", sum("Quantity").over(Window.partitionBy("Date"))).show
+---------+--------+----------+
| Date|Quantity|New_Column|
+---------+--------+----------+
|11/5/2017| 18| 30|
|11/5/2017| 12| 30|
|11/4/2017| 20| 55|
|11/4/2017| 23| 55|
|11/4/2017| 12| 55|
+---------+--------+----------+

Related

How to transform the below table to required format?

I have following table loaded as a dataframe :
Id Name customCount Custom1 Custom1value custom2 custom2Value custom3 custom3Value
1 qwerty 2 Height 171 Age 76 Null Null
2 asdfg 2 Weight 78 Height 166 Null Null
3 zxcvb 3 Age 28 SkinColor white Height 67
4 tyuio 1 Height 177 Null Null Null Null
5 asdfgh 2 SkinColor brown Age 34 Null Null
I need to change this table into below format :
Id Name customCount Height Weight Age SkinColor
1 qwerty 2 171 Null 76 Null
2 asdfg 2 161 78 Null Null
3 zxcvb 3 67 Null 28 white
4 tyuio 1 177 Null Null Null
5 asdfgh 2 Null Null 34 brown
I tried for two custom fields columns :
val rawDf= spark.read.option("Header",false).options(Map("sep"->"|")).csv("/sample/data.csv")
rawDf.createOrReplaceTempView("Table")
val dataframe=spark.sql("select distinct * from (select `_c3` from Table union select `_c5` from Table)")
val dfWithDistinctColumns=dataframe.toDF("colNames")
val list=dfWithDistinctColumns.select("colNames").map(x=>x.getString(0)).collect().toList
val rawDfWithSchema=rawDf.toDF("Id","Name",customCount","h1","v1","h2","v2")
val expectedDf=list.foldLeft(rawDfWithSchema)((df1,c)=>(df1.withColumn(c, when(col("h1")===c,col("v1")).when(col("h2")===c,col("v2")).otherwise(null)))).drop("h1","h2","v1","v2")
But I am not able to do union on multiple columns when I try it on 3 custom fields .
Can you please give any idea/solution for this?
You can do a pivot, but you also need to clean up the format of the dataframe first:
val df2 = df.select(
$"Id", $"Name", $"customCount",
explode(array(
array($"Custom1", $"Custom1value"),
array($"custom2", $"custom2Value"),
array($"custom3", $"custom3Value")
)).alias("custom")
).select(
$"Id", $"Name", $"customCount",
$"custom"(0).alias("key"),
$"custom"(1).alias("value")
).groupBy(
"Id", "Name", "customCount"
).pivot("key").agg(first("value")).drop("null").orderBy("Id")
df2.show
+---+------+-----------+----+------+---------+------+
| Id| Name|customCount| Age|Height|SkinColor|Weight|
+---+------+-----------+----+------+---------+------+
| 1|qwerty| 2| 76| 171| null| null|
| 2| asdfg| 2|null| 166| null| 78|
| 3| zxcvb| 3| 28| 67| white| null|
| 4| tyuio| 1|null| 177| null| null|
| 5|asdfgh| 2| 34| null| brown| null|
+---+------+-----------+----+------+---------+------+

Calculating the rolling sums in pyspark

I have a dataframe that contains information on the daily sales and daily clicks. Before I want to run my analysis, I want to aggregate the data. To make myself clearer, I will try to explain it on an example dataframe
item_id date Price Sale Click Discount_code
2 01.03.2019 10 1 10 NULL
2 01.03.2019 8 1 10 Yes
2 02.03.2019 10 0 4 NULL
2 03.03.2019 10 0 6 NULL
2 04.03.2019 6 0 15 NULL
2 05.03.2019 6 0 14 NULL
2 06.03.2019 5 0 7 NULL
2 07.03.2019 5 1 11 NULL
2 07.03.2019 5 1 11 NULL
2 08.03.2019 5 0 9 NULL
If there are two sales for the given day, I have two observations for that particular day. I want to convert my dataframe to the following one by collapsing observations by item_id and price:
item_id Price CSale Discount_code Cclicks firstdate lastdate
2 10 1 No 20 01.03.2019 03.03.2019
2 8 1 Yes 10 01.03.2019 01.03.2019
2 6 0 NULL 29 04.03.2019 05.03.2019
2 5 2 NULL 38 06.03.2019 08.03.2019
Where CSale correponds to the cumulative sales for the given price and given item_id, Cclicks corresponds to the cumulative clicks for the given price and given item_id, firstdate is the first date on which the given item was available for the given price and lastdate is the last date on which the given item was available for the given price.
According to the problem, OP wants to aggregate the DataFrame on the basis of item_id and Price.
# Creating the DataFrames
from pyspark.sql.functions import col, to_date, sum, min, max, first
df = sqlContext.createDataFrame([(2,'01.03.2019',10,1,10,None),(2,'01.03.2019',8,1,10,'Yes'),
(2,'02.03.2019',10,0,4,None),(2,'03.03.2019',10,0,6,None),
(2,'04.03.2019',6,0,15,None),(2,'05.03.2019',6,0,14,None),
(2,'06.03.2019',5,0,7,None),(2,'07.03.2019',5,1,11,None),
(2,'07.03.2019',5,1,11,None),(2,'08.03.2019',5,0,9,None)],
('item_id','date','Price','Sale','Click','Discount_code'))
# Converting string column date to proper date
df = df.withColumn('date',to_date(col('date'),'dd.MM.yyyy'))
df.show()
+-------+----------+-----+----+-----+-------------+
|item_id| date|Price|Sale|Click|Discount_code|
+-------+----------+-----+----+-----+-------------+
| 2|2019-03-01| 10| 1| 10| null|
| 2|2019-03-01| 8| 1| 10| Yes|
| 2|2019-03-02| 10| 0| 4| null|
| 2|2019-03-03| 10| 0| 6| null|
| 2|2019-03-04| 6| 0| 15| null|
| 2|2019-03-05| 6| 0| 14| null|
| 2|2019-03-06| 5| 0| 7| null|
| 2|2019-03-07| 5| 1| 11| null|
| 2|2019-03-07| 5| 1| 11| null|
| 2|2019-03-08| 5| 0| 9| null|
+-------+----------+-----+----+-----+-------------+
As can be seen in the printSchema below that the dataframe's date column is in date format.
df.printSchema()
root
|-- item_id: long (nullable = true)
|-- date: date (nullable = true)
|-- Price: long (nullable = true)
|-- Sale: long (nullable = true)
|-- Click: long (nullable = true)
|-- Discount_code: string (nullable = true)
Finally aggregating agg() the columns below. Just a caveat - Since Discount_code is a string column and we need to aggregate it as well, we will take the first non-Null value while grouping.
df = df.groupBy('item_id','Price').agg(sum('Sale').alias('CSale'),
first('Discount_code',ignorenulls = True).alias('Discount_code'),
sum('Click').alias('Cclicks'),
min('date').alias('firstdate'),
max('date').alias('lastdate'))
df.show()
+-------+-----+-----+-------------+-------+----------+----------+
|item_id|Price|CSale|Discount_code|Cclicks| firstdate| lastdate|
+-------+-----+-----+-------------+-------+----------+----------+
| 2| 6| 0| null| 29|2019-03-04|2019-03-05|
| 2| 5| 2| null| 38|2019-03-06|2019-03-08|
| 2| 8| 1| Yes| 10|2019-03-01|2019-03-01|
| 2| 10| 1| null| 20|2019-03-01|2019-03-03|
+-------+-----+-----+-------------+-------+----------+----------+

scala find group by id max date

I need to group by id and times and show max date
Id Key Times date
20 40 1 20190323
20 41 1 20191201
31 33 3 20191209
My output should be:
Id Key Times date
20 41 1 20191201
31 33 3 20191209
You can simply apply groupBy function to group by Id and then join with original dataset to get Key column to you resulting dataframe. Try following code,
//your original dataframe
val df = Seq((20,40,1,20190323),(20,41,1,20191201),(31,33,3,20191209))
.toDF("Id","Key","Times","date")
df.show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 20| 40| 1|20190323|
//| 20| 41| 1|20191201|
//| 31| 33| 3|20191209|
//+---+---+-----+--------+
//group by Id column
val maxDate = df.groupBy("Id").agg(max("date").as("maxdate"))
//join with original DF to get rest of the column
maxDate.join(df, Seq("Id"))
.where($"date" === $"maxdate")
.select("Id","Key","Times","date").show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 31| 33| 3|20191209|
//| 20| 41| 1|20191201|
//+---+---+-----+--------+

Calculate variance across columns in pyspark

How can I calculate variance across numerous columns in a pyspark ?
For e.g. if the pyspark.sql.dataframe table is:
ID A B C
1 12 15 7
2 6 15 2
3 56 25 25
4 36 12 5
and output needed is
ID A B C Variance
1 12 15 7 10.9
2 6 15 2 29.6
3 56 25 25 213.6
4 36 12 5 176.2
There is a variance function in pyspark but it works only column-wise.
Just concat the columns that you need using concat_ws function and use udf to calculate variance like below
from pyspark.sql.functions import *
from pyspark.sql.types import *
from statistics import pvariance
def calculateVar(row):
data = [float(x.strip()) for x in row.split(",")]
return pvariance(data)
varUDF = udf(calculateVar,FloatType())
df.withColumn('Variance',varUDF(concat_ws(",",df.a,df.b,df.c))).show()
output :
+---+---+---+---+---------+
| id| a| b| c| Variance|
+---+---+---+---+---------+
| 1| 12| 15| 7|10.888889|
| 2| 6| 15| 2|29.555555|
| 3| 56| 25| 25|213.55556|
| 4| 36| 12| 5|176.22223|
+---+---+---+---+---------+

Spark dataframe data aggregation

I have a below requirement to aggregate the data on Spark dataframe in scala.
I have a spark dataframe with two columns.
mo_id sales
201601 11.01
201602 12.01
201603 13.01
201604 14.01
201605 15.01
201606 16.01
201607 17.01
201608 18.01
201609 19.01
201610 20.01
201611 21.01
201612 22.01
As shown above the dataframe has two columns 'mo_id' and 'sales'.
I want to add a new column (agg_sales)to the dataframe which should have the sum of sales upto the current month like as shown below.
mo_id sales agg_sales
201601 10 10
201602 20 30
201603 30 60
201604 40 100
201605 50 150
201606 60 210
201607 70 280
201608 80 360
201609 90 450
201610 100 550
201611 110 660
201612 120 780
Description:
For the month 201603 agg_sales will be sum of sales from 201601 to 201603.
For the month 201604 agg_sales will be sum of sales from 201601 to 201604.
and so on.
Can anyone please help to do this.
Versions using : Spark 1.6.2 and Scala 2.10
You are looking for a cumulative sum which can be accomplished with a window function:
scala> val df = sc.parallelize(Seq((201601, 10), (201602, 20), (201603, 30), (201604, 40), (201605, 50), (201606, 60), (201607, 70), (201608, 80), (201609, 90), (201610, 100), (201611, 110), (201612, 120))).toDF("id","sales")
df: org.apache.spark.sql.DataFrame = [id: int, sales: int]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ordering = Window.orderBy("id")
ordering: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#75d454a4
scala> df.withColumn("agg_sales", sum($"sales").over(ordering)).show
16/12/27 21:11:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-----+-------------+
| id|sales| agg_sales |
+------+-----+-------------+
|201601| 10| 10|
|201602| 20| 30|
|201603| 30| 60|
|201604| 40| 100|
|201605| 50| 150|
|201606| 60| 210|
|201607| 70| 280|
|201608| 80| 360|
|201609| 90| 450|
|201610| 100| 550|
|201611| 110| 660|
|201612| 120| 780|
+------+-----+-------------+
Note that I defined the ordering on the ids, you would probably want some sort of time stamp to order the summation.