Weighted running total with conditional in Pyspark/Hive - pyspark

I have product, brand, percentage and price columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. I want to weigh them by price. If the price of the products above the current row are more than the current row, I want to down-weigh it by multiplying it by 0.8. How can I do it in PySpark or using using spark.sql? The answer without using multiplying with weight is here.
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
'brand':['b1','b2','b1', 'b3', 'b2','b1'],
'pct': [40, 30, 10, 8,7,5],
'price':[0.6, 1, 0.5, 0.8, 1, 0.5]})
df = spark.createDataFrame(df)
What I am looking for
product brand pct pct_same_brand pct_different_brand
a1 b1 40 null null
a2 b2 30 null 40
a3 b1 10 32 30
a4 b3 8 null 80
a5 b2 7 24 58
a6 b1 5 40 45
Update:
I have added the below data points to help clarify the problem. As can be seen, one row can be multiplied by 0.8 in one row and by 1.0 in another row.
product brand pct price pct_same_brand pct_different_brand
a1 b1 30 0.6 null null
a2 b2 20 1.3 null 30
a3 b1 10 0.5 30*0.8 20
a4 b3 8 0.8 null 60
a5 b2 7 0.5 20*0.8 48
a6 b1 6 0.8 30*1 + 10*1 35
a7 b2 5 1.5 20*1 + 7*1 54
Update2: In the data that I provided above, the weight per row is the same number (0.8 or 1) but it can also be 1 and 0.8 (0.8 for some of the rows and 1 for other rows)
Example in the below data frame, the multiplier for the last row , for example, should be 0.8 for a6 and 1.0 for the rest of brand b1. :
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6', 'a7', 'a8', 'a9', 'a10'],
'brand':['b1','b2','b1', 'b3', 'b2','b1','b2', 'b1', 'b1', 'b1'],
'pct': [30, 20, 10, 8, 7,6,5,4,3,2],
'price':[0.6, 1.3, 0.5, 0.8, 0.5, 0.8, 1.5, 0.5, 0.65, 0.7]
})
df = spark.createDataFrame(df)

You can add a weight column to facilitate calculation:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'weight',
F.when(
F.col('price') <= F.lag('price').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
),
0.8
).otherwise(1.0)
).withColumn(
'pct_same_brand',
F.col('weight')*F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand'), F.lit(0)) / F.col('weight')
)
df2.show()
+---+-----+---+-----+------+--------------+-------------------+
| a|brand|pct|price|weight|pct_same_brand|pct_different_brand|
+---+-----+---+-----+------+--------------+-------------------+
| a1| b1| 40| 0.6| 1.0| null| null|
| a2| b2| 30| 1.0| 1.0| null| 40.0|
| a3| b1| 10| 0.5| 0.8| 32.0| 30.0|
| a4| b3| 8| 0.8| 1.0| null| 80.0|
| a5| b2| 7| 1.0| 0.8| 24.0| 58.0|
| a6| b1| 5| 0.5| 0.8| 40.0| 45.0|
+---+-----+---+-----+------+--------------+-------------------+
Output for the edited question:
+---+-----+---+-----+------+--------------+-------------------+
| a|brand|pct|price|weight|pct_same_brand|pct_different_brand|
+---+-----+---+-----+------+--------------+-------------------+
| a1| b1| 30| 0.6| 1.0| null| null|
| a2| b2| 20| 1.3| 1.0| null| 30.0|
| a3| b1| 10| 0.5| 0.8| 24.0| 20.0|
| a4| b3| 8| 0.8| 1.0| null| 60.0|
| a5| b2| 7| 0.5| 0.8| 16.0| 48.0|
| a6| b1| 6| 0.8| 1.0| 40.0| 35.0|
| a7| b2| 5| 1.5| 1.0| 27.0| 54.0|
+---+-----+---+-----+------+--------------+-------------------+

If anyone has a similar question, this worked for me.
Basically, I used outer join of the dataframe with itself and assigned the weights. Finally, I used window functions.
df_copy = df.withColumnRenamed('a', 'asin')\
.withColumnRenamed('brand', 'brandd')\
.withColumnRenamed('pct', 'pct2')\
.withColumnRenamed('price', 'price2')
df2 = df.join(df_copy, on = [df.brand == df_copy.brandd], how = 'outer').orderBy('brand')
df3 = df2.filter(~((df2.a == df2.asin) & (df2.brand == df2.brandd))
& (df2.pct <= df2.pct2))
df3 = df3.withColumn('weight', F.when(df3.price2 > df3.price, 0.8).otherwise(1))
df4 = df3.groupBy(['a', 'brand', 'pct', 'price']).agg(F.sum(df3.pct2*df3.weight)
.alias('same_brand_pct'))
df5 = df.join(df4, on = ['a', 'brand', 'pct', 'price'], how = 'left')
df6 = df5.withColumn(
'pct_same_brand_unscaled',
F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand_unscaled'), F.lit(0))
).drop('pct_same_brand_unscaled')
gives:
+---+-----+---+-----+--------------+-------------------+
| a|brand|pct|price|same_brand_pct|pct_different_brand|
+---+-----+---+-----+--------------+-------------------+
| a1| b1| 30| 0.6| null| null|
| a2| b2| 20| 1.3| null| 30|
| a3| b1| 10| 0.5| 24.0| 20|
| a4| b3| 8| 0.8| null| 60|
| a5| b2| 7| 0.5| 16.0| 48|
| a6| b1| 6| 0.8| 40.0| 35|
| a7| b2| 5| 1.5| 27.0| 54|
| a8| b1| 4| 0.5| 38.8| 40|
| a9| b1| 3| 0.65| 48.8| 40|
|a10| b1| 2| 0.7| 51.8| 40|```

Related

PySpark : Dataframe : Numeric + Null column values resulting in NULL instead of numeric value

I am facing a problem in PySpark Dataframe loaded from a CSV file , where my numeric column do have empty values Like below
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| |
| Abid Ali, S| 29| 5| |
|Adhikari, H R| 21| | |
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Casted those columns to integer and all those empty become null
df_data_csv_casted = df_data_csv.select(df_data_csv['Country'],df_data_csv['Player_Name'], df_data_csv['Test_Matches'].cast(IntegerType()).alias("Test_Matches"), df_data_csv['ODI_Matches'].cast(IntegerType()).alias("ODI_Matches"), df_data_csv['T20_Matches'].cast(IntegerType()).alias("T20_Matches"))
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| null|
| Abid Ali, S| 29| 5| null|
|Adhikari, H R| 21| null| null|
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Then I am taking a total , but if one of them is null , result is also coming as null. How to solve it ?
df_data_csv_withTotalCol=df_data_csv_casted.withColumn('Total_Matches',(df_data_csv_casted['Test_Matches']+df_data_csv_casted['ODI_Matches']+df_data_csv_casted['T20_Matches']))
+-------------+------------+-----------+-----------+-------------+
|Player_Name |Test_Matches|ODI_Matches|T20_Matches|Total_Matches|
+-------------+------------+-----------+-----------+-------------+
| Aaron, V R | 9| 9| null| null|
|Abid Ali, S | 29| 5| null| null|
|Adhikari, H R| 21| null| null| null|
|Agarkar, A B | 26| 191| 4| 221|
+-------------+------------+-----------+-----------+-------------+
You can fix this by using coalesce function . for example , lets create some sample data
from pyspark.sql.functions import coalesce,lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.show()
+----+----+
| a| b|
+----+----+
|null|null|
| 1|null|
|null| 2|
+----+----+
When I do simple sum as you did -
cDf.withColumn('Total',cDf.a+cDf.b).show()
I get total as null , same as you described-
+----+----+-----+
| a| b|Total|
+----+----+-----+
|null|null| null|
| 1|null| null|
|null| 2| null|
+----+----+-----+
to fix, use coalesce along with lit function , which replaces null values by zeroes.
cDf.withColumn('Total',coalesce(cDf.a,lit(0)) +coalesce(cDf.b,lit(0))).show()
this gives me correct results-
| a| b|Total|
+----+----+-----+
|null|null| 0|
| 1|null| 1|
|null| 2| 2|
+----+----+-----+

How to efficiently perform this column operation on a Spark Dataframe?

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?
I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

join 2 DF with diferent dimension scala

Hi I have 2 Differente DF
scala> d1.show() scala> d2.show()
+--------+-------+ +--------+----------+
| fecha|eventos| | fecha|TotalEvent|
+--------+-------+ +--------+----------+
|20180404| 3| | 0| 23534|
|20180405| 7| |20180322| 10|
|20180406| 10| |20180326| 50|
|20180409| 4| |20180402| 6|
.... |20180403| 118|
scala> d1.count() |20180404| 1110|
res3: Long = 60 ...
scala> d2.count()
res7: Long = 74
But I like to join them by fecha without loose data, and then, create a new column with a math operation (TotalEvent - eventos)*100/TotalEvent
Something like this:
+---------+-------+----------+--------+
|fecha |eventos|TotalEvent| KPI |
+---------+-------+----------+--------+
| 0| | 23534 | 100.00|
| 20180322| | 10 | 100.00|
| 20180326| | 50 | 100.00|
| 20180402| | 6 | 100.00|
| 20180403| | 118 | 100.00|
| 20180404| 3 | 1110 | 99.73|
| 20180405| 7 | 1204 | 99.42|
| 20180406| 10 | 1526 | 99.34|
| 20180407| | 14 | 100.00|
| 20180409| 4 | 1230 | 99.67|
| 20180410| 11 | 1456 | 99.24|
| 20180411| 6 | 1572 | 99.62|
| 20180412| 5 | 1450 | 99.66|
| 20180413| 7 | 1214 | 99.42|
.....
The problems is that I can't find the way to do it.
When I use:
scala> d1.join(d2,d2("fecha").contains(d1("fecha")), "left").show()
I loose the data that isn't in both table.
+--------+-------+--------+----------+
| fecha|eventos| fecha|TotalEvent|
+--------+-------+--------+----------+
|20180404| 3|20180404| 1110|
|20180405| 7|20180405| 1204|
|20180406| 10|20180406| 1526|
|20180409| 4|20180409| 1230|
|20180410| 11|20180410| 1456|
....
Additional, How can I add a new column with the math operation?
Thank you
I would recommend left-joining df2 with df1 and calculating KPI based on whether eventos is null or not in the joined dataset (using when/otherwise):
import org.apache.spark.sql.functions._
val df1 = Seq(
("20180404", 3),
("20180405", 7),
("20180406", 10),
("20180409", 4)
).toDF("fecha", "eventos")
val df2 = Seq(
("0", 23534),
("20180322", 10),
("20180326", 50),
("20180402", 6),
("20180403", 118),
("20180404", 1110),
("20180405", 100),
("20180406", 100)
).toDF("fecha", "TotalEvent")
df2.
join(df1, Seq("fecha"), "left_outer").
withColumn( "KPI",
round( when($"eventos".isNull, 100.0).
otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),
2
)
).show
// +--------+----------+-------+-----+
// | fecha|TotalEvent|eventos| KPI|
// +--------+----------+-------+-----+
// | 0| 23534| null|100.0|
// |20180322| 10| null|100.0|
// |20180326| 50| null|100.0|
// |20180402| 6| null|100.0|
// |20180403| 118| null|100.0|
// |20180404| 1110| 3|99.73|
// |20180405| 100| 7| 93.0|
// |20180406| 100| 10| 90.0|
// +--------+----------+-------+-----+
Note that if the more precise raw KPI is wanted instead, just remove the wrapping round( , 2).
I would do this in several of steps. First join, then select the calculated column, then fill in the na:
# val df2a = df2.withColumnRenamed("fecha", "fecha2") # to avoid ambiguous column names after the join
# val df3 = df1.join(df2a, df1("fecha") === df2a("fecha2"), "outer")
# val kpi = df3.withColumn("KPI", (($"TotalEvent" - $"eventos") / $"TotalEvent" * 100 as "KPI")).na.fill(100, Seq("KPI"))
# kpi.show()
+--------+-------+--------+----------+-----------------+
| fecha|eventos| fecha2|TotalEvent| KPI|
+--------+-------+--------+----------+-----------------+
| null| null|20180402| 6| 100.0|
| null| null| 0| 23534| 100.0|
| null| null|20180322| 10| 100.0|
|20180404| 3|20180404| 1110|99.72972972972973|
|20180406| 10| null| null| 100.0|
| null| null|20180403| 118| 100.0|
| null| null|20180326| 50| 100.0|
|20180409| 4| null| null| 100.0|
|20180405| 7| null| null| 100.0|
+--------+-------+--------+----------+-----------------+
I solved the problems with mixed both suggestion recived.
val dfKPI=d1.join(right=d2, usingColumns = Seq("cliente","fecha"), "outer").orderBy("fecha").withColumn( "KPI",round( when($"eventos".isNull, 100.0).otherwise(($"TotalEvent" - $"eventos") * 100.0 / $"TotalEvent"),2))

Joining data in Scala using array_contains() method

I have below data in Scala in Spark environment -
val abc = Seq(
(Array("A"),0.1),
(Array("B"),0.11),
(Array("C"),0.12),
(Array("A","B"),0.24),
(Array("A","C"),0.27),
(Array("B","C"),0.30),
(Array("A","B","C"),0.4)
).toDF("channel_set", "rate")
abc.show(false)
abc.createOrReplaceTempView("abc")
val df = abc.withColumn("totalChannels",size(col("channel_set"))).toDF()
df.show()
scala> df.show
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A]| 0.1| 1|
| [B]|0.11| 1|
| [C]|0.12| 1|
| [A, B]|0.24| 2|
| [A, C]|0.27| 2|
| [B, C]| 0.3| 2|
| [A, B, C]| 0.4| 3|
+-----------+----+-------------+
val oneChannelDF = df.filter($"totalChannels" === 1)
oneChannelDF.show()
oneChannelDF.createOrReplaceTempView("oneChannelDF")
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A]| 0.1| 1|
| [B]|0.11| 1|
| [C]|0.12| 1|
+-----------+----+-------------+
val twoChannelDF = df.filter($"totalChannels" === 2)
twoChannelDF.show()
twoChannelDF.createOrReplaceTempView("twoChannelDF")
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A, B]|0.24| 2|
| [A, C]|0.27| 2|
| [B, C]| 0.3| 2|
+-----------+----+-------------+
I want to join oneChannel and twoChannel dataframes so that I can see my resultant data as below -
+-----------+----+-------------+------------+-------+
|channel_set|rate|totalChannels|channel_set | rate |
+-----------+----+-------------+------------+-------+
| [A]| 0.1| 1| [A,B] | 0.24 |
| [A]| 0.1| 1| [A,C] | 0.27 |
| [B]|0.11| 1| [A,B] | 0.24 |
| [B]|0.11| 1| [B,C] | 0.30 |
| [C]|0.12| 1| [A,C] | 0.27 |
| [C]|0.12| 1| [B,C] | 0.30 |
+-----------+----+-------------+------------+-------+
Basically I need all the rows where a record from oneChannel dataframe in present in twoChannel dataframe.
I have tried -
spark.sql("""select * from oneChannelDF one inner join twoChannelDF two on array_contains(one.channel_set,two.channel_set)""").show()
However, I am facing this error -
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(one.`channel_set`, two.`channel_set`)' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members; line 1 pos 62;
I guess I figured out the error. I need to pass a member as an argument to the array_contains() method. Since the size of every element in channel_set column for oneChannelDF is 1, hence below code gets me the correct data frame.
scala> spark.sql("""select * from oneChannelDF one inner join twoChannelDF two where array_contains(two.channel_set,one.channel_set[0])""").show()
+-----------+----+-------------+-----------+----+-------------+
|channel_set|rate|totalChannels|channel_set|rate|totalChannels|
+-----------+----+-------------+-----------+----+-------------+
| [A]| 0.1| 1| [A, B]|0.24| 2|
| [A]| 0.1| 1| [A, C]|0.27| 2|
| [B]|0.11| 1| [A, B]|0.24| 2|
| [B]|0.11| 1| [B, C]| 0.3| 2|
| [C]|0.12| 1| [A, C]|0.27| 2|
| [C]|0.12| 1| [B, C]| 0.3| 2|
+-----------+----+-------------+-----------+----+-------------+

How to get Running sum of based on two columns using Spark scala RDD

I have data in RDD which have 4 columns like geog, product, time and price. I want to calculate the running sum based on geog and time.
Given Data
I need result like.
[
I need this spark-Scala-RDD. I am new to this Scala world, i can achieve this easily in SQL. i want do this in spark -Scala -RDD like using (map,flatmap).
Advance thanks for your help.
This is possible by defining a window function:
>>> val data = List(
("India","A1","Q1",40),
("India","A2","Q1",30),
("India","A3","Q1",21),
("German","A1","Q1",50),
("German","A3","Q1",60),
("US","A1","Q1",60),
("US","A2","Q2",25),
("US","A4","Q1",20),
("US","A5","Q5",15),
("US","A3","Q3",10)
)
>>> val df = sc.parallelize(data).toDF("country", "part", "quarter", "result")
>>> df.show()
+-------+----+-------+------+
|country|part|quarter|result|
+-------+----+-------+------+
| India| A1| Q1| 40|
| India| A2| Q1| 30|
| India| A3| Q1| 21|
| German| A1| Q1| 50|
| German| A3| Q1| 60|
| US| A1| Q1| 60|
| US| A2| Q2| 25|
| US| A4| Q1| 20|
| US| A5| Q5| 15|
| US| A3| Q3| 10|
+-------+----+-------+------+
>>> val window = Window.partitionBy("country").orderBy("part", "quarter")
>>> val resultDF = df.withColumn("agg", sum(df("result")).over(window))
>>> resultDF.show()
+-------+----+-------+------+---+
|country|part|quarter|result|agg|
+-------+----+-------+------+---+
| India| A1| Q1| 40| 40|
| India| A2| Q1| 30| 70|
| India| A3| Q1| 21| 91|
| US| A1| Q1| 60| 60|
| US| A2| Q2| 25| 85|
| US| A3| Q3| 10| 95|
| US| A4| Q1| 20|115|
| US| A5| Q5| 15|130|
| German| A1| Q1| 50| 50|
| German| A3| Q1| 60|110|
+-------+----+-------+------+---+
You can do this using Window functions, please take a look at the Databrick blog about Windows:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Hope this helps.
Happy Sparking! Cheers, Fokko
I think this will help others also. I tried in SCALA RDD.
val fileName_test_1 ="C:\\venkat_workshop\\Qintel\\Data_Files\\test_1.txt"
val rdd1 = sc.textFile(fileName_test_1).map { x => (x.split(",")(0).toString() ,
x.split(",")(1).toString(),
x.split(",")(2).toString(),
x.split(",")(3).toDouble
)
}.groupBy( x => (x._1,x._3) )
.mapValues
{
_.toList.sortWith
{
(a,b) => (a._4) > (b._4)
}.scanLeft("","","",0.0,0.0){
(a,b) => (b._1,b._2,b._3,b._4,b._4+a._5)
}.tail
}.flatMapValues(f => f).values