Pyspark - advanced aggregation of monthly data - pyspark

I have a table of the following format.
|---------------------|------------------|------------------|
| Customer | Month | Sales |
|---------------------|------------------|------------------|
| A | 3 | 40 |
|---------------------|------------------|------------------|
| A | 2 | 50 |
|---------------------|------------------|------------------|
| B | 1 | 20 |
|---------------------|------------------|------------------|
I need it in the format as below
|---------------------|------------------|------------------|------------------|
| Customer | Month 1 | Month 2 | Month 3 |
|---------------------|------------------|------------------|------------------|
| A | 0 | 50 | 40 |
|---------------------|------------------|------------------|------------------|
| B | 20 | 0 | 0 |
|---------------------|------------------|------------------|------------------|
Can you please help me out to solve this problem in PySpark?

This should help , i am assumming you are using SUM to aggregate vales from the originical DF
>>> df.show()
+--------+-----+-----+
|Customer|Month|Sales|
+--------+-----+-----+
| A| 3| 40|
| A| 2| 50|
| B| 1| 20|
+--------+-----+-----+
>>> import pyspark.sql.functions as F
>>> df2=(df.withColumn('COLUMN_LABELS',F.concat(F.lit('Month '),F.col('Month')))
.groupby('Customer')
.pivot('COLUMN_LABELS')
.agg(F.sum('Sales'))
.fillna(0))
>>> df2.show()
+--------+-------+-------+-------+
|Customer|Month 1|Month 2|Month 3|
+--------+-------+-------+-------+
| A| 0| 50| 40|
| B| 20| 0| 0|
+--------+-------+-------+-------+

Related

PYSPARK - how can I get the number of elements that cumulates the 80% participation in a column?

I have a column named "Sales", and another column with the Salesman, so I want to know how many salesmans concentrate the 80% of the sales by each type of sales (A, B, C).
For this example,
+---------+------+-----+
|salesman |sales |type |
+---------+-------+----+
| 5 | 9 | a|
| 8 | 12 | b|
| 6 | 3 | b|
| 6 | 1 | a|
| 1 | 3 | a|
| 5 | 1 | b|
| 2 | 11 | b|
| 4 | 3 | a|
| 1 | 1 | b|
| 2 | 3 | a|
| 3 | 4 | a|
+---------+------+-----+
The result should be:
+-----+--------- +
|type |Salesman80|
+-----+----------+
| a | 4 |
| b | 2 |
+-----+----------+
We'll find the total sales per type.
join this to the table,
groupby salesman, type to get totals per sales per type,
then use a math trick to get percentage.
You can then chop this table to any percentage you wish with a where clause.
#create some data
data = spark.range(1 , 1000)
sales = data.select( data.id, floor((rand()*13)).alias("salesman"), floor((rand()*26)+65).alias("type"), floor(rand()*26).alias("sale") )
totalSales = sales.groupby(sales.type)\
.agg(
sum(sales.sale).alias("total_sales")
)\
.select( col("*"), expr("chr( type)").alias("type_") ) #fix int to chr
sales.join( totalSales , ["type"] )\
.groupby("salesman","type_")\
.agg( (sum("sale")/avg("total_sales")).alias("percentage"),
avg("total_sales").alias("total_sales_by_type") #Math trick as the total sum is the same on all types. so average = total sales by type.
).show()
+--------+-----+--------------------+-------------------+
|salesman|type_| percentage|total_sales_by_type|
+--------+-----+--------------------+-------------------+
| 10| H| 0.04710144927536232| 552.0|
| 9| U| 0.21063394683026584| 489.0|
| 0| I| 0.09266409266409266| 518.0|
| 11| K| 0.09683426443202979| 537.0|
| 0| F|0.027070063694267517| 628.0|
| 11| F|0.054140127388535034| 628.0|
| 1| G| 0.08086253369272237| 371.0|
| 5| N| 0.1693548387096774| 496.0|
| 9| L| 0.05353728489483748| 523.0|
| 7| R|0.003058103975535...| 327.0|
| 0| C| 0.05398457583547558| 389.0|
| 6| G| 0.1105121293800539| 371.0|
| 12| A|0.057007125890736345| 421.0|
| 0| J| 0.09876543209876543| 567.0|
| 11| B| 0.11337209302325581| 344.0|
| 8| K| 0.08007448789571694| 537.0|
| 4| N| 0.06854838709677419| 496.0|
| 11| H| 0.1358695652173913| 552.0|
| 10| W| 0.11617312072892938| 439.0|
| 1| C| 0.06940874035989718| 389.0|
+--------+-----+--------------------+-------------------+
I am guessing you mean "how many salesmans contribute to the 80% of the total sales per type" and you want the lowest possible number of salesman.
If that is what you meant, you can do this in these steps
Calculate the total sales per group
Get cumulative sum of sales percentage (sales / total sales)
Assign the row number by ordering sales in descending order
Take minimum row number where the cumulative sum of sales percentage >= 80% per group
Note this is probably not efficient approach but it produces what you want.
from pyspark.sql import functions as F
part_window = Window.partitionBy('type')
order_window = part_window.orderBy(F.desc('sales'))
cumsum_window = order_window.rowsBetween(Window.unboundedPreceding, 0)
df = (df.withColumn('total_sales', F.sum('sales').over(part_window)) # Step 1
.select('*',
F.sum(F.col('sales') / F.col('total_sales')).over(cumsum_window).alias('cumsum_percent'), # Step 2
F.row_number().over(order_window).alias('rn')) # Step 3
.groupby('type') # Step 4
.agg(F.min(F.when(F.col('cumsum_percent') >= 0.8, F.col('rn'))).alias('Salesman80')))

PySpark Column Creation by queuing filtered past rows

In PySpark, I want to make a new column in an existing table that stores the last K texts for a particular user that had label 1.
Example-
Index | user_name | text | label |
0 | u1 | t0 | 0 |
1 | u1 | t1 | 1 |
2 | u2 | t2 | 0 |
3 | u1 | t3 | 1 |
4 | u2 | t4 | 0 |
5 | u2 | t5 | 1 |
6 | u2 | t6 | 1 |
7 | u1 | t7 | 0 |
8 | u1 | t8 | 1 |
9 | u1 | t9 | 0 |
The table after the new column (text_list) should be as follows, storing last K = 2 messages for each user.
Index | user_name | text | label | text_list |
0 | u1 | t0 | 0 | [] |
1 | u1 | t1 | 1 | [] |
2 | u2 | t2 | 0 | [] |
3 | u1 | t3 | 1 | [t1] |
4 | u2 | t4 | 0 | [] |
5 | u2 | t5 | 1 | [] |
6 | u2 | t6 | 1 | [t5] |
7 | u1 | t7 | 0 | [t3, t1] |
8 | u1 | t8 | 1 | [t3, t1] |
9 | u1 | t9 | 0 | [t8, t3] |
A naïve way to do this would be to loop through each row and maintain a queue for each user. But the table could have millions of rows. Can we do this without looping in a more scalable, efficient way?
If you are using spark version >= 2.4, there is a way you can try. Let's say df is your dataframe.
df.show()
# +-----+---------+----+-----+
# |Index|user_name|text|label|
# +-----+---------+----+-----+
# | 0| u1| t0| 0|
# | 1| u1| t1| 1|
# | 2| u2| t2| 0|
# | 3| u1| t3| 1|
# | 4| u2| t4| 0|
# | 5| u2| t5| 1|
# | 6| u2| t6| 1|
# | 7| u1| t7| 0|
# | 8| u1| t8| 1|
# | 9| u1| t9| 0|
# +-----+---------+----+-----+
Two steps :
get list of struct of column text and label over a window using collect_list
filter array where label = 1 and get the text value, descending-sort the array using sort_array and get the first two elements using slice
It would be something like this
from pyspark.sql.functions import col, collect_list, struct, expr, sort_array, slice
from pyspark.sql.window import Window
# window : first row to row before current row
w = Window.partitionBy('user_name').orderBy('index').rowsBetween(Window.unboundedPreceding, -1)
df = (df
.withColumn('text_list', collect_list(struct(col('text'), col('label'))).over(w))
.withColumn('text_list', slice(sort_array(expr("FILTER(text_list, value -> value.label = 1).text"), asc=False), 1, 2))
)
df.sort('Index').show()
# +-----+---------+----+-----+---------+
# |Index|user_name|text|label|text_list|
# +-----+---------+----+-----+---------+
# | 0| u1| t0| 0| []|
# | 1| u1| t1| 1| []|
# | 2| u2| t2| 0| []|
# | 3| u1| t3| 1| [t1]|
# | 4| u2| t4| 0| []|
# | 5| u2| t5| 1| []|
# | 6| u2| t6| 1| [t5]|
# | 7| u1| t7| 0| [t3, t1]|
# | 8| u1| t8| 1| [t3, t1]|
# | 9| u1| t9| 0| [t8, t3]|
# +-----+---------+----+-----+---------+
Thanks to the solution posted here. I modified it slightly (since it assumed text field can be sorted) and was finally able to come to a working solution. Here it is:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, when, collect_list, slice, reverse
K = 2
windowPast = Window.partitionBy("user_name").orderBy("Index").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
df.withColumn("text_list", collect_list\
(when(col("label")==1,col("text"))\
.otherwise(F.lit(None)))\
.over(windowPast))\
.withColumn("text_list", slice(reverse(col("text_list")), 1, K))\
.sort(F.col("Index"))\
.show()

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

How to calculate 5 day-mean, 10-day mean & 15-day mean for given data?

Scenario :
I have following dataframe as below
``` -- -----------------------------------
companyId | calc_date | mean |
----------------------------------
1111 | 01-08-2002 | 15 |
----------------------------------
1111 | 02-08-2002 | 16.5 |
----------------------------------
1111 | 03-08-2002 | 17 |
----------------------------------
1111 | 04-08-2002 | 15 |
----------------------------------
1111 | 05-08-2002 | 23 |
----------------------------------
1111 | 06-08-2002 | 22.6 |
----------------------------------
1111 | 07-08-2002 | 25 |
----------------------------------
1111 | 08-08-2002 | 15 |
----------------------------------
1111 | 09-08-2002 | 15 |
----------------------------------
1111 | 10-08-2002 | 16.5 |
----------------------------------
1111 | 11-08-2002 | 22.6 |
----------------------------------
1111 | 12-08-2002 | 15 |
----------------------------------
1111 | 13-08-2002 | 16.5 |
----------------------------------
1111 | 14-08-2002 | 25 |
----------------------------------
1111 | 15-08-2002 | 16.5 |
----------------------------------
```
Required :
Need to calculate for given data 5 day-mean , 10-day mean , 15-day mean for every record for every company.
5 day-mean --> Previous 5 days available mean sum
10 day-mean --> Previous 10 days available mean sum
15 day-mean --> Previous 15 days available mean sum
Resultant dataframe should have caluluated columns as below
----------------------------------------------------------------------------
companyId | calc_date | mean | 5 day-mean | 10-day mean | 15-day mean |
----------------------------------------------------------------------------
Question :
How to achieve this ?
What is the best way to do this in spark ?
Here's one approach using Window partitions by company to compute the n-day mean between the current row and previous rows within the specified timestamp range via rangeBetween, as shown below (using a dummy dataset):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 3).flatMap(i => Seq.tabulate(15)(j => (i, s"${j+1}-2-2019", j+1))).
toDF("company_id", "calc_date", "mean")
df.show
// +----------+---------+----+
// |company_id|calc_date|mean|
// +----------+---------+----+
// | 1| 1-2-2019| 1|
// | 1| 2-2-2019| 2|
// | 1| 3-2-2019| 3|
// | 1| 4-2-2019| 4|
// | 1| 5-2-2019| 5|
// | ... |
// | 1|14-2-2019| 14|
// | 1|15-2-2019| 15|
// | 2| 1-2-2019| 1|
// | 2| 2-2-2019| 2|
// | 2| 3-2-2019| 3|
// | ... |
// +----------+---------+----+
def winSpec = Window.partitionBy("company_id").orderBy("ts")
def dayRange(days: Int) = winSpec.rangeBetween(-(days * 24 * 60 * 60), 0)
df.
withColumn("ts", unix_timestamp(to_date($"calc_date", "d-M-yyyy"))).
withColumn("mean-5", mean($"mean").over(dayRange(5))).
withColumn("mean-10", mean($"mean").over(dayRange(10))).
withColumn("mean-15", mean($"mean").over(dayRange(15))).
show
// +----------+---------+----+----------+------+-------+-------+
// |company_id|calc_date|mean| ts|mean-5|mean-10|mean-15|
// +----------+---------+----+----------+------+-------+-------+
// | 1| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0|
// | 1| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5|
// | 1| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0|
// | 1| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5|
// | 1| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0|
// | 1| 6-2-2019| 6|1549440000| 3.5| 3.5| 3.5|
// | 1| 7-2-2019| 7|1549526400| 4.5| 4.0| 4.0|
// | 1| 8-2-2019| 8|1549612800| 5.5| 4.5| 4.5|
// | 1| 9-2-2019| 9|1549699200| 6.5| 5.0| 5.0|
// | 1|10-2-2019| 10|1549785600| 7.5| 5.5| 5.5|
// | 1|11-2-2019| 11|1549872000| 8.5| 6.0| 6.0|
// | 1|12-2-2019| 12|1549958400| 9.5| 7.0| 6.5|
// | 1|13-2-2019| 13|1550044800| 10.5| 8.0| 7.0|
// | 1|14-2-2019| 14|1550131200| 11.5| 9.0| 7.5|
// | 1|15-2-2019| 15|1550217600| 12.5| 10.0| 8.0|
// | 3| 1-2-2019| 1|1549008000| 1.0| 1.0| 1.0|
// | 3| 2-2-2019| 2|1549094400| 1.5| 1.5| 1.5|
// | 3| 3-2-2019| 3|1549180800| 2.0| 2.0| 2.0|
// | 3| 4-2-2019| 4|1549267200| 2.5| 2.5| 2.5|
// | 3| 5-2-2019| 5|1549353600| 3.0| 3.0| 3.0|
// +----------+---------+----+----------+------+-------+-------+
// only showing top 20 rows
Note that one could use rowsBetween (as opposed to rangeBetween) directly on calc_date if the dates are guaranteed to be contiguous per-day time series.

Pyspark Join Tables

I'm new in Pyspark. I have 'Table A' and 'Table B' and I need join both to get 'Table C'. Can anyone help-me please?
I'm using DataFrames...
I don't know how to join that tables all together in the right way...
Table A:
+--+----------+-----+
|id|year_month| qt |
+--+----------+-----+
| 1| 2015-05| 190 |
| 2| 2015-06| 390 |
+--+----------+-----+
Table B:
+---------+-----+
year_month| sem |
+---------+-----+
| 2016-01| 1 |
| 2015-02| 1 |
| 2015-03| 1 |
| 2016-04| 1 |
| 2015-05| 1 |
| 2015-06| 1 |
| 2016-07| 2 |
| 2015-08| 2 |
| 2015-09| 2 |
| 2016-10| 2 |
| 2015-11| 2 |
| 2015-12| 2 |
+---------+-----+
Table C:
The join add columns and also add rows...
+--+----------+-----+-----+
|id|year_month| qt | sem |
+--+----------+-----+-----+
| 1| 2015-05 | 0 | 1 |
| 1| 2016-01 | 0 | 1 |
| 1| 2015-02 | 0 | 1 |
| 1| 2015-03 | 0 | 1 |
| 1| 2016-04 | 0 | 1 |
| 1| 2015-05 | 190 | 1 |
| 1| 2015-06 | 0 | 1 |
| 1| 2016-07 | 0 | 2 |
| 1| 2015-08 | 0 | 2 |
| 1| 2015-09 | 0 | 2 |
| 1| 2016-10 | 0 | 2 |
| 1| 2015-11 | 0 | 2 |
| 1| 2015-12 | 0 | 2 |
| 2| 2015-05 | 0 | 1 |
| 2| 2016-01 | 0 | 1 |
| 2| 2015-02 | 0 | 1 |
| 2| 2015-03 | 0 | 1 |
| 2| 2016-04 | 0 | 1 |
| 2| 2015-05 | 0 | 1 |
| 2| 2015-06 | 390 | 1 |
| 2| 2016-07 | 0 | 2 |
| 2| 2015-08 | 0 | 2 |
| 2| 2015-09 | 0 | 2 |
| 2| 2016-10 | 0 | 2 |
| 2| 2015-11 | 0 | 2 |
| 2| 2015-12 | 0 | 2 |
+--+----------+-----+-----+
Code:
from pyspark import HiveContext
sqlContext = HiveContext(sc)
lA = [(1,"2015-05",190),(2,"2015-06",390)]
tableA = sqlContext.createDataFrame(lA, ["id","year_month","qt"])
tableA.show()
lB = [("2016-01",1),("2015-02",1),("2015-03",1),("2016-04",1),
("2015-05",1),("2015-06",1),("2016-07",2),("2015-08",2),
("2015-09",2),("2016-10",2),("2015-11",2),("2015-12",2)]
tableB = sqlContext.createDataFrame(lB,["year_month","sem"])
tableB.show()
It's not really a join more a cartesian product (cross join)
Spark 2
import pyspark.sql.functions as psf
tableA.crossJoin(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
Spark 1.6
tableA.join(tableB)\
.withColumn(
"qt",
psf.when(tableB.year_month == tableA.year_month, psf.col("qt")).otherwise(0))\
.drop(tableA.year_month)
+---+---+----------+---+
| id| qt|year_month|sem|
+---+---+----------+---+
| 1| 0| 2015-02| 1|
| 1| 0| 2015-03| 1|
| 1|190| 2015-05| 1|
| 1| 0| 2015-06| 1|
| 1| 0| 2016-01| 1|
| 1| 0| 2016-04| 1|
| 1| 0| 2015-08| 2|
| 1| 0| 2015-09| 2|
| 1| 0| 2015-11| 2|
| 1| 0| 2015-12| 2|
| 1| 0| 2016-07| 2|
| 1| 0| 2016-10| 2|
| 2| 0| 2015-02| 1|
| 2| 0| 2015-03| 1|
| 2| 0| 2015-05| 1|
| 2|390| 2015-06| 1|
| 2| 0| 2016-01| 1|
| 2| 0| 2016-04| 1|
| 2| 0| 2015-08| 2|
| 2| 0| 2015-09| 2|
| 2| 0| 2015-11| 2|
| 2| 0| 2015-12| 2|
| 2| 0| 2016-07| 2|
| 2| 0| 2016-10| 2|
+---+---+----------+---+