Thanks in advance for any help.
I have a table with unique tickets, customer IDs and ticket price. For each ticket, I want to see the number of tickets and total revenue from a customer 3 months after the date of the ticket.
I tried to use the partition by function with the date condition set in the on clause, but it just evaluates all tickets of the customer rather than the 3 month period I want.
select distinct on (at2.ticket_number)
at2.customer_id
,at2.ticket_id
,at2.ticket_number
,at2.initial_sale_date
,ata.tix "a_tix"
,ata.aov "a_aov"
,ata.rev "a_rev"
from reports.agg_tickets at2
left join (select at2.customer_id, at2.final_fare_value, at2.initial_sale_date, count(at2.customer_id) OVER (PARTITION BY at2.customer_id) AS tix,
avg(at2.final_fare_value) over (partition by at2.customer_id) as aov,
sum(at2.final_fare_value) over (partition by at2.customer_id) as rev
from reports.agg_tickets at2
) ata
on (ata.customer_id = at2.customer_id
and ata.initial_sale_date > at2.initial_sale_date
and ata.initial_sale_date < at2.initial_sale_date + interval '3 months')
I could use a left join lateral, but it takes far too long. Slightly confused with how to achieve what I want, so any help would be greatly appreciated.
Many thanks
Edit:
Here is the sample of data. Picture of data table.
The table is unique on ticket number, but not on customer.
No need to use a join at all, this will yield (as you observe) a problemetic performnce.
What is your solution is a plain window function with a frame_clause that will consider the next 3 months for each ticket
Example (self explained)
count(*) over (partition by customer_id order by initial_sale_date
range between current row and '3 months' following) ticket_cnt
Here a full query with simplified sample data and the result
with dt as (
select * from (values
(1, 1, date'2020-01-01', 10),
(1, 2, date'2020-02-01', 15),
(1, 3, date'2020-03-01', 20),
(1, 4, date'2020-04-01', 25),
(1, 5, date'2020-05-01', 30),
(2, 6, date'2020-01-01', 15),
(2, 7, date'2020-02-01', 20),
(2, 7, date'2021-01-01', 25)
) tab (customer_id, ticket_id, initial_sale_date,final_fare_value)
)
select
customer_id, ticket_id, initial_sale_date, final_fare_value,
count(*) over (partition by customer_id order by initial_sale_date range between current row and '3 months' following) ticket_cnt,
sum(final_fare_value) over (partition by customer_id order by initial_sale_date range between current row and '3 months' following) ticket_sum
from dt;
customer_id|ticket_id|initial_sale_date|final_fare_value|ticket_cnt|ticket_sum|
-----------+---------+-----------------+----------------+----------+----------+
1| 1| 2020-01-01| 10| 4| 70|
1| 2| 2020-02-01| 15| 4| 90|
1| 3| 2020-03-01| 20| 3| 75|
1| 4| 2020-04-01| 25| 2| 55|
1| 5| 2020-05-01| 30| 1| 30|
2| 6| 2020-01-01| 15| 2| 35|
2| 7| 2020-02-01| 20| 1| 20|
2| 7| 2021-01-01| 25| 1| 25|
Related
I have a table which has format like this (id is the pk)
id|timestamps |year|month|day|groups_ids|status |SCHEDULED |uid|
--|-------------------|----|-----|---|----------|-------|-------------------|---|
1|2021-02-04 17:18:24|2020| 8| 9| 1|OK |2020-08-09 00:00:00| 1|
2|2021-02-04 17:18:09|2020| 9| 9| 1|OK |2020-09-09 00:00:00| 1|
3|2021-02-04 17:19:51|2020| 10| 9| 1|HOLD |2020-10-09 00:00:00| 1|
4|2021-02-04 17:19:04|2020| 10| 10| 2|HOLD |2020-10-09 00:00:00| 1|
5|2021-02-04 17:18:30|2020| 10| 11| 2|HOLD |2020-10-09 00:00:00| 1|
6|2021-02-04 17:18:57|2020| 10| 12| 2|OK |2020-10-09 00:00:00| 1|
7|2021-02-04 17:18:24|2020| 8| 9| 1|HOLD |2020-08-09 00:00:00| 2|
8|2021-02-04 17:18:09|2020| 9| 9| 2|HOLD |2020-09-09 00:00:00| 2|
9|2021-02-04 17:19:51|2020| 10| 9| 2|HOLD |2020-10-09 00:00:00| 2|
10|2021-02-04 17:19:04|2020| 10| 10| 2|HOLD |2020-10-09 00:00:00| 2|
11|2021-02-04 17:18:30|2020| 10| 11| 2|HOLD |2020-10-09 00:00:00| 2|
12|2021-02-04 17:18:57|2020| 10| 12| 2|HOLD |2020-10-09 00:00:00| 2|
the job is i want to extract every group_ids for each uid when the status is OK order by SCHEDULED ascended, and if there's no OK found in the record for the uid it will takes for the latest HOLD based on year month and day. After that I want to make a weighing score with each group_ids:
group_ids > score
1 > 100
2 > 80
3 > 60
4 > 50
5 > 10
6 > 50
7 > 0
so if [1,1,2] will be change to (100+100+80) = 280
it will look like this:
ids|uid|pattern|score|
---|---|-------|-----|
1| 1|[1,1,2]| 280|
2| 2|[2] | 80|
It's pretty hard since i cannot found any operators like python for loop and append operators in PostgreSQL
step-by-step demo:db<>fiddle
SELECT
s.uid, s.values,
sum(v.value) as score
FROM (
SELECT DISTINCT ON (uid)
uid,
CASE
WHEN cardinality(ok_count) > 0 THEN ok_count
ELSE ARRAY[last_value]
END as values
FROM (
SELECT
*,
ARRAY_AGG(groups_ids) FILTER (WHERE status = 'OK') OVER (PARTITION BY uid ORDER BY scheduled)as ok_count,
first_value(groups_ids) OVER (PARTITION BY uid ORDER BY year, month DESC) as last_value
FROM mytable
) s
ORDER BY uid, scheduled DESC
) s,
unnest(values) as u_group_id
JOIN (VALUES
(1, 100), (2, 80), (3, 60), (4, 50), (5,10), (6, 50), (7, 0)
) v(group_id, value) ON v.group_id = u_group_id
GROUP BY s.uid, s.values
Phew... quite complex. Let's have a look at the steps:
a)
SELECT
*,
-- 1:
ARRAY_AGG(groups_ids) FILTER (WHERE status = 'OK') OVER (PARTITION BY uid ORDER BY scheduled)as oks,
-- 2:
first_value(groups_ids) OVER (PARTITION BY uid ORDER BY year, month DESC) as last_value
FROM mytable
Using the array_agg() window function to create an array of group_ids without loosing the other data as we would with simple GROUP BY. The FILTER clause is to put only the status = OK records into the array.
Find the last group_id of a group (partition) using the first_value() window function. In descending order is returns the last value.
b)
SELECT DISTINCT ON (uid) -- 2
uid,
CASE -- 1
WHEN cardinality(ok_count) > 0 THEN ok_count
ELSE ARRAY[last_value]
END as values
FROM (
...
) s
ORDER BY uid, scheduled DESC -- 2
The CASE clause either takes the previously created array (from step a1) or, if there is none, it takes the last value (from step a2), creates an one-elemented array.
The DISTINCT ON clause returns only the first element of an ordered group. The group is your uid and the order is given by the column scheduled. Since you don't want the first, but last records within the group, you have to order it DESC to make the most recent one the topmost record. That is taken by the DISTINCT ON
c)
SELECT
uid,
group_id
FROM (
...
) s,
unnest(values) as group_id -- 1
The arrays should be extracted into one record per element. That helps to join the weighted values later.
d)
SELECT
s.uid, s.values,
sum(v.weighted_value) as score -- 2
FROM (
...
) s,
unnest(values) as u_group_id
JOIN (VALUES
(1, 100), (2, 80), ...
) v(group_id, weighted_value) ON v.group_id = u_group_id -- 1
GROUP BY s.uid, s.values -- 2
Join your weighted value on the array elements. Naturally, this can be a table or query or whatever.
Regroup the uid groups to calculate the SUM() of the weighted_values
Additional note:
You should avoid duplicate data storing. You don't need to store the date parts year, month and day if you also store the complete date. You can always calculate them from the date.
I need to do a sum of the values in the last 30 days (exclusive) relative to that date, with every product in every store.
Assuming all months with 30 days:
date|store|product|values
2020-06-30|Store1|Product1|1
2020-07-02|Store1|Product2|4
2020-07-01|Store2|Product1|3
2020-07-18|Store1|Product1|4
2020-07-18|Store1|Product2|2
2020-07-18|Store2|Product1|2
2020-07-30|Store1|Product1|1
2020-08-01|Store1|Product1|1
2020-08-01|Store1|Product2|1
2020-08-01|Store2|Product1|6
In the lines of day 2020-08-01, sum the values of (2020-08-20 - 30 days) to 2020-08-19 and put it in the 2020-08-20 line, like this:
(first line doesn't include '2020-06-30' because is more than 30 days ago and '2020-08-01' because is the same day, and this goes on...)
date|store|product|sum_values_over_last_30_days_to_this_date
2020-08-01|Store1|Product1|5
2020-08-01|Store1|Product2|6
2020-08-01|Store2|Product1|5
....
Tried this below and nothing too:
spark.sql("""
SELECT
a.date,
a.store,
a.product,
SUM(a.values) OVER (PARTITION BY a.product,a.store ORDER BY a.date BETWEEN a.date - INTERVAL '1' DAY AND a.date - INTERVAL '30' DAY) AS sum
FROM table a
""").show()
Anybody can help me?
You can try self-join rather than window function, maybe this kind of join will work -
SELECT
a.date,
a.store,
a.product,
SUM(IFNULL(b.value,0))
FROM
table a
LEFT JOIN
(
SELECT
a.date,
a.store,
a.product,
a.value
FROM
table a
)b
ON
a.store = b.store
AND
a.product = b.product
AND
a.date > b.date - INTERVAL 30 DAYS
AND a.date <= b.date
GROUP BY
1,2,3
Make sure to sum the value from the inner query, to have the sum up until this daye.
Here is my trial with the sampel dataframe.
+----------+------+--------+------+
| date| store| product|values|
+----------+------+--------+------+
|2020-08-10|Store1|Product1| 1|
|2020-08-11|Store1|Product1| 1|
|2020-08-12|Store1|Product1| 1|
|2020-08-13|Store1|Product2| 1|
|2020-08-14|Store1|Product2| 1|
|2020-08-15|Store1|Product2| 1|
|2020-08-16|Store1|Product1| 1|
|2020-08-17|Store1|Product1| 1|
|2020-08-18|Store1|Product1| 1|
|2020-08-19|Store1|Product2| 1|
|2020-08-20|Store1|Product2| 1|
|2020-08-21|Store1|Product2| 1|
|2020-08-22|Store1|Product1| 1|
|2020-08-21|Store1|Product1| 1|
|2020-08-22|Store1|Product1| 1|
|2020-08-20|Store1|Product2| 1|
|2020-08-21|Store1|Product2| 1|
|2020-08-22|Store1|Product2| 1|
+----------+------+--------+------+
df.withColumn("date", to_date($"date"))
.createOrReplaceTempView("table")
spark.sql("""
SELECT
date,
store,
product,
COALESCE(SUM(values) OVER (PARTITION BY 1 ORDER BY date RANGE BETWEEN 3 PRECEDING AND 1 PRECEDING), 0) as sum
FROM table
""").show()
+----------+------+--------+---+
| date| store| product|sum|
+----------+------+--------+---+
|2020-08-10|Store1|Product1|0.0|
|2020-08-11|Store1|Product1|1.0|
|2020-08-12|Store1|Product1|2.0|
|2020-08-13|Store1|Product2|3.0|
|2020-08-14|Store1|Product2|3.0|
|2020-08-15|Store1|Product2|3.0|
|2020-08-16|Store1|Product1|3.0|
|2020-08-17|Store1|Product1|3.0|
|2020-08-18|Store1|Product1|3.0|
|2020-08-19|Store1|Product2|3.0|
|2020-08-20|Store1|Product2|3.0|
|2020-08-20|Store1|Product2|3.0|
|2020-08-21|Store1|Product2|4.0|
|2020-08-21|Store1|Product1|4.0|
|2020-08-21|Store1|Product2|4.0|
|2020-08-22|Store1|Product1|6.0|
|2020-08-22|Store1|Product1|6.0|
|2020-08-22|Store1|Product2|6.0|
+----------+------+--------+---+
I am a new developer in Spark Scala and I have a simple table like this :
City Dept employee_id employee_salary
NY FI 10 10000
NY FI null 20000
WDC IT 30 100000
LA IT 40 500
What I want to do is to calculate :
the number of employees by city & department (a simple count(*))
the number of employees with a non null id
the number of employees with a small (< 500) or medium (< 1000) or high salary (> 1000) => 3 additional counts !
So, as an output, I should have something like this :
City Dept total_emp total_emp total_emp_small total_emp_medium total_emp_high
NY FI 100 90 10 70 10
WDC IT 200 100
LA IT 10 10
The challenge here is that I want to optimize those counts. Because as you can see, we have here many counts to do. Without an optimisation the brute force for me is to do one count by request and generate a new DF result after each count and at the end I do a left joint based on the fixed column (city & dept) to add those new columns. But it will be too heavy since my table contains a lot of data.
I think that the method "window" can simplify this but I am not sure.
Can you help me this at least with just 2 cases (id != null and salary < 500).
Thank you in advance
scala> df.show
+----+----+-----------+---------------+
|City|Dept|employee_id|employee_salary|
+----+----+-----------+---------------+
| NY| FI| 10| 10000|
| NY| FI| null| 20000|
| WDC| IT| 30| 100000|
| LA| IT| 40| 500|
| LA| IT| 40| 600|
| LA| IT| null| 200|
+----+----+-----------+---------------+
scala> val df1 = df.withColumn("NonNullEmpID", when(col("employee_id").isNotNull, lit(1)).otherwise(lit(0)))
.withColumn("Salary_Cat",
when(col("employee_salary") < 500, lit("SmallSalary")).
when(col("employee_salary") < 1000, lit("MediumSalary")).
when(col("employee_salary") >= 1000, lit("HighSalary")))
scala> val SalaryMap = df1.groupBy("Salary_Cat").agg(count(lit(1)).alias("count")).collect.flatMap(x => Map(x(0).toString -> x.toSeq.slice(1, x.length).mkString)).toMap
//the number of employees by city & department (a simple count(*))
//the number of employees with a non null id
//the number of employees with a small (< 500) or medium (< 1000) or high salary (> 1000) => 3 additional counts !
scala> df1.groupBy("City", "Dept")
.agg(count(lit(1)).alias("#Emp_per_City_Dept"), sum(col("NonNullEmpID")).alias("NonNullEmpID"))
.withColumn("total_emp_small", lit(SalaryMap("SmallSalary")))
.withColumn("total_emp_medium", lit(SalaryMap("MediumSalary")))
.withColumn("total_emp_high", lit(SalaryMap("HighSalary")))
.show()
+----+----+------------------+------------+---------------+----------------+--------------+
|City|Dept|#Emp_per_City_Dept|NonNullEmpID|total_emp_small|total_emp_medium|total_emp_high|
+----+----+------------------+------------+---------------+----------------+--------------+
| NY| FI| 2| 1| 1| 2| 3|
| LA| IT| 3| 2| 1| 2| 3|
| WDC| IT| 1| 1| 1| 2| 3|
+----+----+------------------+------------+---------------+----------------+--------------+
I have a dataframe in Spark with name column and dates. And I would like to find all continuous sequences of constantly increasing dates (day after day) for each name and calculate their durations. The output should contain a name, start date (of the dates sequence) and duration of such time period (amount of days)
How can I do this with Spark functions?
A consecutive sequence of dates example:
2019-03-12
2019-03-13
2019-03-14
2019-03-15
I have defined such solution but it calculates the overall amount of days by each name and does not divide it into sequences:
val result = allDataDf
.groupBy($"name")
.agg(count($"date").as("timePeriod"))
.orderBy($"timePeriod".desc)
.head()
Also, I tried with ranks, but counts column has only 1s, for some reason:
val names = Window
.partitionBy($"name")
.orderBy($"date")
val result = allDataDf
.select($"name", $"date", rank over names as "rank")
.groupBy($"name", $"date", $"rank")
.agg(count($"*") as "count")
The output looks like this:
+-----------+----------+----+-----+
|stationName| date|rank|count|
+-----------+----------+----+-----+
| NAME|2019-03-24| 1| 1|
| NAME|2019-03-25| 2| 1|
| NAME|2019-03-27| 3| 1|
| NAME|2019-03-28| 4| 1|
| NAME|2019-01-29| 5| 1|
| NAME|2019-03-30| 6| 1|
| NAME|2019-03-31| 7| 1|
| NAME|2019-04-02| 8| 1|
| NAME|2019-04-05| 9| 1|
| NAME|2019-04-07| 10| 1|
+-----------+----------+----+-----+
Finding consecutive dates is fairly easy in SQL. You could do it with a query like:
WITH s AS (
SELECT
stationName,
date,
date_add(date, -(row_number() over (partition by stationName order by date))) as discriminator
FROM stations
)
SELECT
stationName,
MIN(date) as start,
COUNT(1) AS duration
FROM s GROUP BY stationName, discriminator
Fortunately, we can use SQL in spark. Let's check if it works (I used different dates):
val df = Seq(
("NAME1", "2019-03-22"),
("NAME1", "2019-03-23"),
("NAME1", "2019-03-24"),
("NAME1", "2019-03-25"),
("NAME1", "2019-03-27"),
("NAME1", "2019-03-28"),
("NAME2", "2019-03-27"),
("NAME2", "2019-03-28"),
("NAME2", "2019-03-30"),
("NAME2", "2019-03-31"),
("NAME2", "2019-04-04"),
("NAME2", "2019-04-05"),
("NAME2", "2019-04-06")
).toDF("stationName", "date")
.withColumn("date", date_format(col("date"), "yyyy-MM-dd"))
df.createTempView("stations");
val result = spark.sql(
"""
|WITH s AS (
| SELECT
| stationName,
| date,
| date_add(date, -(row_number() over (partition by stationName order by date)) + 1) as discriminator
| FROM stations
|)
|SELECT
| stationName,
| MIN(date) as start,
| COUNT(1) AS duration
|FROM s GROUP BY stationName, discriminator
""".stripMargin)
result.show()
It seems to output correct dataset:
+-----------+----------+--------+
|stationName| start|duration|
+-----------+----------+--------+
| NAME1|2019-03-22| 4|
| NAME1|2019-03-27| 2|
| NAME2|2019-03-27| 2|
| NAME2|2019-03-30| 2|
| NAME2|2019-04-04| 3|
+-----------+----------+--------+
I have tried numerous approaches to turn the following:
Gender, Age, Value
1, 20, 21
2, 23 22
1, 26, 23
2, 29, 24
into
Male_Age, Male_Value, Female_Age, Female_Value
20 21 23 22
26 23 29 24
What i need to do is group by gender and instead of using an aggregate like (sum, count, avg) I need to create List[age] and List[value]. This should be possible because i am using a Dataset which allows functional operations.
If the number of rows for males and females are not the same, the columns should be filled with nulls.
One approach I tried was to make a new a new dataframe using the columns of other dataframes like so:
df
.select(male.select("sex").where('sex === 1).col("sex"),
female.select("sex").where('sex === 2).col("sex"))
However, this bizarrely produces output like so:
sex, sex,
1, 1
2, 2
1, 1
2, 2
I can't see how that is possible.
I also tried using pivot, but it forces me to aggregate after the group by:
df.withColumn("sex2", df.col("sex"))
.groupBy("sex")
.pivot("sex2")
.agg(
sum('value').as("mean"),
stddev('value).as("std. dev") )
.show()
|sex| 1.0_mean| 1.0_std. dev| 2.0_mean| 2.0_std. dev|
|1.0|0.4926065526| 1.8110632697| | |
|2.0| | |0.951250372|1.75060275400785|
The following code does what I need in Oracle SQL, so it should possible in Spark SQL too I reckon...
drop table mytable
CREATE TABLE mytable
( gender number(10) NOT NULL,
age number(10) NOT NULL,
value number(10) );
insert into mytable values (1,20,21);
insert into mytable values(2,23,22);
insert into mytable values (1,26,23);
insert into mytable values (2,29,24);
insert into mytable values (1,30,25);
select * from mytable;
SELECT A.VALUE AS MALE,
B.VALUE AS FEMALE
FROM
(select value, rownum RN from mytable where gender = 1) A
FULL OUTER JOIN
(select value, rownum RN from mytable where gender = 2) B
ON A.RN = B.RN
The following should give you the result.
val df = Seq(
(1, 20, 21),
(2, 23, 22),
(1, 26, 23),
(2, 29, 24)
).toDF("Gender", "Age", "Value")
scala> df.show
+------+---+-----+
|Gender|Age|Value|
+------+---+-----+
| 1| 20| 21|
| 2| 23| 22|
| 1| 26| 23|
| 2| 29| 24|
+------+---+-----+
// Gender 1 = Male
// Gender 2 = Female
import org.apache.spark.sql.expressions.Window
val byGender = Window.partitionBy("gender").orderBy("gender")
val males = df
.filter("gender = 1")
.select($"age" as "male_age",
$"value" as "male_value",
row_number() over byGender as "RN")
scala> males.show
+--------+----------+---+
|male_age|male_value| RN|
+--------+----------+---+
| 20| 21| 1|
| 26| 23| 2|
+--------+----------+---+
val females = df
.filter("gender = 2")
.select($"age" as "female_age",
$"value" as "female_value",
row_number() over byGender as "RN")
scala> females.show
+----------+------------+---+
|female_age|female_value| RN|
+----------+------------+---+
| 23| 22| 1|
| 29| 24| 2|
+----------+------------+---+
scala> males.join(females, Seq("RN"), "outer").show
+---+--------+----------+----------+------------+
| RN|male_age|male_value|female_age|female_value|
+---+--------+----------+----------+------------+
| 1| 20| 21| 23| 22|
| 2| 26| 23| 29| 24|
+---+--------+----------+----------+------------+
Given a DataFrame called df with columns gender, age, and value, you can do this:
df.groupBy($"gender")
.agg(collect_list($"age"), collect_list($"value")).rdd.map { row =>
val ages: Seq[Int] = row.getSeq(1)
val values: Seq[Int] = row.getSeq(2)
(row.getInt(0), ages.head, ages.last, values.head, values.last)
}.toDF("gender", "male_age", "female_age", "male_value", "female_value")
This uses the collect_list aggregating function in the very helpful Spark functions library to aggregate the values you want. (As you can see, there is also a collect_set as well.)
After that, I don't know of any higher-level DataFrame functions to expand those columnar arrays into individual columns of their own, so I fall back to the lower-level RDD API our ancestors used. I simply expand everything into a Tuple and then turn it back into a DataFrame. The commenters above mention corner cases I have not addressed; using functions like headOption and tailOption might be useful there. But this should be enough to get you moving.