Create deciles to group and label records where the sum of a value is the same for each decile - pyspark

I have something similar to the following table, which is a randomly ordered list of thousands of transactions with a Customer_ID and an order_cost for each transaction.
Customer_ID
order_cost
1
$503
53
$7
4
$80
13
$76
6
$270
78
$2
8
$45
910
$89
10
$3
1130
$43
etc...
etc...
I want to group the transactions by Customer_ID, aggregate the cost of all the orders into a spending column, and then create a new "decile" row that would assign a number 1-10 to each customer so that when the "spending" for all customers in a decile is added up, each decile contains 10% of all the spending.
The resulting table would look something like the table below where each ascending decile will contain fewer customers, but the total sum of "spending" for all the records in each decile group will be the same for deciles 1-10. (The actual numbers in this sample column don't actually add up, it's just the concept)
Customer_ID
spending
Decile
45
$500
1
3
$700
1
349
$800
1
23
$1,000
1
64
$2,000
1
718
$2,100
1
3452
$2,300
1
1276
$2,600
2
10
$3,000
2
34
$4,000
2
etc...
etc...
etc...
So far I have grouped by Customer_ID, aggregated the order_cost to a spending column, ordered each customer in ascending order based on the spending column, and then partitioned all the customers into 5000 groups. From there I manually found the values for each .when statement that would result in deciles 1-10 each containing the right amount of customers so each decile has 10% of the sum of the entire spending column. It's pretty time-consuming to use trial and error to find the right bucket configuration that results in each decile having 10% of the spending column.
I'm trying to find a way to automate this process so I don't have to find the right bucketing ratio for each decile by trial and error.
This is my code so far:
Import pyspark.sql.functions as F
deciles = (table
.groupBy('Customer_ID')
.agg(F.sum('order_cost').alias('spending')).alias('a')
.withColumn('rank', F.ntile (5000).over(W.Window.partitionBy()
.orderBy(F.asc('spending'))))
.withColumn('rank', F.when(F.col('rank')<=4628, F.lit(1))
.when(F.col('rank')<=4850, F.lit(2))
.when(F.col('rank')<=4925, F.lit(3))
.when(F.col('rank')<=4965, F.lit(4))
.when(F.col('rank')<=4980, F.lit(5))
.when(F.col('rank')<=4987, F.lit(6))
.when(F.col('rank')<=4993, F.lit(7))
.when(F.col('rank')<=4997, F.lit(8))
.when(F.col('rank')<=4999, F.lit(9))
.when(F.col('rank')<=5000, F.lit(10))
.otherwise (F.lit(0)))
)
end_table = (table.alias('a').join(deciles.alias('b'), ['Customer_ID'], 'left')
.selectExpr('a.*', 'b.rank')
)

Related

How to divide an individual unit value of a column by another column's aggregate value?

I have been trying to take individual scores from my database and divide them by the collective sum of a secondary score of students that share the same name without aggregating them and maintaining individual cells.
I have a database like this
name score 1 score 2
reed 30 10
reed 50 20
brick 60 30
brick 60 12
and i want this output for a new column Score % 2
name score 1 score 2 score % 2
reed 30 10 10/(30+50)=0.125
reed 50 20 20/(30+50)=0.25
brick 60 30 30/(60+60)=0.25
brick 60 12 10/(60+60)=0.1
so i figured my query would be something like: Score 1/SUM(Score 2) OVER (PARTITION BY Name)
but this doesn't really work probably due to the fact the PARTITION BY is trying to sum by name but the first part of the query refers to individual, unit level data.
Is what I want even possible? Thanks you!
You can just join in the SUM like so.
Select score1,
score2,
sums.score1,
Score2/sums.score1 AS [score % 2]
FROM table INNER JOIN (
SELECT name,
SUM(Score1) score1
FROM table
GROUP BY name
) AS sums ON table.name = sums.name

find the difference between a column ion two rows grouped by one and sort by another column

Need to find the sequential difference and average between within a columns of two rows group by brand column and order by bill_id column and find the difference of worth column between rows in a single query.
I have a data
brand bill_id worth
Moto 1 2550
Samsung 1 3430
Samsung 2 3450
Moto 2 2500
Moto 3 2530
Expected Output
brand bill_id worth net_diff avg_diff
Moto 1 2550 0 00
Moto 2 2560 10 5
Moto 3 2540 -20 -5
Samsung 1 3430 0 0
Samsung 2 3450 20 10
With the following data :
CREATE TABLE T (brand VARCHAR(16), bill_id INT, worth DECIMAL(16,2))
INSERT INTO T VALUES
('Moto', 1, 2550),
('Samsung', 1, 3430),
('Samsung', 2, 3450),
('Moto', 2, 2500),
('Moto', 3, 2530);
One possible solution could be :
WITH
T0 AS
(
SELECT *, worth - COALESCE(LAG(worth) OVER(PARTITION BY brand ORDER BY bill_id), worth) AS net_diff
FROM T
)
SELECT *, AVG(net_diff) OVER(PARTITION BY brand ORDER BY bill_id)
FROM T0;
But I do not understand the computation formulae of your example for AVG...
It appears that by average you are looking for 1/2 the difference between 2 consecutive bill_id for a brand. You can get this by applying the lag() function twice (with answer from #SQLpro as a base) arriving at: (see demo)
with bill_net(brand, bill_id, worth,net_diff) as
( select billing.*, worth - coalesce(lag(worth) over(partition by brand order by bill_id), worth)
from billing
)
select brand, bill_id, worth, net_diff, coalesce(round(((net_diff - lag(net_diff) over(partition by brand order by bill_id))/2.0),2),0.00)
from bill_net;
NOTE: Due to inconsistency between Input and Results it does not exactly produce you expected results.

Calculate Subtotals using fixed categories in postgresql query

I have a query that returns the count and sum of certain fields on my tables, and also a total. It goes like this:
example:
with foo as(
select s.subcat as subcategory,
sum(s.cost) as costs,
round(((sum(s.cost) / (s.tl)::numeric )*100),2)|| ' %' as pct_cost
from (select ...big query)s group by s.subcat
)
select * from foo
union
select 'Total costs' as subcategory,
sum(costs) as costs,
null as pct_cost
from foo
order by...
Category
Cost
Percentage
x_subcategory 1
5
0.5%
x_subcategory 2
1
0.1%
x_subcategory 3
18
1.8%
y_subcategory 1
7
0.7%
y_subcategory 2
10
1.0%
...
...
...
Total
41
5.8%
And what I need to do for another report is to get the totals by Category. I have to assign these categories based on the value of the subcategory name, the point is how to partition the result so I can get something like this:
Category
Cost
Percentage
x_subcategory 1
5
0.5%
x_subcategory 2
1
0.1%
x_subcategory 3
18
1.8%
X category
24
2.4%
y_subcategory 1
7
0.7%
y_subcategory 2
10
1.0%
Y category
17
1.7%
With GROUP BY and GROUP BY GROUPING SET I don't get what I want, and with PARTITION I'm getting syntax errors, I'm able to use it in simpler queries but this one turned out to be very complicated and I wonder if it's possible to achieve this on a query on PostgreSQL.

PostgreSQL select statement to return rows after where condition

I am working on a query to return the next 7 days worth of data every time an event happens indicated by "where event = 1". The goal is to then group all the data by the user id and perform aggregate functions on this data after the event happens - the event is encoded as binary [0, 1].
So far, I have been attempting to use nested select statements to structure the data how I would like to have it, but using the window functions is starting to restrict me. I am now thinking a self join could be more appropriate but need help in constructing such a query.
The query currently first creates daily aggregate values grouped by user and date (3rd level nested select). Then, the 2nd level sums the data "value_x" to obtain an aggregate value grouped by the user. Then, the 1st level nested select statement uses the lead function to grab the next rows value over and partitioned by each user which acts as selecting the next day's value when event = 1. Lastly, the select statement uses an aggregate function to calculate the average "sum_next_day_value_after_event" grouped by user and where event = 1. Put together, where event = 1, the query returns the avg(value_x) of the next row's total value_x.
However, this doesn't follow my time rule; "where event = 1", return the next 7 days worth of data after the event happens. If there is not 7 days worth of data, then return whatever data is <= 7 days. Yes, I currently only have one lead with the offset as 1, but you could just put 6 more of these functions to grab the next 6 rows. But, the lead function currently just grabs the next row without regard to date. So theoretically, the next row's "value_x" could actually be 15 days from where "event = 1". Also, as can be seen below in the data table, a user may have more than one row per day.
Here is the following query I have so far:
select
f.user_id
avg(f.sum_next_day_value_after_event) as sum_next_day_values
from (
select
bld.user_id,
lead(bld.value_x, 1) over(partition by bld.user_id order by bld.daily) as sum_next_day_value_after_event
from (
select
l.user_id,
l.daily,
sum(l.value_x) as sum_daily_value_x
from (
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x) l
group by l.user_id, l.day_ts
order by l.user_id) bld) f
group by f.user_id
Below is a snippet of the data from table_1:
user_id
day_ts
value_x
event
50
4/2/21 07:37
25
0
50
4/2/21 07:42
45
0
50
4/2/21 09:14
67
1
50
4/5/21 10:09
8
0
50
4/5/21 10:24
75
0
50
4/8/21 11:08
34
0
50
4/15/21 13:09
32
1
50
4/16/21 14:23
12
0
50
4/29/21 14:34
90
0
55
4/4/21 15:31
12
0
55
4/5/21 15:23
34
0
55
4/17/21 18:58
32
1
55
4/17/21 19:00
66
1
55
4/18/21 19:57
54
0
55
4/23/21 20:02
34
0
55
4/29/21 20:39
57
0
55
4/30/21 21:46
43
0
Technical details:
PostgreSQL, supported by EDB, version = 14.1
pgAdmin4, version 5.7
Thanks for the help!
"The query currently first creates daily aggregate values"
I don't see any aggregate function in your first query, so that the GROUP BY clause is useless.
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x
could be simplified as
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
which in turn provides no real added value, so this first query could be removed and the second query would become :
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
The order by user_id clause can also be removed at this step.
Now if you want to calculate the average value of the sum_daily_value_x in the period of 7 days after the event (I'm referring to the avg() function in your top query), you can use avg() as a window function that you can restrict to the period of 7 days after the event :
select f.user_id
, avg(f.sum_daily_value_x) over (order by f.daily range between current row and '7 days' following) as sum_next_day_values
from (
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
) AS f
group by f.user_id
The partition by f.user_id clause in the window function is useless because the rows have already been grouped by f.user_id before the window function is applied.
You can replace the avg() window function by any other one, for instance sum() which could better fit with the alias sum_next_day_values

Insert rownumber repeatedly in records in t-sql

I want to insert a row number in a records like counting rows in a specific number of range. example output:
RowNumber ID Name
1 20 a
2 21 b
3 22 c
1 23 d
2 24 e
3 25 f
1 26 g
2 27 h
3 28 i
1 29 j
2 30 k
I rather to try using the rownumber() over (partition by order by column name) but my real records are not containing columns that will count into 1-3 rownumber.
I already try to loop each of record to insert a row count 1-3 but this loop affects the performance of the query. The query will use for the RDL report, that is why as much as possible the performance of the query must be good.
any suggestions are welcome. Thanks
have you tried modulo-ing rownumber()?
SELECT
((row_number() over (order by ID)-1) % 3) +1 as RowNumber
FROM table