How to Calculate Median Price Per Unit Using PERCENTILE_CONT and GROUP BY id - postgresql

I'm using postgres 9.5 and trying to calculate median and average price per unit with a GROUP BY id. Here is the query in DBFIDDLE
Here is the data
id | price | units
-----+-------+--------
1 | 100 | 15
1 | 90 | 10
1 | 50 | 8
1 | 40 | 8
1 | 30 | 7
2 | 110 | 22
2 | 60 | 8
2 | 50 | 11
Using percentile_cont this is my query:
SELECT id,
ceil(avg(price)) as avg_price,
percentile_cont(0.5) within group (order by price) as median_price,
ceil( sum (price) / sum (units) ) AS avg_pp_unit,
ceil( percentile_cont(0.5) within group (order by price) /
percentile_cont(0.5) within group (order by units) ) as median_pp_unit
FROM t
GROUP by id
This query returns:
id| avg_price | median_price | avg_pp_unit | median_pp_unit
--+-----------+--------------+--------------+---------------
1 | 62 | 50 | 6 | 7
2 | 74 | 60 | 5 | 5
I'm pretty sure average calculation is correct. Is this the correct way to calculate median price per unit?
This post suggests this is correct (although performance is poor) but I'm curious if the division in the median calculation could skew the result.
Calculating median with PERCENTILE_CONT and grouping

The median is the value separating the higher half from the lower half of a data sample (a population or a probability distribution). For a data set, it may be thought of as the "middle" value.
https://en.wikipedia.org/wiki/Median
So your median price is 55, and the median units is 9
Sort by price Sort by units
id | price | units | | id | price | units
-------|-----------|--------| |-------|---------|----------
1 | 30 | 7 | | 1 | 30 | 7
1 | 40 | 8 | | 1 | 40 | 8
1 | 50 | 8 | | 1 | 50 | 8
>>> 2 | 50 | 11 | | 2 | 60 | 8 <<<<
>>> 2 | 60 | 8 | | 1 | 90 | 10 <<<<
1 | 90 | 10 | | 2 | 50 | 11
1 | 100 | 15 | | 1 | 100 | 15
2 | 110 | 22 | | 2 | 110 | 22
| | | | | |
(50+60)/2 (8+10)/2
55 9
I'm unsure what you intend for "median price per unit":
CREATE TABLE t(
id INTEGER NOT NULL
,price INTEGER NOT NULL
,units INTEGER NOT NULL
);
INSERT INTO t(id,price,units) VALUES (1,30,7);
INSERT INTO t(id,price,units) VALUES (1,40,8);
INSERT INTO t(id,price,units) VALUES (1,50,8);
INSERT INTO t(id,price,units) VALUES (2,50,11);
INSERT INTO t(id,price,units) VALUES (2,60,8);
INSERT INTO t(id,price,units) VALUES (1,90,10);
INSERT INTO t(id,price,units) VALUES (1,100,15);
INSERT INTO t(id,price,units) VALUES (2,110,22);
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY price) med_price
, percentile_cont(0.5) WITHIN GROUP (ORDER BY units) med_units
FROM
t;
| med_price | med_units
----|-----------|-----------
1 | 55 | 9
If column "price" represents a "unit price" then you don't need to divide 55 by 9, but if "price" is an "order total" then you would need to divide by units: 55/9 = 6.11

Related

Filter a sum of values until a certain threshold is reached

DbFiddle
Stuck. Need SO :)
Consider the following distribution of values.
ID CNT SEC SHOW(Bool)
1 10 1
2 1 1
3 25 1
4 1 1
5 2 1
6 10 1
7 50 2
8 90 2
My goal is to filter by sec and then
sort by cnt ascending,
sort by id ascending
and then flag/filter all rows as show - false where cnt is < 5 and until the sum of cnt of all hidden rows (show=false) is >= 5.
So the sum of all "hidden" rows may never be < 5.
Expected outcome for sec=1:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 2 | 1 | 1 | false |
| 4 | 1 | 2 | false |
| 5 | 2 | 4 | false |
| 1 | 10 | 14 | false | -- The sum of all hidden rows before this point is 4
| 6 | 10 | 24 | true | -- The total of all hidden rows is now >= 5.
| 3 | 25 | 49 | true |
Expected outcome for sec=2:
| id | cnt | cnt_sum | show |
|----|-----|---------|-------|
| 7 | 50 | 50 | true |
| 8 | 90 | 140 | true |
I can already sort the values and create the sums etc. I have not figured out, how to determine how to set the cutoff point, when "hidding" is not necessary.
I am already doing this in "client code" and I want to migrate it to sql.
Here LAG() will help to achieve what you want. You can write your query like below:
with cte as (
SELECT
id, cnt, sec,
sum(cnt) over (partition by sec order by cnt,id) sum_
FROM
tbl )
select
id, cnt, sum_,
case
when sum_<5 or lag(sum_) over (partition by sec order by cnt,id) <5 then 'false'
else
'true'
end as "show"
from cte
DEMO

postgres tablefunc, sales data grouped by product, with crosstab of months

TIL about tablefunc and crosstab. At first I wanted to "group data by columns" but that doesn't really mean anything.
My product sales look like this
product_id | units | date
-----------------------------------
10 | 1 | 1-1-2018
10 | 2 | 2-2-2018
11 | 3 | 1-1-2018
11 | 10 | 1-2-2018
12 | 1 | 2-1-2018
13 | 10 | 1-1-2018
13 | 10 | 2-2-2018
I would like to produce a table of products with months as columns
product_id | 01-01-2018 | 02-01-2018 | etc.
-----------------------------------
10 | 1 | 2
11 | 13 | 0
12 | 0 | 1
13 | 20 | 0
First I would group by month, then invert and group by product, but I cannot figure out how to do this.
After enabling the tablefunc extension,
SELECT product_id, coalesce("2018-1-1", 0) as "2018-1-1"
, coalesce("2018-2-1", 0) as "2018-2-1"
FROM crosstab(
$$SELECT product_id, date_trunc('month', date)::date as month, sum(units) as units
FROM test
GROUP BY product_id, month
ORDER BY 1$$
, $$VALUES ('2018-1-1'::date), ('2018-2-1')$$
) AS ct (product_id int, "2018-1-1" int, "2018-2-1" int);
yields
| product_id | 2018-1-1 | 2018-2-1 |
|------------+----------+----------|
| 10 | 1 | 2 |
| 11 | 13 | 0 |
| 12 | 0 | 1 |
| 13 | 10 | 10 |

Calculate out price in FIFO SQL

Using Postgres 11
Using FIFO, i would like to calculate the price of items taken from the inventory, to keep track of the value of the total inventory.
Dataset is as follows:
ID | prodno | amount_purchased | amount_taken | price | created_at
uuid 13976 10 NULL 130 <timestamp>
uuid 13976 10 NULL 150 <timestamp>
uuid 13976 10 NULL 110 <timestamp>
uuid 13976 10 NULL 100 <timestamp>
uuid 13976 NULL 14 ?? <timestamp>
Before inserting the row with amount_taken i would need to calculate what the avg price of each of the 14 items is, which in this case would be 135,71, but how to calculate this relatively efficient?
My initial idea was to delegate the rows into two temp tables, one where amount_taken is null, and one where it is not null, and then calculate all the rows down, but seeing as this table could become rather large, rather fast (since most of the time, only 1 item would be taken from the inventory), i worry this would be a decent solution in the short term, but would slow down, as the table becomes larger. So, what's the better solution internet?
Given this setup:
CREATE TABLE test (
id int
, prodno int
, quantity numeric
, price numeric
, created_at timestamp
);
INSERT INTO test VALUES
(1, 13976, 10, 130, NOW())
, (2, 13976, 10, 150, NOW()+'1 hours')
, (3, 13976, 10, 110, NOW()+'2 hours')
, (4, 13976, 10, 100, NOW()+'3 hours')
, (5, 13976, -14, NULL, NOW()+'4 hours')
, (6, 13976, -1, NULL, NOW()+'5 hours')
, (7, 13976, -10, NULL, NOW()+'6 hours')
;
then the SQL
SELECT id, prodno, created_at, qty_sold
-- 5
, round((cum_sold_cost - coalesce(lag(cum_sold_cost) over w, 0))/qty_sold, 2) as fifo_price
, qty_bought, prev_bought, total_cost
, prev_total_cost
, cum_sold_cost
, coalesce(lag(cum_sold_cost) over w, 0) as prev_cum_sold_cost
FROM (
SELECT id, tneg.prodno, created_at, qty_sold, tpos.qty_bought, prev_bought, total_cost, prev_total_cost
-- 4
, round(prev_total_cost + ((tneg.cum_sold - tpos.prev_bought)/(tpos.qty_bought - tpos.prev_bought))*(total_cost-prev_total_cost), 2) as cum_sold_cost
FROM (
SELECT id, prodno, created_at, -quantity as qty_sold
, sum(-quantity) over w as cum_sold
FROM test
WHERE quantity < 0
WINDOW w AS (PARTITION BY prodno ORDER BY created_at)
-- 1
) tneg
LEFT JOIN (
SELECT prodno
, sum(quantity) over w as qty_bought
, coalesce(sum(quantity) over prevw, 0) as prev_bought
, quantity * price as cost
, sum(quantity * price) over w as total_cost
, coalesce(sum(quantity * price) over prevw, 0) as prev_total_cost
FROM test
WHERE quantity > 0
WINDOW w AS (PARTITION BY prodno ORDER BY created_at)
, prevw AS (PARTITION BY prodno ORDER BY created_at ROWS BETWEEN unbounded preceding AND 1 preceding)
-- 2
) tpos
-- 3
ON tneg.cum_sold BETWEEN tpos.prev_bought AND tpos.qty_bought
AND tneg.prodno = tpos.prodno
) t
WINDOW w AS (PARTITION BY prodno ORDER BY created_at)
yields
| id | prodno | created_at | qty_sold | fifo_price | qty_bought | prev_bought | total_cost | prev_total_cost | cum_sold_cost | prev_cum_sold_cost |
|----+--------+----------------------------+----------+------------+------------+-------------+------------+-----------------+---------------+--------------------|
| 5 | 13976 | 2019-03-07 21:07:13.267218 | 14 | 135.71 | 20 | 10 | 2800 | 1300 | 1900.00 | 0 |
| 6 | 13976 | 2019-03-07 22:07:13.267218 | 1 | 150.00 | 20 | 10 | 2800 | 1300 | 2050.00 | 1900.00 |
| 7 | 13976 | 2019-03-07 23:07:13.267218 | 10 | 130.00 | 30 | 20 | 3900 | 2800 | 3350.00 | 2050.00 |
tneg contains information about quantities sold
| id | prodno | created_at | qty_sold | cum_sold |
|----+--------+----------------------------+----------+----------|
| 5 | 13976 | 2019-03-07 21:07:13.267218 | 14 | 14 |
| 6 | 13976 | 2019-03-07 22:07:13.267218 | 1 | 15 |
| 7 | 13976 | 2019-03-07 23:07:13.267218 | 10 | 25 |
tpos contains information about quantities bought
| prodno | qty_bought | prev_bought | cost | total_cost | prev_total_cost |
|--------+------------+-------------+------+------------+-----------------|
| 13976 | 10 | 0 | 1300 | 1300 | 0 |
| 13976 | 20 | 10 | 1500 | 2800 | 1300 |
| 13976 | 30 | 20 | 1100 | 3900 | 2800 |
| 13976 | 40 | 30 | 1000 | 4900 | 3900 |
We match rows in tneg with rows in tpos on the condition that cum_sold is between qty_bought and prev_bought.
cum_sold is the cumulative amount sold, qty_bought is the cumulative amount bought, and prev_bought is the previous value of qty_bought.
| id | prodno | created_at | qty_sold | cum_sold | qty_bought | prev_bought | total_cost | prev_total_cost | cum_sold_cost |
|----+--------+----------------------------+----------+----------+------------+-------------+------------+-----------------+---------------|
| 5 | 13976 | 2019-03-07 21:07:13.267218 | 14 | 14 | 20 | 10 | 2800 | 1300 | 1900.00 |
| 6 | 13976 | 2019-03-07 22:07:13.267218 | 1 | 15 | 20 | 10 | 2800 | 1300 | 2050.00 |
| 7 | 13976 | 2019-03-07 23:07:13.267218 | 10 | 25 | 30 | 20 | 3900 | 2800 | 3350.00 |
The fraction
((tneg.cum_sold - tpos.prev_bought)/(tpos.qty_bought - tpos.prev_bought)) as frac
measures how far cum_sold lies in between qty_bought and prev_bought. We use this fraction to compute
cum_sold_cost, the cumulative cost associated with buying cum_sold items.
cum_sold_cost lies frac distance between prev_total_cost and total_cost.
Once you obtain cum_sold_cost, you have everything you need to compute marginal FIFO unit prices.
For each line of tneg, the difference between cum_sold_cost and its previous value is the cost of the qty_sold.
FIFO price is simply the ratio of this cost and qty_sold.

Groupwise select nth row Postgres

I have an problem that falls into the "greatest-n-per-group" category, but with a slight twist. I have a table along the lines of the following:
| t_id | t_amount | b_id | b_amount |
|------|----------|------|----------|
| 1 | 50 | 7 | 50 |
| 1 | 50 | 15 | 50 |
| 1 | 50 | 80 | 50 |
| 3 | 50 | 7 | 50 |
| 3 | 50 | 15 | 50 |
| 3 | 50 | 80 | 50 |
| 17 | 50 | 7 | 50 |
| 17 | 50 | 15 | 50 |
| 17 | 50 | 80 | 50 |
What I'd like to do is essentially partition this table by t_id and then select the first row of the first partition, the second row of the second partition, and the third row of the third partition, with the results looking like this:
| t_id | t_amount | b_id | b_amount |
|------|----------|------|----------|
| 1 | 50 | 7 | 50 |
| 3 | 50 | 15 | 50 |
| 17 | 50 | 80 | 50 |
It seems like a window function or something with distinct on might do the trick, but I haven't yet put it together.
I'm using Postgres 10 on a *nix system.
Using window functions dense_rank and row_number would do it
https://www.postgresql.org/docs/10/static/functions-window.html
Solution: db<>fiddle
SELECT
t_id,
t_amount,
b_id,
b_amount
FROM
(
SELECT
*,
dense_rank() over (ORDER BY t_id) as group_number, -- A
row_number() over (PARTITION BY t_id ORDER BY t_id, b_id)
as row_number_in_group -- B
FROM
test_data) s
WHERE
group_number = row_number_in_group
A dense_rank increases a number per given group (a partition over t_id). So every t_id is getting its own value.
B row_number counts the rows within a given partition.
I illustrate the result of the subquery here:
t_id t_amount b_id b_amount dense_rank row_number
---- -------- ---- -------- ---------- ----------
1 50 7 50 1 1
1 50 15 50 1 2
1 50 80 50 1 3
3 50 7 50 2 1
3 50 15 50 2 2
3 50 80 50 2 3
17 50 7 50 3 1
17 50 15 50 3 2
17 50 80 50 3 3
Now you have to filter where group number equals row number within the group and you get your expected result.

Generate a histogram of values grouped by a column

I have the following data in a reviews table for certain set of items, using a score system that ranges from 0 to 100
+-----------+---------+-------+
| review_id | item_id | score |
+-----------+---------+-------+
| 1 | 1 | 90 |
+-----------+---------+-------+
| 2 | 1 | 40 |
+-----------+---------+-------+
| 3 | 1 | 10 |
+-----------+---------+-------+
| 4 | 2 | 90 |
+-----------+---------+-------+
| 5 | 2 | 90 |
+-----------+---------+-------+
| 6 | 2 | 70 |
+-----------+---------+-------+
| 7 | 3 | 80 |
+-----------+---------+-------+
| 8 | 3 | 80 |
+-----------+---------+-------+
| 9 | 3 | 80 |
+-----------+---------+-------+
| 10 | 3 | 80 |
+-----------+---------+-------+
| 11 | 4 | 10 |
+-----------+---------+-------+
| 12 | 4 | 30 |
+-----------+---------+-------+
| 13 | 4 | 50 |
+-----------+---------+-------+
| 14 | 4 | 80 |
+-----------+---------+-------+
I am trying to create a histogram of the score values with a bin size of five. My goal is to generate a histogram per item. In order to create a histogram of the entire table, it is possible to use the width_bucket. This can also be tuned to operate on a per-item basis:
SELECT item_id, g.n as bucket, COUNT(m.score) as count
FROM generate_series(1, 5) g(n) LEFT JOIN
review as m
ON width_bucket(score, 0, 100, 4) = g.n
GROUP BY item_id, g.n
ORDER BY item_id, g.n;
However, the result looks like this:
+---------+--------+-------+
| item_id | bucket | count |
+---------+--------+-------+
| 1 | 5 | 1 |
+---------+--------+-------+
| 1 | 3 | 1 |
+---------+--------+-------+
| 1 | 1 | 1 |
+---------+--------+-------+
| 2 | 5 | 2 |
+---------+--------+-------+
| 2 | 4 | 2 |
+---------+--------+-------+
| 3 | 4 | 4 |
+---------+--------+-------+
| 4 | 1 | 1 |
+---------+--------+-------+
| 4 | 2 | 1 |
+---------+--------+-------+
| 4 | 3 | 1 |
+---------+--------+-------+
| 4 | 4 | 1 |
+---------+--------+-------+
That is, bins with no entries are not included. While I find this not to be a bad solution, I would rather have either all buckets, with 0 on those with no entries. Even better, using this structure:
+---------+----------+----------+----------+----------+----------+
| item_id | bucket_1 | bucket_2 | bucket_3 | bucket_4 | bucket_5 |
+---------+----------+----------+----------+----------+----------+
| 1 | 1 | 0 | 1 | 0 | 1 |
+---------+----------+----------+----------+----------+----------+
| 2 | 0 | 0 | 0 | 2 | 2 |
+---------+----------+----------+----------+----------+----------+
| 3 | 0 | 0 | 0 | 4 | 0 |
+---------+----------+----------+----------+----------+----------+
| 4 | 1 | 1 | 1 | 1 | 0 |
+---------+----------+----------+----------+----------+----------+
I prefer this solution as it uses a row per item (instead of 5n), which is simpler to query and minimizes memory consumption and data transfer costs. My current approach is as follows:
select item_id,
(sum(case when score >= 0 and score <= 19 then 1 else 0 end)) as bucket_1,
(sum(case when score >= 20 and score <= 39 then 1 else 0 end)) as bucket_2,
(sum(case when score >= 40 and score <= 59 then 1 else 0 end)) as bucket_3,
(sum(case when score >= 60 and score <= 79 then 1 else 0 end)) as bucket_4,
(sum(case when score >= 80 and score <= 100 then 1 else 0 end)) as bucket_5
from review;
Even though this query satisfies my requirements, I am curious to see if there might be a more elegant approach. so many case statements are not easy to read and changes in the bin criteria might require updating every sum. Also I am curious about the potential performance concerns that this query might have.
The second query can be rewritten to use ranges to make editing and writing the query a bit easier:
with buckets (b1, b2, b3, b4, b5) as (
values (
int4range(0, 20), int4range(20, 40), int4range(40, 60), int4range(60, 80), int4range(80, 100)
)
)
select item_id,
count(*) filter (where b1 #> score) as bucket_1,
count(*) filter (where b2 #> score) as bucket_2,
count(*) filter (where b3 #> score) as bucket_3,
count(*) filter (where b4 #> score) as bucket_4,
count(*) filter (where b5 #> score) as bucket_5
from review
cross join buckets
group by item_id
order by item_id;
A range constructed with int4range(0,20) includes the lower end and excludes the upper end.
The CTE named buckets only creates a single row, so the cross join does not change the number of rows from the review table.
I found this post useful
CREATE FUNCTION temp_histogram(table_name_or_subquery text, column_name text)
RETURNS TABLE(bucket int, "range" numrange, freq bigint, bar text)
AS $func$
BEGIN
RETURN QUERY EXECUTE format('
WITH
source AS (
SELECT * FROM %s
),
min_max AS (
SELECT min(%s) AS min, max(%s) AS max FROM source
),
temp_histogram AS (
SELECT
width_bucket(%s, min_max.min, min_max.max, 100) AS bucket,
numrange(min(%s)::numeric, max(%s)::numeric, ''[]'') AS "range",
count(%s) AS freq
FROM source, min_max
WHERE %s IS NOT NULL
GROUP BY bucket
ORDER BY bucket
)
SELECT
bucket,
"range",
freq::bigint,
repeat(''*'', (freq::float / (max(freq) over() + 1) * 15)::int) AS bar
FROM temp_histogram',
table_name_or_subquery,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name,
column_name
);
END
$func$ LANGUAGE plpgsql;
Use the bucket numbers(100 in above script) in your favour.
Invoke like this
SELECT * FROM histogram($table_name_or_subquery, $column_name);
Example:
SELECT * FROM histogram('transactions_tbl', 'amount_colm');