join vs subquery in searches between ranges - postgresql

subquery vs query's performance is an inaccurate science, google shows cases where both are advantageous, it depends on the data structure, it's necessary to test both to arrive at your truth.
I have a subquery that I couldn't replace by join and be able to test its performance.
Assume that you have a price history table, you add records every time the price or its characteristics change, take this simple example: sql fiddle simple sample!
create table price_hist
( hid serial,
product int,
start_day date,
price numeric,
max_discount numeric,
promo_code character(4) );
create table deliveries
( del_id serial,
del_date date,
product int,
quantity int,
u_price numeric);
insert into price_hist (product, start_day,price,max_discount,promo_code)
values
(21,'2018-03-14',56.22, .022, 'Sam2'),
(18,'2018-02-24',11.25, .031, 'pax3'),
(21,'2017-12-28',50.12, .019, 'titi'),
(21,'2017-12-01',51.89, .034, 'any7'),
(18,'2017-12-26',11.52, .039, 'jun3'),
(18,'2017-12-10',10.99, .029, 'sep9');
insert into deliveries(del_date, product, quantity)
values
('2017-12-05',21,4),
('2017-12-20',18,3),
('2017-12-28',21,2),
('2018-05-08',18,1),
('2018-08-20',21,5);
select d.del_id, d.del_date, d.product, d.quantity,
(select price from price_hist h where h.product=d.product order by h.start_day desc limit 1) u_price,
(select max_discount from price_hist h where h.product=d.product order by h.start_day desc limit 1) max_discount,
(select price from price_hist h where h.product=d.product order by h.start_day desc limit 1)*d.quantity total
from deliveries d;
subqueries find values between date ranges, I have not been able to do in postgresql the join that does the same

You can use distinct on to get values from price_hist for the latest start_day:
select distinct on(product)
product, price, max_discount
from price_hist h
order by product, start_day desc
product | price | max_discount
---------+-------+--------------
18 | 11.25 | 0.031
21 | 56.22 | 0.022
(2 rows)
Use it as a derived table to join it with deliveries:
select
d.del_id, d.del_date, d.product, d.quantity,
h.price as u_price, h.max_discount, h.price * d.quantity as total
from deliveries d
join (
select distinct on(product)
product, price, max_discount
from price_hist
order by product, start_day desc
) h using(product)
SqlFiddle.

Related

Postgres Distinct Order by

I have to tables that I want to join, order by two timestamps and get as result the distinct values (for several columns). But it doesn't work.
See examples below:
CREATE TABLE t1(myid int, myyear int, mycol int, mdate timestamp);
INSERT INTO t1 VALUES
(11833,2022,1059,'2022-11-03 22:02:00'),(11834,2022,1059,'2022-11-17 19:56:41'),(11832,2021,1058,'2021-11-16 16:38:21'),(11839,2021,1057,'2021-11-10 18:08:09'),(11847,2021,1055,'2022-05-31 12:13:11'),(11847,2021,1055,'2022-05-31 12:13:11'),(11850,2021,1049,'2021-09-29 16:11:31'),(11853,2021,1046,'2022-01-24 11:44:41'),(11855,2021,1045,'2022-01-24 11:38:05'),(11865,2021,1044,'2022-01-24 11:23:51'),(11856,2021,1043,'2022-01-24 11:00:24'),(11840,2021,1042,'2021-11-30 12:28:13'),(11831,2021,1042,'2021-11-30 12:22:30'),(11846,2022,1042,'2022-11-02 15:06:00'),(11829,2022,1036,'2022-11-02 02:37:00'),(11826,2021,1035,'2021-09-24 13:07:48'),(11825,2021,1034,'2021-10-06 08:22:23'),(11830,2022,1033,'2022-11-03 21:18:00'),(11827,2022,1033,'2022-11-15 21:46:04'),(11828,2022,1032,'2022-11-08 16:44:08'),(11824,2022,1031,'2022-10-25 18:09:03'),(11823,2022,1031,'2022-11-02 03:10:00'),(11822,2022,1030,'2022-10-24 14:59:25')
;
CREATE TABLE t2(myid int, name varchar,idate timestamp);
INSERT INTO t2 VALUES
(11833,'Name1684','2023-01-10 15:52:55'),(11834,'Name1727','2023-01-10 15:52:55'),(11832,'Name609','2023-01-10 15:52:54'),(11839,'Name608','2023-01-10 15:52:59'),(11847,'Name606','2023-01-10 15:53:03'),(11847,'Name607','2023-01-10 15:53:03'),(11850,'Name605','2023-01-10 15:53:04'),(11853,'Name604','2023-01-10 15:53:05'),(11855,'Name603','2023-01-10 15:53:06'),(11865,'Name602','2023-01-10 15:53:10'),(11856,'Name601','2023-01-10 15:53:07'),(11840,'Name600','2023-01-10 15:52:59'),(11831,'Name1726','2023-01-10 15:52:53'),(11846,'Name1683','2023-01-10 15:53:03'),(11829,'Name1682','2023-01-10 15:52:52'),(11826,'Name599','2023-01-10 15:52:50'),(11825,'Name598','2023-01-10 15:52:49'),(11830,'Name1681','2023-01-10 15:52:52'),(11827,'Name1725','2023-01-10 15:52:51'),(11828,'Name1680','2023-01-10 15:52:51'),(11824,'Name1678','2023-01-10 15:52:48'),(11823,'Name1679','2023-01-10 15:52:48'),(11822,'Name1677','2023-01-10 15:52:47')
;
Show example which is not working before order and distinct:
Select
*
from t1
join t2
on t1.myid=t2.myid where t1.mycol =1059
=> Gives me this result:
myid
myyear
mycol
mdate
myid
name
idate
11833
2022
1059
2022-11-03 22:02:00
11833
Name1684
2023-01-10 15:52:55
11834
2022
1059
2022-11-17 19:56:41
11834
Name1727
2023-01-10 15:52:55
I want to order first by column mdate, then by idate (both to see the youngest dates) and then see only distinct values of (myyear and mycol)
CREATE TABLE expectedresult(myid int, myyear int,mycol int, mdate timestamp,name varchar,idate timestamp);
INSERT INTO expectedresult VALUES
(11834,2022,1059,'2022-11-17 19:56:41','Name1727','2023-01-10 15:52:55')
myid
myyear
mycol
mdate
name
idate
11834
2022
1059
2022-11-17 19:56:41
Name1727
2023-01-10 15:52:55
This is what I have tried:
create table t3 as(
select distinct on (subq1.myyear,subq1.mycol)
*
from(
Select
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate
from t1
join t2
on t1.myid=t2.myid
order by t1.mdate desc, t2.idate desc) subq1)
But it "distincts" the wrong row(because a younger mdate is available):
select * from t3 where mycol =1059
myid
myyear
mycol
mdate
name
idate
11833
2022
1059
2022-11-03 22:02:00
Name1684
2023-01-10 15:52:55
here also as fiddle:
https://dbfiddle.uk/eS5FoBeq
Best
SELECT DISTINCT ON (t1.myyear, t1.mycol)
*
FROM
t1
JOIN t2 ON t1.myid = t2.myid
ORDER BY
t1.myyear,
t1.mycol,
t1.mdate DESC,
t2.idate DESC;
or rewrite your query as:
SELECT DISTINCT ON (subq1.myyear, subq1.mycol)
*
FROM (
SELECT
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate
FROM
t1
JOIN t2 ON t1.myid = t2.myid
ORDER BY
t1.mdate DESC,
t2.idate DESC) subq1
ORDER BY
subq1.myyear,
subq1.mycol,
subq1.mdate DESC,
subq1.idate DESC;
if you distinct on (x,y) then you order by should be order by x,y,z
x, y is the columns that you want to get the unique row.In a group set (x,y), there are many rows, but you only want one, then you need order by z to get the only one row in a group set (x,y) in a deterministic way, otherwise, it will get a random row in a group set(x,y).
In general I try to avoid using distinct.
You can use row number to identify the number of elements with the same "myyear" and "mycol" and order them by newest date and then select the first value (rn =1).
with cte as(
Select
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate,
ROW_NUMBER() OVER (PARTITION BY myyear, mycol ORDER BY mdate DESC) as rn
from t1
join t2
on t1.myid=t2.myid
subq1)
)
Select *
from cte
where rn = 1

Express Nearest Neighbor Join in Postgresql?

I have two tables Q and T, both containing a column of float numbers.
What I want to do is, for each number in Q, I want to find a number in T that has the smallest distance to it.
For example, for T={1,7,9} and Q={2,6,10}, I want to return Q,T pairs as {(2,1),(6,7),(10,9)}.
How should I express this query with SQL?
In addition, is that possible to accelerate this join by index, e.g. add an operator class which bind "FOR ORDER BY <->" with fabs calculation?
create table t (val_t integer);
create table q (val_q integer);
insert into t values (1),(7),(9);
insert into q values (2),(6),(10);
Start with a query that cross joins the two tables and adds a rank based on the difference:
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true ;
Use this query in a cte or subquery and filter by rank:
WITH src AS(
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true )
SELECT val_q, val_t FROM src
WHERE rank = 1;
val_q | val_t
-------+-------
2 | 1
6 | 7
10 | 9
See https://www.postgresql.org/docs/12/tutorial-window.html
Given this schema:
create table t (tn float);
insert into t values (1), (7), (9);
create table q (qn float);
insert into q values (2), (6), (10);
DISTINCT ON is the most straightforward way:
select distinct on (qn) qn, tn
from q
cross join t
order by qn, abs(qn - tn);
Exploiting a numeric range may perform better depending on your data sizes. If performance is an issue, then you can create an actual temp table for the range_tn CTE and put a gist index on it:
with all_tn as (
select tn
from t
union select null
), range_tn as (
select numrange(tn::numeric, (lead(tn) over w)::numeric, '[]') as tr
from all_tn
window w as (order by tn nulls first)
)
select qn,
case
when lower_inf(tr) then upper(tr)
when upper_inf(tr) then lower(tr)
when 2 * qn - lower(tr) - upper(tr) > 0 then upper(tr)
else lower(tr)
end as tn
from q
join range_tn
on qn::numeric <# tr;
Fiddle here

How can I SUM distinct records in a Postgres database where there are duplicate records?

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

postgres - get top category purchased by customer

I have a denormalized table with the columns:
buyer_id
order_id
item_id
item_price
item_category
I would like to return something that returns 1 row per buyer_id
buyer_id, sum(item_price), item_category
-- but ONLY for the category with the highest rank of sales along that specific buyer_id.
I can't get row_number() or partition to work because I need to order by the sum of item_price relative to item_category relative to buyer. Am I overlooking anything obvious?
You need a few layers of fudging here:
SELECT buyer_id, item_sum, item_category
FROM (
SELECT buyer_id,
rank() OVER (PARTITION BY buyer_id ORDER BY item_sum DESC) AS rnk,
item_sum, item_category
FROM (
SELECT buyer_id, sum(item_price) AS item_sum, item_category
FROM my_table
GROUP BY 1, 3) AS sub2) AS sub
WHERE rnk = 1;
In sub2 you calculate the sum of 'item_price' for each 'item_category' for each 'buyer_id'. In sub you rank these with a window function by 'buyer_id', ordering by 'item_sum' in descending order (so the highest 'item_sum' comes first). In the main query you select those rows where rnk = 1.

array_agg group by and null

Given this table:
SELECT * FROM CommodityPricing order by dateField
"SILVER";60.45;"2002-01-01"
"GOLD";130.45;"2002-01-01"
"COPPER";96.45;"2002-01-01"
"SILVER";70.45;"2003-01-01"
"GOLD";140.45;"2003-01-01"
"COPPER";99.45;"2003-01-01"
"GOLD";150.45;"2004-01-01"
"MERCURY";60;"2004-01-01"
"SILVER";80.45;"2004-01-01"
As of 2004, COPPER was dropped and mercury introduced.
How can I get the value of (array_agg(value order by date desc) ) [1] as NULL for COPPER?
select commodity,(array_agg(value order by date desc) ) --[1]
from CommodityPricing
group by commodity
"COPPER";"{99.45,96.45}"
"GOLD";"{150.45,140.45,130.45}"
"MERCURY";"{60}"
"SILVER";"{80.45,70.45,60.45}"
SQL Fiddle
select
commodity,
array_agg(
case when commodity = 'COPPER' then null else price end
order by date desc
)
from CommodityPricing
group by commodity
;
To "pad" missing rows with NULL values in the resulting array, build your query on full grid of rows and LEFT JOIN actual values to the grid.
Given this table definition:
CREATE TEMP TABLE price (
commodity text
, value numeric
, ts timestamp -- using ts instead of the inappropriate name date
);
I use generate_series() to get a list of timestamps representing the years and CROSS JOIN to a unique list of all commodities (SELECT DISTINCT ...).
SELECT commodity, (array_agg(value ORDER BY ts DESC)) AS years
FROM generate_series ('2002-01-01 00:00:00'::timestamp
, '2004-01-01 00:00:00'::timestamp
, '1y') t(ts)
CROSS JOIN (SELECT DISTINCT commodity FROM price) c(commodity)
LEFT JOIN price p USING (ts, commodity)
GROUP BY commodity;
Result:
COPPER {NULL,99.45,96.45}
GOLD {150.45,140.45,130.45}
MERCURY {60,NULL,NULL}
SILVER {80.45,70.45,60.45}
SQL Fiddle.
I cast the array to text in the fiddle, because the display sucks and would swallow NULL values otherwise.