I have to tables that I want to join, order by two timestamps and get as result the distinct values (for several columns). But it doesn't work.
See examples below:
CREATE TABLE t1(myid int, myyear int, mycol int, mdate timestamp);
INSERT INTO t1 VALUES
(11833,2022,1059,'2022-11-03 22:02:00'),(11834,2022,1059,'2022-11-17 19:56:41'),(11832,2021,1058,'2021-11-16 16:38:21'),(11839,2021,1057,'2021-11-10 18:08:09'),(11847,2021,1055,'2022-05-31 12:13:11'),(11847,2021,1055,'2022-05-31 12:13:11'),(11850,2021,1049,'2021-09-29 16:11:31'),(11853,2021,1046,'2022-01-24 11:44:41'),(11855,2021,1045,'2022-01-24 11:38:05'),(11865,2021,1044,'2022-01-24 11:23:51'),(11856,2021,1043,'2022-01-24 11:00:24'),(11840,2021,1042,'2021-11-30 12:28:13'),(11831,2021,1042,'2021-11-30 12:22:30'),(11846,2022,1042,'2022-11-02 15:06:00'),(11829,2022,1036,'2022-11-02 02:37:00'),(11826,2021,1035,'2021-09-24 13:07:48'),(11825,2021,1034,'2021-10-06 08:22:23'),(11830,2022,1033,'2022-11-03 21:18:00'),(11827,2022,1033,'2022-11-15 21:46:04'),(11828,2022,1032,'2022-11-08 16:44:08'),(11824,2022,1031,'2022-10-25 18:09:03'),(11823,2022,1031,'2022-11-02 03:10:00'),(11822,2022,1030,'2022-10-24 14:59:25')
;
CREATE TABLE t2(myid int, name varchar,idate timestamp);
INSERT INTO t2 VALUES
(11833,'Name1684','2023-01-10 15:52:55'),(11834,'Name1727','2023-01-10 15:52:55'),(11832,'Name609','2023-01-10 15:52:54'),(11839,'Name608','2023-01-10 15:52:59'),(11847,'Name606','2023-01-10 15:53:03'),(11847,'Name607','2023-01-10 15:53:03'),(11850,'Name605','2023-01-10 15:53:04'),(11853,'Name604','2023-01-10 15:53:05'),(11855,'Name603','2023-01-10 15:53:06'),(11865,'Name602','2023-01-10 15:53:10'),(11856,'Name601','2023-01-10 15:53:07'),(11840,'Name600','2023-01-10 15:52:59'),(11831,'Name1726','2023-01-10 15:52:53'),(11846,'Name1683','2023-01-10 15:53:03'),(11829,'Name1682','2023-01-10 15:52:52'),(11826,'Name599','2023-01-10 15:52:50'),(11825,'Name598','2023-01-10 15:52:49'),(11830,'Name1681','2023-01-10 15:52:52'),(11827,'Name1725','2023-01-10 15:52:51'),(11828,'Name1680','2023-01-10 15:52:51'),(11824,'Name1678','2023-01-10 15:52:48'),(11823,'Name1679','2023-01-10 15:52:48'),(11822,'Name1677','2023-01-10 15:52:47')
;
Show example which is not working before order and distinct:
Select
*
from t1
join t2
on t1.myid=t2.myid where t1.mycol =1059
=> Gives me this result:
myid
myyear
mycol
mdate
myid
name
idate
11833
2022
1059
2022-11-03 22:02:00
11833
Name1684
2023-01-10 15:52:55
11834
2022
1059
2022-11-17 19:56:41
11834
Name1727
2023-01-10 15:52:55
I want to order first by column mdate, then by idate (both to see the youngest dates) and then see only distinct values of (myyear and mycol)
CREATE TABLE expectedresult(myid int, myyear int,mycol int, mdate timestamp,name varchar,idate timestamp);
INSERT INTO expectedresult VALUES
(11834,2022,1059,'2022-11-17 19:56:41','Name1727','2023-01-10 15:52:55')
myid
myyear
mycol
mdate
name
idate
11834
2022
1059
2022-11-17 19:56:41
Name1727
2023-01-10 15:52:55
This is what I have tried:
create table t3 as(
select distinct on (subq1.myyear,subq1.mycol)
*
from(
Select
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate
from t1
join t2
on t1.myid=t2.myid
order by t1.mdate desc, t2.idate desc) subq1)
But it "distincts" the wrong row(because a younger mdate is available):
select * from t3 where mycol =1059
myid
myyear
mycol
mdate
name
idate
11833
2022
1059
2022-11-03 22:02:00
Name1684
2023-01-10 15:52:55
here also as fiddle:
https://dbfiddle.uk/eS5FoBeq
Best
SELECT DISTINCT ON (t1.myyear, t1.mycol)
*
FROM
t1
JOIN t2 ON t1.myid = t2.myid
ORDER BY
t1.myyear,
t1.mycol,
t1.mdate DESC,
t2.idate DESC;
or rewrite your query as:
SELECT DISTINCT ON (subq1.myyear, subq1.mycol)
*
FROM (
SELECT
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate
FROM
t1
JOIN t2 ON t1.myid = t2.myid
ORDER BY
t1.mdate DESC,
t2.idate DESC) subq1
ORDER BY
subq1.myyear,
subq1.mycol,
subq1.mdate DESC,
subq1.idate DESC;
if you distinct on (x,y) then you order by should be order by x,y,z
x, y is the columns that you want to get the unique row.In a group set (x,y), there are many rows, but you only want one, then you need order by z to get the only one row in a group set (x,y) in a deterministic way, otherwise, it will get a random row in a group set(x,y).
In general I try to avoid using distinct.
You can use row number to identify the number of elements with the same "myyear" and "mycol" and order them by newest date and then select the first value (rn =1).
with cte as(
Select
t1.myid,
t1.myyear,
t1.mycol,
t1.mdate,
t2.name,
t2.idate,
ROW_NUMBER() OVER (PARTITION BY myyear, mycol ORDER BY mdate DESC) as rn
from t1
join t2
on t1.myid=t2.myid
subq1)
)
Select *
from cte
where rn = 1
I have two tables Q and T, both containing a column of float numbers.
What I want to do is, for each number in Q, I want to find a number in T that has the smallest distance to it.
For example, for T={1,7,9} and Q={2,6,10}, I want to return Q,T pairs as {(2,1),(6,7),(10,9)}.
How should I express this query with SQL?
In addition, is that possible to accelerate this join by index, e.g. add an operator class which bind "FOR ORDER BY <->" with fabs calculation?
create table t (val_t integer);
create table q (val_q integer);
insert into t values (1),(7),(9);
insert into q values (2),(6),(10);
Start with a query that cross joins the two tables and adds a rank based on the difference:
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true ;
Use this query in a cte or subquery and filter by rank:
WITH src AS(
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true )
SELECT val_q, val_t FROM src
WHERE rank = 1;
val_q | val_t
-------+-------
2 | 1
6 | 7
10 | 9
See https://www.postgresql.org/docs/12/tutorial-window.html
Given this schema:
create table t (tn float);
insert into t values (1), (7), (9);
create table q (qn float);
insert into q values (2), (6), (10);
DISTINCT ON is the most straightforward way:
select distinct on (qn) qn, tn
from q
cross join t
order by qn, abs(qn - tn);
Exploiting a numeric range may perform better depending on your data sizes. If performance is an issue, then you can create an actual temp table for the range_tn CTE and put a gist index on it:
with all_tn as (
select tn
from t
union select null
), range_tn as (
select numrange(tn::numeric, (lead(tn) over w)::numeric, '[]') as tr
from all_tn
window w as (order by tn nulls first)
)
select qn,
case
when lower_inf(tr) then upper(tr)
when upper_inf(tr) then lower(tr)
when 2 * qn - lower(tr) - upper(tr) > 0 then upper(tr)
else lower(tr)
end as tn
from q
join range_tn
on qn::numeric <# tr;
Fiddle here
Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!
Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.
You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3
Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle
You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES
If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle
In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id
I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;
Given this table:
SELECT * FROM CommodityPricing order by dateField
"SILVER";60.45;"2002-01-01"
"GOLD";130.45;"2002-01-01"
"COPPER";96.45;"2002-01-01"
"SILVER";70.45;"2003-01-01"
"GOLD";140.45;"2003-01-01"
"COPPER";99.45;"2003-01-01"
"GOLD";150.45;"2004-01-01"
"MERCURY";60;"2004-01-01"
"SILVER";80.45;"2004-01-01"
As of 2004, COPPER was dropped and mercury introduced.
How can I get the value of (array_agg(value order by date desc) ) [1] as NULL for COPPER?
select commodity,(array_agg(value order by date desc) ) --[1]
from CommodityPricing
group by commodity
"COPPER";"{99.45,96.45}"
"GOLD";"{150.45,140.45,130.45}"
"MERCURY";"{60}"
"SILVER";"{80.45,70.45,60.45}"
SQL Fiddle
select
commodity,
array_agg(
case when commodity = 'COPPER' then null else price end
order by date desc
)
from CommodityPricing
group by commodity
;
To "pad" missing rows with NULL values in the resulting array, build your query on full grid of rows and LEFT JOIN actual values to the grid.
Given this table definition:
CREATE TEMP TABLE price (
commodity text
, value numeric
, ts timestamp -- using ts instead of the inappropriate name date
);
I use generate_series() to get a list of timestamps representing the years and CROSS JOIN to a unique list of all commodities (SELECT DISTINCT ...).
SELECT commodity, (array_agg(value ORDER BY ts DESC)) AS years
FROM generate_series ('2002-01-01 00:00:00'::timestamp
, '2004-01-01 00:00:00'::timestamp
, '1y') t(ts)
CROSS JOIN (SELECT DISTINCT commodity FROM price) c(commodity)
LEFT JOIN price p USING (ts, commodity)
GROUP BY commodity;
Result:
COPPER {NULL,99.45,96.45}
GOLD {150.45,140.45,130.45}
MERCURY {60,NULL,NULL}
SILVER {80.45,70.45,60.45}
SQL Fiddle.
I cast the array to text in the fiddle, because the display sucks and would swallow NULL values otherwise.