How can I SUM distinct records in a Postgres database where there are duplicate records? - postgresql

Imagine a table that looks like this:
The SQL to get this data was just SELECT *
The first column is "row_id" the second is "id" - which is the order ID and the third is "total" - which is the revenue.
I'm not sure why there are duplicate rows in the database, but when I do a SUM(total), it's including the second entry in the database, even though the order ID is the same, which is causing my numbers to be larger than if I select distinct(id), total - export to excel and then sum the values manually.
So my question is - how can I SUM on just the distinct order IDs so that I get the same revenue as if I exported to excel every distinct order ID row?
Thanks in advance!

Easy - just divide by the count:
select id, sum(total) / count(id)
from orders
group by id
See live demo.
Also handles any level of duplication, eg triplicates etc.

You can try something like this (with your example):
Table
create table test (
row_id int,
id int,
total decimal(15,2)
);
insert into test values
(6395, 1509, 112), (22986, 1509, 112),
(1393, 3284, 40.37), (24360, 3284, 40.37);
Query
with distinct_records as (
select distinct id, total from test
)
select a.id, b.actual_total, array_agg(a.row_id) as row_ids
from test a
inner join (select id, sum(total) as actual_total from distinct_records group by id) b
on a.id = b.id
group by a.id, b.actual_total
Result
| id | actual_total | row_ids |
|------|--------------|------------|
| 1509 | 112 | 6395,22986 |
| 3284 | 40.37 | 1393,24360 |
Explanation
We do not know what the reasons is for orders and totals to appear more than one time with different row_id. So using a common table expression (CTE) using the with ... phrase, we get the distinct id and total.
Under the CTE, we use this distinct data to do totaling. We join ID in the original table with the aggregation over distinct values. Then we comma-separate row_ids so that the information looks cleaner.
SQLFiddle example
http://sqlfiddle.com/#!15/72639/3

Create custom aggregate:
CREATE OR REPLACE FUNCTION sum_func (
double precision, pg_catalog.anyelement, double precision
)
RETURNS double precision AS
$body$
SELECT case when $3 is not null then COALESCE($1, 0) + $3 else $1 end
$body$
LANGUAGE 'sql';
CREATE AGGREGATE dist_sum (
pg_catalog."any",
double precision)
(
SFUNC = sum_func,
STYPE = float8
);
And then calc distinct sum like:
select dist_sum(distinct id, total)
from orders
SQLFiddle

You can use DISTINCT in your aggregate functions:
SELECT id, SUM(DISTINCT total) FROM orders GROUP BY id
Documentation here: https://www.postgresql.org/docs/9.6/static/sql-expressions.html#SYNTAX-AGGREGATES

If we can trust that the total for 1 order is actually 1 row. We could eliminate the duplicates in a sub-query by selecting the the MAX of the PK id column. An example:
CREATE TABLE test2 (id int, order_id int, total int);
insert into test2 values (1,1,50);
insert into test2 values (2,1,50);
insert into test2 values (5,1,50);
insert into test2 values (3,2,100);
insert into test2 values (4,2,100);
select order_id, sum(total)
from test2 t
join (
select max(id) as id
from test2
group by order_id) as sq
on t.id = sq.id
group by order_id
sql fiddle

In difficult cases:
select
id,
(
SELECT SUM(value::int4)
FROM jsonb_each_text(jsonb_object_agg(row_id, total))
) as total
from orders
group by id

I would suggest just use a sub-Query:
SELECT "a"."id", SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"
GROUP BY "a"."id"
The Above will give you the total of each id
Use below if you want the full total of each duplicate removed:
SELECT SUM("a"."total")
FROM (SELECT DISTINCT ON ("id") * FROM "Database"."Schema"."Table") AS "a"

Using subselect (http://sqlfiddle.com/#!7/cef1c/51):
select sum(total) from (
select distinct id, total
from orders
)
Using CTE (http://sqlfiddle.com/#!7/cef1c/53):
with distinct_records as (
select distinct id, total from orders
)
select sum(total) from distinct_records;

Related

Postgresql - select query with aggregated decisions column as json

I have table which contains specified columns:
id - bigint
decision - varchar(80)
type - varchar(258)
I want to make a select query which in result returns something like this(id, decisionsValues with counts as json, type):
id decisions type
1 {"firstDecisionsValue":countOfThisValue, "secondDecisionsValue": countOfThisValue} entryType
I heard that I can try play with json_agg but it does not allow COUNT method, tried to use json_agg with query:
SELECT ac.id,
json_agg(ac.decision),
ac.type
FROM myTable ac
GROUP BY ac.id, ac.type;
but ends with this(for entry with id 1 there are two occurences of firstDecisionsValue, one occurence of secondDecisionsValue):
id decisions type
1 {"firstDecisionsValue", "firstDecisionsValue", "secondDecisionsValue"} entryType
minimal reproducible example
CREATE TABLE myTable
(
id bigint,
decisions varchar(80),
type varchar(258)
);
INSERT INTO myTable
VALUES (1, 'firstDecisionsValue', 'myType');
INSERT INTO myTable
VALUES (1, 'firstDecisionsValue', 'myType');
INSERT INTO myTable
VALUES (1, 'secondDecisionsValue', 'myType');
Can you provide me any tips how to make it as expected?
1, {"fistDecisionsValue":2, "secondDecisionsValue":1}, entryType
You can try this
SELECT a.id, jsonb_object_agg(a.decisions, a.count), a.type
FROM
( SELECT id, type, decisions, count(*) AS count
FROM myTable
GROUP BY id, type, decisions
) AS a
GROUP BY a.id, a.type
see the result in dbfiddle.
First, you should calculate the count of id, type, decisions for each decisions after that, you should use jsonb_object_agg to create JSON.
Demo
with data as (
select
ac.id,
ac.type,
ac.decisions,
count(*)
from
myTable ac
group by
ac.id,
ac.type,
ac.decisions
)
select
d.id,
d.type,
json_object_agg(d.decisions, d.count)
from
data d
group by
d.id,
d.type

Getting the first and last score from the Player ID

I'm trying to get the first and last record from this table:
CREATE TABLE record(
id SERIAL PRIMARY KEY,
player_id BIGINT,
score INT
);
INSERT INTO record(player_id,score) VALUES(603,15);
INSERT INTO record(player_id,score) VALUES(603,5);
INSERT INTO record(player_id,score) VALUES(604,15);
INSERT INTO record(player_id,score) VALUES(604,10);
INSERT INTO record(player_id,score) VALUES(604,15);
INSERT INTO record(player_id,score) VALUES(612,20);
INSERT INTO record(player_id,score) VALUES(612,5);
INSERT INTO record(player_id,score) VALUES(612,10);
INSERT INTO record(player_id,score) VALUES(612,15);
How do I get the first and last score with one single row for player_id? So, the table will look like this:
|==========|============|===============|
|player_id | first_score| last_score |
|==========|============|===============|
|603 | 15 | 5 |
|604 | 15 | 15 |
|612 | 20 | 15 |
|==========|============|===============|
Here's what I have so far in DB-Fiddle, and I can't seem to get it right:
https://www.db-fiddle.com/f/gLuJ1dbwg5Eor3SUrubpbe/1
This is a greatest-n-per-group question which is typically done using distinct on () in Postgres.
You need one query to get the first score and one to get the last (assuming that "first" and "last" is defined through the id column)
select fs.player_id,
fs.score as first_score,
ls.score as last_score
from (
-- "first" score per player
select distinct on (player_id) *
from record
order by player_id, id
) fs
left join (
-- "last" score per player
select distinct on (player_id) *
from record
order by player_id, id desc
) ls on ls.player_id = fs.player_id
order by player_id;
Another option (but probably less efficient) is to use a window function that numbers the scores per player. Then use filtered aggregation to "pivot" the result into two columns.
with numbered as (
select player_id,
row_number() over (partition by player_id order by id) as rn_first,
row_number() over (partition by player_id order by id desc) as rn_last,
score
from record
)
select player_id,
max(score) filter (where rn_first = 1) as first_score,
max(score) filter (where rn_last = 1) as last_score
from numbered
where rn_first = 1 or rn_last = 1
group by player_id
order by player_id;
Online example
I think I got it... but let me know if there is a more efficient way to do this:
SELECT
first_scores.player_id,
first_score,
last_score
FROM
(
SELECT
record.player_id,
record.id,
score AS first_score
FROM record
INNER JOIN (
SELECT
player_id,
MIN(record.id) AS min_record_id
FROM record
GROUP BY player_id
ORDER BY player_id
) AS min_scores
ON min_record_id = record.id
) AS first_scores
INNER JOIN
(
SELECT
record.player_id,
record.id,
score AS last_score
FROM record
INNER JOIN (
SELECT
player_id,
MAX(record.id) AS max_record_id
FROM record
GROUP BY player_id
ORDER BY player_id
) AS max_scores
ON max_record_id = record.id
) AS last_scores ON first_scores.player_id = last_scores.player_id
Here's the DB fiddle:
https://www.db-fiddle.com/f/gLuJ1dbwg5Eor3SUrubpbe/4

Express Nearest Neighbor Join in Postgresql?

I have two tables Q and T, both containing a column of float numbers.
What I want to do is, for each number in Q, I want to find a number in T that has the smallest distance to it.
For example, for T={1,7,9} and Q={2,6,10}, I want to return Q,T pairs as {(2,1),(6,7),(10,9)}.
How should I express this query with SQL?
In addition, is that possible to accelerate this join by index, e.g. add an operator class which bind "FOR ORDER BY <->" with fabs calculation?
create table t (val_t integer);
create table q (val_q integer);
insert into t values (1),(7),(9);
insert into q values (2),(6),(10);
Start with a query that cross joins the two tables and adds a rank based on the difference:
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true ;
Use this query in a cte or subquery and filter by rank:
WITH src AS(
SELECT val_q, val_t, rank() OVER (PARTITION BY val_q ORDER BY abs(val_t - val_q))
FROM t
JOIN q ON true )
SELECT val_q, val_t FROM src
WHERE rank = 1;
val_q | val_t
-------+-------
2 | 1
6 | 7
10 | 9
See https://www.postgresql.org/docs/12/tutorial-window.html
Given this schema:
create table t (tn float);
insert into t values (1), (7), (9);
create table q (qn float);
insert into q values (2), (6), (10);
DISTINCT ON is the most straightforward way:
select distinct on (qn) qn, tn
from q
cross join t
order by qn, abs(qn - tn);
Exploiting a numeric range may perform better depending on your data sizes. If performance is an issue, then you can create an actual temp table for the range_tn CTE and put a gist index on it:
with all_tn as (
select tn
from t
union select null
), range_tn as (
select numrange(tn::numeric, (lead(tn) over w)::numeric, '[]') as tr
from all_tn
window w as (order by tn nulls first)
)
select qn,
case
when lower_inf(tr) then upper(tr)
when upper_inf(tr) then lower(tr)
when 2 * qn - lower(tr) - upper(tr) > 0 then upper(tr)
else lower(tr)
end as tn
from q
join range_tn
on qn::numeric <# tr;
Fiddle here

Selecting other columns not in count, group by

So I have a table as follows
product_id sender_id timestamp ...other columns...
1 2 1222
1 2 3423
1 2 1231
2 2 890
3 4 234
2 3 234234
I want to get rows where sender_id = 2, but I want to count and group by product_id and sort by timestamp descending. This means I need the following result
product_id sender_id timestamp count ...other columns...
1 2 3423 3
2 2 890 1
I tried the following query:
SELECT product_id, sender_id, timestamp, count(product_id), ...other columns...
FROM table
WHERE sender_id = 2
GROUP BY product_id
But I get the following error Error in query: ERROR: column "table.sender_id" must appear in the GROUP BY clause or be used in an aggregate function
Seems like I cannot SELECT columns that are not in the GROUP BY. Another method which I found online was to join
SELECT product_id, sender_id, timestamp, count, ...other columns...
FROM table
JOIN (
SELECT product_id, COUNT(product_id) AS count
FROM table
GROUP BY (product_id)
) table1 ON table.product_id = table1.product_id
WHERE sender_id = 2
GROUP BY product_id
But doing this simply lists all rows without grouping or counting. My guess is that the ON part simply extends table again.
Try grouping using product_id, sender_id
select product_id, sender_id, count(product_id), max(timestamp) maxtm
from t
where sender_id = 2
group by product_id, sender_id
order by maxtm desc
If you want other columns too:
select t.*, t1.product_count
from t
inner join (
select product_id, sender_id, count(product_id) product_count, max(timestamp) maxtm
from t
where sender_id = 2
group by product_id, sender_id
) t1
on t.product_id = t1.product_id and t.sender_id = t1.sender_id and t.timestamp = t1.maxtm
order by t1.maxtm desc
Just do a workout with your data:
CREATE TABLE products (product_id INTEGER,
sender_id INTEGER,
time_stamp INTEGER)
INSERT INTO products VALUES
(1,2,1222),
(1,2,3423),
(1,2,1231),
(2,2,890),
(3,4,234),
(2,3,234234)
SELECT product_id,sender_id,string_agg(time_stamp::text,','),count(product_id)
FROM products
WHERE sender_id=2
GROUP BY product_id,sender_id
Here you have distinct time_stamp ,so you need to apply some aggregate or just remove that column in select statement.
If you remove time_stamp in select statement then it would be very easy like below :
SELECT product_id,sender_id,count(product_id)
FROM products
WHERE sender_id=2
GROUP BY product_id,sender_id

T-SQL how to count the number of duplicate rows then print the outcome?

I have a table ProductNumberDuplicates_backups, which has two columns named ProductID and ProductNumber. There are some duplicate ProductNumbers. How can I count the distinct number of products, then print out the outcome like "() products was backup." ? Because this is inside a stored procedure, I have to use a variable #numrecord as the distinct number of rows. I put my codes like this:
set #numrecord= select distinct ProductNumber
from ProductNumberDuplicates_backups where COUNT(*) > 1
group by ProductID
having Count(ProductNumber)>1
Print cast(#numrecord as varchar)+' product(s) were backed up.'
obviously the error was after the = sign as the select can not follow it. I've search for similar cases but they are just select statements. Please help. Many thanks!
Try
select #numrecord= count(distinct ProductNumber)
from ProductNumberDuplicates_backups
Print cast(#numrecord as varchar)+' product(s) were backed up.'
begin tran
create table ProductNumberDuplicates_backups (
ProductNumber int
)
insert ProductNumberDuplicates_backups(ProductNumber)
select 1
union all
select 2
union all
select 1
union all
select 3
union all
select 2
select * from ProductNumberDuplicates_backups
declare #numRecord int
select #numRecord = count(ProductNumber) from
(select ProductNumber, ROW_NUMBER()
over (partition by ProductNumber order by ProductNumber) RowNumber
from ProductNumberDuplicates_backups) p
where p.RowNumber > 1
print cast(#numRecord as varchar) + ' product(s) were backed up.'
rollback