Improving performance of a GROUP BY ... HAVING COUNT(...) > 1 in PostgreSQL - postgresql

I'm trying to select the orders that are part of a trip with multiple orders.
I tried many approaches but can't find how to have a performant query.
To reproduce the problem here is the setup (here it's 100 000 rows, but really it's more 1 000 000 rows to see the timeout on db-fiddle).
Schema (PostgreSQL v14)
create table trips (id bigint primary key);
create table orders (id bigint primary key, trip_id bigint);
create index trips_idx on trips (id);
create index orders_idx on orders (id);
create index orders_trip_idx on orders (trip_id);
insert into trips (id) select seq from generate_series(1,100000) seq;
insert into orders (id, trip_id) select seq, floor(random() * 100000 + 1) from generate_series(1,100000) seq;
Query #1
explain analyze select orders.id
from orders
inner join trips on trips.id = orders.trip_id
inner join orders trips_orders on trips_orders.trip_id = trips.id
group by orders.id, trips.id
having count(trips_orders) > 1
limit 50
;
View on DB Fiddle
Here is what pgmustard gives me on the real query:

Do you actually need the join on trips? You could try
SELECT shared.id
FROM orders shared
WHERE EXISTS (SELECT * FROM orders other
WHERE other.trip_id = shared.trip_id
AND other.id != shared.id
)
;
to replace the group by with a hash join, or
SELECT unnest(array_agg(orders.id))
FROM orders
GROUP BY trip_id
HAVING count(*) > 1
;
to hopefully get Postgres to just use the trip_id index.

Related

Generate data with at least one occurence

I have three tables:
create table genres
(
genre_id serial primary key,
genre_name varchar NOT NULL UNIQUE
);
create table movies
(
movie_id serial primary key,
movie_name varchar NOT NULL
);
create table movie_genres
(
movie_id integer references movies NOT NULL,
genre_id integer references genres NOT NULL,
PRIMARY KEY(movie_id, genre_id)
);
Tables genres and movies are full of data and I want to generate some random data for table movie_genres, so that every movie has at least one genre.
I tried it this way, but then it is possible for a movie to be without any genre. Can anyone help me with that, please?
insert into movie_genres
select movie_id, genre_id
from genres cross join movies
where random() < 0.15;
Hmm, you can try to join a derived table in which you first select one random genre and then UNION some more randomly.
INSERT INTO movie_genres
(movie_id,
genre_id)
SELECT m.movie_id,
rg.genre_id
FROM movies m
CROSS JOIN ((SELECT g.genre_id
FROM genres g
ORDER BY random()
LIMIT 1)
UNION
(SELECT g.genre_id
FROM genres g
WHERE random() < 0.15)) rg;
That however means that every movie has that one genre selected first. To overcome this and have the first genre be random per movie, a lateral join can be used. (Remark: You need to use some column from the outer table in the derived table as otherwise the optimizer seems to optimize the LATERAL away.)
INSERT INTO movie_genres
(movie_id,
genre_id)
SELECT rg.movie_id,
rg.genre_id
FROM movies m
CROSS JOIN LATERAL ((SELECT g.genre_id,
m.movie_id -- that's just here to force the optimizer to keep the join lateral
FROM genres g
ORDER BY random()
LIMIT 1)
UNION
(SELECT g.genre_id,
m.movie_id
FROM genres g
WHERE random() < 0.15)) rg;
db<>fiddle

PostgreSQL count other values of ID that have the same value of other column

Let's say we have the following table that stores id of an observation and its address_id. You can create the table with the following code:
drop table if exists schema.pl_address_cnt;
create table schema.pl_address_cnt (
id serial,
address_id int);
insert into schema.pl_address_cnt(address_id) values
(100), (101), (100), (101), (100), (125), (128), (200), (200), (100);
My task is to count for each id how many other ids (thus -1) have the same address_id. I've come up with a solution that turns out to be quite expensive (explain) on the original dataset. I wonder whether my solution can be somehow optimised.
with tmp_table as (select address_id
, count(distinct id) as id_count
from schema.pl_address_cnt
group by address_id
)
select id
, id_count - 1
from schema.pl_address_cnt as pac
left join tmp_table as tt on tt.address_id=pac.address_id;
You can try to omit the CTE and do a self left join on common address but different ID and then aggregate this.
SELECT pac1.id,
count(pac2.id)
FROM pl_address_cnt pac1
LEFT JOIN pl_address_cnt pac2
ON pac1.address_id = pac2.address_id
AND pac1.id <> pac2.id
GROUP BY pac1.id
ORDER BY pac1.id;
For performance you can try indexes on (address_id, id) and (id).

LEFT JOIN trouble with multiple tables

I have the following query
SELECT a.account_id, sum(p.amount) AS amount
FROM accounts a
LEFT JOIN users_accounts ua
JOIN users u
JOIN payments p on p.meta_id = u.user_id
ON u.user_id = ua.user_id
ON ua.account_id = a.account_id
WHERE p.date_prcsd BETWEEN '2017-08-01 00:00:00' AND '2017-08-31 23:59:59'
GROUP BY a.account_id
ORDER BY account_id ASC;
What I want is all the rows from accounts a and zeroes for missing amount data. Same result set for different types of joins and different join structures - only rows that have some payments in p.
Where do I go wrong?
Simplified:
SELECT a.account_id
,sum(coalesce(p2.amount, 0)) AS amount
FROM accounts a
LEFT JOIN users_accounts ua ON (a.account_id = ua.account_id)
LEFT JOIN users u ON (ua.user_id = u.user_id)
LEFT JOIN (
SELECT p.meta_id
,p.amount
FROM payments p
WHERE p.date BETWEEN '2017-08-01' AND '2017-08-10'
) AS p2 ON (u.user_id = p2.meta_id)
GROUP BY a.account_id
ORDER BY account_id ASC;
Result:
account_id | amount
------------+--------
1 | 4
2 | 0
3 | 0
(3 rows)
Explanation: you need to take care of all returning null values. coalesce() does that for you. The where-clause is actually the real problem in your solution because it filters out rows that you would want to have in your endresult. On top of that: you left out the left join for the other tables. I created a simplified test db:
$ cat tables.sql
drop table users_accounts;
drop table payments;
drop table users;
drop table accounts;
create table accounts (account_id serial primary key, name varchar not
null);
create table users (user_id serial primary key, name varchar not null);
create table users_accounts(user_id int references users(user_id),
account_id int references
accounts(account_id));
create table payments(meta_id int references users(user_id), amount int
not null, date date);
insert into accounts (account_id, name) values (1, 'Account A'), (2,
'Account B'), (3, 'Account C');
insert into users (user_id, name) values (1, 'Marc'), (2, 'Ruben'), (3,
'Isaak');
insert into users_accounts (user_id, account_id) values (1,1),(2,1);
insert into payments(meta_id, amount, date) values (1,1, '2017-08-01'),
(1,2, '2017-08-11'),(1,3, '2017-08-03'),(2,1, null),(2,2, null),(2,3,
null);

SQLite very slow SELECT time

I got some strange behavior of my program, and maybe you can bring some light into it.
Today i started testing some code, and realized, that a specific query was really slow (took about 2 minutes).
here the select:
select distinct table1.someName
from table1
INNER JOIN table2 ON table2.id = table1.t2_id
INNER JOIN table3 ON table1.id = table3.t1_id
INNER JOIN table4 ON Table3.id = table4.t3_id
INNER JOIN table5 ON table5.id = table4.t5_id
INNER JOIN table6 ON table4.id = table6.t4_id
where t4_name = 'whatever'
and t2_name = 'moarWhatever'
and timestamp_till is null
order by someName
So the thing is, the result is about 120 records. the INNER JOINs reduce the amount of checks for timestamp_till is null to about 20 records on each record.
What bugs me most is, i've tested to insert the whole table table6 into a new created table and renamed timestamp_till to ende. On that table the select is done in about 0.1 seconds ...
Is timestamp_till some sort of reserved name of SQLite3? Could this be a bug in the SQLite engine? Is it my fault? oO
edit: add the EXPLAIN QUERY PLAN output...
When querying with the and timestamp_till is null he gives:
0|0|4|SEARCH TABLE table5 USING COVERING INDEX sqlite_autoindex_table5 (t4_name=?) (~1 rows)
0|1|3|SEARCH TABLE table4 USING INDEX table4.fk_table4_1_idx (t5_id=?) (~10 rows)
0|2|2|SEARCH TABLE table3 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|3|0|SEARCH TABLE table1 USING INTEGER PRIMARY KEY (rowid=?) (~1rows)
0|4|1|SEARCH TABLE table2 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|5|5|SEARCH TABLE table6 USING INDEX table6.fk_table6_ts_till (timestamp_till=?) (~2 rows)
0|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|USE TEMP B-TREE FOR DISTINCT
and the fast one is:
select distinct table1.someName
from table1
INNER JOIN table2 ON table2.id = table1.t2_id
INNER JOIN table3 ON table1.id = table3.t1_id
INNER JOIN table4 ON Table3.id = table4.t3_id
INNER JOIN table5 ON table5.id = table4.t5_id
INNER JOIN table6 ON table4.id = table6.t4_id
where t4_name = 'whatever'
and t2_name = 'moarWhatever'
order by someName
and its result:
0|0|4|SEARCH TABLE table5 USING COVERING INDEX sqlite_autoindex_table5_1 (t4name=?) (~1 rows)
0|1|3|SEARCH TABLE table4 USING INDEX table4.fk_table4_1_idx (t5_id=?) (~10 rows)
0|2|2|SEARCH TABLE table3 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|3|0|SEARCH TABLE table1 USING INTEGER PRIMARY KEY (rowid=?) (~1rows)
0|4|1|SEARCH TABLE table2 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|5|5|SEARCH TABLE table6 USING COVERING INDEX sqlite_autoindex_table6_1 (id=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|USE TEMP B-TREE FOR DISTINCT
with the test-table that is a copy of table6
0|0|4|SEARCH TABLE table5 USING COVERING INDEX sqlite_autoindex_table5_1 (name=?) (~1 rows)
0|1|3|SEARCH TABLE table4 USING INDEX table4.fk_t5_idx (t5_id=?) (~10 rows)
0|2|2|SEARCH TABLE table3 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|3|0|SEARCH TABLE table1 USING INTEGER PRIMARY KEY (rowid=?) (~1rows)
0|4|1|SEARCH TABLE table2 USING INTEGER PRIMARY KEY (rowid=?) (~1 rows)
0|5|5|SEARCH TABLE test USING INDEX test.fk_test__idx (id=?) (~2 rows)
0|0|0|USE TEMP B-TREE FOR GROUP BY
0|0|0|USE TEMP B-TREE FOR DISTINCT
create script for test
CREATE TABLE "test"(
"id" INTEGER NOT NULL,
"t12_id" INTEGER NOT NULL,
"value" DECIMAL NOT NULL,
"anfang" INTEGER NOT NULL,
"ende" INTEGER DEFAULT NULL,
PRIMARY KEY("id","t12_id","anfang"),
CONSTRAINT "fk_test_t12_id"
FOREIGN KEY("t12_id")
REFERENCES "table12"("id"),
CONSTRAINT "fk_test_id"
FOREIGN KEY("id")
REFERENCES "id_col"("id"),
CONSTRAINT "fk_test_anfang"
FOREIGN KEY("anfang")
REFERENCES "ts_col"("id"),
CONSTRAINT "fk_test_ende"
FOREIGN KEY("ende")
REFERENCES "ts_col"("id")
);
CREATE INDEX "test.fk_test_idx_t12_id" ON "test"("t12_id");
CREATE INDEX "test.fk_test_idx_id" ON "test"("id");
CREATE INDEX "test.fk_test_anfang" ON "test"("anfang");
CREATE INDEX "test.fk_test_ende" ON "test"("ende");
soo long zai
A first note: SQLite will use only 1 index in its query. Never more (with the current version).
Thus, here is what SQLite is doing:
Slow query: use the index on timestamp_till
Fast query (no timestamp_till): use the (auto) index on table6.id.
I see two workarounds.
You could use a subquery:
select distinct SomeName FROM
(
select table1.someName as "SomeName", timestamp_till
from table1
INNER JOIN table2 ON table2.id = table1.t2_id
INNER JOIN table3 ON table1.id = table3.t1_id
INNER JOIN table4 ON Table3.id = table4.t3_id
INNER JOIN table5 ON table5.id = table4.t5_id
INNER JOIN table6 ON table4.id = table6.t4_id
where t4_name = 'whatever'
and t2_name = 'moarWhatever'
) Q
where timestamp_till is null
order by SomeName;
Or you can drop your index on timestamp_till, if you don't need it elsewhere.
There is also perhaps some speed gains to be made by reordering your joins. Usually the smallest table first is faster, but this can vary greatly.

Postgres group by update - slow query

I am using postgres 9.X.
I have two tables
Table A
(
id integer
);
Table B
(
id integer,
Value integer
);
Both table are indexed on id.
Table A can have duplicate ID's
Example:
Table A
ID
1
1
1
2
1
I intend to insert number of occurrences of ID into table B (This table has all the ID's that are in Table A, but value is 0 initially)
Table B
ID Value
1 4
2 1
3 0
4 0
Here is my SQL statement
update tableB set value = value + sq.total
from
( select id, count(*) as total from TableA group by id ) as sq
where sq.id = tableB.id;
With 3-10 Million entries in TableA, it is taking an awful amount of time. Is there a way I can optimize this query?
Do you need tableB to be initially populated? An INSERT...SELECT from tableA into an empty tableB (with no indexes on tableB) should be faster:
insert into tableb (id, value)
select id, count(*)
from tablea
group by id
and then add your indexes to tableB once the data is there.