SELECT FOR UPDATE returns zero rows - postgresql

UPDATE: Problem solved!
That was a bug in PostgreSQL. Tom Lane fixed it in this
commit.
Why SELECT FOR UPDATE returns 0 rows in scenario below? But if I just execute sql query from second transaction it always returns 1 row.
TRANSACTION 1:
BEGIN;
-- This query updates t1c1 to its current value, it doesn't change anything
UPDATE t1 SET t1c3 = 'string_value_1' WHERE t1c1 = 123456789;
-- Query returned successfully: one row affected, 51 msec execution time.
TRANSACTION 2:
WITH
cte1 AS (
SELECT t2c2 FROM t2 WHERE t2c1 = 'string_value_2'
),
cte2 AS (
SELECT * FROM t1
WHERE
t1c1 = 123456789
AND t1c2 = (SELECT t2c2 FROM cte1)
FOR UPDATE
)
SELECT * FROM cte2
-- Waiting
TRANSACTION 1:
COMMIT;
-- Query returned successfully with no result in 41 msec.
TRANSACTION 2:
-- Returned 0 rows
Example:
CREATE TABLE t1 (_pk serial, t1c1 integer, t1c2 integer, t1c3 text);
CREATE TABLE t2 (_pk serial, t2c1 text, t2c2 integer);
insert into t1 (t1c1, t1c2, t1c3) values(123456789, 100, 'string_value_1');
insert into t2 (t2c1, t2c2) values('string_value_2', 100);

Interesting question! With explain verbose analyze, I get the following query plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
CTE Scan on cte2 (cost=51.13..51.15 rows=1 width=44) (actual time=4544.488..4544.488 rows=0 loops=1)
Output: cte2._pk, cte2.t1c1, cte2.t1c2, cte2.t1c3
CTE cte1
-> Seq Scan on public.t2 (cost=0.00..24.50 rows=6 width=4) (actual time=0.002..0.003 rows=1 loops=1)
Output: t2.t2c2
Filter: (t2.t2c1 = 'string_value_2'::text)
CTE cte2
-> LockRows (cost=0.12..26.63 rows=1 width=50) (actual time=4544.485..4544.485 rows=0 loops=1)
Output: t1._pk, t1.t1c1, t1.t1c2, t1.t1c3, t1.ctid
InitPlan 2 (returns $1)
-> CTE Scan on cte1 (cost=0.00..0.12 rows=6 width=4) (actual time=0.005..0.006 rows=1 loops=1)
Output: cte1.t2c2
-> Seq Scan on public.t1 (cost=0.00..26.50 rows=1 width=50) (actual time=0.018..0.019 rows=1 loops=1)
Output: t1._pk, t1.t1c1, t1.t1c2, t1.t1c3, t1.ctid
Filter: ((t1.t1c1 = 123456789) AND (t1.t1c2 = $1))
Planning time: 0.116 ms
Execution time: 4544.535 ms
(17 rows)
The outer "CTE Scan on cte2" seems to drop the row that was still there during the "LockRows" step. Postgres is known to re-evaluate where clauses after acquiring a lock (see this example with work queues.) Perhaps the query plan contains a where clause on the invisible ctid row identifier, which does change after any UPDATE ?
I've asked this question on the Postgres mailing list to see if other people are able to clarify what's happening here.

I'd say this is a bug in PostgreSQL, and you should report it.

FOR UPDATE has no meaning inside CTE. This should work though:
WITH
cte1 AS (
SELECT t2c2 FROM t2 WHERE t2c1 = "string_value_2"
),
cte2 AS (
SELECT * FROM t1
WHERE
t1c1 = 123456789
AND t1c2 = (SELECT t2c2 FROM cte1)
)
SELECT * FROM cte2 FOR UPDATE

Related

Bad execution plan on Postgresql

I'm trying to migrate from SQL Server to Postgresql.
Here is my Posgresql code:
Create View person_names As
SELECT lp."Code", n."Name", n."Type"
from "Persons" lp
Left Join LATERAL
(
Select *
From "Names" n
Where n.id = lp.id
Order By "Date" desc
Limit 1
) n on true
limit 100;
Explain
Select "Code" From person_names;
It prints
"Subquery Scan on person_names (cost=0.42..448.85 rows=100 width=10)"
" -> Limit (cost=0.42..447.85 rows=100 width=56)"
" -> Nested Loop Left Join (cost=0.42..303946.91 rows=67931 width=56)"
" -> Seq Scan on ""Persons"" lp (cost=0.00..1314.31 rows=67931 width=10)"
" -> Limit (cost=0.42..4.44 rows=1 width=100)"
" -> Index Only Scan Backward using ""IX_Names_Person"" on ""Names"" n (cost=0.42..4.44 rows=1 width=100)"
" Index Cond: ("id" = (lp."id")::numeric)"
Why there is an "Index Only Scan" for the "Names" table? This table is not required to get a result. On SQL Server I get only a single scan over the "Persons" table.
How can I tune Postgres to get a better query plans? I'm trying the lastest version, which is the Postgresql 15 beta 3.
Here is SQL Server version:
Create View person_names As
SELECT top 100 lp."Code", n."Name", n."Type"
from "Persons" lp
Outer Apply
(
Select Top 1 *
From "Names" n
Where n.id = lp.id
Order By "Date" desc
) n
GO
SET SHOWPLAN_TEXT ON;
GO
Select "Code" From person_names;
It gives correct execution plan:
|--Top(TOP EXPRESSION:((100)))
|--Index Scan(OBJECT:([Persons].[IX_Persons] AS [lp]))
Change the lateral join to a regular left join, then Postgres is able to remove the select on the Names table:
create View person_names
As
SELECT lp.Code, n.Name, n.Type
from Persons lp
Left Join (
Select distinct on (id) *
From Names n
Order By id, Date desc
) n on n.id = lp.id
limit 100;
The following index will support the distinct on () in case you do include columns from the Names table:
create index on "Names"(id, "Date" desc);
For select code from names this gives me this plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on persons lp (cost=0.00..309.00 rows=20000 width=7) (actual time=0.009..1.348 rows=20000 loops=1)
Planning Time: 0.262 ms
Execution Time: 1.738 ms
For select Code, name, type From person_names; this gives me this plan:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Right Join (cost=559.42..14465.93 rows=20000 width=25) (actual time=5.585..68.545 rows=20000 loops=1)
Hash Cond: (n.id = lp.id)
-> Unique (cost=0.42..13653.49 rows=20074 width=26) (actual time=0.053..57.323 rows=20000 loops=1)
-> Index Scan using names_id_date_idx on names n (cost=0.42..12903.49 rows=300000 width=26) (actual time=0.052..41.125 rows=300000 loops=1)
-> Hash (cost=309.00..309.00 rows=20000 width=11) (actual time=5.407..5.407 rows=20000 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 1116kB
-> Seq Scan on persons lp (cost=0.00..309.00 rows=20000 width=11) (actual time=0.011..2.036 rows=20000 loops=1)
Planning Time: 0.460 ms
Execution Time: 69.180 ms
Of course I had to guess the table structures as you haven't provided any DDL.
Online example
Change your view definition like that
create view person_names as
select p."Code",
(select "Name"
from "Names" n
where n.id = p.id
order by "Date" desc
limit 1)
from "Persons" p
limit 100;

Can a LEFT JOIN be deferred to only apply to matching rows?

When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;

PostgreSQL delete-statement using big/long-sub-query hangs/fails

I am currently trying to delete about 1+ million rows in a table (actually, 30+ million, but i have made a sub-set, due to problems arising in this as well), where the condition is that the row must not have any references in other tables as a foreign key.. and i delete in batches of 30000 rows at a time.
so the query looks like:
DELETE FROM table_name tn WHERE tn.id IN (
SELECT tn2.id FROM table_name as tn2
LEFT JOIN table_name_join_1 ON table_name_join_1.table_id = tn2.id
LEFT JOIN table_name_join_2 ON table_name_join_2.table_id = tn2.id
...
LEFT JOIN table_name_join_19 ON table_name_join_19.table_id = tn2.id
WHERE table_name_join_1.table_id IS NULL
AND table_name_join_2.table_id IS NULL
...
AND table_name_join_19.table_id IS NULL
LIMIT 30000 OFFSET x
)
The table is referenced in 19 different tables, so a lot of left joins in the sub-query, and takes 61 seconds to run without LIMIT & OFFSET when counting the total amount of rows that will be affected.
The problem is that the query just hangs, when doing it in the delete statement, but works when just counting using COUNT(1).. I am not sure if there is a better way of deleting a lot of rows in a table.. or is it a matter of examining the tables referencing the table-in-question and see if some indexes are wack in some way.
Hope someone can help :D Its quite annoying seeing a query work and then just hang/fail straight afterwards when used as a sub-query.
I use psycopg2 on Python 2.7.17 (a work thing).. i also speculated in when to close the cursor from the psycopg2 connection to up speeds.. currently i create the cursor outside the loop running the delete, and close it along with the db-connection when the script is done... prev. the cursor was closed after each commit of a delete-statement, but it seemed a bit to much to me.. i don't know? Current loop looks like:
cursor = conn.cursor()
while count >= offset:
...
delete(cursor, batch_size, offset)
...
offset += batch_size
Also, is it a bad idea to commit() after each delete-statement is executed or should i wait until after the loop is finished executing all the delete statements, and then commit.. if so, shouldn't i look into using transactions instead ?
Basically I hope someone can tell me why stuff is so slow/fails, even though a count without limit and offset "only" takes 60 seconds??
DELETE FROM xxx has almost the same subsyntax as SELECT COUNT(*) FROM xxx ; so just to test the plan, you can run the fragment below, and check if you get an indexed plan:
EXPLAIN
SELECT COUNT(*)
FROM table_name tn
WHERE NOT EXISTS ( SELECT *
FROM table_name_join_1 x1 WHERE x1.table_id = tn.id
)
--
AND NOT EXISTS ( SELECT *
FROM table_name_join_2 x2 WHERE x2.table_id = tn.id
)
--
AND NOT EXISTS ( SELECT *
FROM table_name_join_3 x3 WHERE x3.table_id = tn.id
)
--
-- et cetera
--
;
Create some data, since it is hard to benchmark pseudocode:
SELECT version();
CREATE TABLE table_name
( id serial NOT NULL PRIMARY KEY
, name text
);
INSERT INTO table_name ( name )
SELECT 'Name_' || gs::text
FROM generate_series(1,100000) gs;
--
CREATE TABLE table_name_join_2
( id serial NOT NULL PRIMARY KEY
, table_id INTEGER REFERENCES table_name(id)
, name text
);
INSERT INTO table_name_join_2(table_id,name)
SELECT src.id , 'Name_' || src.id :: text
FROM table_name src
WHERE src.id % 2 = 0
;
--
CREATE TABLE table_name_join_3
( id serial NOT NULL PRIMARY KEY
, table_id INTEGER REFERENCES table_name(id)
, name text
);
INSERT INTO table_name_join_3(table_id,name)
SELECT src.id , 'Name_' || src.id :: text
FROM table_name src
WHERE src.id % 3 = 0
;
--
CREATE TABLE table_name_join_5
( id serial NOT NULL PRIMARY KEY
, table_id INTEGER REFERENCES table_name(id)
, name text
);
INSERT INTO table_name_join_5(table_id,name)
SELECT src.id , 'Name_' || src.id :: text
FROM table_name src
WHERE src.id % 5 = 0
;
--
CREATE TABLE table_name_join_7
( id serial NOT NULL PRIMARY KEY
, table_id INTEGER REFERENCES table_name(id)
, name text
);
INSERT INTO table_name_join_7(table_id,name)
SELECT src.id , 'Name_' || src.id :: text
FROM table_name src
WHERE src.id % 7 = 0
;
--
CREATE TABLE table_name_join_11
( id serial NOT NULL PRIMARY KEY
, table_id INTEGER REFERENCES table_name(id)
, name text
);
INSERT INTO table_name_join_11(table_id,name)
SELECT src.id , 'Name_' || src.id :: text
FROM table_name src
WHERE src.id % 11 = 0
;
Now, run the DELETE query:
VACUUM ANALYZE table_name;
VACUUM ANALYZE table_name_join_2;
VACUUM ANALYZE table_name_join_3;
VACUUM ANALYZE table_name_join_5;
VACUUM ANALYZE table_name_join_7;
EXPLAIN ANALYZE
DELETE
FROM table_name tn
WHERE 1=1
AND NOT EXISTS ( SELECT * FROM table_name_join_2 x2 WHERE x2.table_id = tn.id)
--
AND NOT EXISTS ( SELECT * FROM table_name_join_3 x3 WHERE x3.table_id = tn.id)
--
AND NOT EXISTS ( SELECT * FROM table_name_join_5 x5 WHERE x5.table_id = tn.id)
--
AND NOT EXISTS ( SELECT * FROM table_name_join_7 x7 WHERE x7.table_id = tn.id)
--
AND NOT EXISTS ( SELECT * FROM table_name_join_11 x11 WHERE x11.table_id = tn.id)
--
-- et cetera
--
;
SELECT count(*) FROM table_name;
Now, exactly the same, but with supporting indexes on the FKs:
CREATE INDEX table_name_join_2_2 ON table_name_join_2( table_id);
CREATE INDEX table_name_join_3_3 ON table_name_join_3( table_id);
CREATE INDEX table_name_join_5_5 ON table_name_join_5( table_id);
CREATE INDEX table_name_join_7_7 ON table_name_join_7( table_id);
CREATE INDEX table_name_join_11_11 ON table_name_join_11( table_id);
VACUUM ANALYZE table_name;
VACUUM ANALYZE table_name_join_2;
VACUUM ANALYZE table_name_join_3;
VACUUM ANALYZE table_name_join_5;
VACUUM ANALYZE table_name_join_7;
EXPLAIN ANALYZE
DELETE
FROM table_name tn
WHERE 1=1
...
;
----------
Query plan#1:
----------
DROP SCHEMA
CREATE SCHEMA
SET
version
----------------------------------------------------------------------------------------------------------
PostgreSQL 11.6 on armv7l-unknown-linux-gnueabihf, compiled by gcc (Raspbian 8.3.0-6+rpi1) 8.3.0, 32-bit
(1 row)
CREATE TABLE
INSERT 0 100000
CREATE TABLE
INSERT 0 50000
CREATE TABLE
INSERT 0 33333
CREATE TABLE
INSERT 0 20000
CREATE TABLE
INSERT 0 14285
CREATE TABLE
INSERT 0 9090
SET
SET
VACUUM
VACUUM
VACUUM
VACUUM
VACUUM QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on table_name tn (cost=3969.52..7651.94 rows=11429 width=36) (actual time=812.010..812.011 rows=0 loops=1)
-> Hash Anti Join (cost=3969.52..7651.94 rows=11429 width=36) (actual time=206.775..712.982 rows=20779 loops=1)
Hash Cond: (tn.id = x7.table_id)
-> Hash Anti Join (cost=3557.10..7088.09 rows=13334 width=34) (actual time=183.070..654.030 rows=24242 loops=1)
Hash Cond: (tn.id = x5.table_id)
-> Hash Anti Join (cost=2979.10..6329.25 rows=16667 width=28) (actual time=149.870..578.173 rows=30303 loops=1)
Hash Cond: (tn.id = x3.table_id)
-> Hash Anti Join (cost=2016.11..5124.59 rows=25000 width=22) (actual time=95.589..461.053 rows=45455 loops=1)
Hash Cond: (tn.id = x2.table_id)
-> Merge Anti Join (cost=572.11..3271.21 rows=50000 width=16) (actual time=14.486..261.955 rows=90910 loops=1)
Merge Cond: (tn.id = x11.table_id)
-> Index Scan using table_name_pkey on table_name tn (cost=0.29..2344.99 rows=100000 width=10) (actual time=0.031..118.968 rows=100000 loops=1)
-> Sort (cost=571.82..589.22 rows=6960 width=10) (actual time=14.446..20.365 rows=9090 loops=1)
Sort Key: x11.table_id
Sort Method: quicksort Memory: 612kB
-> Seq Scan on table_name_join_11 x11 (cost=0.00..127.60 rows=6960 width=10) (actual time=0.029..6.939 rows=9090 loops=1)
-> Hash (cost=819.00..819.00 rows=50000 width=10) (actual time=80.439..80.440 rows=50000 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 2014kB
-> Seq Scan on table_name_join_2 x2 (cost=0.00..819.00 rows=50000 width=10) (actual time=0.019..36.848 rows=50000 loops=1)
-> Hash (cost=546.33..546.33 rows=33333 width=10) (actual time=53.678..53.678 rows=33333 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 1428kB
-> Seq Scan on table_name_join_3 x3 (cost=0.00..546.33 rows=33333 width=10) (actual time=0.027..24.132 rows=33333 loops=1)
-> Hash (cost=328.00..328.00 rows=20000 width=10) (actual time=32.884..32.885 rows=20000 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 832kB
-> Seq Scan on table_name_join_5 x5 (cost=0.00..328.00 rows=20000 width=10) (actual time=0.017..15.135 rows=20000 loops=1)
-> Hash (cost=233.85..233.85 rows=14285 width=10) (actual time=23.542..23.542 rows=14285 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 567kB
-> Seq Scan on table_name_join_7 x7 (cost=0.00..233.85 rows=14285 width=10) (actual time=0.016..10.742 rows=14285 loops=1)
Planning Time: 4.470 ms
Trigger for constraint table_name_join_2_table_id_fkey: time=172949.350 calls=20779
Trigger for constraint table_name_join_3_table_id_fkey: time=116772.757 calls=20779
Trigger for constraint table_name_join_5_table_id_fkey: time=71218.348 calls=20779
Trigger for constraint table_name_join_7_table_id_fkey: time=51760.503 calls=20779
Trigger for constraint table_name_join_11_table_id_fkey: time=36120.128 calls=20779
Execution Time: 449783.490 ms
(35 rows)
count
-------
79221
(1 row)
Query plan#2:
SET
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
CREATE INDEX
SET
VACUUM
VACUUM
VACUUM
VACUUM
VACUUM
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on table_name tn (cost=1.73..6762.95 rows=11429 width=36) (actual time=776.987..776.988 rows=0 loops=1)
-> Merge Anti Join (cost=1.73..6762.95 rows=11429 width=36) (actual time=0.212..676.794 rows=20779 loops=1)
Merge Cond: (tn.id = x7.table_id)
-> Merge Anti Join (cost=1.44..6322.99 rows=13334 width=34) (actual time=0.191..621.986 rows=24242 loops=1)
Merge Cond: (tn.id = x5.table_id)
-> Merge Anti Join (cost=1.16..5706.94 rows=16667 width=28) (actual time=0.172..550.669 rows=30303 loops=1)
Merge Cond: (tn.id = x3.table_id)
-> Merge Anti Join (cost=0.87..4661.02 rows=25000 width=22) (actual time=0.147..438.036 rows=45455 loops=1)
Merge Cond: (tn.id = x2.table_id)
-> Merge Anti Join (cost=0.58..2938.75 rows=50000 width=16) (actual time=0.125..250.082 rows=90910 loops=1)
Merge Cond: (tn.id = x11.table_id)
-> Index Scan using table_name_pkey on table_name tn (cost=0.29..2344.99 rows=100000 width=10) (actual time=0.031..116.630 rows=100000 loops=1)
-> Index Scan using table_name_join_11_11 on table_name_join_11 x11 (cost=0.29..230.14 rows=9090 width=10) (actual time=0.090..11.228 rows=9090 loops=1)
-> Index Scan using table_name_join_2_2 on table_name_join_2 x2 (cost=0.29..1222.29 rows=50000 width=10) (actual time=0.019..59.500 rows=50000 loops=1)
-> Index Scan using table_name_join_3_3 on table_name_join_3 x3 (cost=0.29..816.78 rows=33333 width=10) (actual time=0.022..40.473 rows=33333 loops=1)
-> Index Scan using table_name_join_5_5 on table_name_join_5 x5 (cost=0.29..491.09 rows=20000 width=10) (actual time=0.016..23.105 rows=20000 loops=1)
-> Index Scan using table_name_join_7_7 on table_name_join_7 x7 (cost=0.29..351.86 rows=14285 width=10) (actual time=0.017..16.903 rows=14285 loops=1)
Planning Time: 4.737 ms
Trigger for constraint table_name_join_2_table_id_fkey: time=1114.497 calls=20779
Trigger for constraint table_name_join_3_table_id_fkey: time=1096.065 calls=20779
Trigger for constraint table_name_join_5_table_id_fkey: time=1094.951 calls=20779
Trigger for constraint table_name_join_7_table_id_fkey: time=1090.509 calls=20779
Trigger for constraint table_name_join_11_table_id_fkey: time=1173.987 calls=20779
Execution Time: 6426.626 ms
(24 rows)
count
-------
79221
(1 row)
So, the query speeds up from 450 seconds to 7 seconds. And most of the time appears to be spent on checking the FK constraints, after the actual delete in the base table. [these constraints are implemented as invisible triggers in Postgres]
Summary table:
query type | indexes on all 5 FKs | workmem | total time(ms) | time for triggers
----------------+-----------------------+---------------+-----------------------+-------------------
NOT EXISTS() | No | 4Mb | 449783.490 | 448821.083
NOT EXISTS() | Yes | 4Mb | 6426.626 | 5570.009
NOT EXISTS() | Yes | 64Kb | 6405.273 | 5545.352
NOT IN() | No | 4Mb | 449435.530 | 448829.179
NOT IN() | Yes | 4Mb | 6113.690 | 5443.505
NOT IN() | Yes | 64Kb | 8595341.467 | 5545.796
Conclusion: it is up to you to decide if you want indexes on Foreign Keys.

cross join with a table with single row

My table1 has 25,000+ rows and table2 has only 1 row. Both of them have almost 30 columns. I need to add all the columns in table2 (which has only one row) to the columns in table1 so I can do further calculations. One way to do it is
select * from table1 cross join table2
It gives the desired results but the performance is not good.
I am wondering if there is a better or faster way to get the combined table. I am using PostgreSQL
Here is the output from
explain analyze select * from table1 cross join table2
Nested Loop (cost=0.00..195264.90 rows=15533650 width=336) (actual time=0.013..46.189 rows=25465 loops=1)
-> Seq Scan on table1 (cost=0.00..1076.65 rows=25465 width=232) (actual time=0.007..6.912 rows=25465 loops=1)
-> Materialize (cost=0.00..19.15 rows=610 width=104) (actual time=0.000..0.000 rows=1 loops=25465)
-> Seq Scan on table2 (cost=0.00..16.10 rows=610 width=104) (actual time=0.001..0.002 rows=1 loops=1)
Planning time: 0.153 ms
Execution time: 50.868 ms
Thanks.

PostgreSQL increase group by over 30 million rows

Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.
create table if not exists tb
(
id serial not null constraint tb_pkey primary key,
week integer,
month integer,
year integer,
starttime varchar(20),
endtime varchar(20),
brand smallint,
category smallint,
value real
);
The query below takes 8.5 seconds.
SELECT category from tb group by category
Is there any way to increase the speed. I have tried with and without index.
For that exact query, not really; doing this operation requires scanning every row. No way around it.
But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):
Counting distinct rows using recursive cte over non-distinct index
You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:
testdb=# create table tb(id bigserial, category smallint);
CREATE TABLE
testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
testdb-# ;
INSERT 0 10000
testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
INSERT 0 10000
testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
INSERT 0 10000
testdb=# create index on tb(category);
CREATE INDEX
testdb=# WITH RECURSIVE cte AS
(
(SELECT category
FROM tb
WHERE category >= 0
ORDER BY 1
LIMIT 1)
UNION ALL SELECT
(SELECT category
FROM tb
WHERE category > c.category
ORDER BY 1
LIMIT 1)
FROM cte c
WHERE category IS NOT NULL)
SELECT category
FROM cte
WHERE category IS NOT NULL;
category
----------
1
2
3
(3 rows)
And here's the EXPLAIN ANALYZE:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
Filter: (category IS NOT NULL)
Rows Removed by Filter: 1
CTE cte
-> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
-> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
-> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
Index Cond: (category >= 0)
Heap Fetches: 1
-> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
Filter: (category IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
-> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
Index Cond: (category > c.category)
Heap Fetches: 2
Planning time: 0.224 ms
Execution time: 0.191 ms
(19 rows)
The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.
Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.