Optimize postgress update query with join statement on large tables - postgresql

I have 2 tables A and B.
A contains ~6M records and B ~1M records.
B contains foreign key to A, column name a_id.
I need to move the relationship between the tables, so table A will contain b_id, and remove the a_id from B.
I created this migration:
UPDATE A
SET b_id = B.id
FROM B
INNER JOIN A ON B.a_id = A.id;
This is the explain query:
"Update on A (cost=64444.71..138800306174.79 rows=7984529761815 width=915)"
" -> Nested Loop (cost=64444.71..138800306174.79 rows=7984529761815 width=915)"
" -> Seq Scan on A (cost=0.00..485941.45 rows=5779245 width=899)"
" -> Materialize (cost=64444.71..642396.62 rows=1381587 width=16)"
" -> Hash Join (cost=64444.71..628741.69 rows=1381587 width=16)"
" Hash Cond: (A.id = B.a_id)"
" -> Seq Scan on A (cost=0.00..485941.45 rows=5779245 width=10)"
" -> Hash (cost=40427.87..40427.87 rows=1381587 width=14)"
" -> Seq Scan on B (cost=0.00..40427.87 rows=1381587 width=14)"
This is taking too long(More than half an hour), more than my time window i have for the migration. Is there anything to do in order to optimize the update statement?

Related

Attempt to avoid duplicated code with CTE takes far too long

I need to create a view with some calculated and aggregated values. So I need certain values multiple times, like total_dist_pts in the example below.
There is a loc_a_run table with about 350 rows so far (constantly growing) and a loc_a_pline table with somewhat more than 4 million rows (also growing):
prod=# \d loc_a_run
Table "public.loc_a_run"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+-----------------------------
loc_a_run_id | integer | | not null | generated always as identity
execution_time | timestamp with time zone | | not null | CURRENT_TIMESTAMP(0)
run_type_id | smallint | | not null |
...
has_errors | boolean | | |
Indexes:
"loc_a_run_pkey" PRIMARY KEY, btree (loc_a_run_id)
prod=# \d loc_a_pline
Table "public.loc_a_pline"
Column | Type | Collation | Nullable | Default
----------------+-----------+-----------+----------+-----------------------------
loc_a_pline_id | bigint | | not null | generated always as identity
loc_a_run_id | integer | | not null |
is_top | boolean | | not null |
...
dist_left | numeric | | not null |
dist_right | numeric | | not null |
...
Indexes:
"loc_a_pline_pkey" PRIMARY KEY, btree (loc_a_pline_id)
Foreign-key constraints:
"loc_a_pline_loc_a_run_id_fkey" FOREIGN KEY (loc_a_run_id) REFERENCES loc_a_run(loc_a_run_id) ON UPDATE CASCADE ON DELETE CASCADE
The solution I use right now:
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
/ COUNT(pline.loc_a_pline_id)
AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=154201.17..154577.71 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=154201.17..154519.69 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=153201.15..153204.56 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=153113.01..153130.07 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Hash Join (cost=21.67..120633.75 rows=1623963 width=62)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112756.41 rows=1623963 width=32)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107766.55 rows=1867855 width=30)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes around 14.2s to execute. I lack the experience to assess how good or bad the performance is for this part, but I could live with it. Of course, faster would be an advantage.
Because this contains duplicated code I tried to get rid of it by using a CTE (in the final view I need this for a few more calculations, but the pattern is the same):
WITH dist_pts AS (
SELECT loc_a_run_id
, CASE
WHEN is_top IS true
THEN ROUND(dist_right - dist_left, 2)
ELSE ROUND(dist_left - dist_right, 2)
END AS pts
FROM loc_a_pline
)
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(dist_pts.pts) AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(dist_pts.pts) / COUNT(pline.loc_a_pline_id) AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN dist_pts USING (loc_a_run_id)
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=575677889.59..575678266.13 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=575677889.59..575678208.12 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=575676889.57..575676892.98 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=575676801.43..575676818.49 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Parallel Hash Join (cost=155366.81..111024852.15 rows=23232597464 width=62)"
" Hash Cond: (loc_a_pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline (cost=0.00..107877.85 rows=1869785 width=22)"
" -> Parallel Hash (cost=120758.30..120758.30 rows=1625641 width=48)"
" -> Hash Join (cost=21.67..120758.30 rows=1625641 width=48)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112872.83 rows=1625641 width=18)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107877.85 rows=1869785 width=12)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes forever and seems to be the wrong approach. I struggle to understand the query plan to find my mistake(s).
So my questions are:
Why does the CTE approach take so much time?
What would be the smartest solution to avoid duplicated code and eventually reduce execution time?
Is there a way to SUM(dist_pts.pts) only once?
Is there a way to COUNT(pline.loc_a_pline_id) in the same go as the subtraction in the CTE instead of accessing the big loc_a_pline table again? (is it accessed again at all?)
Any help is highly appreciated
Consider creating an index on loc_a_pline.loc_a_run_id. Postgres doesn't automatically create indexes on the referencing side of FK relationships. That should drastically improve the speed and remove the sequential scans over loc_a_pline from the execution plan.
Additionally, I'd suggest gathering all the data you want in the first portion of the CTE and then separating the aggregates out into their own portion. Something like this that accesses all of the tables once and aggregates over the set once:
WITH dist_pts AS (
SELECT run.loc_a_run_id rid
, pline.loc_a_pline_id pid
, rt.run_type
, CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END AS pts
FROM loc_a_run run
JOIN loc_a_pline pline ON run.loc_a_run_id = pline.loc_a_run_id
JOIN run_type rt ON run.run_type_id = rt.run_type_id
WHERE run.has_errors IS FALSE
), aggs AS
(
SELECT SUM(dist_pts.pts)::NUMERIC total_dist_pts
, COUNT(dist_pts.pid)::NUMERIC total_plines
, dist_pts.rid
FROM dist_pts
GROUP BY dist_pts.rid
)
SELECT dp.rid
, dp.run_type
, aggs.total_dist_pts
, aggs.total_plines
, (aggs.total_dist_pts / aggs.total_plines) dist_pts_per_pline
FROM dist_pts dp
JOIN aggs ON dp.rid = aggs.rid
GROUP BY dp.rid, dp.run_type, aggs.total_dist_pts, aggs.total_plines
;

PostgreSQL - Slow Count

I need to write one time query. It will be run one time, and the data will be moved to other system (AWS Personalize). It does not need to be optimized for sure, but at least sped up a bit, so the migration of data is even possible.
Coming from MySQL I thought it would not be a problem. But reading a lot, it seems the COUNT function is handled differently in PostgreSQL. Having mentioned all of that this is the query, reduced in size. There are several other joins (removed from this example), but they do not present an issue, at least looking at the QUERY PLAN.
explain
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE,
'-1' AS EVENT_VALUE,
extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
INNER JOIN schedules sch ON p.id = sch.plan_id
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND (select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE') = 1
The issue is here:
select Count(id) FROM schedules s WHERE s.plan_id = sch.plan_id AND s.status = 'DONE'
The id field in the schedules table is uuid.
I have tried lots of things, but they all end up the same. Same or worse.
I have read somewhere it is possible to use row estimate in these cases, but I have honestly no idea how to do that in this case.
This is the query plan:
Unique (cost=0.99..25152516038.36 rows=100054 width=88)
-> Nested Loop (cost=0.99..25152515788.22 rows=100054 width=88)
-> Index Only Scan using idx_schedules_plan_id_done_date on schedules sch (cost=0.56..25152152785.84 rows=107641 width=16)
Filter: ((SubPlan 1) = 1)
SubPlan 1
-> Aggregate (cost=1168.28..1168.29 rows=1 width=8)
-> Bitmap Heap Scan on schedules s (cost=14.78..1168.13 rows=58 width=16)
Recheck Cond: (plan_id = sch.plan_id)
Filter: ((status)::text = 'DONE'::text)
-> Bitmap Index Scan on idx_schedules_plan_id_done_date (cost=0.00..14.77 rows=294 width=0)
Index Cond: (plan_id = sch.plan_id)
-> Index Scan using plans_pkey on plans p (cost=0.42..3.37 rows=1 width=24)
Index Cond: (id = sch.plan_id)
Filter: ((continuous IS NOT TRUE) AND ((status)::text = 'ENDED'::text))
you are not selecting any column from the schedules table, so it can be omitted from the main query, and put into an EXISTS() term
distinct is probaly not needed, assuming id is a PK
Maybe you dont need the COUNT() to be exactly one, but just > 0
SELECT DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS (
SELECT *
FROM schedules sch
WHERE p.id = sch.plan_id
)
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE' -- <<-- Must there be EXACTLY ONE schedules record?
) ;
Now you can see that the first EXISTS() is actually not needed: if the second one yields True, the first EXISTS() must yield True, too
SELECT -- DISTINCT ON (p.id)
'plan_progress' AS EVENT_TYPE
, '-1' AS EVENT_VALUE
, extract(EPOCH FROM p.created_at) AS CREATION_TIMESTAMP
FROM plans p
WHERE p.status = 'ENDED' AND p.continuous IS NOT TRUE
AND EXISTS(
select *
FROM schedules s
WHERE s.plan_id = p.id
AND s.status = 'DONE'
) ;

UPDATE query for 180k rows in 10M row table unexpectedly slow

I have a table that is getting too big and I want to reduce it's size
with an UPDATE query. Some of the data in this table is redundant, and
I should be able to reclaim a lot of space by setting the redundant
"cells" to NULL. However, my UPDATE queries are taking excessive
amounts of time to complete.
Table details
-- table1 10M rows (estimated)
-- 45 columns
-- Table size 2200 MB
-- Toast Table size 17 GB
-- Indexes Size 1500 MB
-- **columns in query**
-- id integer primary key
-- testid integer foreign key
-- band integer
-- date timestamptz indexed
-- data1 real[]
-- data2 real[]
-- data3 real[]
This was my first attempt at an update query. I broke it up into some
temporary tables just to get the id's to update. Further, to reduce the
query, I selected a date range for June 2020
CREATE TEMP TABLE A as
SELECT testid
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3;
CREATE TEMP TABLE B as -- this table has 180k rows
SELECT id
FROM table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND testid in (SELECT testid FROM A)
AND band > 1
UPDATE table1
SET data1 = Null, data2 = Null, data3 = Null
WHERE id in (SELECT id FROM B)
Queries for creating TEMP tables execute in under 1 sec. I ran the UPDATE query for an hour(!) before I finally killed it. Only 180k
rows needed to be updated. It doesn't seem like it should take that much
time to update that many rows. Temp table B identifies exactly which
rows to update.
Here is the EXPLAIN from the above UPDATE query. One of the odd features of this explain is that it shows 4.88M rows, but there are only 180k rows to update.
Update on table1 (cost=3212.43..4829.11 rows=4881014 width=309)
-> Nested Loop (cost=3212.43..4829.11 rows=4881014 width=309)
-> HashAggregate (cost=3212.00..3214.00 rows=200 width=10)
-> Seq Scan on b (cost=0.00..2730.20 rows=192720 width=10)
-> Index Scan using table1_pkey on table1 (cost=0.43..8.07 rows=1 width=303)
Index Cond: (id = b.id)
Another way to run this query is in one shot:
WITH t as (
SELECT id from table1
WHERE testid in (
SELECT testid
from table1
WHERE date BETWEEN '2020-06-01' AND '2020-07-01'
AND band = 3
)
)
UPDATE table1 a
SET data1 = Null, data2 = Null, data3 = Null
FROM t
WHERE a.id = t.id
I only ran this one for about 10 minutes before I killed it. It feels like I should be able to run this query in much less time if I just knew the tricks. This query has EXPLAIN below. This explain shows 195k rows which is more expected, but cost is much higher # 1.3M to 1.7M
Update on testlog a (cost=1337986.60..1740312.98 rows=195364 width=331)
CTE t
-> Hash Join (cost=8834.60..435297.00 rows=195364 width=4)
Hash Cond: (testlog.testid = testlog_1.testid)
-> Seq Scan on testlog (cost=0.00..389801.27 rows=9762027 width=8)
-> Hash (cost=8832.62..8832.62 rows=158 width=4)"
-> HashAggregate (cost=8831.04..8832.62 rows=158 width=4)
-> Index Scan using amptest_testlog_date_idx on testlog testlog_1 (cost=0.43..8820.18 rows=4346 width=4)
Index Cond: ((date >= '2020-06-01 00:00:00-07'::timestamp with time zone) AND (date <= '2020-07-01 00:00:00-07'::timestamp with time zone))
Filter: (band = 3)
-> Hash Join (cost=902689.61..1305015.99 rows=195364 width=331)
Hash Cond: (t.id = a.id)
-> CTE Scan on t (cost=0.00..3907.28 rows=195364 width=32)
-> Hash (cost=389801.27..389801.27 rows=9762027 width=303)
-> Seq Scan on testlog a (cost=0.00..389801.27 rows=9762027 width=303)
Edit: one of the suggestions in the accepted answer was to drop any indexes before the update and then add them back later. This is what I went with, with a twist: I needed another table to hold indexed data from the dropped indexes to make the A and B queries faster:
CREATE TABLE tempid AS
SELECT id, testid, band, date
FROM table1
I made indexes on this table for id, testid, and date. Then I replaced table1 in the A and B queries with tempid. It still went slower than I would have liked, but it did get the job done.
You might have another table that has a foreign key to this table to one or more columns you are setting to NULL. And this foreign table does not have an index on the column.
Each time you set the row value to NULL the database has to check the foreign table - maybe it has a row that references the value you are removing.
If this is the case you should be able to speed it up by adding an index on this remote table.
For example if you have a table like this:
create table table2 (
id serial primary key,
band integer references table1(data1)
)
Then you can create an index create index table2_band_nnull_idx on table2(band) where band is not null.
But you suggested that all columns you are setting to NULL have array type. This means that it is unlikely that they are referenced. Still it is worth checking.
Another possibility is that you have a trigger on the table that works slowly.
Another possibility is that you have a lot of indexes on the table. Each index has to be updated for each row you update and it can use only a single processor core.
Sometimes it is faster to drop all indexes, do the bulk update and then recreate them all back. Creating indexes can use multiple cores - one core per index.
Another possibility is that your query is waiting for some other query to finish and release its locks. You should check with:
select now()-query_start, * from pg_stat_activity where state<>'idle' order by 1;

Lock one table at update and another in subquery, which one will be locked first?

I have a query like this:
UPDATE table1 SET
col = 'some value'
WHERE id = X
RETURNING col1, (SELECT col2 FROM table2 WHERE id = table1.table2_id FOR UPDATE);
So, this query will lock both tables, table1 and table2, right? But which one will be locked first?
The execution plan for the query will probably look like this:
QUERY PLAN
-------------------------------------------------------------------------------------------
Update on laurenz.table1
Output: table1.col1, (SubPlan 1)
-> Index Scan using table1_pkey on laurenz.table1
Output: table1.id, table1.table2_id, 'some value'::text, table1.col1, table1.ctid
Index Cond: (table1.id = 42)
SubPlan 1
-> LockRows
Output: table2.col2, table2.ctid
-> Index Scan using table2_pkey on laurenz.table2
Output: table2.col2, table2.ctid
Index Cond: (table2.id = table1.table2_id)
That suggests that the row in table1 is locked first.
Looking into the code, I see that ExecUpdate first calls EvalPlanQual, where the updated tuple is locked, and only after that calls ExecProcessReturning where the RETURNING clause is processed.
So yes, the row in table1 is locked first.
So far, I have treated row locks, but there are also the ROW EXCLUSIVE locks on the tables themselves:
The tables are all locked in InitPlan in execMain.c, and it seems to me that again table1 will be locked before table2 here.

Why can't PostgreSQL do this simple FULL JOIN?

Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.
PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.
I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.