Postgres update from large table - postgresql

I have a Postgres 12 table with more than 20 million rows (the table is partitioned on the customer_column_id) called A and I also have a table with 2.2 million rows called B. I want to update B by selecting data from A but it sometimes is taking 3 minutes and sometimes throwing an exception while reading from stream or the operation times out. The query which I am using is below.
UPDATE B
set "udd_erhvervsfaglig_forloeb" = A.value
from A
where A.customer_column_id = 60
and B.customer_id = A.customer_id
and coalesce(A.value, '') != ''
Query Plan
-> Nested Loop (cost=0.43..15739311932.47 rows=6273088796225 width=85506)
-> Seq Scan on B(cost=0.00..3509632.93 rows=112625893 width=84980)
-> Index Scan using A_66_customer_id_idx1 on A_66 cev (cost=0.43..0.46 rows=1 width=17)
Index Cond: (customer_id = B.customer_id)
Filter: (((COALESCE(value, ''::character varying))::text <> ''::text) AND (customer_column_id = 66))```

One of the ways of avoiding nested loops could be using EXISTS. So, your query can be written like this:
UPDATE B
SET "udd_erhvervsfaglig_forloeb" = A.value
FROM A
WHERE
EXISTS (SELECT 1
FROM a
WHERE B.customer_id = A.customer_id
AND A.customer_column_id = 60
AND coalesce(A.value, '') != '');

Related

restriction in second position - index not used - why?

I have created the below example and do not understand why the planner does not use index i2 for the query. As can be seen in pg_stats, it understands that column uniqueIds contains unique values. it also understands that column fourOtherIds contains only 4 different values. Shouldn't a search of index i2 then be by far the fastest way? Looking for uniqueIds in only four different index leaves of fourOtherIds? What is wrong with my understanding of how an index works? Why does it think using i1 makes more sense here, even though it has to filter out 333.333 rows? In my understanding it should use i2 to find the one row (or few rows, as there is no unique constraint) that has uniqueIds 4000 first and then apply where fourIds = 1 as a filter.
create table t (fourIds int, uniqueIds int,fourOtherIds int);
insert into t ( select 1,*,5 from generate_series(1 ,1000000));
insert into t ( select 2,*,6 from generate_series(1000001,2000000));
insert into t ( select 3,*,7 from generate_series(2000001,3000000));
insert into t ( select 4,*,8 from generate_series(3000001,4000000));
create index i1 on t (fourIds);
create index i2 on t (fourOtherIds,uniqueIds);
analyze t;
select n_distinct,attname from pg_stats where tablename = 't';
/*
n_distinct|attname |
----------+------------+
4.0|fourids |
-1.0|uniqueids |
4.0|fourotherids|
*/
explain analyze select * from t where fourIds = 1 and uniqueIds = 4000;
/*
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------+
Gather (cost=1000.43..22599.09 rows=1 width=12) (actual time=0.667..46.818 rows=1 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Index Scan using i1 on t (cost=0.43..21598.99 rows=1 width=12) (actual time=25.227..39.852 rows=0 loops=3)|
Index Cond: (fourids = 1) |
Filter: (uniqueids = 4000) |
Rows Removed by Filter: 333333 |
Planning Time: 0.107 ms |
Execution Time: 46.859 ms |
*/
Not every conceivable optimization has been implemented. You are looking for a variant of an index skip scan AKA a loose index scan. PostgreSQL does not automatically implement those (yet--people were working on it but I don't know if they still are. Also, I think I've read that one of the 3rd party extensions/forks, citus maybe, has implemented it). You can emulate one yourself using a recursive CTE, but that would be quite annoying to do.

Time Consuming SQL Update Statement

In Postgresql (version 9.2), I need to update a table with values from another table. The UPDATE statement below works and completes quickly on a small data set (1K records). With large amount of records (600K+), the statement has not completed after more than two hours. I don't know if it is taking a long time or is simply hung.
UPDATE training_records r SET cid =
(SELECT cid_main FROM account_events e
WHERE e.user_ekey = r.ekey
AND e.type = 't'
AND r.enroll_date < e.time
ORDER BY e.time ASC LIMIT 1)
WHERE r.cid IS NULL;
Is there a problem with this statement? Is there a more efficient way to do this?
About the operation: training_records holds course enrollment records for member accounts (id by ekey) that are organized in groups. cid is the group id. account_events holds account changing events including transfers between groups (e.type='t'), where cid_main would be the group id before the transfer. I am trying to retroactively patch the newly added cid column in training_records so it accurately reflects the group membership when the course was enrolled. There could be multiple transfers, so I am picking the group id (cid_main) from the earliest transfer after the time of enrollment. Hope this makes sense.
The table training_records has close to 700K records, and account_events has 560K+ records.
Output of EXPLAIN {command above}
Update on training_records r (cost=0.00..13275775666.76 rows=664913 width=74)
-> Seq Scan on training_records r (cost=0.00..13275775666.76 rows=664913 width=74)
Filter: (cid IS NULL)
SubPlan 1
-> Limit (cost=19966.15..19966.16 rows=1 width=12)
-> Sort (cost=19966.15..19966.16 rows=1 width=12)
Sort Key: e."time"
-> Seq Scan on account_events e (cost=0.00..19966.15 rows=1 width=12)
Filter: ((r.enroll_date < "time") AND (user_ekey = r.ekey) AND (type = 't'::bpchar))
(9 rows)
One more udpate:
Adding an additional condition in WHERE, I limited the number of records from training_records to about 10K. The update took about 15 minutes. If the time is even to close to being linear to the number of records of this one table, 700K records would take about 17+ hours.
Thank you for your help!
Update: It took close to 9 hours, but the original command completed.
Try to transform it to something that does not force a nested loop join:
UPDATE training_records r
SET cid = e.cid_main
FROM account_events e
WHERE e.user_ekey = r.ekey
AND e.type = 't'
AND r.enroll_date < e.time
AND NOT EXISTS (SELECT 1 FROM account_events e1
WHERE e1.user_ekey = r.ekey
AND e1.type = 't'
AND r.enroll_date < e1.time
AND e1.time < e.time)
AND r.cid IS NULL;
The statement actually isn't equivalent: if there is no matching account_events row, your statement will update cid to NULL, while my statement will not update that row.

Avoid using Nested Loop Join while using a non Equi join condition

Postgres is using a Nested Loop Join algorithm when I use a non equi join condition in my update query. I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation as per
[https://www.postgresql.org/docs/8.3/planner-optimizer.html]
The update query and the execution plan is below.
Query
explain analyze
UPDATE target_tbl tgt
set descr = stage.descr,
prod_name = stage.prod_name,
item_name = stage.item_name,
url = stage.url,
col1_name = stage.col1_name,
col2_name = stage.col2_name,
col3_name = stage.col3_name,
col4_name = stage.col4_name,
col5_name = stage.col5_name,
col6_name = stage.col6_name,
col7_name = stage.col7_name,
col8_name = stage.col8_name,
flag = stage.flag
from tbl1 stage
where tgt.col1 = stage.col1
and tgt.col2 = stage.col2
and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text)
and stage.row_number::int >= 1::int
and stage.row_number::int < 50001::int;
Execution Plan
Update on target_tbl tgt (cost=0.56..3557.91 rows=1 width=813) (actual time=346153.460..346153.460 rows=0 loops=1)
-> Nested Loop (cost=0.56..3557.91 rows=1 width=813) (actual time=4.326..163876.029 rows=50000 loops=1)
-> Seq Scan on tbl1 stage (cost=0.00..2680.96 rows=102 width=759) (actual time=3.060..2588.745 rows=50000 loops=1)
Filter: (((row_number)::integer >= 1) AND ((row_number)::integer < 50001))
-> Index Scan using tbl_idx on target_tbl tgt (cost=0.56..8.59 rows=1 width=134) (actual time=3.152..3.212 rows=1 loops=50000)
Index Cond: ((col1 = stage.col1) AND (col2 = stage.col2) AND (COALESCE(col3, 'col3'::text) = COALESCE(stage.col3, 'col3'::text)) AND (COALESCE(col4, 'col4'::text) = COALESCE(stage.col4, 'col4'::text)))
Planning time: 17.700 ms
Execution time: 346157.168 ms
Is there any way to avoid the nested loop join during the execution of the above query?
Or is there a way that can help me to reduce the cost of the the nested loop scan, currently it takes 6-7 minutes to update just 50000 records?
PostgreSQL can choose a different join strategy in that case. The reason why it doesn't is the gross mis-estimate in the sequential scan: 102 instead of 50000.
Fix that problem, and things will get better:
ANALYZE tbl1;
If that is not enough, collect more detailed statistics:
ALTER TABLE tbl1 ALTER row_number SET STATISTICS 1000;
ANALYZE tbl1;
All this assumes that row_number is an integer and the type cast is redundant. If you made the mistake to use a different data type, an index is your only hope:
CREATE INDEX ON tbl1 ((row_number::integer));
ANALYZE tbl1;
I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation
But the "right relation" here is an index scan, not a scan of the full table.
You can get it to stop using the index by changing the leading column of the join condition to something like where tgt.col1+0 = stage.col1 .... Upon doing that, it will probably change to a hash join or a merge join, but you will have to try it and see if it does. Also, the new plan may not actually be faster. (And fixing the estimation problem would be preferable, if that works)
Or is there a way that can help me to reduce the cost of the the
nested loop scan, currently it takes 6-7 minutes to update just 50000
records?
Your plan shows that over half the time is spent on the update itself, so probably reducing the cost of just the nested loop scan can have only a muted impact on the overall time. Do you have a lot of indexes on the table? The maintenance of those indexes might be a major bottleneck.

Postgres: Optimize a huge GROUP BY

I have such a table:
CREATE TABLE values (first_id varchar(26), sec_id int, mode varchar(6), external1_id varchar(23), external2_id varchar(26), x int, y int);
There may be multiple values having the same first_id, my goal is to flatten (into json) for each first_id, all the related rows, into another table.
I do this this way:
INSERT INTO othervalues(first_id, results)
SELECT first_id, json_agg(values) AS results FROM values GROUP BY first_id;
In the results column, I have a json array of all the rows, that I can use later as it is.
My problem is that this is very slow, with a huge table: with about 100 000 000 rows in values, this slows down my computer (I actually test locally) until it dies (this is an Ubuntu).
Using EXPLAIN I noticed that is used a GroupPartitioner, I added:
SET work_mem = '1GB';
Now it uses a HashPartitioner, but this still kills my computer. An explain gives me:
Insert on othervalues (cost=2537311.89..2537316.89 rows=200 width=64)
-> Subquery Scan on "*SELECT*" (cost=2537311.89..2537316.89 rows=200 width=64)
-> HashAggregate (cost=2537311.89..2537314.39 rows=200 width=206)
-> Seq Scan on values (cost=0.00..2251654.26 rows=57131526 width=206)
Any idea how to optimize it?
The solution I finally use is to split the GROUP BY into multiple:
First I create a temporary table with the unique IDs of the stuff I want to group. This allow to get only a part of the results - like with OFFSET and LIMIT - but these there can be very slow with huge results sets (big offset mean the execution tree will yet browse the first results):
CREATE TEMP TABLE tempids AS SELECT ROW_NUMBER() OVER (ORDER BY theid), theid FROM (SELECT DISTINCT theid FROM sourcetable) sourcetable;
Then in a WHILE loop:
DO $$DECLARE
r record;
i INTEGER := 0;
step INTEGER := 500000;
size INTEGER := (SELECT COUNT(*) FROM tempids);
BEGIN
WHILE i <= size
LOOP
INSERT INTO target(theid, theresult)
SELECT ...
WHERE tempids > i AND tempids < i + step
GROUP BY tempids.theid;
This looks like usual coding, this is not nice sql, but this works.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.