Optimise a simple update query for large dataset - postgresql

I have some data migration that has to occur between a parent and child table. For the sake of simplicity, the schemas are as follows:
------- -----------
| event | | parameter |
------- -----------
| id | | id |
| order | | eventId |
------- | order |
-----------
Because of an oversight with business logic that needs to be performed, we need to update parameter.order to the parent event.order. I have come up with the following SQL to do that:
UPDATE "parameter"
SET "order" = e."order"
FROM "event" e
WHERE "eventId" = e.id
The problem is that this query didn't resolve after over 4 hours and I had to clock out, so I cancelled it.
There are 11 million rows on parameter and 4 million rows on event. I've run EXPLAIN on the query and it tells me this:
Update on parameter (cost=706691.80..1706622.39 rows=11217313 width=155)
-> Hash Join (cost=706691.80..1706622.39 rows=11217313 width=155)
Hash Cond: (parameter."eventId" = e.id)
-> Seq Scan on parameter (cost=0.00..435684.13 rows=11217313 width=145)
-> Hash (cost=557324.91..557324.91 rows=7724791 width=26)
-> Seq Scan on event e (cost=0.00..557324.91 rows=7724791 width=26)
Based on this article it tells me that the "cost" referenced by the EXPLAIN is an "arbitrary unit of computation".
Ultimately, this update needs to be performed, but I would accept it happening in one of two ways:
I am advised of a better way to do this query that executes in a timely manner (I'm open to all suggestions, including updating schemas, indexing, etc.)
The query remains the same but I can somehow get an accurate prediction of execution time (even if it's hours long). This way, at least, I can manage the expectations of the team. I understand that without actually running the query it can't be expected to know the times, but is there an easy way to "convert" these arbitrary units into some millisecond execution time?
Edit for Jim Jones' comment:
I executed the following query:
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
I got 9 identical rows like the following:
pid | locktype | mode | query | query-start | state
-------------------------------------------------------------------------
23192 | relation | AccessShareLock | <see below> | 2021-10-26 14:10:01 | active
query column:
--update parameter
--set "order" = e."order"
--from "event" e
--where "eventId" = e.id
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
Edit 2: I think I've been stupid here... The query produced by checking these locks is just the commented query. I think that means there's actually nothing to report.

If some rows already have the target value, you can skip empty updates (at full cost). Like:
UPDATE parameter p
SET "order" = e."order"
FROM event e
WHERE p."eventId" = e.id
AND p."order" IS DISTINCT FROM e."order"; -- this
If both "order" columns are defined NOT NULL, simplify to:
...
AND p."order" <> e."order";
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you have to update all or most rows - and can afford it! - writing a new table may be cheaper overall, like Mike already mentioned. But concurrency and depending objects may stand in the way.
Aside: use legal, lower-case identifiers, so you don't have to double-quote. Makes your life with Postgres easier.

The query will be slow because for each UPDATE operation, it has to look up the index by id. Even with an index, on a large table, this is a per-row read/write so it is slow.
I'm not sure how to get a good estimate, maybe do 1% of the table and multiply?
I suggest creating a new table, then dropping the old one and renaming the new table.
CREATE TABLE parameter_new AS
SELECT
parameter.id,
parameter."eventId",
e."order"
FROM
parameter
JOIN event AS "e" ON
"e".id = parameter."eventId"
Later, once you verify things:
ALTER TABLE parameter RENAME TO parameter_old;
ALTER TABLE parameter_new RENAME TO parameter;
Later, once you're completely certain:
DROP TABLE parameter_old;

Related

PostgreSQL 13 - Performance Improvement to delete large table data

I am using PostgreSQL 13 and has intermediate level experience with PostgreSQL.
I have a table named tbl_employee. it stores employee details for number of customers.
Below is my table structure, followed by datatype and index access method
Column | Data Type | Index name | Idx Access Type
-------------+-----------------------------+---------------------------+---------------------------
id | bigint | |
name | character varying | |
customer_id | bigint | idx_customer_id | btree
is_active | boolean | idx_is_active | btree
is_delete | boolean | idx_is_delete | btree
I want to delete employees for specific customer by customer_id.
In table I have total 18,00,000+ records.
When I execute below query for customer_id 1001 it returns 85,000.
SELECT COUNT(*) FROM tbl_employee WHERE customer_id=1001;
When I perform delete operation using below query for this customer then it takes 2 hours, 45 minutes to delete the records.
DELETE FROM tbl_employee WHERE customer_id=1001
Problem
My concern is that this query should take less than 1 min to delete the records. Is this normal to take such long time or is there any way we can optimise and reduce the execution time?
Below is Explain output of delete query
The values of seq_page_cost = 1 and random_page_cost = 4.
Below are no.of pages occupied by the table "tbl_employee" from pg_class.
Please guide. Thanks
During :
DELETE FROM tbl_employee WHERE customer_id=1001
Is there any other operation accessing this table? If only this SQL accessing this table, I don't think it will take so much time.
In RDBMS systems each SQL statement is also a transaction, unless it's wrapped in BEGIN; and COMMIT; to make multi-statement transactions.
It's possible your multirow DELETE statement is generating a very large transaction that's forcing PostgreSQL to thrash -- to spill its transaction logs from RAM to disk.
You can try repeating this statement until you've deleted all the rows you need to delete:
DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
Doing it this way will keep your transactions smaller, and may avoid the thrashing.
SQL: DELETE FROM tbl_employee WHERE customer_id=1001 LIMIT 1000;
will not work then.
To make the batch delete smaller, you can try this:
DELETE FROM tbl_employee WHERE ctid IN (SELECT ctid FROM tbl_employee where customer_id=1001 limit 1000)
Until there is nothing to delete.
Here the "ctid" is an internal column of Postgresql Tables. It can locate the rows.

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Using results from PostgreSQL query in FROM clause of another query

I have a table that's designed as follows.
master_table
id -> serial
timestamp -> timestamp without time zone
fk_slave_id -> integer
fk_id -> id of the table
fk_table1_id -> foreign key relationship with table1
...
fk_table30_id -> foreign key relationship with table30
Every time a new table is added, this table gets altered to include a new column to link. I've been told it was designed as such to allow for deletes in the tables to cascade in the master.
The issue I'm having is finding a proper solution to linking the master table to the other tables. I can do it programmatically using loops and such, but that would be incredibly inefficient.
Here's the query being used to grab the id of the table the id of the row within that table.
SELECT fk_slave_id, concat(fk_table1_id,...,fk_table30_id) AS id
FROM master_table
ORDER BY id DESC
LIMIT 100;
The results are.
fk_slave_id | id
-------------+-----
30 | 678
25 | 677
29 | 676
1 | 675
15 | 674
9 | 673
The next step is using this data to formulate the table required to get the required data. For example, data is required from table30 with id 678.
This is where I'm stuck. If I use WITH it doesn't seem to accept the output in the FROM clause.
WITH items AS (
SELECT fk_slave_id, concat(fk_table1_id,...,fk_table30_id) AS id
FROM master_table
ORDER BY id DESC
LIMIT 100
)
SELECT data
FROM concat('table', items.fk_slave_id)
WHERE id = items.id;
This produces the following error.
ERROR: missing FROM-clause entry for table "items"
LINE x: FROM string_agg('table', items.fk_slave_id)
plpgsql is an option to use EXECUTE with format, but then I'd have to loop through each result and process it with EXECUTE.
Is there any way to achieve what I'm after using SQL or is it a matter of needing to do it programmatically?
Apologies on the bad title. I can't think of another way to word this question.
edit 1: Replaced rows with items
edit 2: Based on the responses it doesn't seem like this can be accomplished cleanly. I'll be resorting to creating an additional column and using triggers instead.
I don't think you can reference a dynamically named table like that in your FROM clause:
FROM concat('table', rows.fk_slave_id)
Have you tried building/executing that SQL from a stored procedure/function. You can create the SQL you want to execute as a string and then just EXECUTE it.
Take a look at this one:
PostgreSQL - Writing dynamic sql in stored procedure that returns a result set

optimizing postgres view for timestamps and aggregation of fields from another table

I've greatly simplified the examples to hopefully produce a clear enough question that can be answered:
Consider a table of events
CREATE TABLE alertable_events
(
unique_id text NOT NULL DEFAULT ''::text,
generated_on timestamp without time zone NOT NULL DEFAULT now(),
message_text text NOT NULL DEFAULT ''::text,
CONSTRAINT pk_alertable_events PRIMARY KEY (unique_id),
)
with the following data:
COPY alertable_events (unique_id,message_text,generated_on) FROM stdin;
one message one 2014-03-20 06:00:00.000000
two message two 2014-03-21 06:00:00.000000
three message three 2014-03-22 06:00:00.000000
four message four 2014-03-23 06:00:00.000000
five message five 2014-03-24 06:00:00.000000
\.
And for each event, there is a list of fields
CREATE TABLE alertable_event_fields
(
unique_id text NOT NULL DEFAULT ''::text,
field_name text NOT NULL,
field_value text NOT NULL DEFAULT ''::text,
CONSTRAINT pk_alertable_event_fields PRIMARY KEY (unique_id, field_name),
CONSTRAINT fk_alertable_event_fields_0 FOREIGN KEY (unique_id)
REFERENCES alertable_events (unique_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
)
with the following data:
COPY alertable_event_fields (unique_id,field_name,field_value) FROM stdin;
one field1 a
one field2 b
two field1 z
two field2 y
three field1 a
three field2 m
four field1 a
four field2 b
five field1 z
five field2 y
\.
I want to define a view that produces the following:
| unique_id | fields | message_text | generated_on | updated_on | count |
| five | z|y | message five | 2014-03-21 06:00:00.000000 | 2014-03-24 06:00:00.000000 | 2 |
| four | a|b | message four | 2014-03-20 06:00:00.000000 | 2014-03-23 06:00:00.000000 | 2 |
| three | a|m | message three | 2014-03-22 06:00:00.000000 | 2014-03-22 06:00:00.000000 | 1 |
Notably:
fields is a pipe delimited string (or any serialization of) the field values (json encoding of field_name:field_value pairs would be even better ... but I can work with pipe_delim for now)
the output is grouped by matching fields. Update 3/30 12:45am The values are ordered by their field_name's alphabetically therefore a|b would not match b|a
a count is produced of the events that match that field set. updated 3/30 12:45am there can be different number of fields per unique_id, a match requires matching all fields and not a subset of the fields.
generated_on is the timestamp of the first event
updated_on is the timestamp of the most recent event
message_text is the message_text of the most recent event
I've produced this view, and it works for small data sets, however, as the alertable_events table grows, it becomes exceptionally slow. I can only assume I'm doing something wrong in the view because I have never dealt with anything quite so ugly.
Update 3/30 12:15PM EDT It looks like I may have server tuning problems causing this high run-times, see added explain for more info. If you see a glaring issue there, I'd be greatly interested in tweaking the server's configuration.
Can anyone piece together a view that handles large datasets well and has a significantly better run time than this? Perhaps using hstore? (I'm running 9.2 preferrably, though 9.3 if I can have a nice json encoding of the fields.)
Updated 3/30 11:30AM I'm beginning to think my issue may be server tuning (which means I'll need to talk to the SA) Here's a very simple explain (analyze,buffers) which is showing a ridiculous run-time for as few as 8k rows in the unduplicated_event_fields
Update 3/30 7:20PM I bumped my available memory to 5MB using SET WORK_MEM='5MB' (which is plenty for the query below), strangely, even though the planner went to in memory quicksort, it actually took on average 100ms longer!
explain (analyze,buffers)
SELECT a.unique_id,
array_to_string(array_agg(a.field_value order by a.field_name),'|') AS "values"
FROM alertable_event_fields a
GROUP BY a.unique_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=771.11..892.79 rows=4056 width=80) (actual time=588.679..630.989 rows=4056 loops=1)
Buffers: shared hit=143, temp read=90 written=90
-> Sort (cost=771.11..791.39 rows=8112 width=80) (actual time=588.591..592.622 rows=8112 loops=1)
Sort Key: unique_id
Sort Method: external merge Disk: 712kB
Buffers: shared hit=143, temp read=90 written=90
-> Seq Scan on alertable_event_fields a (cost=0.00..244.40 rows=8112 width=80) (actual time=0.018..5.478 rows=8112 loops=1)
Filter: (message_name = 'LIMIT_STATUS'::text)
Buffers: shared hit=143
Total runtime: 632.323 ms
(10 rows)
Update 3/30 4:10AM EDT I'm still not completely satisfied and would be interested in any further optimization. I have a requirement to support 500msgs/sec steady state, and although most of those should not be "events", I get a little backlogged right now when stress testing.
Update 3/30 12:00PM EDT Here's my most readable iteration yet, unfortunately, for 4000 rows I'm still looking at 600ms runtimes! ... (see above, as its mostly contained with the inner most query) Any help here would be greatly appreciated
CREATE OR REPLACE VIEW views.unduplicated_events AS
SELECT a.unique_id,a.message_text,
b."values",b.generated_on,b.updated_on,b.count
FROM alertable_events a
JOIN (
SELECT b."values",
min(a.generated_on) AS generated_on,
max(a.generated_on) AS updated_on,
count(*) AS count
FROM alertable_events a
JOIN (
SELECT a.unique_id,
array_to_string(array_agg(a.field_value order by a.field_name),'|') AS "values"
FROM alertable_event_fields a
GROUP BY a.unique_id
) b USING (unique_id)
GROUP BY b."values"
) b ON a.generated_on=b.updated_on
ORDER BY updated_on DESC;
Update 3/30 12:00PM EDT removed old stuff as this is getting too long
Some pointers
Invalid query
Your current query is incorrect unless generated_on is unique, which is not declared in the question and probably is not the case:
CREATE OR REPLACE VIEW views.unduplicated_events AS
SELECT ...
FROM alertable_events a
JOIN ( ... ) b ON a.generated_on=b.updated_on -- !! unreliable
Possibly faster
SELECT DISTINCT ON (f.fields)
unique_id -- most recent
, f.fields
, e.message_text -- most recent
, min(e.generated_on) OVER (PARTITION BY f.fields) AS generated_on -- "first"
, e.generated_on AS updated_on -- most recent
, count(*) OVER (PARTITION BY f.fields) AS ct
FROM alertable_events e
JOIN (
SELECT unique_id, array_to_string(array_agg(field_value), '|') AS fields
FROM (
SELECT unique_id, field_value
FROM alertable_event_fields
ORDER BY 1, field_name -- a bit of a hack, but much faster
) f
GROUP BY 1
) f USING (unique_id)
ORDER BY f.fields, e.generated_on DESC;
SQL Fiddle.
The result is currently sorted by fields. If you need a different sort order, you'd need to wrap it in another subquery ...
Major points
The output column name generated_on conflicts with the input column generated_on. You have to table-qualify the column e.generated_on to refer to the input column. I added table-qualification everywhere to make it clear, but it is only actually necessary the ORDER BY clause. The manual:
If an ORDER BY expression is a simple name that matches both an
output column name and an input column name, ORDER BY will interpret
it as the output column name. This is the opposite of the choice that
GROUP BY will make in the same situation. This inconsistency is made
to be compatible with the SQL standard.
The updated query should also be faster (as intended all along). Run EXPLAIN ANALYZE again.
For the whole query, indexes will hardly be of use. Only if you select specific rows ... One possible exception: a covering index for alertable_event_fields:
CREATE INDEX f_idx1
ON alertable_event_fields (unique_id, field_name, field_value);
Lots of write operations might void the benefit, though.
array_agg(field_value ORDER BY ...) tends to be slower for big sets than pre-sorting in a subquery.
DISTINCT ON is convenient here. Not sure, whether it's actually faster, though, since ct and generated_on have to be computed in separate window functions, which requires another sort step.
work_mem: setting it too high can actually harm performance. More in the Postgres Wiki. or in "Craig's list".
Generally this is hard to optimize. Indexes fail because the sort order depends on two tables. If you can work with a snapshot, consider a MATERIALIZED VIEW.

lock the rows until next select postgres

Is there a way in postgres to lock the rows until the next select query execution from the same system.And one more thing is there will be no update process on locked rows.
scenario is something like this
If the table1 contains data like
id | txt
-------------------
1 | World
2 | Text
3 | Crawler
4 | Solution
5 | Nation
6 | Under
7 | Padding
8 | Settle
9 | Begin
10 | Large
11 | Someone
12 | Dance
If sys1 executes
select * from table1 order by id limit 5;
then it should lock row from id 1 to 5 for other system which are executing select statement concurrently.
Later if sys1 again execute another select query like
select * from table1 where id>10 order by id limit 5;
then pereviously locked rows should be released.
I don't think this is possible. You cannot block a read only access to a table (unless that select is done FOR UPDATE)
As far as I can tell, the only chance you have is to use the pg_advisory_lock() function.
http://www.postgresql.org/docs/current/static/functions-admin.html#FUNCTIONS-ADVISORY-LOCKS
But this requires a "manual" release of the locks obtained through it. You won't get an automatic unlocking with that.
To lock the rows you would need something like this:
select pg_advisory_lock(id), *
from
(
select * table1 order by id limit 5
) t
(Note the use of the derived table for the LIMIT part. See the manual link I posted for an explanation)
Then you need to store the retrieved IDs and later call pg_advisory_unlock() for each ID.
If each process is always releasing all IDs at once, you could simply use pg_advisory_unlock_all() instead. Then you will not need to store the retrieved IDs.
Note that this will not prevent others from reading the rows using "normal" selects. It will only work if every process that accesses that table uses the same pattern of obtaining the locks.
It looks like you really have a transaction which transcends the borders of your database, and all the change happens in an another system.
My idea is select ... for update no wait to lock the relevant rows, then offload the data into another system, then rollback to unlock the rows. No two select ... for update queries will select the same row, and the second select will fail immediately rather than wait and proceed.
But you don't seem to mark offloaded records in any way; I don't see why two non-consecutive selects won't happily select overlapping range. So I'd still update the records with a flag and/or a target user name and would only select records with the flag unset.
I tried both select...for update and pg_try_advisory_lock and managed to get near my requirement.
/*rows are locking but limit is the problem*/
select * from table1 where pg_try_advisory_lock( id) limit 5;
.
.
$_SESSION['rows'] = $rowcount; // no of row to process
.
.
/*afer each process of word*/
$_SESSION['rows'] -=1;
.
.
/*and finally unlock locked rows*/
if ($_SESSION['rows']===0)
select pg_advisory_unlock_all() from table1
But there are two problem in this
1. As Limit will apply before lock, every time the same rows are trying to lock in different instance.
2. Not sure whether pg_advisory_unlock_all will unlock the rows locked by current instance or all the instance.