PostgreSQL: UPDATE large table - postgresql

I have a large PostgreSQL table of 29 million rows. The size (according to the stats tab in pgAdmin is almost 9GB.) The table is post-gis enabled with an empty geometry column.
I want to UPDATE the geometry column using ST_GeomFromText, reading from X and Y coordinate columns (SRID: 27700) stored in the same table. However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
To overcome this, should I UPDATE the 29 million rows in batches/stages? How can I do 1 million rows (the FIRST 1 million), then do the next 1 million rows until I reach 29 million?
Or are there other more efficient ways of updating large tables like this?
I should add, the table is hosted in AWS.
My UPDATE query is:
UPDATE schema.table
SET geom = ST_GeomFromText('POINT(' || eastingcolumn || ' ' || northingcolumn || ')',27700);

You did not give any server specs, writing 9GB can be pretty fast on recent hardware.
You should be OK with one, long, update - unless you have concurrent writes to this table.
A common trick to overcome this problem (a very long transaction, locking writes to the table) is to split the UPDATE into ranges based on the primary key, ran in separate transactions.
/* Use PK or any attribute with a known distribution pattern */
UPDATE schema.table SET ... WHERE id BETWEEN 0 AND 1000000;
UPDATE schema.table SET ... WHERE id BETWEEN 1000001 AND 2000000;
For high level of concurrent writes people use more subtle tricks (like: SELECT FOR UPDATE / NOWAIT, lightweight locks, retry logic, etc).

From my original question:
However, running this query on the whole table at once results in 'out of disk space' and 'connection to server lost' errors... the latter being less frequent.
Turns out our Amazon AWS instance database was running out of space, stopping my original ST_GeomFromText query from completing. Freeing up space fixed it.
On an important note, as suggested by #mlinth, ST_Point ran my query far quicker than ST_GeomFromText (24mins vs 2hours).
My final query being:
UPDATE schema.tablename
SET geom = ST_SetSRID(ST_Point(eastingcolumn,northingcolumn),27700);

Related

Azure Postgres AUTOVACUM AND ANALYZE THRESHOLD - How to change it?

I am coming again with another Postgres question. We are using the Managed Service from Azure that uses autovacuum. Both vacuum and statistics are automatic.
The problem I am getting is that for a specific query, when it is running at specific hours, the plan is not good. I realized that after collecting statistics manually, the plan behaves better back again.
From the documentation of Azure I got the following:
The vacuum process reads physical pages and checks for dead tuples.
Every page in shared_buffers is considered to have a cost of 1
(vacuum_cost_page_hit). All other pages are considered to have a cost
of 20 (vacuum_cost_page_dirty), if dead tuples exist, or 10
(vacuum_cost_page_miss), if no dead tuples exist. The vacuum operation
stops when the process exceeds the autovacuum_vacuum_cost_limit.
After the limit is reached, the process sleeps for the duration
specified by the autovacuum_vacuum_cost_delay parameter before it
starts again. If the limit isn't reached, autovacuum starts after the
value specified by the autovacuum_naptime parameter.
In summary, the autovacuum_vacuum_cost_delay and
autovacuum_vacuum_cost_limit parameters control how much data cleanup
is allowed per unit of time. Note that the default values are too low
for most pricing tiers. The optimal values for these parameters are
pricing tier-dependent and should be configured accordingly.
The autovacuum_max_workers parameter determines the maximum number of
autovacuum processes that can run simultaneously.
With PostgreSQL, you can set these parameters at the table level or
instance level. Today, you can set these parameters at the table level
only in Azure Database for PostgreSQL.
Let's imagine that I want to stress the default values I have for specific tables, as currently all of them are using the default ones for the whole database.
Keeping in mind that, I could try to use ( where X is what I don't know )
ALTER TABLE tablename SET (autovacuum_vacuum_threshold = X );
ALTER TABLE tablename SET (autovacuum_vacuum_scale_factor = X);
ALTER TABLE tablename SET (autovacuum_vacuum_cost_limit = X );
ALTER TABLE tablename SET (autovacuum_vacuum_cost_delay = X );
Currently I have these values in pg_stat_all_tables
SELECT schemaname,relname,n_tup_ins,n_tup_upd,n_tup_del,last_analyze,last_autovacuum,last_autoanalyze,analyze_count,autoanalyze_count
FROM pg_stat_all_tables where schemaname = 'swp_am_hcbe_pro'
and relname in ( 'submissions','applications' )
"swp_am_hcbe_pro" "applications" "264615" "11688533" "18278" "2021-11-11 08:45:45.878654+00" "2021-11-11 13:50:27.498745+00" "2021-11-10 12:02:04.690082+00" "1" "152"
"swp_am_hcbe_pro" "submissions" "663107" "687757" "51603" "2021-11-11 08:46:48.054731+00" "2021-11-07 04:41:30.205468+00" "2021-11-04 15:25:45.758618+00" "1" "20"
Those two tables are by far the ones getting most of the DML activity.
Questions
How can I determine which values for those specific parameters of the auto_vacuum are the best for tables with huge dml activity ?
How can I force Postgres to run more times the automatic analyze for these tables that I can get more up-to-date statistics ? According to the documentation
autovacuum_analyze_threshold
Specifies the minimum number of inserted, updated or deleted tuples
needed to trigger an ANALYZE in any one table. The default is 50
tuples. This parameter can only be set in the postgresql.conf file or
on the server command line; but the setting can be overridden for
individual tables by changing table storage parameters.
Does it mean that either deletes, updates or inserts gets to 50 triggers an auto analyze ? Because I am not seeing this behaviour.
If I change the values for the tables, should I do the same for their indexes ? Is there any option like cascade or similar that changing the table makes the values also affect the corresponding indexes ?
Thank you in advance for any advice. If you need any further details, let me know.

PostgreSQL Database size is not equal to sum of size of all tables

I am using an AWS RDS PostgreSQL instance. I am using below query to get size of all databases.
SELECT datname, pg_size_pretty(pg_database_size(datname))
from pg_database
order by pg_database_size(datname) desc
One database's size is 23 GB and when I ran below query to get sum of size of all individual tables in this particular database, it was around 8 GB.
select pg_size_pretty(sum(pg_total_relation_size(table_schema || '.' || table_name)))
from information_schema.tables
As it is an AWS RDS instance, I don't have rights on pg_toast schema.
How can I find out which database object is consuming size?
Thanks in advance.
The documentation says:
pg_total_relation_size ( regclass ) → bigint
Computes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.
So TOAST tables are covered, and so are indexes.
One simple explanation could be that you are connected to a different database than the one that is shown to be 23GB in size.
Another likely explanation would be materialized views, which consume space, but do not show up in information_schema.tables.
Yet another explanation could be that there have been crashes that left some garbage files behind, for example after an out-of-space condition during the rewrite of a table or index.
This is of course harder to debug on a hosted platform, where you don't have shell access...

Postgres - Bulk transferring of data from one table to another

I need to transfer a large amount of data (several million rows) from one table to another. So far I’ve tried doing this….
INSERT INTO TABLE_A (field1, field2)
SELECT field1, field2 FROM TABLE_A_20180807_BCK;
This worked (eventually) for a table with about 10 million rows in it (took 24 hours). The problem is that I have several other tables that need the same process applied and they’re all a lot larger (the biggest is 20 million rows). I have attempted a similar load with a table holding 12 million rows and it failed to complete in 48 hours so I had to cancel it.
Other issues that probably affect performance are 1) TABLE_A has a field based on an auto-generated sequence, 2) TABLE_A has an AFTER INSERT trigger on it that parses each new record and adds a second record to TABLE_B
A number of other threads have suggested doing a pg_dump of TABLE_A_20180807_BCK and then load the data back into TABLE_A. I’m not sure a pg_dump would actually work for me because I’m only interested in couple of fields from TABLE_A, not the whole lot.
Instead I was wondering about the following….
Export to a CSV file…..
COPY TABLE_A_20180807_BCK (field1,field2) to 'd:\tmp\dump\table_a.dump' DELIMITER ',' CSV;
Import back into the desired table….
COPY TABLE_A(field1,field2) FROM 'd:\tmp\dump\table_a.dump' DELIMITER ',' CSV
Is the export/import method likely to be any quicker – I’d like some guidance on this before I start on another job that may take days to run, and may not even work any better! The obvious answer of "just try it and see" isn't really an option, I can't afford more downtime!
(this is follow-on question from this, if any background details are required)
Update....
I don't think there is any significant problems with the trigger. Under normal circumstances records are INPUTed into TABLE_A at a rate of about 1000/sec (including trigger time). I think the issue is likely to be size of the transaction, under normal circumstances records are inserted into in blocks of 100 records per INSERT, the statement shown above attempts to add 10 million records in a single transaction, my guess is that this is the problem, but I've no way of knowing if it really is, or if there's a suitable work around (or if the export/import method I've proposed would be any quicker)
Maybe I should have emphasized this earlier, every insert into TABLE_A fires a trigger that adds record to TABLE_B. It's the data that's in TABLE_B that's the final objective, so disabling the trigger isn't an option! This whole problem came about because I accidentally disabled the trigger for a few days, and the preferred solution to the question 'how to run a trigger on existing rows' seemed to be 'remove the rows and add them back again' - see the original post (link above) for details.
My current attempt involves using the COPY command with a WHERE clause to split the contents of TABLE_A_20180807_BCK into a dozen small files and then re-load them one at a time. This may not give me any overall time saving, but although I can't afford 24 hours of continuous downtime, I can afford 6 hours of downtime for 4 nights.
Preparation (if you have access and can restart your server) set checkpoint_segments to 32 or perhaps more. This will reduce the frequency and number of checkpoints during this operation. You can undo it when you're finished. This step is not totally necessary but should speed up writes considerably.
edit postgresql.conf and set checkpoint_segments to 32 or maybe more
Step 1: drop/delete all indexes and triggers on table A.
EDIT: Step 1a
alter table_a set unlogged;
(repeat step 1 for each table you're inserting into)
Step 2. (unnecessary if you do one table at a time)
begin transaction;
Step 3.
INSERT INTO TABLE_A (field1, field2)
SELECT field1, field2 FROM TABLE_A_20180807_BCK;
(repeat step 3 for all tables being inserted into)
Step 4. (unnecessary if you do one table at a time)
commit;
Step 5 re-enable indexes and triggers on all tables.
Step 5a.
Alter table_a set logged;

Select * from table_name is running slow

The table contains around 700 000 data. Is there any way to make the query run faster?
This table is stored on a server.
I have tried to run the query by taking the specific columns.
If select * from table_name is unusually slow, check for these things:
Network speed. How large is the data and how fast is your network? For large queries you may want to think about your data in bytes instead of rows. Run select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'TABLE_NAME'; and compare that with your network speed.
Row fetch size. If the application or IDE is fetching one-row-at-a-time, each row has a large overhead with network lag. You may need to increase that setting.
Empty segment. In a few weird cases the table's segment size can increase and never shrink. For example, if the table used to have billions of rows, and they were deleted but not truncated, the space would not be released. Then a select * from table_name may need to read a lot of empty extents to get to the real data. If the GB size from the above query seems too large, run alter table table_name move; to rebuild the table and possible save space.
Recursive query. A query that simple almost couldn't have a bad execution plan. It's possible, but rate, for a recursive query has a bad execution plan. While the query is running, look at select * from gv$sql where users_executing > 0;. There might be a data dictionary query that's really slow and needs to be tuned.

How does this PostgreSQL query slow down when the number of rows increases?

I have a table briefly structured like this:
tn( id integer NOT NULL primary key DEFAULT nextval('tn_sequence'),
create_dt TIMESTAMP NOT NULL DEFAULT NOW(),
...............
deleted boolean );
create_dt is the timestamp when the row is inserted into the database.
deleted indicates that the row is or no longer useful.
And I have the following queries:
select * from tn where create_dt > ( NOW() - interval '150 seconds ) and deleted = FALSE;
select * from tn where create_dt < ( NOW() - interval '150 seconds ) and deleted = FALSE;
My question is how these query will slow down when the number of rows increase? For instance, when the number of rows exceeds 10K, 20K, or 100K, will it make a big impact on the speed? Is there any way I can optimize these queries? Note that every 5 seconds I will turn the column 'deleted' of rows which are older than 150 seconds into 'TRUE'.
The effect of table growth on performance will depend on the query plan chosen, available indexes, the selectivity of the query, and lots of other factors. EXPLAIN ANALYZE on the query might help. In short, if your query only selects a few rows and can use a simple b-tree index then it won't usually slow down tons, only a little as the index grows. On the other hand queries using complex non-indexed conditions or returning lots of rows could perform very badly indeed.
Your issue appears to mirror that in the question How should we handle rows which won't be queried once they are old in PostgreSQL?
The advice given there should apply:
Use a partial index with the condition WHERE (not deleted); or
partition on 'deleted' with constraint exclusion enabled.
For example, you might:
CREATE INDEX create_dt_when_not_deleted_idx
ON tn (create_dt)
WHERE (NOT deleted);
This includes only rows where deleted = 'f' (assuming deleted is `not null) in the index. This isn't the same as having them gone from the table completely.
Nothing changes with full table sequential scans, the deleted='t' rows must still be scanned; and
There's more I/O than if the deleted = 't' rows weren't there because any given heap page is likely to contain a mix of deleted = 't' and deleted = 'f' rows.
You can reduce the impact of the latter by CLUSTERing on an index that includes deleted. Again, this will have no effect on sequential scans. To help with sequential scans you would have to partition the table on deleted.
Pg 9.2's index only scans should (I think, haven't tested) use the partial index. When an index only scan is possible the partial index should be as fast as an index on a table containing only the deleted = 'f' rows.
Note that you'll need to keep table and index bloat under control. Ensure autovaccum runs very frequently and use a current version of PostgreSQL that doesn't need things like manually-managed free space map and has the latest, best-behaved autovacuum. I'd recommend 9.0 or above, preferably 9.1 or 9.2. Tune autovacuum to run aggressively.
When tuning and testing performance - test your queries with EXPLAIN ANALYZE, don't just guess.