I have run into a case such that Pg always preferring into a sequential scan for a table that has around 70M rows. (Index scan is ideal for that query and i have confirmed it by setting enable_seq_scan=off, speed improved by 200x)
So, in order to help Pg understand my data better i executed this
ALTER TABLE tablename ALTER COLUMN columnname SET STATISTICS 1000;
Unfortunately this requires Update Exclusive lock which locks the entire table (too much lock).
Is there a solution to avoid locking for this statement ?
Data sharding is done for this table based on Primary Key Range, so I would like Pg to even understand my Pk better so that it knows which User has got large data. Will it be of use if i increase the statistics of PrimaryKey column as well ?
From the very docs you linked
SET STATISTICS
This form sets the per-column statistics-gathering target for subsequent ANALYZE operations. The target can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system default statistics target (default_statistics_target). For more information on the use of statistics by the PostgreSQL query planner, refer to Section 14.2.
SET STATISTICS acquires a SHARE UPDATE EXCLUSIVE lock.
And, on the docs for Explicit Locking
SHARE UPDATE EXCLUSIVE
Conflicts with the SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, and ACCESS EXCLUSIVE lock modes. This mode protects a table against concurrent schema changes and VACUUM runs.
Acquired by VACUUM (without FULL), ANALYZE, CREATE INDEX CONCURRENTLY, and ALTER TABLE VALIDATE and other ALTER TABLE variants (for full details see ALTER TABLE).
So you can't change the schema, or vacuum while analytics are happening. So what? They should happen very fast. Almost instantly.
Related
We use PostgreSQL for analytics. Three typical operations we do on tables are:
Create table as select
Create table followed by insert in table
Drop table
We are not doing any UPDATE, DELETE etc.
For this situation can we assume that estimates would just be accurate?
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
With autovacuum running (which is the default), ANALYZE and VACUUM are fired up automatically - both of which update reltuples. Basic configuration parameters for ANALYZE (which typically runs more often), (quoting the manual):
autovacuum_analyze_threshold (integer)
Specifies the minimum number of inserted, updated or deleted tuples
needed to trigger an ANALYZE in any one table. The default is 50
tuples. This parameter can only be set in the postgresql.conf file
or on the server command line; but the setting can be overridden for
individual tables by changing table storage parameters.
autovacuum_analyze_scale_factor (floating point)
Specifies a fraction of the table size to add to
autovacuum_analyze_threshold when deciding whether to trigger an
ANALYZE. The default is 0.1 (10% of table size). This parameter can
only be set in the postgresql.conf file or on the server command
line; but the setting can be overridden for individual tables by
changing table storage parameters.
Another quote gives insight to details:
For efficiency reasons, reltuples and relpages are not updated
on-the-fly, and so they usually contain somewhat out-of-date values.
They are updated by VACUUM, ANALYZE, and a few DDL commands such
as CREATE INDEX. A VACUUM or ANALYZE operation that does not
scan the entire table (which is commonly the case) will incrementally
update the reltuples count on the basis of the part of the table it
did scan, resulting in an approximate value. In any case, the planner
will scale the values it finds in pg_class to match the current
physical table size, thus obtaining a closer approximation.
Estimates are up to date accordingly. You can change autovacuum settings to be more aggressive. You can even do this per table. See:
Aggressive Autovacuum on PostgreSQL
On top of that, you can scale estimates like Postgres itself does it. See:
Fast way to discover the row count of a table in PostgreSQL
Note that VACUUM (of secondary relevance to your case) wasn't triggered by only INSERTs before Postgres 13. Quoting the release notes:
Allow inserts, not only updates and deletes, to trigger vacuuming
activity in autovacuum (Laurenz Albe, Darafei
Praliaskouski)
Previously, insert-only activity would trigger auto-analyze but not
auto-vacuum, on the grounds that there could not be any dead tuples to
remove. However, a vacuum scan has other useful side-effects such as
setting page-all-visible bits, which improves the efficiency of
index-only scans. Also, allowing an insert-only table to receive
periodic vacuuming helps to spread out the work of “freezing” old
tuples, so that there is not suddenly a large amount of freezing work
to do when the entire table reaches the anti-wraparound threshold all
at once.
If necessary, this behavior can be adjusted with the new parameters
autovacuum_vacuum_insert_threshold and
autovacuum_vacuum_insert_scale_factor, or the equivalent
table storage options.
I know the statistical information is updated by VACUUM ANALYZE and CREATE INDEX, but I'm not sure about some other situations:
insert new data into a table
let the database do nothing (and wait for autovacuum?)
delete some rows in a table
truncate a partition of a table
CREATE INDEX does not cause new statistics to be calculated.
The autovacuum daemon will run an ANALYZE process for all tables that have more than 10% of their data changed (this is the default configuration). Theres changes are INSERT, UPDATE or DELETE. TRUNCATE will clear the statistics for a table.
I'm trying to obtain an indefinite lock on my Postgresql database (specifically on a table called orders) for QA purposes. In short, I want to know if certain locks on a table prevent or indefinitely block database migrations for adding columns (I think ALTER TABLE grabs an ACCESS EXCLUSIVE LOCK).
My plan is to:
grab a table lock or a row lock on the orders table
run the migration to add a column (an ALTER TABLE statement that grabs an ACCESS EXCLUSIVE LOCK)
issue a read statement to see if (2) is blocked (the ACCESS EXCLUSIVE LOCK blocks reads, and so this would be a problem that I'm trying to QA).
How would one do this? How do I grab a row lock on a table called orders via the Rails Console? How else could I do this?
Does my plan make sense?
UPDATE
It turns out open row-level transactions actually do block ALTER TABLE statements that grab an ACCESS EXCLUSIVE LOCK like table migrations that add columns. For example, when I run this code in one process:
Order.first.with_lock do
binding.pry
end
It blocks my migration in another process to add a column to the orders table. That migration's ACCESS EXCLUSIVE LOCK blocks all reads and select statements to the orders table, causing problems for end users.
Why is this?
Let's say you're in a transaction, selecting rows from a table with various where clauses. Halfway through, some other transaction adds a column to that table. Now you are getting back more fields than you did previously. How is your application supposed to handle this?
I think I read somewhere that running an ALTER TABLE foo ADD COLUMN baz text on a postgres database will not cause a read or write lock. Setting a default value causes locking, but allowing a null default prevents a lock.
I can't find this in the documentation, though. Can anyone point to a place that says, definitively, if this is true or not?
The different sorts of locks and when they're used are mentioned in the doc in
Table-level Locks. For instance, Postgres 11's ALTER TABLE may acquire a SHARE UPDATE EXCLUSIVE, SHARE ROW EXCLUSIVE, or ACCESS EXCLUSIVE lock.
Postgres 9.1 through 9.3 claimed to support two of the above three but actually forced Access Exclusive for all variants of this command. This limitation was lifted in Postgres 9.4 but ADD COLUMN remains at ACCESS EXCLUSIVE by design.
It's easy to check in the source code because there's a function dedicated to establishing the lock level needed for this command in various cases: AlterTableGetLockLevel in src/backend/commands/tablecmds.c.
Concerning how much time the lock is held, once acquired:
When the column's default value is NULL, the column's addition should be very quick because it doesn't need a table rewrite: it's only an update in the catalog.
When the column has a non-NULL default value, it depends on PostgreSQL version: with version 11 or newer, there is no immediate rewriting of all the rows, so it should be as fast as the NULL case. But with version 10 or older, the table is entirely rewritten, so it may be quite expensive depending on the table's size.
Adding new null column will lock the table for very very short time since no need to rewrite all data on disk. While adding column with default value requires PostgreSQL to make new versions of all rows and store them on the disk. And during that time table will be locked.
So when you need to add column with default value to big table it's recommended to add null value first and then update all rows in small portions. This way you'll avoid high load on disk and allow autovacuum to do it's job so you'll not end up doubling table size.
http://www.postgresql.org/docs/current/static/sql-altertable.html#AEN57290
"Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten."
So the documentation only specifies when the table is not rewritten.
There will always be a lock, but it will be very short in case the table is not to be rewritten.
PostgreSQL stores statistics about tables in the system table called pg_class. The query planner accesses this table for every query. These statistics may only be updated using the analyze command. If the analyze command is not run often, the statistics in this table may not be accurate and the query planner may make poor decisions which can degrade system performance. Another strategy is for the query planner to generate these statistics for each query (including selects, inserts, updates, and deletes). This approach would allow the query planner to have the most up-to-date statistics possible.
Why postgres always rely on pg_class instead?
pg_class doesn't contain all the statistics needed by the planner, it only contains information about the structure of the table. Statistics generated by analyze command contain information about values existing in each column so when executing a command like:
SELECT * FROM tab WHERE cname = "pg";
the planner knows how much rows are in the table and how many rows have the value "pg" in the column cname. These information does not exist in pg_class.
Another nice feature of PostgreSQL is autovacuum, in 99,9999% of cases it should be enabled so the database actualizes statistics as soon as some (can be defined in config file) number of rows change. That minimizes the chance of wrong execution plan because of wrong table statistics.