PostgreSQL concurrent index creation - invalid index - postgresql

We have the requirement to contain index size without locking tables. I tried to use 'create index concurrently ..' but it was resulting in INVALID index being created on one of the systems.We tried to do
- drop index
- drop index concurrently
- reindex table
However intermittently they were also getting stuck. This makes the whole approach of creating indexes concurrently via script is vulnerable.
Any ideas how this can be made full proof without manual intervention? If not, what are other effective ways to contain index sizes on postgreSQL in automated fashion on large and busy tables.

To benefit with concurrent build of indexes in script, you need to add logic in next commands (as you can't put in transaction). Simply check if index is not INVALID in next line and if it is, abort script.
Also I would reverse actions: first you build new index concurrently. If success, drop old one.
https://www.postgresql.org/docs/current/static/sql-createindex.html:
If a problem arises while scanning the table, such as a deadlock or a
uniqueness violation in a unique index, the CREATE INDEX command will
fail but leave behind an "invalid" index. This index will be ignored
for querying purposes because it might be incomplete; however it will
still consume update overhead.
and
Another difference is that a regular CREATE INDEX command can be
performed within a transaction block, but CREATE INDEX CONCURRENTLY
cannot.

Related

CREATE INDEX CONCURRENTLY is executed but the CONCURRENTLY option lost after creation?

In my Postgres 14.3 database on AWS RDS, I want to create an index without blocking other database operations. So I want to use the CONCURRENTLY option and I executed the following statement successfully.
CREATE INDEX CONCURRENTLY idx_test
ON public.ap_identifier USING btree (cluster_id);
But when checking the database with:
SELECT * FROM pg_indexes WHERE indexname = 'idx_test';
I only see: Index without CONCURRENTLY option
I am expecting the index is created with CONCURRENTLY option.
Is there any database switch to turn this feature on, or why does it seem to ignore CONCURRENTLY?
As has been commented, CONCURRENTLY is not a property of the index, but an instruction to create the index without blocking concurrent writes. The resulting index does not remember that option in any way. Read the chapter "Building Indexes Concurrently" in the manual.
Creating indexes on big tables can take a while. The system table
pg_stat_progress_create_index can be queried for progress reporting. While that is going on, CONCURRENTLY is still reflected in the command column.
To your consolation: once created, all indexes are "concurrent" anyway, in the sense that they are maintained automatically without blocking concurrent reads or writes (except for UNIQUE indexes that prevent duplicates.)

To what degree does PostgreSQL support parallel DDL?

Looking here, it is clear that Oracle supports execution of DDL commands in parallel with scenarios clearly listed. I was wondering whether Postgres does indeed offer such functionality? I can find a lot of material on "parallel queries" for PostgreSQL but not so much when DDL is involved.
For example, can I execute multiple 'CREATE TABLE...AS SELECT' in parallel? And if not, how can I achieve such functionality? What happens if I have a temporary table (CREATE TEMP TABLE)? Do I need to configure something for locks?
From here:
Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:
The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated.
(emphasis mine).
Which seems to suggest that Postgres will not "parallelize" any query that modifies the database structure, under any circumstances.
Running multiple queries simultaneously in Postgres requires one connection per running query.
Those are generic DDL statements, they are index operations and partition operations that can be parallelized.
If you check the Notes section of the CREATE INDEX statement, you'll see that parallel index building is supported :
PostgreSQL can build indexes while leveraging multiple CPUs in order to process the table rows faster. This feature is known as parallel index build. For index methods that support building indexes in parallel (currently, only B-tree), maintenance_work_mem specifies the maximum amount of memory that can be used by each index build operation as a whole, regardless of how many worker processes were started. Generally, a cost model automatically determines how many worker processes should be requested, if any.
Update
I suspect the real question is about CREATE TABLE ... AS though.
This is essentially a CREATE TABLE followed by an INSERT .. SELECT. The CREATE TABLE part can't be parallelized and doesn't have to - it's essentially a metadata operation. The SELECT on the other hand, could be parallelized easily. INSERT is a bit harder, but it's a matter of implementation.
As a_horse_with_no_name explains in a comment to this question, parallelization for CREATE TABLE AS was added in PostgreSQL 11 :
Improvements to parallelism, including:
CREATE INDEX can now use parallel processing while building a B-tree index
Parallelization is now possible in CREATE TABLE ... AS, CREATE MATERIALIZED VIEW, and certain queries using UNION
Parallelized hash joins and parallelized sequential scans now perform better

What is the difference between postgresql rebuild index and recreate index, which one is better?

I have a table which index size is too big (about 2G). When I restore the database to a VM, the size is only 200M so I need to rebuild/recreate the index and I will probably do this online.
What is the difference between re-building (reindex) and re-creating the index, and which one is better when I do it online? Particularly, which option allows querying the DB during the operation?
The REINDEX command requires an exclusive table lock, which means that it'll stall any accesses to the table until the command has completed. If you can afford that kind of maintenance window it's perfectly fine.
The alternative for online rebuilding is to create a new index using CREATE INDEX CONCURRENTLY, then drop the old one. This will take longer to complete, but allows access to the table while rebuilding the index.
Postgres 12 has added a REINDEX INDEX CONCURRENTLY command, which does what you want here. https://paquier.xyz/postgresql-2/postgres-12-reindex-concurrently/ https://www.depesz.com/2019/03/29/waiting-for-postgresql-12-reindex-concurrently/

PostgreSQL: OK to allow errors?

Before I try to insert a row into a PostgreSQL table, should I query whether the insert would violate a constraint?
I do check when the insert would cause unwanted side-effects (e.g., auto-increment) upon an error.
But, if there are no possible side effects, is it OK to just blindly try to insert into a table? Or, is it better practice to prevent errors by anticipating them when possible (as advised in Objective-C)?
Also, when performing the insert inside an SQL function, will other queries (e.g., CTEs) inside the function get rolled back if the insert fails?
In general testing before hand is not a good idea because it requires you to explicitly lock tables to prevent other clients from changing or inserting data between your test and inserts. Explicit locking is bad for concurrency.
Serials getting auto incremented by failed inserts is in general not a problem. Just don't assume the values inserted into the database are consecutive.
A database and obj-c are two completely different things. Let the database check for problems, it is much easier to add the appropriate constraints to your schema then it is to check everything in your client program.
The default is to rollback to the start of the transaction. But you can control it with savepoints and rollback to savepoint. However a CTE is part of the query and queries are always rolled back completely when part of them fails. However you might be able to work around that by splitting the CTE of into a full query that creates a temp table. Then you can use the temp table instead of the cte.

Set index statistics after insert in Firebird database

I would like to automate the process of setting index statistics in a Firebird database so that it doesn't require a database administrator to run the command, or a user to click a button.
Since the statistics only need to be recalculated after a large number of inserts or deletes, I am considering using an After Insert and After Delete trigger to keep track of how many inserts or deletes have taken place, and then run a procedure to set index statistics based on that value.
My question is whether there is anything to watch out for when setting the index statistics in this manner on a live database. To be clear, I am not rebuilding indexes, but recalculating index statistics only. It is quite possible that this would occur during a mass import or delete operation. Would calculating index statistics during a mass import or delete have the potential to cause any problems?
It is safe to recalculate index statistics on a live database, while it is in use. It is also safe to do that in PSQL, e.g. in a stored procedure. For example I'm running a scheduled batch job in the night, which executes a stored procedure recalculating statistics for all indexes.
I'm not sure if it is wise to do that in a trigger, because triggers in Firebird fire per row and not per statement, thus you have to make sure to run that in some kind of conditional branch in your PSQL body.