PostgreSQL - When do indices get build and when to use CONCURRENTLY? - postgresql

I'm fairly inexperienced with SQL (or here PostgreSQL) and I'm trying to understand and use indices correctly.
PostgreSQL has a CONCURRENTLY option for CREATE INDEX and the documentation says:
"When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment."
Does this mean that an INDEX is only created at startup or during a migration process?
I know that one can re-index tables if they get fragmented over time (not sure how this actually happens and why an index is just not kept "up-to-date") and that re-indexing helps the database to get more efficient again.
Can I benefit from CONCURRENTLY during such a re-index process?
and besides that I'm asking myself
Are there situation where I should avoid CONCURRENTLY or would it hurt to use CONCURRENTLY just on every INDEX I create?

If it was sensible to always create index ... concurrently it'd be the default.
What it does is builds the index with weaker locks held on the table being indexed, so you can continue to insert, update, delete, etc.
This comes at a price:
You can't use create index ... concurrently in a transaction, unlike almost all other DDL
The index build can take longer
The index built may be less efficiently laid out (slower, bigger)
Rarely, the create index can fail, so you have to drop and recreate the index
You can't easily use this to re-create an existing index. PostgreSQL doesn't yet support reindex ... concurently. There are workarounds where you create a new index, then swap old and new indexes, but it's very difficult if you're trying to do it for a unique index or primary key that's the target of a foreign key constraint.
Unless you know you need it, just use create index without concurrently.

Related

TSQL Lock behaviour on Index Creation/Partition Switch

I currently have the use case of inserting lots of Data (3.5Mio Rows, somewhere around 200GB) into multiple Staging Tables and then switching them into the destination-Tables. Now, becaufe of the amount of Data, we discovered that it would be faster to insert the data into an empty heap-table, then creating the columnstore index so the structure is identical to the destination table, and then switching - all within one transaction.
All the Tables are in the same Database, but they do not depend on each other, so the best-case would be to fill table A-Stage and B-Stage at the same time, create the corresponding indices on them at the same time, and then switching them at the same time.
Obviously, with Creating Indices and Switching Partitions, there are plenty of locks involved. Now i was curious whether those locks can cause a deadlock at any point, especially when it comes to sys-Tables.
All Tables involved will get a SCH-M lock, and certain sys Tables will also get locked, but from what i can see, they get locked on PAGE/KEY/EXTENT level.
Now i guess my Question is:
Are the sys-Tables and other structures i might miss stored in a way that i can alter indices/partitions without running into locks, as long as those are different tables/objects that do not depend on each other (no Foreign Keys or anything for example) or will i eventually run into a scenario where Table B has to wait for Table A to finish before even starting, or worse, a deadlock?
Thanks in advance!
Tried Creating a Clustered Columnstore Index/Switching Partitons and saw that certain sys-Objects couldnĀ“t be accessed, wondered whether this will cause locks in the future or if the locks for different objects will always work out

Postgres Create Index command hangs

This is similar to a recent problem I posted where COPY command was hanging for a large data set. In that instance, it was due to a foreign key constraint. But in this case I'm creating an index, so I would think an FK wouldn't be an issue, even though I still disabled triggers on the table just in case. I'm trying to add a regular btree index on a table with 10 billion rows. The index is on two int fields. I tried running it and it was going forever, so I thought it might just be too slow, I increased max_parallel_maintenance_workers to 8 and maintenance_work_mem to 2047MB (I'm on Windows, so it's the max).
At that point, things seemed to go faster, but the same problem happened: I can see the files growing in the pgsql_tmp/pgsql_tmpxxxx.x.sharedfileset folder, until they just stop but the index creation never seems to finish.
I wondered if I'd set too many workers for whatever reason, so I tried setting it to 4, same problem. Files were last modified around 3:20am, it's 7:35am and it's still running. The files in the folder are 261GB, which looks about right compared to the table size and every time I run the process it stalls at that size, so I assume it's done with creating the index, I just have no clue what it might be doing at this point. In case it matters, the table has a foreign key on another table that has 1 billion records, but the triggers are disabled on the table, which has worked for me in loading data in the table. I checked for locks, there are none, it's not waiting on any lock, which makes sense because this is a test database with dummy data that I created to test some things, so nobody else even knows it exists or has any use for it.
Creating an index runs in several stages. The table has to be read, the values have to be sorted, and the index has to be created on disk.
In certain stages you will see temporary files growing, in others not, even though CREATE INDEX is still working. Perhaps it is writing the index file at the moment.
So be patient, it will finish.
If you are nervous look into pg_locks to see if the CREATE INDEX is blocked by something. That may be the case if it is a CREATE INDEX CONCURRENTLY, which has do do more complicated processing.

Postgres include REINDEX in UPDATE statement

I have a database with a table which is incremently patched and has many indexes. But sometimes the patching does not happen and the new patch becomes very large. Which makes in practice makes it smarter to delete the indexes and patch the table and reset the indexes. But this seems horrible and in practice with users using the table this is not an option. So I though that there was a posibility to RESET the index during the update statement or even better have postgres it self check if it is optimal. (I'm using postgres 10 this might be a problem that is solved by upgrading).
I hope you can help me.
No, there is no good solution, nor any on the horizon for future versions.
Either you keep the indexes and must maintain them during the "patch"; or you drop them in the same transaction as the "patch" and rebuild them later in which case the table is locked against all other uses; or you drop them in a separate transaction and rebuild them later in which case other sessions can see the table in an unindexed state.
There are in principle ways this could be improved (for example, ending the "patch" the same way create-index-concurrently ends, with a merge join between the index and the table. But since CIC must be in its own transaction, it is not clear how these could be shoehorned together), but I am not aware of any promising work going on currently.

Postgres building an index concurrently

I'm reading this from the Postgres docs:
Building Indexes Concurrently
...
PostgreSQL supports building indexes without locking out writes. This
method is invoked by specifying the CONCURRENTLY option of CREATE
INDEX. When this option is used, PostgreSQL must perform two scans of
the table, and in addition it must wait for all existing transactions
that could potentially modify or use the index to terminate. Thus this
method requires more total work than a standard index build and takes
significantly longer to complete. However, since it allows normal
operations to continue while the index is built, this method is useful
for adding new indexes in a production environment....
In a concurrent index build, the index is actually entered into the
system catalogs in one transaction, then two table scans occur in two
more transactions. Before each table scan, the index build must wait
for existing transactions that have modified the table to terminate.
After the second scan, the index build must wait for any transactions
that have a snapshot (see Chapter 13) predating the second scan to
terminate. Then finally the index can be marked ready for use, and the
CREATE INDEX command terminates. Even then, however, the index may not
be immediately usable for queries: in the worst case, it cannot be
used as long as transactions exist that predate the start of the index
build.
What is a system catalog? What is a table scan? So it sounds like the index is build first, then it must wait for existing transactions (ones that occurred during the index build?) to terminate, and then wait for any transactions that have a snapshot predating the second scan to terminate (what does this mean? How does it differ from the first statement). What are these scans?
What is a system catalog? - https://www.postgresql.org/docs/current/static/catalogs.html
What is a table scan? - It reads table to get values of a column you build index on.
ones that occurred during the index build? - no, ones that could change data after first table scan
what does this mean? It means it waits transactions to end.
What are these scans? First scan reads table before starting to build index concurrently. Allowing changes to table to avoid lock. When build is done it scans roughly saying a difference, apply a short lock and mark index as usable. It is different from create index in a way the last locks table, not permitting any changes to data, while concurrently scans twice, but alows changing data while building index

What is the overhead of ensureIndex({field:1}) when an index already exists?

I'd like to always ensure that my collections are indexed, and I'm adding and dropping them on a semi-regular basis.
Assuming that I make a new connection to the DB with every web request, would it be okay to execute a few db.collection.ensureIndex({field:true}) statements every time I connect?
As I understand it MongoDB will simply just query the system Collection to look to see if the index exists before it creates it ...
http://www.mongodb.org/display/DOCS/Indexes#Indexes-AdditionalNotesonIndexes
> db.system.indexes.find();
You can run getIndexes() to see a Collection's indexes
> db.things.getIndexes();
So really, you'd just be adding one query; it would not rebuild it or do anything else non-obvious.
That said, I don't think this would be a particularly good idea. It would add unneeded overhead, and worse might lock your database as the index is created ... since by default creating an index blocks your database (unless you run it in the background) like so:
> db.things.ensureIndex({x:1}, {background:true});
However note ...
... background mode building uses an
incremental approach to building the
index which is slower than the default
foreground mode: time to build the
index will be greater.
I think it would be much better to do this in code when you add the collections instead of everytime you connect to the database. Why are you adding and dropping them anyhow?