PostgreSQL Query Planner/Optimizer: Is there a way to get candidate plans? - postgresql

In PostgreSQL, we can use "EXPLAIN ANALYZE" on a query to get the query plan of a given SQL Query. While this is useful, is there anyway that we are able to get information on other candidate plans that the optimizer generated (and subsequently discarded)?
This is so that we can do an analysis ourselves for some of the candidates (for e.g. top 3) generated by the DBMS.

No. The planner discards incipient plans as early as it can, before they are even completely formed. Once it decides a plan can't be the best, it never finishes constructing it, so it can't display it.
You can usually use the various enable_* settings or the *_cost settings to force it to make a different choice and show the plan for that, but it can be hard to control exactly what that different choice is.
You can also temporarily drop an index to see what it would do without that index. If you DROP an index inside a transaction, then do the EXPLAIN, then ROLLBACK the transaction, it will rollback the DROP INDEX so that the index doesn't need to be rebuilt, it will just be revived. But be warned that DROP INDEX will take a strong lock on the table and hold it until the ROLLBACK, so this method is not completely free of consequences.
If you just want to see what the other plan is, you just need EXPLAIN, not EXPLAIN ANALYZE. This is faster and, if the statement has side effects, also safer.

Related

Postgres include REINDEX in UPDATE statement

I have a database with a table which is incremently patched and has many indexes. But sometimes the patching does not happen and the new patch becomes very large. Which makes in practice makes it smarter to delete the indexes and patch the table and reset the indexes. But this seems horrible and in practice with users using the table this is not an option. So I though that there was a posibility to RESET the index during the update statement or even better have postgres it self check if it is optimal. (I'm using postgres 10 this might be a problem that is solved by upgrading).
I hope you can help me.
No, there is no good solution, nor any on the horizon for future versions.
Either you keep the indexes and must maintain them during the "patch"; or you drop them in the same transaction as the "patch" and rebuild them later in which case the table is locked against all other uses; or you drop them in a separate transaction and rebuild them later in which case other sessions can see the table in an unindexed state.
There are in principle ways this could be improved (for example, ending the "patch" the same way create-index-concurrently ends, with a merge join between the index and the table. But since CIC must be in its own transaction, it is not clear how these could be shoehorned together), but I am not aware of any promising work going on currently.

Parallel queries on CTE for writing operations in PostgreSQL

From PostgreSQL 9.6 Release Notes:
Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.
My question is: If a CTE (WITH clause) contains only read operations, but its results is used to feed a writing operation, like an insert or update, is it also disallowed to parallelize sequential scans?
I mean, as CTE is much like a temporary table which only exists for currently executing query, can I suppose that its inner query can take advantage of the brand new parallel seq-scan of PostgreSQL 9.6? Or, otherwise, is it treated as a using subquery and cannot perform parallel scan?
For example, consider this query:
WITH foobarbaz AS (
SELECT foo FROM bar
WHERE some_expensive_function(baz)
)
DELETE FROM bar
USING foobarbaz
WHERE bar.foo = foobarbaz.foo
;
Is that foobarbaz calculation expected to be able to be parallelized or is it disallowed because of the delete sentence?
If it isn't allowed, I thought that can replace the CTE by a CREATE TEMPORARY TABLE statement. But I think I will fall into the same issue as CREATE TABLE is a write operation. Am I wrong?
Lastly, a last chance I could try is to perform it as a pure read operation and use its result as input for insert and / or update operations. Outside of a transaction it should work. But the question is: If the read operation and the insert/update are between a begin and commit sentences, it not will be allowed anyway? I understand they are two completely different operations, but in the same transaction and Postgres.
To be clear, my concern is that I have an awful mass of hard-to-read and hard-to-redesign SQL queries that involves multiple sequential scans with low-performance function calls and which performs complex changes over two tables. The whole process runs in a single transaction because, if not, the mess in case of failure would be totally unrecoverable.
My hope is to able to parallelize some sequential scans to take advantage of the 8 cpu cores of the machine to be able to complete the process sooner.
Please, don't answer that I need to fully redesign that mess: I know and I'm working on it. But it is a great project and we need to continue working meantime.
Anyway, any suggestion will be thankful.
EDIT:
I add a brief report of what I could discover up to now:
As #a_horse_with_no_name says in his comment (thanks), CTE and the rest of the query is a single DML statement and, if it has a write operation, even outside of the CTE, then the CTE cannot be parallelized (I also tested it).
Also I found this wiki page with more concise information about parallel scans than what I found in the release notes link.
An interesting point I could check thanks to that wiki page is that I need to declare the involved functions as parallel safe. I did it and worked (in a test without writings).
Another interesting point is what #a_horse_with_no_name says in his second comment: Using DbLink to perform a pure read-only query. But, investigating a bit about that, I seen that postgres_fdw, which is explicitly mentioned in the wiki as non supporting parallel scans, provides roughly the same functionality using a more modern and standards-compliant infrastructure.
And, on the other hand, even if it would worked, I were end up getting data from outside the transaction which, in some cases would be acceptable for me but, I think, not as good idea as general solution.
Finally, I checked that is possible to perform a parallel-scan in a read-only query inside a transaction, even if it later performs write operations (no exception is triggered and I could commit).
...in summary, I think that my best bet (if not the only one) would be to refactor the script in a way that it reads the data to memory before to later perform the write operations in the same transaction.
It will increase I/O overhead but, attending the latencies I manage it will be even better.

PostgreSQL - When do indices get build and when to use CONCURRENTLY?

I'm fairly inexperienced with SQL (or here PostgreSQL) and I'm trying to understand and use indices correctly.
PostgreSQL has a CONCURRENTLY option for CREATE INDEX and the documentation says:
"When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment."
Does this mean that an INDEX is only created at startup or during a migration process?
I know that one can re-index tables if they get fragmented over time (not sure how this actually happens and why an index is just not kept "up-to-date") and that re-indexing helps the database to get more efficient again.
Can I benefit from CONCURRENTLY during such a re-index process?
and besides that I'm asking myself
Are there situation where I should avoid CONCURRENTLY or would it hurt to use CONCURRENTLY just on every INDEX I create?
If it was sensible to always create index ... concurrently it'd be the default.
What it does is builds the index with weaker locks held on the table being indexed, so you can continue to insert, update, delete, etc.
This comes at a price:
You can't use create index ... concurrently in a transaction, unlike almost all other DDL
The index build can take longer
The index built may be less efficiently laid out (slower, bigger)
Rarely, the create index can fail, so you have to drop and recreate the index
You can't easily use this to re-create an existing index. PostgreSQL doesn't yet support reindex ... concurently. There are workarounds where you create a new index, then swap old and new indexes, but it's very difficult if you're trying to do it for a unique index or primary key that's the target of a foreign key constraint.
Unless you know you need it, just use create index without concurrently.

what is visibility check in index scan

I am looking up query optimization in Postgres.
I don't understand this statement:
Index scans involve random disk access and still have to read the underlying data blocks for visibility checks.
what does "visibility check" mean here?
PostgreSQL uses a technique called Multi-Version Concurrency Control for managing concurrent access to data. Data is not visible until the transaction that inserted it commits. Under other cases, the data is silently ignored for other transactions so they don't see it (except in rare cases, explicit locks, or higher isolations levels).
What this means is that PostgreSQL must check the transaction id's of the actual rows to make sure they are visible for all transactions. Now, 9.2 (iirc) and higher allow PostgreSQL to skip this check if all tuples in a page are visible. However otherwise it has to check per row.

Synchronizing two tables, best practice

I need to synchronize two tables across databases, whenever either one changes, on either update, delete or insert. The tables are NOT identical.
So far the easiest and best solution i have been able to find is adding SQL triggers.
Which i slowly started adding, and seems to be working fine. But before i continue finishing it, I want to be sure that this is a good idea? And in general good practice.
If not, what a better option for this scenario?
Thank you in advance
Regards
Daniel.
Triggers will work, but there are quite a few different options available to consider.
Are all data modifications to these tables done through stored procedures? If so, consider putting the logic in the stored procedures instead of in a trigger.
Do the updates have to be real-time? If not, consider a job that regularly synchronizes the tables instead of a trigger. This probably gets tricky with deletes, though. Not impossible, just tricky.
We had one situation where the tables were very similar, but had slightly different column names or orders. In that case, we created a view to the original table that let the application use the view instead of the second copy of the table. We were also able to use a Synonym one time to point to the original table, but that requires that the table structures be the same.
Generally speaking, a lot of people try to avoid unnecessary triggers as they're just too easy to miss when doing other work in the database. That doesn't make them bad, but can lead to interesting times when trying to troubleshoot problems.
In your scenario, I'd probably briefly explore other options before continuing with the triggers. Just watch out for cascading trigger effects where your one update results in the second table updating, passing the update back to the first table, then the second, etc. You can guard for this a little with nesting levels. Otherwise you run the risk of hitting that maximum recursion level and throwing errors.