Execution plan cache for PL/pgSQL functions in PostgreSQL

Execution plan cache for PL/pgSQL functions in PostgreSQL - postgresql

If I change PL/pgSQL function that is in use in another PL/pgSQL function will the PostgreSQL rebuild execution plan for both of them or only for changed one?
Say I have 2 functions using third one. Say function check_permission(user_id) is used by get_company(user_id) and get_location(user_id).
And I they got cached their execution plan somehow.
And then I change check_permission, would the execution plan caches for get_company(user_id) and get_location(user_id) be deleted and rebuilt on demand?

PostgreSQL tracks dependencies, and it flushes caches pretty aggressively when things change.
If you change a function, it'll invalidate at least the plans of all functions that depend on it. In practice, IIRC it just flushes all cached query plans entirely.
The same is true of views that depend on other views, prepared statements that reference views, etc.
If you find a case where it fails to do so you have found a bug. Please report it with a complete reproducible test case.

Related

can i somehow detect that a new statement has started

I have a plv8 function (but I assume the same thing applies to any language). It could be called a lot in one statement (select say). It computes expensive stuff so I dont want to recompute every time. But the stuff it computes depends on the contents of the database. I can naively cache things in the function but that will never get cleared. So after the first call I am always operating on old data. If would be nice to flush at the start of each statement execution.
Note that triggering off changes to the tables the cache depends on doesnt work. The cache exists in connectionA, the DB can be changed by connectionB (or C,...)

PostgreSQL approaching plan caching on identical queries?

I am running some benchmarks tests on a lot of queries. I have a set of queries and they will be run multiple times after each other. I know that PostgreSQL caches query plans so this is important to consider but as far as I know this does not always happen.
So I have two approaches. I am considering to either (a) force the query plan to be generated each time I run a query or either (b) to 'warm up' a bit so that a plan is cached and it is reused each time. How can I perform either or what precautions can I take to ensure that one or the other is happening?
It would be great if I could monitor plans in the cache but I am not sure if it is possible.
UPDATE: My queries are complex SELECTs to retrieve data, no DELETEs/INSERTs etc. Does this mean I should not give so much respect to the query planner in benchmarks?

PostgreSQL only caches query plans if
you use prepared statements
the statement is executed inside a PL/pgSQL function
So if you want to benchmark how much faster your queries become if you avoid the overhead of planning, you should create a prepared statement and execute it al least six times (because the first five runs will always generate a custom plan).
If your queries are complex, odds are that you might even lose if you cache query plans, particularly if the runtime of the queries is long. In such a case, it is usually better to spend more effort on planning each query. The biggest win with prepared statements is when the execution time of the queries is low.

Execute multiple functions together without losing performance

I have this process that has to make a series of queries, using pl/pgsql:
--process:
SELECT function1();
SELECT function2();
SELECT function3();
SELECT function4();
To be able to execute everything in one call, I created a process function as such:
CREATE OR REPLACE FUNCTION process()
RETURNS text AS
$BODY$
BEGIN
PERFORM function1();
PERFORM function2();
PERFORM function3();
PERFORM function4();
RETURN 'process ended';
END;
$BODY$
LANGUAGE plpgsql
The problem is, when I sum the time that each function takes by itself, the total is 200 seconds, while the time that the function process() takes is more than one hour!
Maybe it's a memory issue, but I don't know which configuration on postgresql.conf should I change.
The DB is running on PostgreSQL 9.4, in a Debian 8.

You commented that the 4 functions have to run consecutively. So it's safe to assume that each function works with data from tables that have been modified by the previous function. That's my prime suspect.
Any Postgres function runs inside the transaction of the outer context. So all functions share the same transaction context if packed into another function. Each can see effects on data from previous functions, obviously. (Even though effects are still invisible to other concurrent transactions.) But statistics are not updated immediately.
Query plans are based on statistics on involved objects. PL/pgSQL does not plan statements until they are actually executed, that would work in your favor. Per documentation:
As each expression and SQL command is first executed in the function,
the PL/pgSQL interpreter parses and analyzes the command to create a
prepared statement, using the SPI manager's SPI_prepare function.
PL/pgSQL can cache query plans, but only within the same session and (in pg 9.2+ at least) only after a couple of executions have shown the same query plan to work best repeatedly. If you suspect this going wrong for you, you can work around it with dynamic SQL which forces a new plan every time:
EXECUTE 'SELECT function1()';
However, the most likely candidate I see is invalidated statistics that lead to inferior query plans. SELECT / PERFORM statements (same thing) inside the function are run in quick succession, there is no chance for autovacuum to kick in and update statistics between one function and the next. If one function substantially alters data in a table the next function is working with, the next function might base its query plan on outdated information. Typical example: A table with a few rows is filled with many thousands of rows, but the next plan still thinks a sequential scan is fastest for the "small" table. You state:
when I sum the time that each function takes by itself, the total is
200 seconds, while the time that the function process() takes is more
than one hour!
What exactly does "by itself" mean? Did you run them in a single transaction or in individual transactions? Maybe even with some time in between? That would allow autovacuum to update statistics (it's typically rather quick) and possibly lead to completely different query plans based on the changed statistic.
You can inspect query plans inside plpgsql functions with auto-explain
Postgres query plan of a UDF invocation written in pgpsql
If you can identify such an issue, you can force ANALYZE in between statements. Being at it, for just a couple of SELECT / PERFORM statements you might as well use a simpler SQL function and avoid plan caching altogether (but see below!):
CREATE OR REPLACE FUNCTION process()
RETURNS text
LANGUAGE sql AS
$func$
SELECT function1();
ANALYZE some_substantially_affected_table;
SELECT function2();
SELECT function3();
ANALYZE some_other_table;
SELECT function4();
SELECT 'process ended'; -- only last result is returned
$func$;
Also, as long as we don't see the actual code of your called functions, there can be any number of other hidden effects.
Example: you could SET LOCAL ... some configuration parameter to improve the performance of your function1(). If called in separate transactions that won't affect the rest. The effect only last till the end of the transaction. But if called in a single transaction it affects the rest, too ...
Basics:
Difference between language sql and language plpgsql in PostgreSQL functions
PostgreSQL Stored Procedure Performance
Plus: transactions accumulate locks, which binds an increasing amount of resources and may cause increasing friction with concurrent processes. All locks are released at the end of a transaction. It's better to run big functions in separate transactions if at all possible, not wrapped in a single function (and thus transaction). That last item is related to what #klin and IMSoP already covered.

Warning for future readers (2015-05-30).
The technique described in the question is one of the smartest ways to effectively block the server.
In some corporations the use of this technology can meet with the reward in the form of immediate termination of the employment contract.
Attempts to improve this method are useless. It is simple, beautiful and sufficiently effective.
In RDMS the support of transactions is very expensive. When executing a transaction the server must create and store information on all changes made to the database to make these changes visible in environment (other concurrent processes) in case of a successful completion, and in case of failure, to restore the state before the transaction as soon as possible. Therefore the natural principle affecting server performance is to include in one transaction a minimum number of database operations, ie. only as much as is necessary.
A Postgres function is executed in one transaction. Placing in it many operations that could be run independently is a serious violation of the above rule.
The answer is simple: just do not do it. A function execution is not a mere execution of a script.
In the procedural languages used to write applications there are many other possibilities to simplify the code by using functions or scripts. There is also the possibility to run scripts with shell.
The use a Postgres function for this purpose would make sense if there were a possibility of using transactions within the function. At present, such a possibility does not exist, although discussions on this issue have already long history (you can read about it e.g. in postgres mailing lists).

Vastly different query run time in application

I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?

The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.

Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.

How does PostgreSQL cache statements and data?

In Oracle, SQL statements will be cached in shared_pool, and data which is selected frequently will be cached in db_cache.
What does PostgreSQL do? Will SQL statements and data be cached in shared_buffers?

Generally, only the contents of table and index files will be cached in the shared buffer space.
Query plans are cached in some circumstances. The best way to ensure this is to PREPARE the query once, then EXECUTE it each time.
The results of a query are not automatically cached. If you rerun the same query -- even if it's letter-for-letter identical, and no updates have been performed on the DB -- it will still execute the whole plan. It will, of course, make use of any table/index data that's already in the shared buffers cache; so it will not necessarily have to read all the data from disk again.
Update on plan caching
Plan caching is generally done per session. This means only the connection that makes the plan can use the cached version. Other connections have to make and use their own cached versions. This isn't really a performance issue because the saving you get from reusing a plan is almost always miniscule compared to the cost of connecting anyway. (Unless your queries are really complicated.)
It does cache if you use PREPARE: http://www.postgresql.org/docs/current/static/sql-prepare.html
It does cache when the query is in a PL/plSQL function: http://www.postgresql.org/docs/current/static/plpgsql-implementation.html#PLPGSQL-PLAN-CACHING
It does not cache ad-hoc queries entered in psql.
Hopefully someone else can elaborate on any other cases of query plan caching.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse