Execute multiple functions together without losing performance - postgresql

I have this process that has to make a series of queries, using pl/pgsql:
--process:
SELECT function1();
SELECT function2();
SELECT function3();
SELECT function4();
To be able to execute everything in one call, I created a process function as such:
CREATE OR REPLACE FUNCTION process()
RETURNS text AS
$BODY$
BEGIN
PERFORM function1();
PERFORM function2();
PERFORM function3();
PERFORM function4();
RETURN 'process ended';
END;
$BODY$
LANGUAGE plpgsql
The problem is, when I sum the time that each function takes by itself, the total is 200 seconds, while the time that the function process() takes is more than one hour!
Maybe it's a memory issue, but I don't know which configuration on postgresql.conf should I change.
The DB is running on PostgreSQL 9.4, in a Debian 8.

You commented that the 4 functions have to run consecutively. So it's safe to assume that each function works with data from tables that have been modified by the previous function. That's my prime suspect.
Any Postgres function runs inside the transaction of the outer context. So all functions share the same transaction context if packed into another function. Each can see effects on data from previous functions, obviously. (Even though effects are still invisible to other concurrent transactions.) But statistics are not updated immediately.
Query plans are based on statistics on involved objects. PL/pgSQL does not plan statements until they are actually executed, that would work in your favor. Per documentation:
As each expression and SQL command is first executed in the function,
the PL/pgSQL interpreter parses and analyzes the command to create a
prepared statement, using the SPI manager's SPI_prepare function.
PL/pgSQL can cache query plans, but only within the same session and (in pg 9.2+ at least) only after a couple of executions have shown the same query plan to work best repeatedly. If you suspect this going wrong for you, you can work around it with dynamic SQL which forces a new plan every time:
EXECUTE 'SELECT function1()';
However, the most likely candidate I see is invalidated statistics that lead to inferior query plans. SELECT / PERFORM statements (same thing) inside the function are run in quick succession, there is no chance for autovacuum to kick in and update statistics between one function and the next. If one function substantially alters data in a table the next function is working with, the next function might base its query plan on outdated information. Typical example: A table with a few rows is filled with many thousands of rows, but the next plan still thinks a sequential scan is fastest for the "small" table. You state:
when I sum the time that each function takes by itself, the total is
200 seconds, while the time that the function process() takes is more
than one hour!
What exactly does "by itself" mean? Did you run them in a single transaction or in individual transactions? Maybe even with some time in between? That would allow autovacuum to update statistics (it's typically rather quick) and possibly lead to completely different query plans based on the changed statistic.
You can inspect query plans inside plpgsql functions with auto-explain
Postgres query plan of a UDF invocation written in pgpsql
If you can identify such an issue, you can force ANALYZE in between statements. Being at it, for just a couple of SELECT / PERFORM statements you might as well use a simpler SQL function and avoid plan caching altogether (but see below!):
CREATE OR REPLACE FUNCTION process()
RETURNS text
LANGUAGE sql AS
$func$
SELECT function1();
ANALYZE some_substantially_affected_table;
SELECT function2();
SELECT function3();
ANALYZE some_other_table;
SELECT function4();
SELECT 'process ended'; -- only last result is returned
$func$;
Also, as long as we don't see the actual code of your called functions, there can be any number of other hidden effects.
Example: you could SET LOCAL ... some configuration parameter to improve the performance of your function1(). If called in separate transactions that won't affect the rest. The effect only last till the end of the transaction. But if called in a single transaction it affects the rest, too ...
Basics:
Difference between language sql and language plpgsql in PostgreSQL functions
PostgreSQL Stored Procedure Performance
Plus: transactions accumulate locks, which binds an increasing amount of resources and may cause increasing friction with concurrent processes. All locks are released at the end of a transaction. It's better to run big functions in separate transactions if at all possible, not wrapped in a single function (and thus transaction). That last item is related to what #klin and IMSoP already covered.

Warning for future readers (2015-05-30).
The technique described in the question is one of the smartest ways to effectively block the server.
In some corporations the use of this technology can meet with the reward in the form of immediate termination of the employment contract.
Attempts to improve this method are useless. It is simple, beautiful and sufficiently effective.
In RDMS the support of transactions is very expensive. When executing a transaction the server must create and store information on all changes made to the database to make these changes visible in environment (other concurrent processes) in case of a successful completion, and in case of failure, to restore the state before the transaction as soon as possible. Therefore the natural principle affecting server performance is to include in one transaction a minimum number of database operations, ie. only as much as is necessary.
A Postgres function is executed in one transaction. Placing in it many operations that could be run independently is a serious violation of the above rule.
The answer is simple: just do not do it. A function execution is not a mere execution of a script.
In the procedural languages used to write applications there are many other possibilities to simplify the code by using functions or scripts. There is also the possibility to run scripts with shell.
The use a Postgres function for this purpose would make sense if there were a possibility of using transactions within the function. At present, such a possibility does not exist, although discussions on this issue have already long history (you can read about it e.g. in postgres mailing lists).

Related

row caching with Postgres

Being new to Postgres, I have a question regarding caching. I do not know how much is handled by the database, vs. how much I need to handle myself.
I have rows which are part of a time series. For the sake of example, let's say I have 1 row every second.
My query is to get the last minute (60 rows), on a rolling basis (from now - 1min to now).
Ideally, I could cache the last result in the client and request the last second.
Unfortunately, the data has holes in it: some rows are missing and me be added a few seconds later. So I need to query the whole thing anyways.
I could find where the holes are and make a query just for these, this is also an option.
What I was wondering is: how much caching does Postgres perform on its own?
If I have a query returning rows 3, 4, 5, 6 and my next query returns 4, 5, 6, 7. Would rows 4, 5, 6 have to be fetched again, or are they cached by the database?
Postgres includes an extensive caching system. It is very likely that subsequent executions of the same query will use the cached data, but we have virtually no influence on it because the cache works autonomously. What can and should be done is to use prepared statements. Per the documentation:
A prepared statement is a server-side object that can be used to optimize performance. When the PREPARE statement is executed, the specified statement is parsed, analyzed, and rewritten. When an EXECUTE command is subsequently issued, the prepared statement is planned and executed. This division of labor avoids repetitive parse analysis work, while allowing the execution plan to depend on the specific parameter values supplied.
However, the scenario described in the question suggests a completely different approach to the issue, namely the use of the NOTIFY / LISTEN feature. In brief:
the server sends a notification every time a row is inserted into the table,
the application listens on the agreed channel and receives newly entered rows.
In this solution, the application performs the query only once and completes the data from notifications.
Sample trigger function:
create or replace function before_insert_on_my_table()
returns trigger language plpgsql as $$
begin
perform pg_notify('my_data', new::text);
return new;
end $$;
create trigger before_insert_on_my_table
before insert on my_table
for each row execute procedure before_insert_on_my_table();
The implementation of the LISTEN statement in an application depends on the language used. For example, the python psycopg2 has convenient tools for it.
Read more about NOTIFY in the docs.
PostgreSQL caches data automatically. Data are read, written and cached in units of 8kB, and whenever you access a row, the block containing the row will be cached.
All you have to make sure is that you have an index on the timestamp of your table, otherwise PostgreSQL has to read the whole table to compute the result. With an index it will read (and cache) only the blocks that contain the data you need, and that will speed up your next query for these data.

Can I rely on it that every single update statement in postgresql is atomic?

I read some materials before saying that each update statement in postgresql is atomic.
For example,
update set column_1 = column_1 + 100 where column_2 = 10;
Even though I have multiple processes calling the update simultaneously, I can rely on it that they will happen in sequence because each update is atomic behind the scene and the "read_modify_write" cycle is encapsulated in a bundle.
However, what if the update statement looks like the following:
update set column_1 = myFunction(column_1) where column_2 = 10;
Here, myFunction() is a stored procedure created by me.In this function, I will apply different math operations to column_1 depending on its amount. Something like:
if(column_1 < 10):
// do something
else if (column_1 < 20):
// do something
else
// do something
In this case, when the single update statement contains self-defined function, does it remain atomic?
OK, #Schwern's knowledge of Perl may well be world class but as regards PostgreSQL transactions, I can correct him :-)
Every statement in PostgreSQL is executed within a transaction, either an explicit one you BEGIN/COMMIT yourself or an implicit one wrapping the statement. For the duration of a statement you will see a stable view of the entire database.
If you write myFunction as an in-database custom function in pl/pgsql or some such then it too will be in the same transaction as the statement that calls it. If it doesn't run its own queries, just operates on its parameters then you don't need to think any further.
If you are reading from tables within your function then you will need a passing familiarity with transaction isolation levels. In particular, make sure you understand what "read committed" implies about seeing other processes' activities.
The blog article you refer to is discussing performing operations outside of the database. The solution it proposes is exactly what you are asking about - an atomic update.
The update, subqueries, and queries in function calls should all see a consistent view of the data.
From Chapter 13. Concurrency Control.
Internally, data consistency is maintained by using a multiversion model (Multiversion Concurrency Control, MVCC). This means each SQL statement sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session.
What that means, I believe, is for each statement Postgres keeps a version of the data. Each statement sees a consistent version of the database for the entirety of its run. That includes sub-queries and function calls.
This doesn't mean you don't have to think about concurrency. It just means that update will see consistent data.

Multiple deletes in Postgres with a commit after each one

I have about 30 million records to delete from a table, and deleting even 10.000 is taking 30 minutes. I'm concerned about issuing the delete command for all the 30 million records, so I'd like to do the delete in batches.
So my approach was to do a loop deleting a batch, then commit, then loop to delete the next batch. But that generates the following error:
LOCATION: exec_stmt_raise, pl_exec.c:3216
ERROR: 0A000: cannot begin/end transactions in PL/pgSQL
HINT: Use a BEGIN block with an EXCEPTION clause instead.
This is the code I wrote:
DO $$
BEGIN
FOR i in 1..30000 loop
DELETE FROM my_table
WHERE id IN (
SELECT id
FROM my_table
WHERE should_delete = true
LIMIT 1000
);
RAISE NOTICE 'Done with batch %', i;
COMMIT;
END LOOP;
END
$$;
What is an alternative to achieving this?
A few things jump out at me:
Transaction overhead in PostgreSQL is significant. It's possible you would see a substantial performance improvement if you just did this all in one big transaction.
It sounds as if you are experimenting on a live production database. Don't do that. Set up a test instance with statistically similar (or identical) data and a similar workload, and use that. That way, if you manage to break something, you'll only hurt the test instance.
Once you have a test instance via (2), you can use it to test (1) and see how long it takes. Then you won't have to guess at which approach is superior.
You mention that other jobs run at fixed times. So this is not a database backing (something like) a web server exposed to the internet. It is a batch system which runs scheduled jobs at predetermined times. Since you already have a scheduling system, use it to schedule your deletes at times when the system is not in use (or when it is unlikely to be in use).
If you decide that you must use multiple transactions, use something other than PL/pgSQL to do the actual looping. You could, for instance, use a shell script or another programming language like Python or Java. Anything with Postgres bindings will do.
The really drastic approach is to replicate the whole database, perform the deletes on the replica, and then replace the original with the replica. The swap may require putting the database into read-only mode for a short time to avoid write inconsistencies and to allow the replica to converge. Since yours is a batch system, that might not matter. Obviously this is the most resource-intensive approach since it requires a whole extra database.

Parallel queries on CTE for writing operations in PostgreSQL

From PostgreSQL 9.6 Release Notes:
Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.
My question is: If a CTE (WITH clause) contains only read operations, but its results is used to feed a writing operation, like an insert or update, is it also disallowed to parallelize sequential scans?
I mean, as CTE is much like a temporary table which only exists for currently executing query, can I suppose that its inner query can take advantage of the brand new parallel seq-scan of PostgreSQL 9.6? Or, otherwise, is it treated as a using subquery and cannot perform parallel scan?
For example, consider this query:
WITH foobarbaz AS (
SELECT foo FROM bar
WHERE some_expensive_function(baz)
)
DELETE FROM bar
USING foobarbaz
WHERE bar.foo = foobarbaz.foo
;
Is that foobarbaz calculation expected to be able to be parallelized or is it disallowed because of the delete sentence?
If it isn't allowed, I thought that can replace the CTE by a CREATE TEMPORARY TABLE statement. But I think I will fall into the same issue as CREATE TABLE is a write operation. Am I wrong?
Lastly, a last chance I could try is to perform it as a pure read operation and use its result as input for insert and / or update operations. Outside of a transaction it should work. But the question is: If the read operation and the insert/update are between a begin and commit sentences, it not will be allowed anyway? I understand they are two completely different operations, but in the same transaction and Postgres.
To be clear, my concern is that I have an awful mass of hard-to-read and hard-to-redesign SQL queries that involves multiple sequential scans with low-performance function calls and which performs complex changes over two tables. The whole process runs in a single transaction because, if not, the mess in case of failure would be totally unrecoverable.
My hope is to able to parallelize some sequential scans to take advantage of the 8 cpu cores of the machine to be able to complete the process sooner.
Please, don't answer that I need to fully redesign that mess: I know and I'm working on it. But it is a great project and we need to continue working meantime.
Anyway, any suggestion will be thankful.
EDIT:
I add a brief report of what I could discover up to now:
As #a_horse_with_no_name says in his comment (thanks), CTE and the rest of the query is a single DML statement and, if it has a write operation, even outside of the CTE, then the CTE cannot be parallelized (I also tested it).
Also I found this wiki page with more concise information about parallel scans than what I found in the release notes link.
An interesting point I could check thanks to that wiki page is that I need to declare the involved functions as parallel safe. I did it and worked (in a test without writings).
Another interesting point is what #a_horse_with_no_name says in his second comment: Using DbLink to perform a pure read-only query. But, investigating a bit about that, I seen that postgres_fdw, which is explicitly mentioned in the wiki as non supporting parallel scans, provides roughly the same functionality using a more modern and standards-compliant infrastructure.
And, on the other hand, even if it would worked, I were end up getting data from outside the transaction which, in some cases would be acceptable for me but, I think, not as good idea as general solution.
Finally, I checked that is possible to perform a parallel-scan in a read-only query inside a transaction, even if it later performs write operations (no exception is triggered and I could commit).
...in summary, I think that my best bet (if not the only one) would be to refactor the script in a way that it reads the data to memory before to later perform the write operations in the same transaction.
It will increase I/O overhead but, attending the latencies I manage it will be even better.

Vastly different query run time in application

I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?
The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.
Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.