why functions that returns tables are so much slower then running the actual query? - postgresql

I'm pretty new to PostgreSQL so I guess i'm missing some basic information, information that I didn't quite find while googling, guess I didn't really know the right keywords, hopefully here I'll get the missing information :)
I'm using PostgreSQL 11.4.
I've encountered many issues when I create a function that returns a query result as a table, and it executes it about 50 times slower then running the actual query, sometimes even more then that.
I understand that IMMUTABLE can be used when there is no table scans, just when I manipulate and return data based on the function parameters and STABLE when if the query with same parameters do a table scan and always returns the same results.
so the format of my function creation is this:
CREATE FUNCTION fnc_name(columns...)
RETURNS TABLE ( columns..) STABLE AS $func$
BEGIN
select ...
END $func$ LANGUAGE pgplsql;
I can't show the query here since it's work related, but still... there is something that I didn't quite understand about creating functions why is it so slow ? I need to fully understand this issue cause I need to create many more functions and it seems right now that I need to run the actual query to get proper performance instead of using functions and I still don't really have a clue as to why!
any information regarding this issue would be greatly appreciated.

All depends on usage of this function, and size of returned relation.
First I have to say - don't write these functions. It is known antipattern. I'll try to explain why. Use views instead.
Result of table functions written in higher PL languages like Perl, Python or PLpgSQL is materialized. When table is small (to work_mem) it is stored in memory. Bigger tables are stored in temp file. It can have significant overhead.
Function is a black box for optimizer - is not possible to push down predicates, there are not correct statistics, there is not possible to play with form of joins or order of joins. So some not trivial queries can be slower (little bit or significantly) due impossible optimizations.
There is a exception from these rules - simple SQL functions. SQL functions (functions with single SQL statement) can be inlined (when some prerequisites are true). Due inlining the body of function is merged to body of outer SQL query, and the result is same like you will write subquery directly. So result is not materialized and it is not a barrier for optimization.
There is a basic rule - use functions only when you cannot to calculate some data by SQL. Don't try to hide SQL or encapsulate SQL (elsewhere - for simplification some complex queries use views not functions). Same rules are valid for all SQL databases (Oracle, DB2, MSSQL). Postgres is not a exception.
This note is not against stored procedures (functions). It is great technology. But it requires specific style of programming. Wrapping queries into functions (when there is not any other) is bad.

Related

Parallel queries on CTE for writing operations in PostgreSQL

From PostgreSQL 9.6 Release Notes:
Only strictly read-only queries where the driving table is accessed via a sequential scan can be parallelized.
My question is: If a CTE (WITH clause) contains only read operations, but its results is used to feed a writing operation, like an insert or update, is it also disallowed to parallelize sequential scans?
I mean, as CTE is much like a temporary table which only exists for currently executing query, can I suppose that its inner query can take advantage of the brand new parallel seq-scan of PostgreSQL 9.6? Or, otherwise, is it treated as a using subquery and cannot perform parallel scan?
For example, consider this query:
WITH foobarbaz AS (
SELECT foo FROM bar
WHERE some_expensive_function(baz)
)
DELETE FROM bar
USING foobarbaz
WHERE bar.foo = foobarbaz.foo
;
Is that foobarbaz calculation expected to be able to be parallelized or is it disallowed because of the delete sentence?
If it isn't allowed, I thought that can replace the CTE by a CREATE TEMPORARY TABLE statement. But I think I will fall into the same issue as CREATE TABLE is a write operation. Am I wrong?
Lastly, a last chance I could try is to perform it as a pure read operation and use its result as input for insert and / or update operations. Outside of a transaction it should work. But the question is: If the read operation and the insert/update are between a begin and commit sentences, it not will be allowed anyway? I understand they are two completely different operations, but in the same transaction and Postgres.
To be clear, my concern is that I have an awful mass of hard-to-read and hard-to-redesign SQL queries that involves multiple sequential scans with low-performance function calls and which performs complex changes over two tables. The whole process runs in a single transaction because, if not, the mess in case of failure would be totally unrecoverable.
My hope is to able to parallelize some sequential scans to take advantage of the 8 cpu cores of the machine to be able to complete the process sooner.
Please, don't answer that I need to fully redesign that mess: I know and I'm working on it. But it is a great project and we need to continue working meantime.
Anyway, any suggestion will be thankful.
EDIT:
I add a brief report of what I could discover up to now:
As #a_horse_with_no_name says in his comment (thanks), CTE and the rest of the query is a single DML statement and, if it has a write operation, even outside of the CTE, then the CTE cannot be parallelized (I also tested it).
Also I found this wiki page with more concise information about parallel scans than what I found in the release notes link.
An interesting point I could check thanks to that wiki page is that I need to declare the involved functions as parallel safe. I did it and worked (in a test without writings).
Another interesting point is what #a_horse_with_no_name says in his second comment: Using DbLink to perform a pure read-only query. But, investigating a bit about that, I seen that postgres_fdw, which is explicitly mentioned in the wiki as non supporting parallel scans, provides roughly the same functionality using a more modern and standards-compliant infrastructure.
And, on the other hand, even if it would worked, I were end up getting data from outside the transaction which, in some cases would be acceptable for me but, I think, not as good idea as general solution.
Finally, I checked that is possible to perform a parallel-scan in a read-only query inside a transaction, even if it later performs write operations (no exception is triggered and I could commit).
...in summary, I think that my best bet (if not the only one) would be to refactor the script in a way that it reads the data to memory before to later perform the write operations in the same transaction.
It will increase I/O overhead but, attending the latencies I manage it will be even better.

CTE vs TVF Performance

Which performs better: common table expressions or table value functions? Im designing a process that I could use either and am unable to find any real data either way. Whatever route I choose would be executed via a SP and the data would ultimately update a table connected through a linked server (unfortunately there is no way around this). Insights appreciated.
This isn't really a performance question. You are comparing tuna fish and watermelons. A cte is an inline view that can be used by the next query only. A TVF is a complete unit of work that can function on it's own, unlike a cte. They both have their place and when used correctly are incredibly powerful tools.

Vastly different query run time in application

I'm having a scaling issue with an application that uses a PostgreSQL 9 backend. I have one table who's size is about 40 million records and growing and the conditional queries against it have slowed down dramatically.
To help figure out what's going wrong, I've taken a development snapshot of the database and dump the queries with the execution time into the log.
Now for the confusing part, and the gist of the question ....
The run times for my queries in the log are vastly different (an order of magnitude+) that what I get when I run the 'exact' same query in DbVisualizer to get the explain plan.
I say 'exact' but really the difference is, the application is using a prepared statement to which I bind values at runtime while the queries I run in DbVisualizer has those values in place already. The values themselves are exactly as I pulled them from the log.
Could the use of prepared statements make that big of a difference?
The answer is YES. Prepared statements cut both ways.
On the one hand, the query does not have to be re-planned for every execution, saving some overhead. This can make a difference or be hardly noticeable, depending on the complexity of the query.
On the other hand, with uneven data distribution, a one-size-fits-all query plan may be a bad choice. Called with particular values another query plan could be (much) better suited.
Running the query with parameter values in place can lead to a different query plan. More planning overhead, possibly a (much) better query plan.
Also consider unnamed prepared statements like #peufeu provided. Those re-plan the query considering parameters every time - and you still have safe parameter handling.
Similar considerations apply to queries inside PL/pgSQL functions, where queries can be treated as prepared statements internally - unless executed dynamically with EXECUTE. I quote the manual on Executing Dynamic Commands:
The important difference is that EXECUTE will re-plan the command on
each execution, generating a plan that is specific to the current
parameter values; whereas PL/pgSQL may otherwise create a generic plan
and cache it for re-use. In situations where the best plan depends
strongly on the parameter values, it can be helpful to use EXECUTE to
positively ensure that a generic plan is not selected.
Apart from that, general guidelines for performance optimization apply.
Erwin nails it, but let me add that the extended query protocol allows you to use more flavors of prepared statements. Besides avoiding re-parsing and re-planning, one big advantage of prepared statements is to send parameter values separately, which avoids escaping and parsing overhead, not to mention the opportunity for SQL injections and bugs if you don't use an API that handles parameters in a manner you can't forget to escape them.
http://www.postgresql.org/docs/9.1/static/protocol-flow.html
Query planning for named prepared-statement objects occurs when the
Parse message is processed. If a query will be repeatedly executed
with different parameters, it might be beneficial to send a single
Parse message containing a parameterized query, followed by multiple
Bind and Execute messages. This will avoid replanning the query on
each execution.
The unnamed prepared statement is likewise planned during Parse
processing if the Parse message defines no parameters. But if there
are parameters, query planning occurs every time Bind parameters are
supplied. This allows the planner to make use of the actual values of
the parameters provided by each Bind message, rather than use generic
estimates.
So, if your DB interface supports it, you can use unnamed prepared statements. It's a bit of a middle ground between a query and a usual prepared statement.
If you use PHP with PDO, please note that PDO's prepared statement implementation is rather useless for postgres, since it uses named prepared statements, but re-prepares every time you call prepare(), no plan caching takes place. So you get the worst of both : many roundtrips and plan without parameters. I've seen it be 1000x slower than pg_query() and pg_query_params() on specific queries where the postgres optimizer really needs to know the parameters to produce the optimal plan. pg_query uses raw queries, pg_query_params uses unnamed prepared statements. Usually one is faster than the other, that depends on the size of parameter data.

Using TSQL for the first time some basic instructions

I am writing an app that will use many tables and i have been told that using stored procs in the app. is not the way to go, that it is too slow.
It has been suggested i use TSQL. I have only used stored procs till now. in what way is using TSQL different, how can I get up to speed. IN fact, is this the way to go for faster data access or is there other methods?
TSQL is Microsoft and Sybase SQL dialect, so your stored procedures are written with TSQL if you use SQLServer.
In the most cases, properly written stored procedures overperform adhoc queries.
On the other hand, coding procedures requires more skills and debugging is quite a tedious process. It's really hard to give advice without seeing your procedures, but there are some common things that slow down SPs.
Execution plan is generated upon the first run, but sometimes the optimal plan depends on input parameters. See here for more details.
Another thing that prevents generating optimal plan is using conditions in SP body.
For example,
IF (something)
BEGIN
SELECT ... FROM table1
INNER JOIN table2 ...
.....
END
ELSE
BEGIN
SELECT ... FROM table2
INNER JOIN table3 ...
.....
END
should be refactored to
IF (something)
EXEC proc1; // create a new SP and move code from IF there
ELSE
EXEC proc2; // create a new SP and move code from ELSE there
The traditional argument for using SPs was always that they're compiled so they run faster. That hasn't been true for many years but nor is it true, in general, that SPs run slower.
If the reference is to development time rather than runtime then there may be some truth to this but, considering your skills, it may be that learning a new approach would slow you down more than using SPs.
If your system uses Object-Relational Mapping (ORM) then SPs will probably get in your way but then you wouldn't really be using T-SQL either - it'll be done for you.
Stored proc's are written with T-SQL, so it's a bit odd that someone would make such a statement.
Daniel is right, ORM is a good option. If you're doing any data intensive operations (such as parsing content), I'd look at the database first and foremost. You might want to do some reading on SP as speed isn't everything... there are other benefits. This was one hit from Google, but you can do more research yourself:
http://msdn.microsoft.com/en-us/library/ms973918.aspx

what are the advantages of using plpgsql in postgresql

Besides the syntactic sugar and expressiveness power what are the differences in runtime efficiency. I mean, plpgsql can be faster than, lets say plpythonu or pljava? Or are they all approximately equals?
We are using stored procedures for the task of detecting nearly-duplicates records of people in a moderately sized database (around 10M of records)
plpgsql provides greater type safety I believe, you have to perform explicit casts if you want to perform operations using two different columns of similar type, like varchar and text or int4 and int8. This is important because if you need to have your stored proc use indexes, postgres requires that the types match exactly between join conditions (edit: for equality checks too I think).
There may be a facility for this in the other languages though, I haven't used them. In any case, I hope this gives you a better starting point for your investigation.
plpgsql is very well integrated with SQL - the source code should be very clean and readable. For SQL languages like PLJava or PLPython, SQL statements have to be isolated - SQL isn't part of language. So you have to write little bit more code. If your procedure has lot of SQL statements, then plpgsql procedure should be cleaner, shorter and little bit faster. When your procedure hasn't SQL statements, then procedures from external languages can be faster - but external languages (interprets) needs some time for initialisation - so for simple task, procedures in SQL or plpgsql language should be faster.
External languages are used when you need some functionality like access to net, access to filesystem - http://www.postgres.cz/index.php/PL/Perlu_-_Untrusted_Perl_%28en%29
What I know - people usually use a combination of PL languages - (SQL,plpgsql, plperl) or (SQL, plpgsql, plpython).
Without doing actual testing, I would expect plpgsql to be somewhat more efficient than other languages, because it's small. Having said that, remember that SQL functions are likely to be even faster than plpgsql, if a function is simple enough that you can write it in just SQL.