Executing queries dynamically in PL/pgSQL - postgresql

I have found solutions (I think) to the problem I'm about to ask for on Oracle and SQL Server, but can't seem to translate this into a Postgres solution. I am using Postgres 9.3.6.
The idea is to be able to generate "metadata" about the table content for profiling purposes. This can only be done (AFAIK) by having queries run for each column so as to find out, say... min/max/count values and such. In order to automate the procedure, it is preferable to have the queries generated by the DB, then executed.
With an example salesdata table, I'm able to generate a select query for each column, returning the min() value, using the following snippet:
SELECT 'SELECT min('||column_name||') as minval_'||column_name||' from salesdata '
FROM information_schema.columns
WHERE table_name = 'salesdata'
The advantage being that the db will generate the code regardless of the number of columns.
Now there's a myriad places I had in mind for storing these queries, either a variable of some sort, or a table column, the idea being to then have these queries execute.
I thought of storing the generated queries in a variable then executing them using the EXECUTE (or EXECUTE IMMEDIATE) statement which is the approach employed here (see right pane), but Postgres won't let me declare a variable outside a function and I've been scratching my head with how this would fit together, whether that's even the direction to follow, perhaps there's something simpler.
Would you have any pointers, I'm currently trying something like this, inspired by this other question but have no idea whether I'm headed in the right direction:
CREATE OR REPLACE FUNCTION foo()
RETURNS void AS
$$
DECLARE
dyn_sql text;
BEGIN
dyn_sql := SELECT 'SELECT min('||column_name||') from salesdata'
FROM information_schema.columns
WHERE table_name = 'salesdata';
execute dyn_sql
END
$$ LANGUAGE PLPGSQL;

System statistics
Before you roll your own, have a look at the system table pg_statistic or the view pg_stats:
This view allows access only to rows of pg_statistic that correspond
to tables the user has permission to read, and therefore it is safe to
allow public read access to this view.
It might already have some of the statistics you are about to compute. It's populated by ANALYZE, so you might run that for new (or any) tables before checking.
-- ANALYZE tbl; -- optionally, to init / refresh
SELECT * FROM pg_stats
WHERE tablename = 'tbl'
AND schemaname = 'public';
Generic dynamic plpgsql function
You want to return the minimum value for every column in a given table. This is not a trivial task, because a function (like SQL in general) demands to know the return type at creation time - or at least at call time with the help of polymorphic data types.
This function does everything automatically and safely. Works for any table, as long as the aggregate function min() is allowed for every column. But you need to know your way around PL/pgSQL.
CREATE OR REPLACE FUNCTION f_min_of(_tbl anyelement)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT (t::%2$s).* FROM (SELECT min(%1$s) FROM %2$s) t'
, string_agg(quote_ident(attname), '), min(' ORDER BY attnum)
, pg_typeof(_tbl)::text)
FROM pg_attribute
WHERE attrelid = pg_typeof(_tbl)::text::regclass
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
);
END
$func$;
Call (important!):
SELECT * FROM f_min_of(NULL::tbl); -- tbl being the table name
db<>fiddle here
Old sqlfiddle
You need to understand these concepts:
Dynamic SQL in plpgsql with EXECUTE
Polymorphic types
Row types and table types in Postgres
How to defend against SQL injection
Aggregate functions
System catalogs
Related answer with detailed explanation:
Table name as a PostgreSQL function parameter
Refactor a PL/pgSQL function to return the output of various SELECT queries
Postgres data type cast
How to set value of composite variable field using dynamic SQL
How to check if a table exists in a given schema
Select columns with particular column names in PostgreSQL
Generate series of dates - using date type as input
Special difficulty with type mismatch
I am taking advantage of Postgres defining a row type for every existing table. Using the concept of polymorphic types I am able to create one function that works for any table.
However, some aggregate functions return related but different data types as compared to the underlying column. For instance, min(varchar_column) returns text, which is bit-compatible, but not exactly the same data type. PL/pgSQL functions have a weak spot here and insist on data types exactly as declared in the RETURNS clause. No attempt to cast, not even implicit casts, not to speak of assignment casts.
That should be improved. Tested with Postgres 9.3. Did not retest with 9.4, but I am pretty sure, nothing has changed in this area.
That's where this construct comes in as workaround:
SELECT (t::tbl).* FROM (SELECT ... FROM tbl) t;
By casting the whole row to the row type of the underlying table explicitly we force assignment casts to get original data types for every column.
This might fail for some aggregate function. sum() returns numeric for a sum(bigint_column) to accommodate for a sum overflowing the base data type. Casting back to bigint might fail ...

#Erwin Brandstetter, Many thanks for the extensive answer. pg_stats does indeed provide a few things, but what I really need to draw a complete profile is a variety of things, min, max values, counts, count of nulls, mean etc... so a bunch of queries have to be ran for each columns, some with GROUP BY and such.
Also, thanks for highlighting the importance of data types, i was sort of expecting this to throw a spanner in the works at some point, my main concern was with how to automate the query generation, and its execution, this last bit being my main concern.
I have tried the function you provide (I probably will need to start learning some plpgsql) but get a error at the SELECT (t::tbl) :
ERROR: type "tbl" does not exist
btw, what is the (t::abc) notation referred as, in python this would be a list slice, but it’s probably not the case in PLPGSQL

Related

Is it possible to write polymorphic Postgres functions using RECORD parameters?

I want to write a PL/pgSQL function that can take records of different types, check the type of record provided, then do something with the record. Example:
CREATE FUNCTION polymorphic_input(arg_rec RECORD) RETURNS TEXT LANGUAGE plpgsql AS
$plpgsql$
BEGIN
IF pg_typeof(arg_rec)::text = 'information_schema.tables' THEN
RETURN (arg_rec::information_schema.tables).table_name;
ELSIF pg_typeof(arg_rec)::text = 'information_schema.columns' THEN
RETURN (arg_rec::information_schema.columns).column_name;
ELSE
RETURN 'unknown';
END IF;
END;
$plpgsql$;
When you call the function with a row from the information_schema.tables table, it should return the name of the table and it does so when you call it like this:
-- this returns table name "pg_type"
SELECT polymorphic_input((SELECT t FROM information_schema.tables t WHERE table_name = 'pg_type' LIMIT 1));
When you call the function with a row from the information_schema.columns table, it should return the name of the column and it does so when you call it like this:
-- this returns column name "objsubid"
SELECT polymorphic_input((SELECT t FROM information_schema.columns t WHERE t.column_name = 'objsubid' LIMIT 1));
The problem is you CAN'T call the function twice in a row with different row types. For example, if you call it with a row from information_schema.columns it works, then when you call it with a row form information_schema.tables, you get an error like this:
type of parameter 1 (information_schema.tables) does not match that when preparing the plan (information_schema.columns)
The words "when preparing the plan" gave me a hint that Postgres is caching the plans, so I figured running DISCARD PLANS; before each call to the function would work, and indeed it does when you run this entire query:
DISCARD PLANS; SELECT polymorphic_input((SELECT t FROM information_schema.tables t WHERE table_name = 'pg_type' LIMIT 1));
DISCARD PLANS; SELECT polymorphic_input((SELECT t FROM information_schema.columns t WHERE t.column_name = 'objsubid' LIMIT 1));
Running DISCARD PLANS; seems like the nuclear option and would no doubt affect performance in a real-world scenario. After some experimentation, I saw that using the pg_typeof function is what forces the plans to be cached. We can rewrite the function to avoid pg_typeof by adding a parameter that specifies what record type to expect:
CREATE FUNCTION polymorphic_input2(arg_rec RECORD, arg_type text) RETURNS TEXT LANGUAGE plpgsql AS
$plpgsql$
BEGIN
IF arg_type = 'tables' THEN
RETURN (arg_rec::information_schema.tables).table_name;
ELSIF arg_type = 'columns' THEN
RETURN (arg_rec::information_schema.columns).column_name;
ELSE
RETURN 'unknown';
END IF;
END;
$plpgsql$;
You can then call polymorphic_input2 multiple times in a row with different row types without error as follows:
-- no need for DISCARD PLANS here...these calls work fine.
SELECT polymorphic_input2((SELECT t FROM information_schema.tables t WHERE table_name = 'pg_type' LIMIT 1), 'tables');
SELECT polymorphic_input2((SELECT t FROM information_schema.columns t WHERE t.column_name = 'objsubid' LIMIT 1), 'columns');
The problem with polymorphic_input2 is that you have to manually give it a hint as to the type of the record to expect. My question: is it possible to implement a polymorphic function that can figure out the type of record passed to it, without the cached plan errors?
The docs mention the plan_cache_mode setting:
Prepared statements (either explicitly prepared or implicitly generated, for example by PL/pgSQL) can be executed using custom or generic plans. Custom plans are made afresh for each execution using its specific set of parameter values, while generic plans do not rely on the parameter values and can be re-used across executions....The allowed values are auto (the default), force_custom_plan and force_generic_plan...
I tried removing the error by running SET plan_cache_mode = force_custom_plan; but that didn't help (which is probably a bug because the docs imply it should force a custom plan in each call, but Postgres is still caching the plan and causing errors). Only DISCARD PLANS worked.
The docs on plan caching seem to recognize this issue and say:
The mutable nature of record variables presents another problem in this connection. When fields of a record variable are used in expressions or statements, the data types of the fields must not change from one call of the function to the next, since each expression will be analyzed using the data type that is present when the expression is first reached. EXECUTE can be used to get around this problem when necessary.
...and a little further down the docs indicate this shouldn't be happening:
Likewise, functions having polymorphic argument types have a separate statement cache for each combination of actual argument types they have been invoked for, so that data type differences do not cause unexpected failures.
This is further confirmed by the docs EXECUTE which say:
Also, there is no plan caching for commands executed via EXECUTE. Instead, the command is always planned each time the statement is run. Thus the command string can be dynamically created within the function to perform actions on different tables and columns.
So I tried another variant that tries to run pg_typeof via EXECUTE:
CREATE FUNCTION polymorphic_input3(arg_rec RECORD) RETURNS TEXT LANGUAGE plpgsql AS
$plpgsql$
DECLARE
rec_type text;
BEGIN
EXECUTE 'SELECT pg_typeof($1)' INTO rec_type USING arg_rec;
IF rec_type = 'information_schema.tables' THEN
RETURN (arg_rec::information_schema.tables).table_name;
ELSIF rec_type = 'information_schema.columns' THEN
RETURN (arg_rec::information_schema.columns).column_name;
ELSE
RETURN 'unknown';
END IF;
END;
$plpgsql$;
...but that still produces the same error as the variant which calls pg_typeof directly.
My question once again: is it possible (in Postgres 14) to implement a polymorphic function that can figure out the type of record passed to it, without the cached plan errors?

PostgreSQL. Is such function vulnerable to SQL injection or is it safe?

Functions that looks problematic
I'm exploring postgresql database and I see a recurring pattern:
CREATE OR REPLACE FUNCTION paginated_class(_orderby text DEFAULT NULL, _limit int DEFAULT 10, _offset int DEFAULT 0)
RETURNS SETOF pg_class
LANGUAGE PLPGSQL
AS $$
BEGIN
return query execute'
select * from pg_class
order by '|| coalesce (_orderby, 'relname ASC') ||'
limit $1 offset $1
'
USING _limit, _offset;
END;
$$;
Sample usage:
SELECT * FROM paginated_class(_orderby:='reltype DESC, relowner ASC ')
Repeating is:
_orderby is passed as text. It could be any combination of fields of returned SETOF type. E.g. 'relname ASC, reltype DESC'
_orderby parameter is not sanitized or checked in any way
_limit and _offset are integers
DB Fiddle for that: https://www.db-fiddle.com/f/vF6bCN37yDrjBiTEsdEwX6/1
Question: is such function vulnerable to SQL injection or not?
By external signs it's possible to suspect that such function is vulnerable to sql injection.
But all my attempts to find combination of params failed.
E.g.
CREATE TABLE T(id int);
SELECT * FROM paginated_class(_orderby:='reltype; DROP TABLE T; SELECT * FROM pg_class');
will return "Query Error: error: cannot open multi-query plan as cursor".
I did not found a way to exploit vulnerability if it exists with UPDATE/INSERT/DELETE.
So can we conclude that such function is actually safe?
If so: then why?
UPDATE. Possible plan for attack
Maybe I was not clear: I'm not asking about general guidelines but for experimental exploit of vulnerability or proof that such exploit is not possible.
DB Fiddle for this: https://www.db-fiddle.com/f/vF6bCN37yDrjBiTEsdEwX6/4 (or you can provide other of course)
My conclusions so far
A. Such attack could be possible if _orderby will have parts:
sql code that suppresses output of first SELECT
do something harmful
select * from pg_class so that it satisfies RETURNS SETOF pg_class
E.g.
SELECT * FROM paginated_class(_orderby:='relname; DELETE FROM my_table; SELECT * FROM pg_class')
It's easy for 2 and 3. I don't know a way to do 1st part.
This will generate: "error: cannot open multi-query plan as cursor"
B. If it's not possible to suppress first SELECT
Then:
every postgresql function works in separate transaction
because of error this transaction will be rollbacked
there is no autonomous transactions as in Oracle
for non-transactional ops: I know only about sequence-related ops
everything else be that DML or DDL is transactional
So? Can we conclude that such function is actually safe?
Or I'm missing something?
UPDATE 2. Attack using prepared function
From answer https://stackoverflow.com/a/69189090/1168212
A. It's possible to implement Denial-of-service attack putting expensive calculation
B. Side-effects:
If you put a function with side effects into the ORDER BY clause, you may also be able to modify data.
Let's try the latter:
CREATE FUNCTION harmful_fn()
RETURNS bool
LANGUAGE SQL
AS '
DELETE FROM my_table;
SELECT true;
';
SELECT * FROM paginated_class(_orderby:='harmful_fn()', _limit:=1);
https://www.db-fiddle.com/f/vF6bCN37yDrjBiTEsdEwX6/8
Yes.
So if an attacker has right to create functions: non-DOS attack is possible too.
I accept Laurenz Albe answer but: is it possible to do non-DOS attack without function?
Ideas?
No, that is not safe. An attacker could put any code into your ORDER BY clause via the _orderby parameter.
For example, you can pass an arbitrary subquery, as long as it returns only a single row: (SELECT max(i) FROM generate_series(1, 100000000000000) AS i). That can easily be used for a denial-of-service attack, if the query is expensive enough. Or, like with this example, you can cause a (brief) out-of-space condition with temporary files.
If you put a function with side effects into the ORDER BY clause, you may also be able to modify data.

Lack of user-defined table types for passing data between stored procedures in PostgreSQL

So I know there's already similar questions on this, but most of them are very old, or they have non-answers, like "why would you even want to do this?", or "table types aren't performant and we don't want them here", or even "you need to rethink your whole approach".
So what I would ideally want to do is to declare a user-defined table type like this:
CREATE TYPE my_table AS TABLE (
a int,
b date);
Then use this in a procedure as a parameter, like this:
CREATE PROCEDURE my_procedure (
my_table_parameter my_table)
Then be able to do stuff like this:
INSERT INTO
my_temp_table
SELECT
m.a,
m.b,
o.useful_data
FROM
my_table m
INNER JOIN my_schema.my_other_table o ON o.a = m.a;
This is for a billing system, let's make it a mobile phone billing system (it isn't but it's similar enough to work). Here's several ways I might call my procedure:
I sometimes want to call my procedure for one row, to create an adhoc bill for one customer. I want to do this while they are on the phone and get them an immediate result. Maybe I just fixed something wrong with their bill, and they're angry!
I sometimes want to bill everyone who's due a bill on a specific date. Maybe this is their preferred billing date, and they're on a monthly billing cycle?
I sometimes want to bill in bulk, but my data comes from a CSV file. Maybe this is a custom billing run that I have no way of understanding the motivation for?
Maybe I want to final bill customers who recently left?
Sometimes I might need to rebill customers because a mistake was made. Maybe a tariff rate was uploaded incorrectly, and everyone on that tariff needs their bill regenerating?
I want to split my code up into modules, it's easier to work like this, and it allows a large degree of reusability. So what I don't want to do is to write n billing systems, where each one handles one of the use cases above. I want a generic billing engine that is a stored procedure, uses set-based queries where possible, and works just about as well for one customer as it does for 1,000,000. If anything I want it optimised for lots of customers, as long as it only takes a few seconds to run for one customer.
If I had SQL Server I would create a user-defined table type, and this would contain a list of customers, the date they need billing to, and maybe anything else that would be useful. But let's just leave it at the simplest case possible, an integer representing a customer and a date to say what date I want to bill them up to, like my example above.
I've spent some days now looking at the options available in PostgreSQL, and these are the conclusions I have reached. I would be extremely grateful for any help with this, correcting my incorrect assumptions, or telling me of another way I have overlooked.
Process-Keyed Table
Create a table that looks like this:
CREATE TABLE customer_list (
process_key int,
customer_id int,
bill_to_date date);
When I want to call my billing system I get a unique key (probably from a sequence), load up the rows with my list of customers/ dates to bill them to, and add the unique key to every row. Now I can simply pass the unique key to my billing engine, and it can scoop up the data at the other side.
This seems the most optional way to proceed, but it's clunky, like something I would have done in SQL Server 20 years ago, when there weren't better options, it's prone to leaving data lying around, and it doesn't seem like it would be optimal, as the data will have to be squirted to physical storage, and read back into memory.
Use a Temporary Table
So I'm thinking that I create a temporary table, call it customer_temp, and make it ON COMMIT DROP. When I call my stored procedure to bill customers it picks the data out of the temporary table, does what it needs to do, and then when it ends the table is vacuumed away.
But this doesn't work if I call the billing engine more than once at a time. So I need to give my temporary tables unique names, and also pass this name into my billing engine, which has to use some vile dynamic SQL to get the data into some usable area (probably another temporary table?).
Use a TYPE
When I first saw this I thought I had the answer, but it turns out to not work for multidimensional arrays (or I'm doing something wrong). I quickly learned that for a single dimensional array I could get this working by just pretending that a PostgreSQL TYPE was a user defined table type. But it obviously isn't.
So passing in an array of integers, e.g. customer_list int[]; works fine, and I can use the ARRAY command to populate that array from a query, and then it's easy to access it with =ANY(customer_list) at the other side. It's not ideal, and I bet it's awful for large data sets, but it's neat.
However, it doesn't work for multidimensional arrays, the ARRAY command can't cope with a query that has more than one column in it, and the other side becomes more awkward, needing an UNNEST command to unpack the data into a format where it's usable.
I defined my type like this:
CREATE TYPE customer_list (
customer_id int,
bill_to_date date);
...and then used it in my procedure parameter list as customer_list[], which seems to work, but I have no nice way to populate this structure from the calling procedure.
I feel I'm missing something here, as I never got it to work properly as a prototype, but I also feel this is a potential dead end anyway, as it won't cope with large numbers of rows in a performant way, and this isn't what arrays are meant for. The first thing I do with the array at the other side, is unpack it back into a table again, which seems counterintuitive.
Ref Cursors
I read that you can use REF CURSORs, but I'm not quite sure how this works. It seems that you open a cursor in one procedure, and then pass a handle to it to another procedure. It doesn't seem like this is going to be set-based, but I could be wrong, and I just haven't found a way to convert a cursor back into a table again?
Write Everything as One Massive Procedure
I'm not ruling this out, as PostgreSQL seems to be leading me this way. If I write one enormous billing engine that copes with every eventuality, then my only issue will be when this is called using an externally provided list. Every other issue can be solved by just not having to pass data between procedures.
I can probably cope with this by loading the data into a batch table, and feeding this in as one of the options. It's awful, and it's like going back to the 1990s, but if this is what PostgreSQL wants me to do, then so be it.
Final Thoughts
I'm sure I'm going to be asked for code examples, which I will happily provide, but I avoided because this post is already uber-long, and what I'm trying to achieve is actually quite simple I feel.
Having typed all of this out, I'm still feeling that there must be a way of working around the "temporary table names must be unique", as this would work nicely if I found a way to let it be called in a multithreaded way.
Okay, taking the bits I was missing I came up with this, which seems to work:
CREATE TYPE IF NOT EXISTS bill_list AS (
customer_id int,
bill_date date);
CREATE TABLE IF NOT EXISTS billing.pending_bill (
customer_id int,
bill_date date);
CREATE TABLE IF NOT EXISTS billing.customer (
customer_id int,
billed boolean,
last_billed date);
INSERT INTO billing.customer
VALUES
(1, false, NULL::date),
(2, false, NULL::date),
(3, false, NULL::date);
INSERT INTO billing.pending_bill
VALUES
(1, '20210108'::date),
(2, '20210105'::date),
(3, '20210104'::date);
CREATE OR REPLACE PROCEDURE billing.bill_customer_list (
pending bill_list[])
LANGUAGE PLPGSQL
AS
$$
BEGIN
UPDATE
billing.customer c
SET
billed = true,
last_billed = p.bill_date
FROM
UNNEST(pending) p
WHERE
p.customer_id = c.customer_id;
END;
$$
CREATE OR REPLACE PROCEDURE billing.test ()
LANGUAGE PLPGSQL
AS
$$
DECLARE pending bill_list[];
BEGIN
pending := ARRAY(SELECT p FROM billing.pending_bill p);
CALL billing.bill_customer_list (pending);
END;
$$
Your select in the procedure returns multiple columns. But you want to create an array of a custom type. So your SELECT list needs to return the type, not *.
You don't need the bill_list type either, as every table has a corresponding type and you can simply pass an array of the table's type.
So you can use the following:
CREATE PROCEDURE bill_customer_list (
pending pending_bill[])
LANGUAGE PLPGSQL
AS
$$
BEGIN
UPDATE
customer c
SET
billed = true
FROM unnest(pending) p --<< treat the array as a table
WHERE
p.customer_id = c.customer_id;
END;
$$
;
CREATE PROCEDURE test ()
LANGUAGE PLPGSQL
AS
$$
DECLARE
pending pending_bill[];
BEGIN
pending := ARRAY(SELECT p FROM pending_bill p);
CALL bill_customer_list (pending);
END;
$$
;
The select p returns a record (of the same type as the table) as a single column in the result.
The := is the assignment operator in PL/pgSQL and typically much faster than a SELECT .. INTO variable. Although in this case the performance difference wouldn't matter much I guess.
Online example
If you do want to keep the extra type bill_list around because it e.g. contains less columns than pending_bill you need to select only those columns that match the type's column and create a record by enclosing them in parentheses. (a,b) is a single column with an anonymous record type (and two fields). a,b are two distinct columns
CREATE PROCEDURE test ()
LANGUAGE PLPGSQL
AS
$$
DECLARE
pending bill_list[];
BEGIN
pending := ARRAY(SELECT (id, customer_id) FROM pending_bill p);
CALL bill_customer_list (pending);
END;
$$
;
You should also note that DECLARE starts a block in PL/pgSQL where multiple variables can be defined. There is no need to write one DECLARE for each variable (your formatting of the DECLARE block let's me think that you assumed you need one DECLARE per variable as is the case in T-SQL)

Postgres using functions inside queries

I have a table with common word values to match against brands - so when someone types in "coke" I want to match any possible brand names associated with it as well as the original term.
CREATE TABLE word_association ( commonterm TEXT, assocterm TEXT);
INSERT INTO word_association ('coke', 'coca-cola'), ('coke', 'cocacola'), ('coke', 'coca-cola');
I have a function to create a list of these values in a pipe-delim string for pattern matching:
CREATE OR REPLACE FUNCTION usp_get_search_terms(userterm text)
RETURNS text AS
$BODY$DECLARE
returnstr TEXT DEFAULT '';
BEGIN
SET DATESTYLE TO DMY;
returnstr := userterm;
IF EXISTS (SELECT 1 FROM word_association WHERE LOWER(commonterm) = LOWER(userterm)) THEN
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE commonterm = userterm;
END IF;
RETURN returnstr;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION usp_get_search_terms(text)
OWNER TO customer_role;
If you call SELECT * FROM usp_get_search_terms('coke') you end up with
coke|coca-cola|cocacola|coca cola
EDIT: this function runs <100ms so it works fine.
I want to run a query with this text inserted e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE LOWER(X.online_description) % usp_get_search_terms ('coke');
This takes approx 56s to run against my table of ~500K records.
If I get the raw text and use it in the query it takes ~300ms e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE X.online_description % '(coke|coca-cola|cocacola|coca cola)';
The result sets are identical.
I've tried modifying what the output string from the function to e.g. enclose it in quotes and parentheses but it doesn't seem to make a difference.
Can someone please advise why there is a difference here? Is it the data type or something about calling functions inside queries? Thanks.
Your function might take 100ms, but it's not calling your function once; it's calling it 500,000 times.
It's because your function is declared VOLATILE. This tells Postgres that either the function returns different values when called multiple times within a query (like clock_timestamp() or random()), or that it alters the state of the database in some way (for example, by inserting records).
If your function contains only SELECTs, with no INSERTs, calls to other VOLATILE functions, or other side-effects, then you can declare it STABLE instead. This tells the planner that it can call the function just once and reuse the result without affecting the outcome of the query.
But your function does have side-effects, due to the SET DATESTYLE statement, which takes effect for the rest of the session. I doubt this was the intention, however. You may be able to remove it, as it doesn't look like date formatting is relevant to anything in there. But if it is necessary, the correct approach is to use the SET clause of the CREATE FUNCTION statement to change it only for the duration of the function call:
...
$BODY$
LANGUAGE plpgsql STABLE
SET DATESTYLE TO DMY
COST 100;
The other issue with the slow version of the query is the call to LOWER(X.online_description), which will prevent the query from utilising the index (since online_description is indexed, but LOWER(online_description) is not).
With these changes, the performance of both queries is the same; see this SQLFiddle.
So the answer came to me about dawn this morning - CTEs to the rescue!
Particularly as this is the "simple" version of a very large query, it helps to get this defined once in isolation, then do the matching against it. The alternative (given I'm calling this from a NodeJS platform) is to have one request retrieve the string of terms, then make another request to pass the string back. Not elegant.
WITH matches AS
( SELECT * FROM usp_get_search_terms('coke') )
, main AS
( SELECT X.article_number, X.online_description
FROM articles X
JOIN matches M ON X.online_description % M.usp_get_search_terms )
SELECT * FROM main
Execution time is somewhere around 300-500ms depending on term searched and articles returned.
Thanks for all your input guys - I've learned a few things about PostGres that my MS-SQL background didn't necessarily prepare me for :)
Have you tried removing the IF EXISTS() and simply using:
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE LOWER(commonterm) = LOWER(userterm)
In instead of calling the function for each row call it once:
select x.article_number, x.online_description
from
woolworths.articles x
cross join
woolworths.usp_get_search_terms ('coke') c (s)
where lower(x.online_description) % s

Why does PostgreSQL treat my query differently in a function?

I have a very simple query that is not much more complicated than:
select *
from table_name
where id = 1234
...it takes less than 50 milliseconds to run.
Took that query and put it into a function:
CREATE OR REPLACE FUNCTION pie(id_param integer)
RETURNS SETOF record AS
$BODY$
BEGIN
RETURN QUERY SELECT *
FROM table_name
where id = id_param;
END
$BODY$
LANGUAGE plpgsql STABLE;
This function when executed select * from pie(123); takes 22 seconds.
If I hard code an integer in place of id_param, the function executes in under 50 milliseconds.
Why does the fact that I am using a parameter in the where statement cause my function to run slow?
Edit to add concrete example:
CREATE TYPE test_type AS (gid integer, geocode character varying(9))
CREATE OR REPLACE FUNCTION geocode_route_by_geocode(geocode_param character)
RETURNS SETOF test_type AS
$BODY$
BEGIN
RETURN QUERY EXECUTE
'SELECT gs.geo_shape_id AS gid,
gs.geocode
FROM geo_shapes gs
WHERE geocode = $1
AND geo_type = 1
GROUP BY geography, gid, geocode' USING geocode_param;
END;
$BODY$
LANGUAGE plpgsql STABLE;
ALTER FUNCTION geocode_carrier_route_by_geocode(character)
OWNER TO root;
--Runs in 20 seconds
select * from geocode_route_by_geocode('999xyz');
--Runs in 10 milliseconds
SELECT gs.geo_shape_id AS gid,
gs.geocode
FROM geo_shapes gs
WHERE geocode = '9999xyz'
AND geo_type = 1
GROUP BY geography, gid, geocode
Update in PostgreSQL 9.2
There was a major improvement, I quote the release notes here:
Allow the planner to generate custom plans for specific parameter
values even when using prepared statements (Tom Lane)
In the past, a prepared statement always had a single "generic" plan
that was used for all parameter values, which was frequently much
inferior to the plans used for non-prepared statements containing
explicit constant values. Now, the planner attempts to generate custom
plans for specific parameter values. A generic plan will only be used
after custom plans have repeatedly proven to provide no benefit. This
change should eliminate the performance penalties formerly seen from
use of prepared statements (including non-dynamic statements in
PL/pgSQL).
Original answer for PostgreSQL 9.1 or older
A plpgsql functions has a similar effect as the PREPARE statement: queries are parsed and the query plan is cached.
The advantage is that some overhead is saved for every call.
The disadvantage is that the query plan is not optimized for the particular parameter values it is called with.
For queries on tables with even data distribution, this will generally be no problem and PL/pgSQL functions will perform somewhat faster than raw SQL queries or SQL functions. But if your query can use certain indexes depending on the actual values in the WHERE clause or, more generally, chose a better query plan for the particular values, you may end up with a sub-optimal query plan. Try an SQL function or use dynamic SQL with EXECUTE to force a the query to be re-planned for every call. Could look like this:
CREATE OR REPLACE FUNCTION pie(id_param integer)
RETURNS SETOF record AS
$BODY$
BEGIN
RETURN QUERY EXECUTE
'SELECT *
FROM table_name
where id = $1'
USING id_param;
END
$BODY$
LANGUAGE plpgsql STABLE;
Edit after comment:
If this variant does not change the execution time, there must be other factors at play that you may have missed or did not mention. Different database? Different parameter values? You would have to post more details.
I add a quote from the manual to back up my above statements:
An EXECUTE with a simple constant command string and some USING
parameters, as in the first example above, is functionally equivalent
to just writing the command directly in PL/pgSQL and allowing
replacement of PL/pgSQL variables to happen automatically. The
important difference is that EXECUTE will re-plan the command on each
execution, generating a plan that is specific to the current parameter
values; whereas PL/pgSQL normally creates a generic plan and caches it
for re-use. In situations where the best plan depends strongly on the
parameter values, EXECUTE can be significantly faster; while when the
plan is not sensitive to parameter values, re-planning will be a
waste.