Creating Postgres/PostGIS function to update table values based on spatial query - postgresql

I have a fair bit of experience with Postgres and PostGIS running queries and scripts, but no experience creating functions. I feel like what I want to achieve is rather more complex than the examples I've seen so I'm hoping someone can help me out.
I'm creating a web application that allows users to update lot boundary records stored in a spatial table based on the intersection with a polygon drawn on a map and some values entered into a form. I don't know if there is a way of storing a subset of database records in an array and iterating over it, updating each record in turn, or if I have to run separate update scripts within the function. I'm also not sure if its possible to pass in a table name as an argument to a function, as I'd like to just run the function and have it work on different tables.
If I would create a function that would do all the things I wanted it to do by simply running a bunch of separate UPDATE scripts, it might look like the following (function not actually tested):
CREATE OR REPLACE FUNCTION updateLots(wkt_geom text, tablename varchar(25), landuse varchar(25), density NUMERIC(4,1))
RETURNs VOID AS
$$
BEGIN
UPDATE [tablename] SET landuse = [landuse] WHERE ST_Intersection(geom, GeomFromWKT([wkt_geom], 3857));
UPDATE [tablename] SET density = [density] WHERE ST_Intersection(geom, GeomFromWKT([wkt_geom], 3857)) WHERE landuse = 'Residential';
UPDATE [tablename] SET density = NULL WHERE ST_Intersection(geom, GeomFromWKT([wkt_geom], 3857)) WHERE landuse != 'Residential';
UPDATE [tablename] SET yield = area / 10000 * [density] WHERE ST_Intersection(geom, GeomFromWKT([wkt_geom], 3857));
END;
$$
LANGUAGE plpgsql;
While this approach would save me from running several nested database scripts from the server, it seems inefficient and Postgres will not accept tablename as an argument as is. Hence I'm wondering the following 2 things:
Is there a way to create a subset of a table based on the spatial intersection with a provided geometry and iterate over each record, performing the necessary updates? If yes, how might this function be specified?
Can I provide a table name as an argument to a function?
I'm not sure of the best way to proceed, so if someone can tell me if what I want to do is possible and if so get me started with specifying a function I'd be very appreciative.
Cheers.

In PostgreSQL you can pass a table name to a function and then operate on that table, but you would have to EXECUTE a dynamic query, which is inefficient because the query has to be parsed and planned on every function call. If you have only a few tables then it is probably better to just put the command for each of the few tables in one function: the function is bigger but you have to call it only once and the queries can be planned and stored for future usage by the query planner.
Making a subset of the table based on the intersection between geometries is probably not a good thing. Instead, work on your UPDATE command, which can be greatly optimized:
CREATE FUNCTION updateLots(wkt_geom text, lu varchar(25), dens NUMERIC(4,1))
RETURNS void AS $$
BEGIN
UPDATE t1 SET landuse = lu,
density = (CASE WHEN lu = 'Residential' THEN dens END), -- ELSE NULL
yield = area * 0.0001 * dens
WHERE ST_Intersection(geom, GeomFromWKT(wkt_geom, 3857));
...; -- Same for other tables
END; $$ LANGUAGE plpgsql STRICT;
A few notes:
I renamed the parameters such that they are different from column names.
Update all columns in a single command. This means that ST_Intersection() gets called only once - potentially a huge cost saving.
Multiply by 0.0001 instead of dividing by 10000.
The function is declared STRICT: don't call the function if parameters are not supplied.
Another potential big cost saver is to pass the wkt_geom as a geometry, not as text. If this can be done in your case then you do not have to do the expensive ST_GeomFromWKT().
Since you want to run the function with a passed table name, you should use the following version:
CREATE FUNCTION updateLots(wkt_geom geometry, tablename varchar(25), lu varchar(25), dens NUMERIC(4,1))
RETURNS void AS $$
BEGIN
EXECUTE format('
UPDATE %I SET landuse = %L,
density = (CASE WHEN %2$L = ''Residential'' THEN $1 END),
yield = area * 0.0001 * $1
WHERE ST_Intersection(geom, $2)', tablename, lu)
USING dens, wkt_geom;
END; $$ LANGUAGE plpgsql STRICT;
In this case you should definitely convert the wkt_geom data to a geometry before calling this function once for each table name.

Related

Postgres using functions inside queries

I have a table with common word values to match against brands - so when someone types in "coke" I want to match any possible brand names associated with it as well as the original term.
CREATE TABLE word_association ( commonterm TEXT, assocterm TEXT);
INSERT INTO word_association ('coke', 'coca-cola'), ('coke', 'cocacola'), ('coke', 'coca-cola');
I have a function to create a list of these values in a pipe-delim string for pattern matching:
CREATE OR REPLACE FUNCTION usp_get_search_terms(userterm text)
RETURNS text AS
$BODY$DECLARE
returnstr TEXT DEFAULT '';
BEGIN
SET DATESTYLE TO DMY;
returnstr := userterm;
IF EXISTS (SELECT 1 FROM word_association WHERE LOWER(commonterm) = LOWER(userterm)) THEN
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE commonterm = userterm;
END IF;
RETURN returnstr;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION usp_get_search_terms(text)
OWNER TO customer_role;
If you call SELECT * FROM usp_get_search_terms('coke') you end up with
coke|coca-cola|cocacola|coca cola
EDIT: this function runs <100ms so it works fine.
I want to run a query with this text inserted e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE LOWER(X.online_description) % usp_get_search_terms ('coke');
This takes approx 56s to run against my table of ~500K records.
If I get the raw text and use it in the query it takes ~300ms e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE X.online_description % '(coke|coca-cola|cocacola|coca cola)';
The result sets are identical.
I've tried modifying what the output string from the function to e.g. enclose it in quotes and parentheses but it doesn't seem to make a difference.
Can someone please advise why there is a difference here? Is it the data type or something about calling functions inside queries? Thanks.
Your function might take 100ms, but it's not calling your function once; it's calling it 500,000 times.
It's because your function is declared VOLATILE. This tells Postgres that either the function returns different values when called multiple times within a query (like clock_timestamp() or random()), or that it alters the state of the database in some way (for example, by inserting records).
If your function contains only SELECTs, with no INSERTs, calls to other VOLATILE functions, or other side-effects, then you can declare it STABLE instead. This tells the planner that it can call the function just once and reuse the result without affecting the outcome of the query.
But your function does have side-effects, due to the SET DATESTYLE statement, which takes effect for the rest of the session. I doubt this was the intention, however. You may be able to remove it, as it doesn't look like date formatting is relevant to anything in there. But if it is necessary, the correct approach is to use the SET clause of the CREATE FUNCTION statement to change it only for the duration of the function call:
...
$BODY$
LANGUAGE plpgsql STABLE
SET DATESTYLE TO DMY
COST 100;
The other issue with the slow version of the query is the call to LOWER(X.online_description), which will prevent the query from utilising the index (since online_description is indexed, but LOWER(online_description) is not).
With these changes, the performance of both queries is the same; see this SQLFiddle.
So the answer came to me about dawn this morning - CTEs to the rescue!
Particularly as this is the "simple" version of a very large query, it helps to get this defined once in isolation, then do the matching against it. The alternative (given I'm calling this from a NodeJS platform) is to have one request retrieve the string of terms, then make another request to pass the string back. Not elegant.
WITH matches AS
( SELECT * FROM usp_get_search_terms('coke') )
, main AS
( SELECT X.article_number, X.online_description
FROM articles X
JOIN matches M ON X.online_description % M.usp_get_search_terms )
SELECT * FROM main
Execution time is somewhere around 300-500ms depending on term searched and articles returned.
Thanks for all your input guys - I've learned a few things about PostGres that my MS-SQL background didn't necessarily prepare me for :)
Have you tried removing the IF EXISTS() and simply using:
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE LOWER(commonterm) = LOWER(userterm)
In instead of calling the function for each row call it once:
select x.article_number, x.online_description
from
woolworths.articles x
cross join
woolworths.usp_get_search_terms ('coke') c (s)
where lower(x.online_description) % s

Executing queries dynamically in PL/pgSQL

I have found solutions (I think) to the problem I'm about to ask for on Oracle and SQL Server, but can't seem to translate this into a Postgres solution. I am using Postgres 9.3.6.
The idea is to be able to generate "metadata" about the table content for profiling purposes. This can only be done (AFAIK) by having queries run for each column so as to find out, say... min/max/count values and such. In order to automate the procedure, it is preferable to have the queries generated by the DB, then executed.
With an example salesdata table, I'm able to generate a select query for each column, returning the min() value, using the following snippet:
SELECT 'SELECT min('||column_name||') as minval_'||column_name||' from salesdata '
FROM information_schema.columns
WHERE table_name = 'salesdata'
The advantage being that the db will generate the code regardless of the number of columns.
Now there's a myriad places I had in mind for storing these queries, either a variable of some sort, or a table column, the idea being to then have these queries execute.
I thought of storing the generated queries in a variable then executing them using the EXECUTE (or EXECUTE IMMEDIATE) statement which is the approach employed here (see right pane), but Postgres won't let me declare a variable outside a function and I've been scratching my head with how this would fit together, whether that's even the direction to follow, perhaps there's something simpler.
Would you have any pointers, I'm currently trying something like this, inspired by this other question but have no idea whether I'm headed in the right direction:
CREATE OR REPLACE FUNCTION foo()
RETURNS void AS
$$
DECLARE
dyn_sql text;
BEGIN
dyn_sql := SELECT 'SELECT min('||column_name||') from salesdata'
FROM information_schema.columns
WHERE table_name = 'salesdata';
execute dyn_sql
END
$$ LANGUAGE PLPGSQL;
System statistics
Before you roll your own, have a look at the system table pg_statistic or the view pg_stats:
This view allows access only to rows of pg_statistic that correspond
to tables the user has permission to read, and therefore it is safe to
allow public read access to this view.
It might already have some of the statistics you are about to compute. It's populated by ANALYZE, so you might run that for new (or any) tables before checking.
-- ANALYZE tbl; -- optionally, to init / refresh
SELECT * FROM pg_stats
WHERE tablename = 'tbl'
AND schemaname = 'public';
Generic dynamic plpgsql function
You want to return the minimum value for every column in a given table. This is not a trivial task, because a function (like SQL in general) demands to know the return type at creation time - or at least at call time with the help of polymorphic data types.
This function does everything automatically and safely. Works for any table, as long as the aggregate function min() is allowed for every column. But you need to know your way around PL/pgSQL.
CREATE OR REPLACE FUNCTION f_min_of(_tbl anyelement)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT (t::%2$s).* FROM (SELECT min(%1$s) FROM %2$s) t'
, string_agg(quote_ident(attname), '), min(' ORDER BY attnum)
, pg_typeof(_tbl)::text)
FROM pg_attribute
WHERE attrelid = pg_typeof(_tbl)::text::regclass
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
);
END
$func$;
Call (important!):
SELECT * FROM f_min_of(NULL::tbl); -- tbl being the table name
db<>fiddle here
Old sqlfiddle
You need to understand these concepts:
Dynamic SQL in plpgsql with EXECUTE
Polymorphic types
Row types and table types in Postgres
How to defend against SQL injection
Aggregate functions
System catalogs
Related answer with detailed explanation:
Table name as a PostgreSQL function parameter
Refactor a PL/pgSQL function to return the output of various SELECT queries
Postgres data type cast
How to set value of composite variable field using dynamic SQL
How to check if a table exists in a given schema
Select columns with particular column names in PostgreSQL
Generate series of dates - using date type as input
Special difficulty with type mismatch
I am taking advantage of Postgres defining a row type for every existing table. Using the concept of polymorphic types I am able to create one function that works for any table.
However, some aggregate functions return related but different data types as compared to the underlying column. For instance, min(varchar_column) returns text, which is bit-compatible, but not exactly the same data type. PL/pgSQL functions have a weak spot here and insist on data types exactly as declared in the RETURNS clause. No attempt to cast, not even implicit casts, not to speak of assignment casts.
That should be improved. Tested with Postgres 9.3. Did not retest with 9.4, but I am pretty sure, nothing has changed in this area.
That's where this construct comes in as workaround:
SELECT (t::tbl).* FROM (SELECT ... FROM tbl) t;
By casting the whole row to the row type of the underlying table explicitly we force assignment casts to get original data types for every column.
This might fail for some aggregate function. sum() returns numeric for a sum(bigint_column) to accommodate for a sum overflowing the base data type. Casting back to bigint might fail ...
#Erwin Brandstetter, Many thanks for the extensive answer. pg_stats does indeed provide a few things, but what I really need to draw a complete profile is a variety of things, min, max values, counts, count of nulls, mean etc... so a bunch of queries have to be ran for each columns, some with GROUP BY and such.
Also, thanks for highlighting the importance of data types, i was sort of expecting this to throw a spanner in the works at some point, my main concern was with how to automate the query generation, and its execution, this last bit being my main concern.
I have tried the function you provide (I probably will need to start learning some plpgsql) but get a error at the SELECT (t::tbl) :
ERROR: type "tbl" does not exist
btw, what is the (t::abc) notation referred as, in python this would be a list slice, but it’s probably not the case in PLPGSQL

PostgreSQL insert or update trigger function volatility category

Assume, i have 2 tables in my DB (postgresql-9.x)
CREATE TABLE FOLDER (
KEY BIGSERIAL PRIMARY KEY,
PATH TEXT,
NAME TEXT
);
CREATE TABLE FOLDERFILE (
FILEID BIGINT,
PATH TEXT,
PATHKEY BIGINT
);
I automatically update FOLDERFILE.PATHKEY from FOLDER.KEY whenever i insert into or update FOLDERFILE:
CREATE OR REPLACE FUNCTION folderfile_fill_pathkey() RETURNS trigger AS $$
DECLARE
pathkey bigint;
changed boolean;
BEGIN
IF tg_op = 'INSERT' THEN
changed := TRUE;
ELSE IF old.FILEID != new.FILEID THEN
changed := TRUE;
END IF;
END IF;
IF changed THEN
SELECT INTO pathkey key FROM FOLDER WHERE PATH = new.path;
IF FOUND THEN
new.pathkey = pathkey;
ELSE
new.pathkey = NULL;
END IF;
END IF;
RETURN new;
END
$$ LANGUAGE plpgsql VOLATILE;
CREATE TRIGGER folderfile_fill_pathkey_trigger AFTER INSERT OR UPDATE
ON FOLDERFILE FOR EACH ROW EXECUTE PROCEDURE fcliplink_fill_pathkey();
So the question is about function folderfile_fill_pathkey() volatility. Documentations says
Any function with side-effects must be labeled VOLATILE
But as far as i understand – this function does not change any data in the tables it rely on, so i can mark this function as IMMUTABLE. It that correct?
Would there be any problem with IMMUTABLE trigger function if I bulk-insert many rows into FOLDERFILE within the same transaction, like:
BEGIN;
INSERT INTO FOLDERFILE ( ... );
...
INSERT INTO FOLDERFILE ( ... );
COMMIT;
Firstly, as #pozs already pointed out, the function definition you have provided is most definitely STABLE rather than IMMUTABLE since it performs database look-ups. This means that the result is not simply derived from the input parameters (as IMMUTABLE would suggest), but also from the data stored in your FOLDER table (which is bound to change). As per the documentation:
STABLE indicates that the function cannot modify the database, and
that within a single table scan it will consistently return the same
result for the same argument values, but that its result could change
across SQL statements. This is the appropriate selection for functions
whose results depend on database lookups, parameter variables (such as
the current time zone), etc.
Secondly, adding stability modifiers (IMMUTABLE/STABLE/VOLATILE) to your trigger functions serves an illustrative purpose at best, since AFAIK PostgreSQL doesn't actually perform any planning that would warrant their use. The following post from the pgsql-hackers mailing list seems to support my claim:
Volatility is a complete no-op for a trigger function anyway, as are
other planner parameters such as cost/rows, because there is no
planning involved in trigger calls.
To sum up: you're probably better off avoiding the stability keywords in your trigger(!) procedures for now, since including them seems to add little to no benefit but entails several unexpected caveats/pitfalls (see the end of #pozs's first comment).

Why does PostgreSQL treat my query differently in a function?

I have a very simple query that is not much more complicated than:
select *
from table_name
where id = 1234
...it takes less than 50 milliseconds to run.
Took that query and put it into a function:
CREATE OR REPLACE FUNCTION pie(id_param integer)
RETURNS SETOF record AS
$BODY$
BEGIN
RETURN QUERY SELECT *
FROM table_name
where id = id_param;
END
$BODY$
LANGUAGE plpgsql STABLE;
This function when executed select * from pie(123); takes 22 seconds.
If I hard code an integer in place of id_param, the function executes in under 50 milliseconds.
Why does the fact that I am using a parameter in the where statement cause my function to run slow?
Edit to add concrete example:
CREATE TYPE test_type AS (gid integer, geocode character varying(9))
CREATE OR REPLACE FUNCTION geocode_route_by_geocode(geocode_param character)
RETURNS SETOF test_type AS
$BODY$
BEGIN
RETURN QUERY EXECUTE
'SELECT gs.geo_shape_id AS gid,
gs.geocode
FROM geo_shapes gs
WHERE geocode = $1
AND geo_type = 1
GROUP BY geography, gid, geocode' USING geocode_param;
END;
$BODY$
LANGUAGE plpgsql STABLE;
ALTER FUNCTION geocode_carrier_route_by_geocode(character)
OWNER TO root;
--Runs in 20 seconds
select * from geocode_route_by_geocode('999xyz');
--Runs in 10 milliseconds
SELECT gs.geo_shape_id AS gid,
gs.geocode
FROM geo_shapes gs
WHERE geocode = '9999xyz'
AND geo_type = 1
GROUP BY geography, gid, geocode
Update in PostgreSQL 9.2
There was a major improvement, I quote the release notes here:
Allow the planner to generate custom plans for specific parameter
values even when using prepared statements (Tom Lane)
In the past, a prepared statement always had a single "generic" plan
that was used for all parameter values, which was frequently much
inferior to the plans used for non-prepared statements containing
explicit constant values. Now, the planner attempts to generate custom
plans for specific parameter values. A generic plan will only be used
after custom plans have repeatedly proven to provide no benefit. This
change should eliminate the performance penalties formerly seen from
use of prepared statements (including non-dynamic statements in
PL/pgSQL).
Original answer for PostgreSQL 9.1 or older
A plpgsql functions has a similar effect as the PREPARE statement: queries are parsed and the query plan is cached.
The advantage is that some overhead is saved for every call.
The disadvantage is that the query plan is not optimized for the particular parameter values it is called with.
For queries on tables with even data distribution, this will generally be no problem and PL/pgSQL functions will perform somewhat faster than raw SQL queries or SQL functions. But if your query can use certain indexes depending on the actual values in the WHERE clause or, more generally, chose a better query plan for the particular values, you may end up with a sub-optimal query plan. Try an SQL function or use dynamic SQL with EXECUTE to force a the query to be re-planned for every call. Could look like this:
CREATE OR REPLACE FUNCTION pie(id_param integer)
RETURNS SETOF record AS
$BODY$
BEGIN
RETURN QUERY EXECUTE
'SELECT *
FROM table_name
where id = $1'
USING id_param;
END
$BODY$
LANGUAGE plpgsql STABLE;
Edit after comment:
If this variant does not change the execution time, there must be other factors at play that you may have missed or did not mention. Different database? Different parameter values? You would have to post more details.
I add a quote from the manual to back up my above statements:
An EXECUTE with a simple constant command string and some USING
parameters, as in the first example above, is functionally equivalent
to just writing the command directly in PL/pgSQL and allowing
replacement of PL/pgSQL variables to happen automatically. The
important difference is that EXECUTE will re-plan the command on each
execution, generating a plan that is specific to the current parameter
values; whereas PL/pgSQL normally creates a generic plan and caches it
for re-use. In situations where the best plan depends strongly on the
parameter values, EXECUTE can be significantly faster; while when the
plan is not sensitive to parameter values, re-planning will be a
waste.

In postgres (plpgsql), how to make a function that returns select * on a variable table_name?

Basically, at least for proof of concept, I want a function where I can run:
SELECT res('table_name'); and this will give me the results of SELECT * FROM table_name;.
The issue I am having is schema...in the declaration of the function I have:
CREATE OR REPLACE FUNCTION res(table_name TEXT) RETURNS SETOF THISISTHEPROBLEM AS
The problem is that I do not know how to declare my return, as it wants me to specify a table or a schema, and I won't have that until the function is actually run.
Any ideas?
You can do this, but as mentioned before you have to add a column definiton list in the SELECT query.
CREATE OR REPLACE FUNCTION res(table_name TEXT) RETURNS SETOF record AS $$
BEGIN
RETURN QUERY EXECUTE 'SELECT * FROM ' || table_name;
END;
$$ LANGUAGE plpgsql;
SELECT * FROM res('sometable') sometable (col1 INTEGER, col2 INTEGER, col3 SMALLINT, col4 TEXT);
Why for any real practical purpose would you just want to pass in table and select * from it? For fun maybe?
You can't do it without defining some kind of known output like jack and rudi show. Or doing it like depesz does here using output parameters http://www.depesz.com/index.php/2008/05/03/waiting-for-84-return-query-execute-and-cursor_tuple_fraction/.
A few hack around the wall approachs are to issue raise notices in a loop and print out a result set one row at a time. Or you could create a function called get_rows_TABLENAME that has a definition for every table you want to return. Just use code to generate the procedures creations. But again not sure how much value doing a select * from a table, especially with no constraints is other than for fun or making the DBA's blood boil.
Now in SQL Server you can have a stored procedure return a dynamic result set. This is both a blessing and curse as you can't be certain what comes back without looking up the definition. For me I look at PostgreSQL's implementation to be the more sound way to go about it.
Even if you manage to do this (see rudi-moore's answer for a way if you have 8.4 or above), You will have to expand the type explicitly in the select - eg:
SELECT res('table_name') as foo(id int,...)