Why can't immutable UDFs be called within certain clauses of a CTE in Redshift? - postgresql

The issue can be simplified to this, where, within a view, any CTE referring to some_immutable_func breaks, except in the WHERE, HAVING clauses, resulting in the following error::
create or replace function some_immutable_func ()
returns int
immutable as $$
SELECT 1
$$ language sql;
create view some_view as
WITH some_cte AS (
SELECT immutable_func()
)
SELECT * FROM some_cte;
FATAL: Query processing failed due to an internal error.
CONTEXT: SQL function "immutable_func"
SSL connection has been closed unexpectedly
The connection to the server was lost. Attempting reset: Succeeded.
-- but this is okay
create view some_view as
WITH some_cte AS (
SELECT * FROM some_table
WHERE ...some_immutable_func()...
HAVING ...some_immutable_func()...
)
SELECT * FROM some_cte;
CREATE VIEW
Built-in immutable UDFs such as ABS(-3) work fine. Simply changing the UDF to be stable fixes the issue, but I am looking to optimize query performance within a complex view, where apparently the stable nature of a certain UDF slows it down nearly 100x. Ideally, I would also like to minimize changes to the view's structure, hopefully doing a simple replace-all instead of rearranging all references to the UDF into WHERE and HAVING clauses.
I believe the problem may be with the query optimizer, but I'm surprised there is little information out there regarding the cryptic/finer details of Redshift/postgres and immutability.
EDIT: I've also discovered that changing the UDF language to python seems to work fine. However, it seems to have quite slow performance in my specific use case, potentially worse than just using the STABLE SQL UDF.

Just wanted to confirm that this has been resolved since the question was posted.
The provided example now works as expected.

Related

Executing queries dynamically in PL/pgSQL

I have found solutions (I think) to the problem I'm about to ask for on Oracle and SQL Server, but can't seem to translate this into a Postgres solution. I am using Postgres 9.3.6.
The idea is to be able to generate "metadata" about the table content for profiling purposes. This can only be done (AFAIK) by having queries run for each column so as to find out, say... min/max/count values and such. In order to automate the procedure, it is preferable to have the queries generated by the DB, then executed.
With an example salesdata table, I'm able to generate a select query for each column, returning the min() value, using the following snippet:
SELECT 'SELECT min('||column_name||') as minval_'||column_name||' from salesdata '
FROM information_schema.columns
WHERE table_name = 'salesdata'
The advantage being that the db will generate the code regardless of the number of columns.
Now there's a myriad places I had in mind for storing these queries, either a variable of some sort, or a table column, the idea being to then have these queries execute.
I thought of storing the generated queries in a variable then executing them using the EXECUTE (or EXECUTE IMMEDIATE) statement which is the approach employed here (see right pane), but Postgres won't let me declare a variable outside a function and I've been scratching my head with how this would fit together, whether that's even the direction to follow, perhaps there's something simpler.
Would you have any pointers, I'm currently trying something like this, inspired by this other question but have no idea whether I'm headed in the right direction:
CREATE OR REPLACE FUNCTION foo()
RETURNS void AS
$$
DECLARE
dyn_sql text;
BEGIN
dyn_sql := SELECT 'SELECT min('||column_name||') from salesdata'
FROM information_schema.columns
WHERE table_name = 'salesdata';
execute dyn_sql
END
$$ LANGUAGE PLPGSQL;
System statistics
Before you roll your own, have a look at the system table pg_statistic or the view pg_stats:
This view allows access only to rows of pg_statistic that correspond
to tables the user has permission to read, and therefore it is safe to
allow public read access to this view.
It might already have some of the statistics you are about to compute. It's populated by ANALYZE, so you might run that for new (or any) tables before checking.
-- ANALYZE tbl; -- optionally, to init / refresh
SELECT * FROM pg_stats
WHERE tablename = 'tbl'
AND schemaname = 'public';
Generic dynamic plpgsql function
You want to return the minimum value for every column in a given table. This is not a trivial task, because a function (like SQL in general) demands to know the return type at creation time - or at least at call time with the help of polymorphic data types.
This function does everything automatically and safely. Works for any table, as long as the aggregate function min() is allowed for every column. But you need to know your way around PL/pgSQL.
CREATE OR REPLACE FUNCTION f_min_of(_tbl anyelement)
RETURNS SETOF anyelement
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE (
SELECT format('SELECT (t::%2$s).* FROM (SELECT min(%1$s) FROM %2$s) t'
, string_agg(quote_ident(attname), '), min(' ORDER BY attnum)
, pg_typeof(_tbl)::text)
FROM pg_attribute
WHERE attrelid = pg_typeof(_tbl)::text::regclass
AND NOT attisdropped -- no dropped (dead) columns
AND attnum > 0 -- no system columns
);
END
$func$;
Call (important!):
SELECT * FROM f_min_of(NULL::tbl); -- tbl being the table name
db<>fiddle here
Old sqlfiddle
You need to understand these concepts:
Dynamic SQL in plpgsql with EXECUTE
Polymorphic types
Row types and table types in Postgres
How to defend against SQL injection
Aggregate functions
System catalogs
Related answer with detailed explanation:
Table name as a PostgreSQL function parameter
Refactor a PL/pgSQL function to return the output of various SELECT queries
Postgres data type cast
How to set value of composite variable field using dynamic SQL
How to check if a table exists in a given schema
Select columns with particular column names in PostgreSQL
Generate series of dates - using date type as input
Special difficulty with type mismatch
I am taking advantage of Postgres defining a row type for every existing table. Using the concept of polymorphic types I am able to create one function that works for any table.
However, some aggregate functions return related but different data types as compared to the underlying column. For instance, min(varchar_column) returns text, which is bit-compatible, but not exactly the same data type. PL/pgSQL functions have a weak spot here and insist on data types exactly as declared in the RETURNS clause. No attempt to cast, not even implicit casts, not to speak of assignment casts.
That should be improved. Tested with Postgres 9.3. Did not retest with 9.4, but I am pretty sure, nothing has changed in this area.
That's where this construct comes in as workaround:
SELECT (t::tbl).* FROM (SELECT ... FROM tbl) t;
By casting the whole row to the row type of the underlying table explicitly we force assignment casts to get original data types for every column.
This might fail for some aggregate function. sum() returns numeric for a sum(bigint_column) to accommodate for a sum overflowing the base data type. Casting back to bigint might fail ...
#Erwin Brandstetter, Many thanks for the extensive answer. pg_stats does indeed provide a few things, but what I really need to draw a complete profile is a variety of things, min, max values, counts, count of nulls, mean etc... so a bunch of queries have to be ran for each columns, some with GROUP BY and such.
Also, thanks for highlighting the importance of data types, i was sort of expecting this to throw a spanner in the works at some point, my main concern was with how to automate the query generation, and its execution, this last bit being my main concern.
I have tried the function you provide (I probably will need to start learning some plpgsql) but get a error at the SELECT (t::tbl) :
ERROR: type "tbl" does not exist
btw, what is the (t::abc) notation referred as, in python this would be a list slice, but it’s probably not the case in PLPGSQL

Postgres Rules Preventing CTE Queries

Using Postgres 9.3:
I am attempting to automatically populate a table when an insert is performed on another table. This seems like a good use for rules, but after adding the rule to the first table, I am no longer able to perform inserts into the second table using the writable CTE. Here is an example:
CREATE TABLE foo (
id INT PRIMARY KEY
);
CREATE TABLE bar (
id INT PRIMARY KEY REFERENCES foo
);
CREATE RULE insertFoo AS ON INSERT TO foo DO INSERT INTO bar VALUES (NEW.id);
WITH a AS (SELECT * FROM (VALUES (1), (2)) b)
INSERT INTO foo SELECT * FROM a
When this is run, I get the error
"ERROR: WITH cannot be used in a query that is rewritten by rules
into multiple queries".
I have searched for that error string, but am only able to find links to the source code. I know that I can perform the above using row-level triggers instead, but it seems like I should be able to do this at the statement level. Why can I not use the writable CTE, when queries like this can (in this case) be easily re-written as:
INSERT INTO foo SELECT * FROM (VALUES (1), (2)) a
Does anyone know of another way that would accomplish what I am attempting to do other than 1) using rules, which prevents the use of "with" queries, or 2) using row-level triggers? Thanks,
        
TL;DR: use triggers, not rules.
Generally speaking, prefer triggers over rules, unless rules are absolutely necessary. (Which, in practice, they never are.)
Using rules introduces heaps of problems which will needlessly complicate your life down the road. You've run into one here. Another (major) one is, for instance, that the number of affected rows will correspond to that of the very last query -- if you're relying on FOUND somewhere and your query is incorrectly reporting that no rows were affected by a query, you'll be in for painful bugs.
Moreover, there's occasional talk of deprecating Postgres rules outright:
http://postgresql.nabble.com/Deprecating-RULES-td5727689.html
As the other answer I definitely recommend using INSTEAD OF triggers before RULEs.
However if for some reason you don't want to change existing VIEW RULEs and still want use WITH you can do so by wrapping the VIEW in a stored procedure:
create function insert_foo(int) returns void as $$
insert into foo values ($1)
$$ language sql;
WITH a AS (SELECT * FROM (VALUES (1), (2)) b)
SELECT insert_foo(a.column1) from a;
This could be useful when using some legacy db through some system that wraps statements with CTEs.

How is this code prone to SQL injection?

I was reading PostgreSql documentation here, and came across the following code snippet:
EXECUTE 'SELECT count(*) FROM mytable WHERE inserted_by = $1 AND inserted <= $2'
INTO c
USING checked_user, checked_date;
The documentation states that "This method is often preferable to inserting data values into the command string as text: it avoids run-time overhead of converting the values to text and back, and it is much less prone to SQL-injection attacks since there is no need for quoting or escaping".
Can you show me how this code is prone to SQL injection at all?
Edit: in all other RDBMS I have worked with this would completely prevent SQL injection. What is implemented differently in PostgreSql?
Quick answer is that it isn't by itself prone to SQL injection, As I understand your question you are asking why we don't just say so. So since you are looking for scenarios where this might lead to SQL injection, consider that mytable might be a view, and so could have additional functions behind it. Those functions might be vulnerable to SQL injection.
So you can't look at a query and conclude that it is definitely not susceptible to SQL injection. The best you can do is indicate that at the level provided, this specific level of your application does not raise sql injection concerns here.
Here is an example of a case where sql injection might very well happen.
CREATE OR REPLACE FUNCTION ban_user() returns trigger
language plpgsql security definer as
$$
begin
insert into banned_users (username) values (new.username);
execute 'alter user ' || new.username || ' WITH VALID UNTIL ''YESTERDAY''';
return new;
end;
Note that utility functions cannot be parameterized as you indicate, and we forgot to quote_ident() around new.username, thus making the field vulnerable.
CREATE OR REPLACE VIEW banned_users_today AS
SELECT username FROM banned_users where banned_date = 'today';
CREATE TRIGGER i_banned_users_today INSTEAD OF INSERT ON banned_users_today
FOR EACH ROW EXECUTE PROCEDURE ban_user();
EXECUTE 'insert into banned_users_today (username) values ($1)'
USING 'postgres with password ''boo''; drop function ban_user() cascade; --';
So no it doesn't completely solve the problem even if used everywhere it can be used. And proper use of quote_literal() and quote_ident() don't always solve the problem either.
The thing is that the problem can always be at a lower level than the query you are executing.
The bound parameters prevent garbage to manipulate the statement into doing anything other than what it's intended to do.
This guarantees no possibility for SQL-injection attacks short of a Postgres bug. (See H2C03's link for examples of what could go wrong.)
I imagine the "much less prone to SQL-injection attacks" amounts to CYA verbiage were such a thing were to arise.
SQL injection is usually associated with large data dumps on pastebin.com and such scenario won't work here even if the example used contatenation not variables. It's because that COUNT(*) will aggregate all data you'd be trying to steal.
But I can imagine scenarios where count of arbitrary record would be sufficiently valuable information - e.g. number of competitor's clients, number of sold products etc. And actually recalling some of the really tricky blind SQL injection methods it might be possible to build a query that using COUNT alone would allow to iteratively recover actual text from the database.
It would be also much easier to exploit on a database sufficiently old and misconfigured to allow the ; separator, in which case the attacker might just append a completely separate query.

Is ST_GeomFromText better than providing direct geometry?

I have been working with postgis recently,and in my query if I use ST_GeomFromText it execute faster than running a sub-query to get geom.
I thought ST_GeomFromText will be more expensive but after running many tests every time I got the result faster, my question Is there any explanation behind this?
because for me getting the geom directly in sub-query is better than getting geom as text then added as GeomFromText.
Thanks,
Sara
Your issue is that the issues are different. ST_GeomFromText is going to be an immutable function, so the output depends on the input only. This means the planner can execute it once at the beginning of the query. Running a subquery means you are having to look up the geometry, which means disk access, etc. In the first, you have a little bit of CPU activity, executed once for the query, and on the second you have disk lookups.
So the answer to some extent depends on what you are doing with it. In general, you can assume that the optimizer will handle things like type conversions on input, where those are not dependent on settings, quite well.
Think about it this way.
SELECT * FROM mytable WHERE my_geom = ST_GeomFromText(....);
This gets converted into something like the following pseudocode:
private_geom = ST_GeomFromText(....);
SELECT * FROM mytable WHERE my_geom = private_geom;
Then that query gets planned and executed.
Obviously you don't want to be adding round trips just to avoid in-query lookups, but where you know the geometry, you might as well specify it via ST_GeomFromText(....) in your query.

Is it possible to use CASE with IN?

I'm trying to construct a T-SQL statement with a WHERE clause determined by an input parameter. Something like:
SELECT * FROM table
WHERE id IN
CASE WHEN #param THEN
(1,2,4,5,8)
ELSE
(9,7,3)
END
I've tried all combination of moving the IN, CASE etc around that I can think of. Is this (or something like it) possible?
try this:
SELECT * FROM table
WHERE (#param='??' AND id IN (1,2,4,5,8))
OR (#param!='??' AND id in (9,7,3))
this will have a problem using an index.
The key with a dynamic search conditions is to make sure an index is used, instead of how can I easily reuse code, eliminate duplications in a query, or try to do everything with the same query. Here is a very comprehensive article on how to handle this topic:
Dynamic Search Conditions in T-SQL by Erland Sommarskog
It covers all the issues and methods of trying to write queries with multiple optional search conditions. This main thing you need to be concerned with is not the duplication of code, but the use of an index. If your query fails to use an index, it will preform poorly. There are several techniques that can be used, which may or may not allow an index to be used.
here is the table of contents:
Introduction
The Case Study: Searching Orders
The Northgale Database
Dynamic SQL
Introduction
Using sp_executesql
Using the CLR
Using EXEC()
When Caching Is Not Really What You Want
Static SQL
Introduction
x = #x OR #x IS NULL
Using IF statements
Umachandar's Bag of Tricks
Using Temp Tables
x = #x AND #x IS NOT NULL
Handling Complex Conditions
Hybrid Solutions – Using both Static and Dynamic SQL
Using Views
Using Inline Table Functions
Conclusion
Feedback and Acknowledgements
Revision History
if you are on the proper version of SQL Server 2008, there is an additional technique that can be used, see: Dynamic Search Conditions in T-SQL Version for SQL 2008 (SP1 CU5 and later)
If you are on that proper release of SQL Server 2008, you can just add OPTION (RECOMPILE) to the query and the local variable's value at run time is used for the optimizations.
Consider this, OPTION (RECOMPILE) will take this code (where no index can be used with this mess of ORs):
WHERE
(#search1 IS NULL or Column1=#Search1)
AND (#search2 IS NULL or Column2=#Search2)
AND (#search3 IS NULL or Column3=#Search3)
and optimize it at run time to be (provided that only #Search2 was passed in with a value):
WHERE
Column2=#Search2
and an index can be used (if you have one defined on Column2)
if #param = 'whatever'
select * from tbl where id in (1,2,4,5,8)
else
select * from tbl where id in (9,7,3)