I would like to allow the execution of a PL/PGSQL function (my_function) only if its argument (my_table.x) belongs to a predefined interval (e.g. [100,1000]).
Let's take the following query example :
(q) SELECT * FROM my_table WHERE my_function(mytable.x);
I would like this query automatically rewrites itself to check whether mytable.x belongs to the interval [100,1000] :
(q') SELECT * FROM my_table WHERE (my_table.x BETWEEN 100 AND 1000) AND my_function(my_table.x);
The command EXPLAIN ANALYSE shows that the second query is really faster than the first one.
How can I change the query execution plan in order to automate the process of query rewriting (q into q') ?
Where can I store suitably the metadata about the interval [100,1000] associated to my_function ?
Thanks by advance,
Thomas Girault
The help I need will help a project about the integration of fuzzy logics into PostgreSQL : [https://github.com/postgresqlf/PostgreSQL_f/][PostgreSQLf]
The fastest way to catch it is something like this at the top of the function body:
IF $1 BETWEEN 100 AND 1000 THEN
-- proceed
ELSE
RETURN NULL; -- Or what ever you want to return in this case
END IF;
This should be very fast.
Actual query rewriting is done with the RULE system in PostgreSQL. But rules apply to tables and views, not to functions. You could wrap your query in a view - but then you can add the additional condition explicitly, which is cheaper.
CREATE VIEW v_tbl_only_valid_x AS
SELECT *
FROM tbl
WHERE x BETWEEN 100 AND 1000;
Call:
SELECT * FROM v_tbl_only_valid_x WHERE my_function(x);
This way the query planner gets the information about the selectivity of the query on the column x explicitly, which may result in a different query plan.
But wouldn't it be simpler to just add the second WHERE condition in your query like you have it in q'?
Related
There are two identical tables. A dynamic parameter will determine which of the two tables to get data from. The approach I took is to union results from both tables and add a predicate to each subquery to filter on the parameter passed in (please see below). This has produced an estimated execution plan that reads both tables and then applies the filter. Is there a way to optimize the query without using a condition (like IF) so that only the appropriate table is queried?
create table Flip (ID uniqueidentifier)
create table Flop (ID uniqueidentifier)
The query
declare #useFlip bit
select * from Flip where #useFlip=1
union
select * from Flop where #useFlip<>1
Execution plan
I would personally move to a dynamic statement or IF...ELSE logic. As you're only ever going to query one table, it makes more sense to create a plan that's relevant.
Dynamic approach:
DECLARE #SQL nvarchar(MAX);
SET #SQL = N'SELECT * FROM ' + QUOTENAME(CASE #UseFlip WHEN 1 THEN N'Flip' ELSE N'Flop' END) + N';';
EXEC sp_executesql #SQL;
IF...ELSE Approach:
IF #UseFlip = 1
SELECT * FROM Flip;
ELSE
SELECT * FROM Flop;
Edit: The above above been completely invalidated, as the OP commented below this answer to advise they are using an inline table-value function (this information is not in the question at this time). Neither of these statements can be used in an iTVF.
I have a table with common word values to match against brands - so when someone types in "coke" I want to match any possible brand names associated with it as well as the original term.
CREATE TABLE word_association ( commonterm TEXT, assocterm TEXT);
INSERT INTO word_association ('coke', 'coca-cola'), ('coke', 'cocacola'), ('coke', 'coca-cola');
I have a function to create a list of these values in a pipe-delim string for pattern matching:
CREATE OR REPLACE FUNCTION usp_get_search_terms(userterm text)
RETURNS text AS
$BODY$DECLARE
returnstr TEXT DEFAULT '';
BEGIN
SET DATESTYLE TO DMY;
returnstr := userterm;
IF EXISTS (SELECT 1 FROM word_association WHERE LOWER(commonterm) = LOWER(userterm)) THEN
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE commonterm = userterm;
END IF;
RETURN returnstr;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
ALTER FUNCTION usp_get_search_terms(text)
OWNER TO customer_role;
If you call SELECT * FROM usp_get_search_terms('coke') you end up with
coke|coca-cola|cocacola|coca cola
EDIT: this function runs <100ms so it works fine.
I want to run a query with this text inserted e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE LOWER(X.online_description) % usp_get_search_terms ('coke');
This takes approx 56s to run against my table of ~500K records.
If I get the raw text and use it in the query it takes ~300ms e.g.
SELECT X.article_number, X.online_description
FROM articles X
WHERE X.online_description % '(coke|coca-cola|cocacola|coca cola)';
The result sets are identical.
I've tried modifying what the output string from the function to e.g. enclose it in quotes and parentheses but it doesn't seem to make a difference.
Can someone please advise why there is a difference here? Is it the data type or something about calling functions inside queries? Thanks.
Your function might take 100ms, but it's not calling your function once; it's calling it 500,000 times.
It's because your function is declared VOLATILE. This tells Postgres that either the function returns different values when called multiple times within a query (like clock_timestamp() or random()), or that it alters the state of the database in some way (for example, by inserting records).
If your function contains only SELECTs, with no INSERTs, calls to other VOLATILE functions, or other side-effects, then you can declare it STABLE instead. This tells the planner that it can call the function just once and reuse the result without affecting the outcome of the query.
But your function does have side-effects, due to the SET DATESTYLE statement, which takes effect for the rest of the session. I doubt this was the intention, however. You may be able to remove it, as it doesn't look like date formatting is relevant to anything in there. But if it is necessary, the correct approach is to use the SET clause of the CREATE FUNCTION statement to change it only for the duration of the function call:
...
$BODY$
LANGUAGE plpgsql STABLE
SET DATESTYLE TO DMY
COST 100;
The other issue with the slow version of the query is the call to LOWER(X.online_description), which will prevent the query from utilising the index (since online_description is indexed, but LOWER(online_description) is not).
With these changes, the performance of both queries is the same; see this SQLFiddle.
So the answer came to me about dawn this morning - CTEs to the rescue!
Particularly as this is the "simple" version of a very large query, it helps to get this defined once in isolation, then do the matching against it. The alternative (given I'm calling this from a NodeJS platform) is to have one request retrieve the string of terms, then make another request to pass the string back. Not elegant.
WITH matches AS
( SELECT * FROM usp_get_search_terms('coke') )
, main AS
( SELECT X.article_number, X.online_description
FROM articles X
JOIN matches M ON X.online_description % M.usp_get_search_terms )
SELECT * FROM main
Execution time is somewhere around 300-500ms depending on term searched and articles returned.
Thanks for all your input guys - I've learned a few things about PostGres that my MS-SQL background didn't necessarily prepare me for :)
Have you tried removing the IF EXISTS() and simply using:
SELECT returnstr || '|' || string_agg(assocterm, '|') INTO returnstr
FROM word_association
WHERE LOWER(commonterm) = LOWER(userterm)
In instead of calling the function for each row call it once:
select x.article_number, x.online_description
from
woolworths.articles x
cross join
woolworths.usp_get_search_terms ('coke') c (s)
where lower(x.online_description) % s
I have the following problem with PostgreSQL 9.3.
There is a view encapsulating a non-trivial query to some resources (e.g., documents). Let's illustrate it as simple as
CREATE VIEW vw_resources AS
SELECT * FROM documents; -- there are several joined tables in fact...
The client application uses the view usually with some WHERE conditions on several fields, and might also use paging of the results, so OFFSET and LIMIT may also be applied.
Now, on top of the actual resource list computed by vw_resources, I only want to display resources which the current user is allowed for. There is quite a complex set of rules regarding privileges (they depend on several attributes of the resources in question, explicit ACLs, implicit rules based on user roles or relations to other users...) so I wanted to encapsulate all of them in a single function. To prevent repetitive costly queries for each resource, the function takes a list of resource IDs, evaluates the privileges for all of them at once, and returns the set of the requested resource IDs together with the according privileges (read/write is distinguished). It looks roughly like this:
CREATE FUNCTION list_privileges(resource_ids BIGINT[])
RETURNS TABLE (resource_id BIGINT, privilege TEXT)
AS $function$
BEGIN
-- the function lists privileges for a user that would get passed in an argument - omitting that for simplicity
RAISE NOTICE 'list_privileges called'; -- for diagnostic purposes
-- for illustration, let's simply grant write privileges for any odd resource:
RETURN QUERY SELECT id, (CASE WHEN id % 2 = 1 THEN 'write' ELSE 'none' END)
FROM unnest(resource_ids) id;
END;
$function$ LANGUAGE plpgsql STABLE;
The question is how to integrate such a function in the vw_resources view for it to give only resources the user is privileged for (i.e., has 'read' or 'write' privilege).
A trivial solution would use a CTE:
CREATE VIEW vw_resources AS
WITH base_data AS (
SELECT * FROM documents
)
SELECT base_data.*, priv.privilege
FROM base_data
JOIN list_privileges((SELECT array_agg(resource_id) FROM base_data)) AS priv USING (resource_id)
WHERE privilege IN ('read', 'write');
The problem is that the view itself gives too much rows - some WHERE conditions and OFFSET/LIMIT clauses are only applied to the view itself, like SELECT * FROM vw_resources WHERE id IN (1,2,3) LIMIT 10 (any complex filtering might be requested by the client application). And since PostgreSQL is unable to push the conditions down the CTE, the list_privileges(BIGINT[]) function ends up with evaluating privileges for all resources in the database, which effectively kills the performance.
So I attempted to use a window function which would collect resource IDs from the whole result set and join the list_privileges(BIGINT[]) function in an outer query, like illustrated below, but the list_privileges(BIGINT[]) function ends up being called repetitively for each row (as testified by 'list_privileges called' notices), which kinda ruins the previous effort:
CREATE VIEW vw_resources AS
SELECT d.*, priv.privilege
FROM (
SELECT *, array_agg(resource_id) OVER () AS collected
FROM documents
) AS d
JOIN list_privileges(d.collected) AS priv USING (resource_id)
WHERE privilege IN ('read', 'write');
I would resort to forcing clients to give two separate queries, the first taking the vw_resources without privileges applied, the second calling the list_privileges(BIGINT[]) function passing it the list of resource IDs fetched by the first query, and filtering the disallowed resources on the client side. It is quite clumsy for the client, though, and obtaining e.g. the first 20 allowed resources would be practically impossible as limiting the first query simply does not get it - if some resources are filtered out due to privileges then we simply don't have 20 rows in the overall result...
Any help welcome!
P.S. For the sake of completeness, I append a sample documents table:
CREATE TABLE documents (resource_id BIGINT, content TEXT);
INSERT INTO documents VALUES (1,'a'),(2,'b'),(3,'c');
If you must use plpgsql then create the function taking no arguments
create function list_privileges()
returns table (resource_id bigint, privilege text)
as $function$
begin
raise notice 'list_privileges called'; -- for diagnostic purposes
return query select 1, case when 1 % 2 = 1 then 'write' else 'none' end
;
end;
$function$ language plpgsql stable;
And join it to the other complex query to form the vw_resources view
create view vw_resources as
select *
from
documents d
inner join
list_privileges() using(resource_id)
The filter conditions will be added at query time
select *
from vw_resources
where
id in (1,2,3)
and
privilege in ('read', 'write')
Let the planner do its optimization magic and check the explain output before any "premature optimization".
This is just a conjecture: The function might make it harder or impossible for the planner to optimize.
If plpgsql is not really necessary, and that is very frequent, I would just create a view in instead of the function
create view vw_list_privileges as
select
1 as resource_id,
case when 1 % 2 = 1 then 'write' else 'none' end as privilege
And join it the same way to the complex query
create view vw_resources as
select *
from
documents d
inner join
vw_list_privileges using(resource_id)
In my table results from column work_time (interval type) display as 200:00:00. Is it possible to cut the seconds part, so it will be displayed as 200:00? Or, even better: 200h00min (I've seen it accepts h unit in insert so why not load it like this?).
Preferably, by altering work_time column, not by changing the select query.
This is not something you should do by altering a column but by changing the select query in some way. If you change the column you are changing storage and functional uses, and that's not good. To change it on output, you need to modify how it is retrieved.
You have two basic options. The first is to modify your select queries directly, using to_char(myintervalcol, 'HH24:MI')
However if your issue is that you have a common format you want to have universal access to in your select query, PostgreSQL has a neat trick I call "table methods." You can attach a function to a table in such a way that you can call it in a similar (but not quite identical) syntax to a new column. In this case you would do something like:
CREATE OR REPLACE FUNCTION myinterval_nosecs(mytable) RETURNS text LANGUAGE SQL
IMMUTABLE AS
$$
SELECT to_char($1.myintervalcol, 'HH24:MI');
$$;
This works on the row input, not on the underlying table. As it always returns the same information for the same input, you can mark it immutable and even index the output (meaning it can be run at plan time and indexed used).
To call this, you'd do something like:
SELECT myinterval_nosecs(m) FROM mytable m;
But you can then use the special syntax above to rewrite that as:
SELECT m.myinterval_nosecs FROM mytable m;
Note that since myinterval_nosecs is a function you cannot omit the m. at the beginning. This is because the query planner will rewrite the query in the former syntax and will not guess as to which relation you mean to run it against.
I have a page on my site which has multiple drop down boxes as filters.
So the SQL procedure for that page would be something like this
IF #Filter1 = 0, #Filter2 = 0, #Filter3 = 0
BEGIN
SELECT * FROM Table1
END
ELSE IF #Filter1 = 1, #Filter2 = 0, #Filter3 = 0
BEGIN
SELECT * FROM Table2
END
At the beginning, there were only a few results per filter so there weren't that many permutations. However, more filters have been added such that there are over 20 IF ELSE checks now.
So if each filter has 5 options, I will need to do 5*5*5 = 125 IF ELSE checks to return data dependent on the the filters.
Update
The first filter alters the WHERE condition, the second filter adds more tables to the result set, the third filter alters the ORDER BY condition
How can I make this query more scalable such that I don't have to write a new bunch of IF ELSE statements to check for every condition everytime a new filter is added to the list besides using dynamic SQL...
You must have to have a rule table with formulaes maybe bitwise and construct a query that might plug variable data from the table and appends to a string to form the sql and the use dynamic sql to run them.
As much as I dislike dynamic SQL, this may be the time for it. You can build the query a little at a time, then execute it at the end.
If you're unfamiliar, the syntax is something like:
DECLARE #SQL VARCHAR(1000)
SELECT #SQL = 'SELECT * FROM ' + 'SOME_TABLE'
EXEC(#SQL)
Make sure you deal with SQL injection attacks, proper spacing, etc.
In this case, I'd do my best to put this logic in application code, but that's not always possible. If you're using LINQ-to-SQL or another LINQ framework, you should be able to do this safely, but it may take some creativity to get the LINQ query built properly.
You can set up a bunch of views, one for each "filter" and then select from the appropriate view based on which "filter" was selected.