PostgreSQL function having parameters is not using existing index - postgresql

I have a PostgreSQL plpgsql function and noticed that it does not use an existing index on the MATERIALIZED VIEW (my_view), it is querying.
The special thing about this function is, that it invokes another function (check_user) passing its parameter v_username. The result is a boolean and dependant on its value the function decides with the case-construct, if the user gets all data from given destination view (my_view) back or a one, which is being joined with another table.
CREATE OR REPLACE function my_view_secured (v_username text)
RETURNS setof my_view
LANGUAGE plpgsql stable
AS
$function$
declare
v_show_all boolean := check_user(v_username);
begin
CASE
WHEN v_show_all then return query select * from my_view;
WHEN v_show_all = false then return query select st.* from my_view st join other_table st2 on st2.id = st.id;
end case;
end;
$function$
;
When executing the both queries defined in the CASE-WHEN part directly / without the function, PostgreSQL is using an existing index and the query is returning data quite fast (50ms).
When invoking this wrapper function (my_view_secured), I assume the index is not used because it takes about 10-20 seconds to return.
select * from my_view --takes some ms
vs
select * from my_view_secured('RETURNS_TRUE') -- takes 10-20 secs, although same underlying query as the one above
(Maybe it is my DBeaver settings which falsely gives me the impression, that query 1 takes just some ms)
I have read here PostgreSQL is not using index when passed by param, that it is because due to the function parameter, PostgreSQL can not optimize the query. My first attempt was to rewrite this PLPGSQL style of function to a pure SQL-native query, but I struggle to find the correct logic for it.
From client perspective, I simply want to call the function with
SELECT .. FROM my_view_secured ('someusername')
and the function takes care about the data to be returned (either all or joined, depending on the return value from check_user function call). The index itself is set on an id field which is used by the join.
Does anyone has an idea, how to solve this issue?
Some additional informations:
PGSQL-Version: 13.6
my_view (MV): 19 million rows (index on id col)
other_table: 60k rows (unique index on id col)
LOAD 'auto_explain';
SET auto_explain.log_min_duration = 1;
SET auto_explain.log_nested_statements = ON;
SET client_min_messages TO log;
Query MV directly:
explain (analyze, verbose, buffers) select * from my_view
Seq Scan on my_view (cost=0.00..598633.52 rows=18902952 width=185) (actual time=0.807..15754.467 rows=18902952 loops=1)
Output: <removed>
Buffers: shared read=409604
Planning:
Buffers: shared hit=67 read=8
Planning Time: 2.870 ms
Execution Time: 16662.400 ms
2a) Query MV via wrapper function (which returns all data / no join):
explain (analyze, verbose, buffers) select * from my_view_secured ('some_username_returning_all_data')
Function Scan on my_view_secured (cost=0.25..10.25 rows=1000 width=3462) (actual time=9006.965..11887.518 rows=18902952 loops=1)
Output: <removed>
Function Call: my_view_secured('some_username_returning_to_all_data'::text)
Buffers: shared hit=174 read=409590, temp read=353030 written=353030
Planning Time: 0.052 ms
Execution Time: 13091.509 ms
2b) Query MV via wrapper function (which returns joined data):
explain (analyze, verbose, buffers) select * from my_view_secured ('some_username_triggering_join')
Function Scan on my_view_secured (cost=0.25..10.25 rows=1000 width=3462) (actual time=10183.590..11756.417 rows=8624367 loops=1)
Output: <removed>
Function Call: my_view_secured('some_username_triggering_join'::text)
Buffers: shared hit=126 read=409792, temp read=161138 written=161138
Planning Time: 0.050 ms
Execution Time: 12434.169 ms

I just recreated your scenario and I get index scans for the nested queries as expected. Postgres absolutely can use indexes here.
PL/pgSQL handles nested SQL DML statements like this: every statement reached by control is parsed, planned and executed. Since neither of the two SELECT statements involves any parameters, those plans are saved immediately and reused on repeated execution. Either way, if a plain select * from my_view; "uses indexes", exactly the same should be the case for the nested statement.
There must be something going on that is not reflected in your question.
A couple of notes:
You misunderstood the linked answer. Your case is different as neither query involves parameters to begin with.
About ...
does not use an existing index on the views (my_view), it is querying.
Maybe just phrased ambiguously, but to be clear: there are no indexes on views. Tables (incl. MATERIALIZED VIEWs) can have indexes. A VIEW is basically just a stored query with some added secondary settings attached to an empty staging table with rewrite rules. Underlying tables may be indexed.
How do you know the nested queries do "not use an existing index" to begin with? It's not trivial to inspect query plans for nested statements like that. See:
Postgres query plan of a function invocation written in plpgsql
It would seem you are barking up the wrong tree.

Related

Why does postgres use index scan over sequential scan even with a mismatching data type on the indexed column and query condition

I have the following PostgreSQL table:
CREATE TABLE staff (
id integer primary key,
full_name VARCHAR(100) NOT NULL,
department VARCHAR(100) NULL,
tier bigint
);
Filled random data into this table using following block:
do $$
declare
begin
FOR counter IN 1 .. 100000 LOOP
INSERT INTO staff (id, full_name, department, tier)
VALUES (nextval('staff_sequence'),
random_string(10),
random_string(20),
get_department(),
floor(random() * 5 + 1)::bigint);
end LOOP;
end; $$
After the data is populated, I created an index on this table on the tier column:
create index staff_tier_idx on staff(tier);
Although I created this index, when I execute a query using this column, I want this index NOT to be used. To accomplish this, I tried to execute this query:
select count(*) from staff where tier=1::numeric;
Due to mismatching data types on the indexed column and the query condition, I thought the index will not be used & instead a sequential scan will be executed. However, when I run EXPLAIN ANALYZE on the above query I get the following output:
Aggregate (cost=2349.54..2349.55 rows=1 width=8) (actual time=17.078..17.079 rows=1 loops=1)
-> Index Only Scan using staff_tier_idx on staff (cost=0.29..2348.29 rows=500 width=0) (actual time=0.022..15.925 rows=19942 loops=1)
Filter: ((tier)::numeric = '1'::numeric)
Rows Removed by Filter: 80058
Heap Fetches: 0
Planning Time: 0.305 ms
Execution Time: 17.130 ms
Showing that the index has indeed been used.
How do I change this so that the query uses a sequential scan instead of the index? This is purely for a testing/learning purposes.
If its of any importance, I am running this on an Amazon RDS database instance
From the "Filter" rows of the plan like
Rows Removed by Filter: 80058
you can see that the index is not being used as a real index, but just as a skinny table, testing the casted condition for each row. This appears favorable because the index is less than 1/4 the size of the table, while the default ratio of random_page_cost/seq_page_cost = 4.
In addition to just outright disabling index scans as Adrian already suggested, you could also discourage this "skinny table" usage by just increasing random_page_cost, since pages of indexes are assumed to be read in random order.
Another method would be to change the query so it can't use the index-only scan. For example, just using count(full_name) would do that, as PostgreSQL then needs to visit the table to make sure full_name is not NULL (even though it has a constraint asserting that already--sometimes it is not very clever)
Which method is better depends on what it is you are wanting to test/learn.

How to access internal representation of JSONb?

In big-data queries the intermediary "CAST to text" is a performance bottleneck... The good binary information is there, at the JSONb datatype: how to rescue it?
Typical "select where" example:
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE (j->>'flag1')::boolean AND NOT((j->>'flag2')::boolean)
The the "casting to text" is a big loss of performance. Ideal is a mechanism to do direct, from JSONb to Boolean, something as
WHERE (j->'flag1')::magic_boolean AND NOT((j->'flag2')::magic_boolean)
PS: it is possible using C++? Is possible a CREATE CAST C++ implementation to solve this problem?
The feature is implemented in Postgres 11:
E.4.3.4. Data Types
[...]
Add casts from JSONB scalars to numeric and boolean data types (Anastasia Lubennikova)
Db<>Fiddle.
TL;DR
Performance-wise it's best to use #> with an appropriate index covering all JSON attributes including type conversions (to avoid type conversions when accessing the index): https://dbfiddle.uk/?rdbms=postgres_11&fiddle=4da77576874651f4d2cf801142ae34d2
CREATE INDEX idx_flags_btree_jsonb ON t ((j#>'{flag1}'), (j#>'{flag2}'));
Times (all selecting the same 5,195 rows out of 1,000,000):
->>::boolean | ~75 ms
->::boolean | ~55 ms
#> | ~80 ms
#> | ~40 ms
Scalability:
Interestingly, a local test with 40M rows (all cached in memory, no I/O latency here) revealed the following (best) numbers out of 10 runs (excluding the first and last run) for each query:
->>::boolean | 222.333 ms
->::boolean | 268.002 ms
#> | 1644.605 ms
#> | 207.230 ms
So, in fact, the new cast seems to slow things down on larger data sets (which I suspect is due to the fact that it still converts to text before converting to boolean but within a wrapper, not directly).
We also can see that the #> operator using the GIN index doesn't scale very well here, which is expected, as it is much more generic than the other special-purpose indexes and hence, needs to do a lot more under-the-hood.
However, in case these special purpose btree indexes cannot be put in place or I/O becomes a bottleneck, then the GIN index will be superior as it consumes only a fraction of the space on disk (and also in memory), increasing the chance of an index buffer hit.
But that depends on a lot of factors and needs to be decided with all accessing applications understood.
Details:
Preferably use the #> containment operator with a single GIN index as it saves a lot of special-purpose indexes:
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE j #> '{"flag1":true, "flag2":false}'::jsonb;
...which gives a plan like:
QUERY PLAN
-----------------------------------------------------------
CTE Scan on t (cost=0.01..0.03 rows=1 width=32)
Filter: (j #> '{"flag1": true, "flag2": false}'::jsonb)
CTE t
-> Result (cost=0.00..0.01 rows=1 width=64)
(4 rows)
As an alternative (if you can afford creating special-purpose indexes and the resulting write penalty) use the #> operator instead of -> or ->> and by that skip any performance-costly type conversions, e.g.
with t(x,j) as (select 'hello','{"flag1":true,"flag2":false}'::jsonb)
SELECT x FROM t
WHERE j#>'{flag1}' = 'true'::jsonb AND j#>'{flag2}' = 'false'::jsonb;
...resulting in a plan like:
QUERY PLAN
--------------------------------------------------------------------------------------------------------
CTE Scan on t (cost=0.01..0.04 rows=1 width=32)
Filter: (((j #> '{flag1}'::text[]) = 'true'::jsonb) AND ((j #> '{flag2}'::text[]) = 'false'::jsonb))
CTE t
-> Result (cost=0.00..0.01 rows=1 width=64)
(4 rows)
So, no more implicit type conversion here (only for the given constants, but that's a one-time operation, not for every row).

How does postgres decide whether to use index scan or seq scan?

explain analyze shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039):
explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using updated on tbl t (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
Planning time: 0.102 ms
Execution time: 0.720 ms
however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:
explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl t (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
Rows Removed by Filter: 947
Planning time: 0.100 ms
Execution time: 2910.086 ms
How does postgres decide on what to use, specifically when performing filtering by date?
The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE and opportunistically by some other utility commands. That all happens automatically when autovacuum is on (by default).
The manual:
Most queries retrieve only a fraction of the rows in a table, due to
WHERE clauses that restrict the rows to be examined. The planner thus
needs to make an estimate of the selectivity of WHERE clauses, that
is, the fraction of rows that match each condition in the WHERE
clause. The information used for this task is stored in the
pg_statistic system catalog. Entries in pg_statistic are updated by
the ANALYZE and VACUUM ANALYZE commands, and are always approximate
even when freshly updated.
There is a row count (in pg_class), a list of most common values, etc.
The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.
Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.
For your particular example, the important statistic is histogram_bounds, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats for the human eye:
SELECT histogram_bounds
FROM pg_stats
WHERE tablename = 'tbl'
AND attname = 'updated';
There is a dedicated chapter explaining row estimation in the manual.
Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.
Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:
The engine uses the index to find the first value meeting the condition.
The engine then loops over the values, finishing when the condition is no longer true.
For each value in the index, the engine then looks up the data on the data page.
In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.
When the engine does a full table scan it:
Starts with the first record on the first page
Does the comparison and accepts or rejects the record
Continues sequentially through all data pages
There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.
The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.

Postgresql 9.4: index not working in a pattern search

I have a table called "doctors" and a field called "fullname" which will store names with accents.
What I need to do is an "accent insensitive + case insensitive" search, something like:
SELECT *
FROM doctors
WHERE unaccent_t(fullname) ~* 'unaccented_and_lowercase_string';
where the value to search will come unaccented+lowercase and unaccent_t is a function defined as:
CREATE FUNCTION unaccent_t(text, lowercase boolean DEFAULT false)
RETURNS text AS
$BODY$
SELECT CASE
WHEN $2 THEN unaccent('unaccent', lower(trim($1)))
ELSE unaccent('unaccent', trim($1))
END;
$BODY$ LANGUAGE sql IMMUTABLE SET search_path = public, pg_temp;
(I already installed 'unaccent' extension).
So, I went ahead and created the index for "fullname" field:
CREATE INDEX doctors_fullname ON doctors (unaccent_t(fullname) text_pattern_ops);
(I also tried with varchar_pattern_ops and also no specifying ops at all)
In the doctors table, I have around 15K rows.
The query works and I get the expected results, but when I add the explain analyze to the query, I don't see that the index is used:
Seq Scan on doctors (cost=0.00..4201.76 rows=5 width=395) (actual time=0.282..182.025 rows=15000 loops=1)
Filter: (unaccent_t((fullname)::text, false) ~* 'garcia'::text)
Rows Removed by Filter: 1
Planning time: 0.207 ms
Execution time: 183.387 ms
I also tried removing the optional parameter from unaccent_t but I got the same results.
In a scenario like this, how should I define the index so it gets used in a query like the one above?
Btree indexes are usable to speed up operations only when the pattern is left anchored.
Starting from PostgreSQL 9.3 you can speed up generic regular expression searches using a GIN or GiST index with the operator classes provided by the pg_trgm contrib module.
You can read more about it on the PostgreSQL manual at http://www.postgresql.org/docs/9.4/static/pgtrgm.html#AEN163078

PostgreSQL - max number of parameters in "IN" clause?

In Postgres, you can specify an IN clause, like this:
SELECT * FROM user WHERE id IN (1000, 1001, 1002)
Does anyone know what's the maximum number of parameters you can pass into IN?
According to the source code located here, starting at line 850, PostgreSQL doesn't explicitly limit the number of arguments.
The following is a code comment from line 870:
/*
* We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
* possible if the inputs are all scalars (no RowExprs) and there is a
* suitable array type available. If not, we fall back to a boolean
* condition tree with multiple copies of the lefthand expression.
* Also, any IN-list items that contain Vars are handled as separate
* boolean conditions, because that gives the planner more scope for
* optimization on such clauses.
*
* First step: transform all the inputs, and detect whether any are
* RowExprs or contain Vars.
*/
This is not really an answer to the present question, however it might help others too.
At least I can tell there is a technical limit of 32767 values (=Short.MAX_VALUE) passable to the PostgreSQL backend, using Posgresql's JDBC driver 9.1.
This is a test of "delete from x where id in (... 100k values...)" with the postgresql jdbc driver:
Caused by: java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 100000
at org.postgresql.core.PGStream.SendInteger2(PGStream.java:201)
explain select * from test where id in (values (1), (2));
QUERY PLAN
Seq Scan on test (cost=0.00..1.38 rows=2 width=208)
Filter: (id = ANY ('{1,2}'::bigint[]))
But if try 2nd query:
explain select * from test where id = any (values (1), (2));
QUERY PLAN
Hash Semi Join (cost=0.05..1.45 rows=2 width=208)
Hash Cond: (test.id = "*VALUES*".column1)
-> Seq Scan on test (cost=0.00..1.30 rows=30 width=208)
-> Hash (cost=0.03..0.03 rows=2 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4)
We can see that postgres build temp table and join with it
As someone more experienced with Oracle DB, I was concerned about this limit too. I carried out a performance test for a query with ~10'000 parameters in an IN-list, fetching prime numbers up to 100'000 from a table with the first 100'000 integers by actually listing all the prime numbers as query parameters.
My results indicate that you need not worry about overloading the query plan optimizer or getting plans without index usage, since it will transform the query to use = ANY({...}::integer[]) where it can leverage indices as expected:
-- prepare statement, runs instantaneous:
PREPARE hugeplan (integer, integer, integer, ...) AS
SELECT *
FROM primes
WHERE n IN ($1, $2, $3, ..., $9592);
-- fetch the prime numbers:
EXECUTE hugeplan(2, 3, 5, ..., 99991);
-- EXPLAIN ANALYZE output for the EXECUTE:
"Index Scan using n_idx on primes (cost=0.42..9750.77 rows=9592 width=5) (actual time=0.024..15.268 rows=9592 loops=1)"
" Index Cond: (n = ANY ('{2,3,5,7, (...)"
"Execution time: 16.063 ms"
-- setup, should you care:
CREATE TABLE public.primes
(
n integer NOT NULL,
prime boolean,
CONSTRAINT n_idx PRIMARY KEY (n)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.primes
OWNER TO postgres;
INSERT INTO public.primes
SELECT generate_series(1,100000);
However, this (rather old) thread on the pgsql-hackers mailing list indicates that there is still a non-negligible cost in planning such queries, so take my word with a grain of salt.
There is no limit to the number of elements that you are passing to IN clause. If there are more elements it will consider it as array and then for each scan in the database it will check if it is contained in the array or not. This approach is not so scalable. Instead of using IN clause try using INNER JOIN with temp table. Refer http://www.xaprb.com/blog/2006/06/28/why-large-in-clauses-are-problematic/ for more info. Using INNER JOIN scales well as query optimizer can make use of hash join and other optimization. Whereas with IN clause there is no way for the optimizer to optimize the query. I have noticed speedup of at least 2x with this change.
Just tried it. the answer is ->
out-of-range integer as a 2-byte value: 32768
You might want to consider refactoring that query instead of adding an arbitrarily long list of ids... You could use a range if the ids indeed follow the pattern in your example:
SELECT * FROM user WHERE id >= minValue AND id <= maxValue;
Another option is to add an inner select:
SELECT *
FROM user
WHERE id IN (
SELECT userId
FROM ForumThreads ft
WHERE ft.id = X
);
If you have query like:
SELECT * FROM user WHERE id IN (1, 2, 3, 4 -- and thousands of another keys)
you may increase performace if rewrite your query like:
SELECT * FROM user WHERE id = ANY(VALUES (1), (2), (3), (4) -- and thousands of another keys)