PostgreSQL - max number of parameters in "IN" clause? - postgresql

In Postgres, you can specify an IN clause, like this:
SELECT * FROM user WHERE id IN (1000, 1001, 1002)
Does anyone know what's the maximum number of parameters you can pass into IN?

According to the source code located here, starting at line 850, PostgreSQL doesn't explicitly limit the number of arguments.
The following is a code comment from line 870:
/*
* We try to generate a ScalarArrayOpExpr from IN/NOT IN, but this is only
* possible if the inputs are all scalars (no RowExprs) and there is a
* suitable array type available. If not, we fall back to a boolean
* condition tree with multiple copies of the lefthand expression.
* Also, any IN-list items that contain Vars are handled as separate
* boolean conditions, because that gives the planner more scope for
* optimization on such clauses.
*
* First step: transform all the inputs, and detect whether any are
* RowExprs or contain Vars.
*/

This is not really an answer to the present question, however it might help others too.
At least I can tell there is a technical limit of 32767 values (=Short.MAX_VALUE) passable to the PostgreSQL backend, using Posgresql's JDBC driver 9.1.
This is a test of "delete from x where id in (... 100k values...)" with the postgresql jdbc driver:
Caused by: java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 100000
at org.postgresql.core.PGStream.SendInteger2(PGStream.java:201)

explain select * from test where id in (values (1), (2));
QUERY PLAN
Seq Scan on test (cost=0.00..1.38 rows=2 width=208)
Filter: (id = ANY ('{1,2}'::bigint[]))
But if try 2nd query:
explain select * from test where id = any (values (1), (2));
QUERY PLAN
Hash Semi Join (cost=0.05..1.45 rows=2 width=208)
Hash Cond: (test.id = "*VALUES*".column1)
-> Seq Scan on test (cost=0.00..1.30 rows=30 width=208)
-> Hash (cost=0.03..0.03 rows=2 width=4)
-> Values Scan on "*VALUES*" (cost=0.00..0.03 rows=2 width=4)
We can see that postgres build temp table and join with it

As someone more experienced with Oracle DB, I was concerned about this limit too. I carried out a performance test for a query with ~10'000 parameters in an IN-list, fetching prime numbers up to 100'000 from a table with the first 100'000 integers by actually listing all the prime numbers as query parameters.
My results indicate that you need not worry about overloading the query plan optimizer or getting plans without index usage, since it will transform the query to use = ANY({...}::integer[]) where it can leverage indices as expected:
-- prepare statement, runs instantaneous:
PREPARE hugeplan (integer, integer, integer, ...) AS
SELECT *
FROM primes
WHERE n IN ($1, $2, $3, ..., $9592);
-- fetch the prime numbers:
EXECUTE hugeplan(2, 3, 5, ..., 99991);
-- EXPLAIN ANALYZE output for the EXECUTE:
"Index Scan using n_idx on primes (cost=0.42..9750.77 rows=9592 width=5) (actual time=0.024..15.268 rows=9592 loops=1)"
" Index Cond: (n = ANY ('{2,3,5,7, (...)"
"Execution time: 16.063 ms"
-- setup, should you care:
CREATE TABLE public.primes
(
n integer NOT NULL,
prime boolean,
CONSTRAINT n_idx PRIMARY KEY (n)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.primes
OWNER TO postgres;
INSERT INTO public.primes
SELECT generate_series(1,100000);
However, this (rather old) thread on the pgsql-hackers mailing list indicates that there is still a non-negligible cost in planning such queries, so take my word with a grain of salt.

There is no limit to the number of elements that you are passing to IN clause. If there are more elements it will consider it as array and then for each scan in the database it will check if it is contained in the array or not. This approach is not so scalable. Instead of using IN clause try using INNER JOIN with temp table. Refer http://www.xaprb.com/blog/2006/06/28/why-large-in-clauses-are-problematic/ for more info. Using INNER JOIN scales well as query optimizer can make use of hash join and other optimization. Whereas with IN clause there is no way for the optimizer to optimize the query. I have noticed speedup of at least 2x with this change.

Just tried it. the answer is ->
out-of-range integer as a 2-byte value: 32768

You might want to consider refactoring that query instead of adding an arbitrarily long list of ids... You could use a range if the ids indeed follow the pattern in your example:
SELECT * FROM user WHERE id >= minValue AND id <= maxValue;
Another option is to add an inner select:
SELECT *
FROM user
WHERE id IN (
SELECT userId
FROM ForumThreads ft
WHERE ft.id = X
);

If you have query like:
SELECT * FROM user WHERE id IN (1, 2, 3, 4 -- and thousands of another keys)
you may increase performace if rewrite your query like:
SELECT * FROM user WHERE id = ANY(VALUES (1), (2), (3), (4) -- and thousands of another keys)

Related

Why does postgres use index scan over sequential scan even with a mismatching data type on the indexed column and query condition

I have the following PostgreSQL table:
CREATE TABLE staff (
id integer primary key,
full_name VARCHAR(100) NOT NULL,
department VARCHAR(100) NULL,
tier bigint
);
Filled random data into this table using following block:
do $$
declare
begin
FOR counter IN 1 .. 100000 LOOP
INSERT INTO staff (id, full_name, department, tier)
VALUES (nextval('staff_sequence'),
random_string(10),
random_string(20),
get_department(),
floor(random() * 5 + 1)::bigint);
end LOOP;
end; $$
After the data is populated, I created an index on this table on the tier column:
create index staff_tier_idx on staff(tier);
Although I created this index, when I execute a query using this column, I want this index NOT to be used. To accomplish this, I tried to execute this query:
select count(*) from staff where tier=1::numeric;
Due to mismatching data types on the indexed column and the query condition, I thought the index will not be used & instead a sequential scan will be executed. However, when I run EXPLAIN ANALYZE on the above query I get the following output:
Aggregate (cost=2349.54..2349.55 rows=1 width=8) (actual time=17.078..17.079 rows=1 loops=1)
-> Index Only Scan using staff_tier_idx on staff (cost=0.29..2348.29 rows=500 width=0) (actual time=0.022..15.925 rows=19942 loops=1)
Filter: ((tier)::numeric = '1'::numeric)
Rows Removed by Filter: 80058
Heap Fetches: 0
Planning Time: 0.305 ms
Execution Time: 17.130 ms
Showing that the index has indeed been used.
How do I change this so that the query uses a sequential scan instead of the index? This is purely for a testing/learning purposes.
If its of any importance, I am running this on an Amazon RDS database instance
From the "Filter" rows of the plan like
Rows Removed by Filter: 80058
you can see that the index is not being used as a real index, but just as a skinny table, testing the casted condition for each row. This appears favorable because the index is less than 1/4 the size of the table, while the default ratio of random_page_cost/seq_page_cost = 4.
In addition to just outright disabling index scans as Adrian already suggested, you could also discourage this "skinny table" usage by just increasing random_page_cost, since pages of indexes are assumed to be read in random order.
Another method would be to change the query so it can't use the index-only scan. For example, just using count(full_name) would do that, as PostgreSQL then needs to visit the table to make sure full_name is not NULL (even though it has a constraint asserting that already--sometimes it is not very clever)
Which method is better depends on what it is you are wanting to test/learn.

Postgres SLOWER when a LIMIT is set: how to fix besides adding a dummy `ORDER BY`?

In Postgres, some queries are a whole lot slower when adding a LIMIT:
The queries:
SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4; -- 51 sec
SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4; -- 0.020s
SELECT * FROM review WHERE clicker_id=28 LIMIT 4; -- 0.007s
SELECT * FROM review WHERE clicker_id=28 ORDER BY id; -- 0.007s
As you can see, I need to add a dummy id to the ORDER BY in order for things to go fast. I'm trying to understand why.
Running EXPLAIN on them:
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 LIMIT 4;
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id;
gives this:
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY done DESC LIMIT 4
Limit (cost=0.44..249.76 rows=4 width=56)
-> Index Scan using review_done on review (cost=0.44..913081.13 rows=14649 width=56)
Filter: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id, done DESC LIMIT 4
Limit (cost=11970.75..11970.76 rows=4 width=56)
-> Sort (cost=11970.75..12007.37 rows=14649 width=56)
Sort Key: id, done DESC
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 LIMIT 4
Limit (cost=0.44..3.65 rows=4 width=56)
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
EXPLAIN SELECT * FROM review WHERE clicker_id=28 ORDER BY id
Sort (cost=12764.61..12801.24 rows=14649 width=56)
Sort Key: id
-> Index Scan using review_clicker_id on review (cost=0.44..11751.01 rows=14649 width=56)
Index Cond: (clicker_id = 28)
I'm no SQL expert, but I take it Postgres expected the query to be faster than it actually is, and so used a way to fetch the data that's actually inappropriate, correct?
The database:
The review table:
Contains 22+ million rows.
A given user will get 7 066 rows tops.
The one in the test (id 28) has 288 at the time.
Has this structure:
id: bigint Auto Increment [nextval('public.review_id_seq')]
type: review_type NULL
iteration: smallint NULL
repetitions: smallint NULL
due: timestamptz NULL
done: timestamptz NULL
added: timestamptz NULL
clicker_id: bigint NULL
monologue_id: bigint NULL
Has these indexes:
UNIQUE type, clicker_id, monologue_id, iteration
INDEX clicker_id
INDEX done, due, monologue_id
INDEX id
INDEX done DESC
INDEX type
Additional details:
Environment:
The queries were ran in development with Postgres 9.6.14.
Running the queries into production (Heroku Postgres, version 9.6.16) the difference is less dramatic, but still not great: the slow queries might take 600 ms.
Variable speed:
Sometimes, the same queries (be it the exact same, or for a different clicker_id) run a lot faster (under 1 sec), but I don't understand why. I need them to be consistently fast.
If I use LIMIT 288 for a user that has 288 rows, then it's so much faster (< 1sec), but if I do the same for a user with say 7066 rows then it's back to super slow.
Before I figured the use of a dummy ORDER BY, I tried these:
Re-importing the database.
analyze review;
Setting the index for done to DESC (used to be set to default/ASC.) [The challenge then was that there's no proper way to check if/when the index is done rebuilding.]
None helped.
The question:
My issue in itself is solved, but I'm dissatisfied with it:
Is there a name for this "pattern" that consists of adding a dummy ORDER BY to speed things up?
How can I spot such issues in the future? (This took ages to figure.) Unless I missed something, the EXPLAIN is not that useful:
For the slow query, the cost is misleadingly slow, while for the fast variant it's misleadingly high.
Alternative: is there another way to handle this? (Because this solution feels like a hack.)
Thanks!
Similar questions:
PostgreSQL query very slow with limit 1 is almost the same question, except his queries were slow with LIMIT 1 and fine with LIMIT 3 or higher. And then of course the data is not the same.
The underlying issue here is what's called an abort-early query plan. Here's a thread from pgsql-hackers describing something similar:
https://www.postgresql.org/message-id/541A2335.3060100%40agliodbs.com
Quoting from there, this is why the planner is using the often-extremely-slow index scan when the ORDER BY done DESC is present:
As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio). This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.
In your case, the planner thinks that if it just starts going through rows ordered by done desc (IOW, using the review_done index), it will find 4 rows with clicker_id=28 quickly. Since the rows need to be returned in "done" descending order, it thinks this will save a sort step and be faster than retrieving all rows for clicker 28 and then sorting. Given the real-world distribution of rows, this can often turn out not to be the case, requiring it to skip a huge number of rows before finding 4 with clicker=28.
A more-general way of handling it is to use a CTE (which, in 9.6, is still an optimization fence - this changes in PG 12, FYI) to fetch the rows without an order by, then add the ORDER BY in the outer query. Given that fetching all rows for a user, sorting them, and returning however many you need is completely reasonable for your dataset (even the 7k-rows clicker shouldn't be an issue), you can prevent the planner from believing an abort-early plan will be fastest by not having an ORDER BY or a LIMIT in the CTE, giving you a query of something like:
WITH clicker_rows as (SELECT * FROM review WHERE clicker_id=28)
select * From clicker_rows ORDER BY done DESC LIMIT 4;
This should be reliably fast while still respecting the ORDER BY and the LIMIT you want. I'm not sure if there's a name for this pattern, though.

PostgreSQL - existence of index causes hash-join

I was looking at the EXPLAIN of a natural join query of two simple tables. At first, the postgresql planner is using merge-join. Then, I add index on the join's attribute, and it causes the planner to use hash-join instead (and, with sequential read of the data!).
So my question is: why is the existence of an index causes an hash-join?
Additional data & code:
I defined two relations: R(A,B) and S(B,C). (without primary keys or
such).
Filled the tables with few lines of data (~5 each, such that there are common values of the attribute B in R and S).
then executed:
EXPLAIN VERBOSE SELECT * FROM R NATURAL JOIN S;
which resulted
Merge Join (cost=317.01..711.38 rows=25538 width=12)...
and finally, executed:
CREATE INDEX SI on S(B);
EXPLAIN VERBOSE SELECT * FROM R NATURAL JOIN S;
which resulted
Hash Join (cost=1.09..42.62 rows=45 width=12)...
Seq Scan on "user".s (cost=0.00..1.04 rows=4 width=8)

Postgresql Sorting a Joined Table with an index

I'm currently working on a complex sorting problem in Postgres 9.2
You can find the Source Code used in this Question(simplified) here: http://sqlfiddle.com/#!12/9857e/11
I have a Huge (>>20Mio rows) table containing various columns of different types.
CREATE TABLE data_table
(
id bigserial PRIMARY KEY,
column_a character(1),
column_b integer
-- ~100 more columns
);
Lets say i want to sort this table over 2 Columns (ASC).
But i don't want to do that with a simply Order By, because later I might need to insert rows in the sorted output and the user probably only wants to see 100 Rows at once (of the sorted output).
To achieve these goals i do the following:
CREATE TABLE meta_table
(
id bigserial PRIMARY KEY,
id_data bigint NOT NULL -- refers to the data_table
);
--Function to get the Column A of the current row
CREATE OR REPLACE FUNCTION get_column_a(bigint)
RETURNS character AS
'SELECT column_a FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Function to get the Column B of the current row
CREATE OR REPLACE FUNCTION get_column_b(bigint)
RETURNS integer AS
'SELECT column_b FROM data_table WHERE id=$1'
LANGUAGE sql IMMUTABLE STRICT;
--Creating a index on expression:
CREATE INDEX meta_sort_index
ON meta_table
USING btree
(get_column_a(id_data), get_column_b(id_data), id_data);
And then I copy the Id's of the data_table to the meta_table:
INSERT INTO meta_table(id_data) (SELECT id FROM data_table);
Later I can add additional rows to the table with a similar simple insert.
To get the Rows 900000 - 900099 (100 Rows) i can now use:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
ORDER BY 1,2,3 OFFSET 900000 LIMIT 100;
(With an additional INNER JOIN on data_table if I want all the data.)
The Resulting Plan is:
Limit (cost=498956.59..499012.03 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8)
This is a pretty efficient plan (Index Only Scans are new in Postgres 9.2).
But what is if I want to get Rows 20'000'000 - 20'000'099 (100 Rows)? Same Plan, much longer execution time. Well, to improve the Offset Performance (Improving OFFSET performance in PostgreSQL) I can do the following (Let's assume I saved every 100'000th Row away into another table).
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 )
ORDER BY 1,2,3 LIMIT 100;
This runs much faster. The Resulting Plan is:
Limit (cost=0.51..61.13 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8)
Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857))
So far everything works perfect and postgres does a great job!
Let's assume I want to change the Order of the 2nd Column to DESC.
But then I would have to change my WHERE Clause, because the > Operator compares both Columns ASC. The same query as above (ASC Ordering) could also be written as:
SELECT get_column_a(id_data), get_column_b(id_data), id_data
FROM meta_table
WHERE
(get_column_a(id_data) > get_column_a(587857))
OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857))
OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857))))
ORDER BY 1,2,3 LIMIT 100;
Now the Plan Changes and the Query becomes slow:
Limit (cost=0.00..1095.94 rows=100 width=8)
-> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8)
Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857)))))
How can I use the efficient older plan with DESC-Ordering?
Do you have any better ideas how to solve the Problem?
(I already tried to declare a own Type with own Operator Classes, but that's too slow)
You need to rethink your approach. Where to begin? This is a clear example, basically of the limits, performance-wise, of the sort of functional approach you are taking to SQL. Functions are largely planner opaque, and you are forcing two different lookups on data_table for every row retrieved because the stored procedure's plans cannot be folded together.
Now, far worse, you are indexing one table based on data in another. This might work for append-only workloads (inserts allowed but no updates) but it will not work if data_table can ever have updates applied. If the data in data_table ever changes, you will have the index return wrong results.
In these cases, you are almost always better off writing in the join as explicit, and letting the planner figure out the best way to retrieve the data.
Now your problem is that your index becomes a lot less useful (and a lot more intensive disk I/O-wise) when you change the order of your second column. On the other hand, if you had two different indexes on the data_table and had an explicit join, PostgreSQL could more easily handle this.

Postgres query optimization (forcing an index scan)

Below is my query. I am trying to get it to use an index scan, but it will only seq scan.
By the way the metric_data table has 130 million rows. The metrics table has about 2000 rows.
metric_data table columns:
metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
How can I get this query to use my PRIMARY KEY index?
SELECT
S.metric,
D.t,
D.d
FROM metric_data D
INNER JOIN metrics S
ON S.id = D.metric_id
WHERE S.NAME = ANY (ARRAY ['cpu', 'mem'])
AND D.t BETWEEN '2012-02-05 00:00:00'::TIMESTAMP
AND '2012-05-05 00:00:00'::TIMESTAMP;
EXPLAIN:
Hash Join (cost=271.30..3866384.25 rows=294973 width=25)
Hash Cond: (d.metric_id = s.id)
-> Seq Scan on metric_data d (cost=0.00..3753150.28 rows=29336784 width=20)
Filter: ((t >= '2012-02-05 00:00:00'::timestamp without time zone)
AND (t <= '2012-05-05 00:00:00'::timestamp without time zone))
-> Hash (cost=270.44..270.44 rows=68 width=13)
-> Seq Scan on metrics s (cost=0.00..270.44 rows=68 width=13)
Filter: ((sym)::text = ANY ('{cpu,mem}'::text[]))
For testing purposes you can force the use of the index by "disabling" sequential scans - best in your current session only:
SET enable_seqscan = OFF;
Do not use this on a productive server. Details in the manual here.
I quoted "disabling", because you cannot actually disable sequential table scans. But any other available option is now preferable for Postgres. This will prove that the multicolumn index on (metric_id, t) can be used - just not as effective as an index on the leading column.
You probably get better results by switching the order of columns in your PRIMARY KEY (and the index used to implement it behind the curtains with it) to (t, metric_id). Or create an additional index with reversed columns like that.
Is a composite index also good for queries on the first field?
You do not normally have to force better query plans by manual intervention. If setting enable_seqscan = OFF leads to a much better plan, something is probably not right in your database. Consider this related answer:
Keep PostgreSQL from sometimes choosing a bad query plan
You cannot force index scan in this case because it will not make it faster.
You currently have index on metric_data (metric_id, t), but server cannot take advantage of this index for your query, because it needs to be able to discriminate by metric_data.t only (without metric_id), but there is no such index. Server can use sub-fields in compound indexes, but only starting from the beginning. For example, searching by metric_id will be able to employ this index.
If you create another index on metric_data (t), your query will make use of that index and will work much faster.
Also, you should make sure that you have an index on metrics (id).
Have you tried to use:
WHERE S.NAME = ANY (VALUES ('cpu'), ('mem'))
instead of
ARRAY
like here
It appears you are lacking suitable FK constraints:
CREATE TABLE metric_data
( metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
, FOREIGN KEY metrics_xxx_fk (metric_id) REFERENCES metrics (id)
)
and in table metrics:
CREATE TABLE metrics
( id INTEGER PRIMARY KEY
...
);
Also check if your statistics are sufficient (and fine-grained enough, since you intend to select 0.2 % of the metrics_data table)