PostgreSQL increase group by over 30 million rows - postgresql

Is there any way to increase speed of dynamic group by query ? I have a table with 30 million rows.
create table if not exists tb
(
id serial not null constraint tb_pkey primary key,
week integer,
month integer,
year integer,
starttime varchar(20),
endtime varchar(20),
brand smallint,
category smallint,
value real
);
The query below takes 8.5 seconds.
SELECT category from tb group by category
Is there any way to increase the speed. I have tried with and without index.

For that exact query, not really; doing this operation requires scanning every row. No way around it.
But if you're looking to be able to quickly get the set of unique categories, and you have an index on that column, you can use a variation of the WITH RECURSIVE example shown in the edit to the question here (look towards the end of the question):
Counting distinct rows using recursive cte over non-distinct index
You'll need to change it to return the set of unique values instead of counting them, but it looks like a simple change:
testdb=# create table tb(id bigserial, category smallint);
CREATE TABLE
testdb=# insert into tb(category) select 2 from generate_series(1, 10000)
testdb-# ;
INSERT 0 10000
testdb=# insert into tb(category) select 1 from generate_series(1, 10000);
INSERT 0 10000
testdb=# insert into tb(category) select 3 from generate_series(1, 10000);
INSERT 0 10000
testdb=# create index on tb(category);
CREATE INDEX
testdb=# WITH RECURSIVE cte AS
(
(SELECT category
FROM tb
WHERE category >= 0
ORDER BY 1
LIMIT 1)
UNION ALL SELECT
(SELECT category
FROM tb
WHERE category > c.category
ORDER BY 1
LIMIT 1)
FROM cte c
WHERE category IS NOT NULL)
SELECT category
FROM cte
WHERE category IS NOT NULL;
category
----------
1
2
3
(3 rows)
And here's the EXPLAIN ANALYZE:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
CTE Scan on cte (cost=40.66..42.68 rows=100 width=2) (actual time=0.057..0.127 rows=3 loops=1)
Filter: (category IS NOT NULL)
Rows Removed by Filter: 1
CTE cte
-> Recursive Union (cost=0.29..40.66 rows=101 width=2) (actual time=0.052..0.119 rows=4 loops=1)
-> Limit (cost=0.29..0.33 rows=1 width=2) (actual time=0.051..0.051 rows=1 loops=1)
-> Index Only Scan using tb_category_idx on tb tb_1 (cost=0.29..1363.29 rows=30000 width=2) (actual time=0.050..0.050 rows=1 loops=1)
Index Cond: (category >= 0)
Heap Fetches: 1
-> WorkTable Scan on cte c (cost=0.00..3.83 rows=10 width=2) (actual time=0.015..0.015 rows=1 loops=4)
Filter: (category IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.29..0.36 rows=1 width=2) (actual time=0.016..0.016 rows=1 loops=3)
-> Index Only Scan using tb_category_idx on tb (cost=0.29..755.95 rows=10000 width=2) (actual time=0.015..0.015 rows=1 loops=3)
Index Cond: (category > c.category)
Heap Fetches: 2
Planning time: 0.224 ms
Execution time: 0.191 ms
(19 rows)
The number of loops it has to do the WorkTable scan node will be equal to the number of unique categories you have plus one, so it should stay very fast up to, say, hundreds of unique values.
Another route you can take is to add another table where you just store unique values of tb.category and have application logic check that table and insert their value when updating/inserting that column. This can also be done database-side with triggers; that solution is also discussed in the answers to the linked question.

Related

Can a LEFT JOIN be deferred to only apply to matching rows?

When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;

Postgresql max query on big indexed table has slow performance

I have a table inside my Postgresql database, called consumer_actions. It contains all the actions done by consumers registered in my app. At the moment, this table has ~ 500 million records. What i'm trying to do is to get the maximum id, based on the system that the action came from.
The definition of the table is:
CREATE TABLE public.consumer_actions (
id int4 NOT NULL,
system_id int4 NOT NULL,
consumer_id int4 NOT NULL,
action_id int4 NOT NULL,
payload_json jsonb NULL,
external_system_date timestamptz NULL,
local_system_date timestamptz NULL,
CONSTRAINT consumer_actions_pkey PRIMARY KEY (id, system_id)
);
CREATE INDEX consumer_actions_ext_date ON public.consumer_actions USING btree (external_system_date);
CREATE INDEX consumer_actions_system_consumer_id ON public.consumer_actions USING btree (system_id, consumer_id);
when i'm trying
select max(id) from consumer_actions where system_id = 1
it takes less than one second, but if i try to use the same index (consumer_actions_system_consumer_id) to get the max(id) by system_id = 2, it takes more than an hour.
select max(id) from consumer_actions where system_id = 2
I have also checked the query planner, is looks similar for both queries; i also rerun vacuum analyze on the table and a reindex. Neither of them helped. Any idea what i can do to improve the second query time?
Here are the query planners for both tables, and the size at the moment of this table:
explain analyze
select max(id) from consumer_actions where system_id = 1;
Result (cost=1.49..1.50 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..1.49 rows=1 width=4) (actual time=0.057..0.057 rows=1 loops=1)
-> Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca (cost=0.57..524024735.49 rows=572451344 width=4) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ((id IS NOT NULL) AND (system_id = 1))
Heap Fetches: 1
Planning Time: 0.173 ms
Execution Time: 0.092 ms
explain analyze
select max(id) from consumer_actions where system_id = 2;
Result (cost=6.46..6.47 rows=1 width=4) (actual time=7099484.855..7099484.858 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..6.46 rows=1 width=4) (actual time=7099484.839..7099484.841 rows=1 loops=1)
-> Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca (cost=0.57..20205843.58 rows=3436129 width=4) (actual time=7099484.833..7099484.834 rows=1 loops=1)
Index Cond: ((id IS NOT NULL) AND (system_id = 2))
Heap Fetches: 1
Planning Time: 3.078 ms
Execution Time: 7099484.992 ms
(8 rows)
select count(*) from consumer_actions; --result is 577408504
Instead of using an aggregation function like max() that has to potentially scan and aggregate large numbers of rows for a table like yours you could get similar results with a query designed to return the fewest rows possible:
SELECT id FROM consumer_actions WHERE system_id = ? ORDER BY id DESC LIMIT 1;
This should still benefit significantly in performance from the existing indices.
I think that you should create an index like this one
CREATE INDEX consumer_actions_system_system_id_id ON public.consumer_actions USING btree (system_id, id);

PostgreSQL query takes very longtime

I have a table with 3 columns and composite primary key with all the 3 columns. All the individual columns have lot of duplicates and I have btree separately on all of them. The table has around 10 million records.
My query with just a condition with a hardcoded value for single column would always return more than a million records. It takes more than 40 secs whereas it takes very few seconds if I limit the query to 1 or 2 million rows without any condition.
Any help to optimize it as there is no bitmap index in Postgres? All 3 columns have lots of duplicates, would it help if I drop the btree index on them?
SELECT t1.filterid,
t1.filterby,
t1.filtertype
FROM echo_sm.usernotificationfilters t1
WHERE t1.filtertype = 9
UNION
SELECT t1.filterid, '-1' AS filterby, 9 AS filtertype
FROM echo_sm.usernotificationfilters t1
WHERE NOT EXISTS (SELECT 1
FROM echo_sm.usernotificationfilters t2
WHERE t2.filtertype = 9 AND t2.filterid = t1.filterid);
Filtertype column is integer and the rest 2 are varchar(50). All 3 columns have separate btree indexes on them.
Explain plan:
Unique (cost=2168171.15..2201747.47 rows=3357632 width=154) (actual time=32250.340..36371.928 rows=3447159 loops=1)
-> Sort (cost=2168171.15..2176565.23 rows=3357632 width=154) (actual time=32250.337..35544.050 rows=4066447 loops=1)
Sort Key: usernotificationfilters.filterid, usernotificationfilters.filterby, usernotificationfilters.filtertype
Sort Method: external merge Disk: 142696kB
-> Append (cost=62854.08..1276308.41 rows=3357632 width=154) (actual time=150.155..16025.874 rows=4066447 loops=1)
-> Bitmap Heap Scan on usernotificationfilters (cost=62854.08..172766.46 rows=3357631 width=25) (actual time=150.154..574.297 rows=3422522 loops=1)
Recheck Cond: (filtertype = 9)
Heap Blocks: exact=39987
-> Bitmap Index Scan on index_sm_usernotificationfilters_filtertype (cost=0.00..62014.67 rows=3357631 width=0) (actual time=143.585..143.585 rows=3422522 loops=1)
Index Cond: (filtertype = 9)
-> Gather (cost=232131.85..1069965.63 rows=1 width=50) (actual time=3968.492..15133.812 rows=643925 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Hash Anti Join (cost=231131.85..1068965.53 rows=1 width=50) (actual time=4135.235..12945.029 rows=214642 loops=3)
Hash Cond: ((usernotificationfilters_1.filterid)::text = (usernotificationfilters_1_1.filterid)::text)
-> Parallel Seq Scan on usernotificationfilters usernotificationfilters_1 (cost=0.00..106879.18 rows=3893718 width=14) (actual time=0.158..646.432 rows=3114974 loops=3)
-> Hash (cost=172766.46..172766.46 rows=3357631 width=14) (actual time=4133.991..4133.991 rows=3422522 loops=3)
Buckets: 131072 Batches: 64 Memory Usage: 3512kB
-> Bitmap Heap Scan on usernotificationfilters usernotificationfilters_1_1 (cost=62854.08..172766.46 rows=3357631 width=14) (actual time=394.775..1891.931 rows=3422522 loops=3)
Recheck Cond: (filtertype = 9)
Heap Blocks: exact=39987
-> Bitmap Index Scan on index_sm_usernotificationfilters_filtertype (cost=0.00..62014.67 rows=3357631 width=0) (actual time=383.635..383.635 rows=3422522 loops=3)
Index Cond: (filtertype = 9)
Planning time: 0.467 ms
Execution time: 36531.763 ms
The second subquery in your UNION takes about 15 seconds all by itself, and that could possibly be optimized separately from the rest of the query.
The sort to implement the duplicate removal implied by UNION takes about 20 seconds all by itself. It spills to disk. You could increase "work_mem" until it either stops spilling to disk, or starts using a hash rather than a sort. Of course you do need to have the RAM to backup your setting of "work_mem".
A third possibility would be not to treat these steps in isolation. If you had an index which would allow the data to be read from the 2nd branch of the union already in order, than it might not have to re-sort the whole thing. That would probably be an index on (filterid, filterby, filtertype).
This is a separate independent way to approach it.
I think your
WHERE NOT EXISTS (SELECT 1...
could be correctly changed to
WHERE t1.filtertype <> 9 NOT EXISTS AND (SELECT 1...
because the case where t1.filtertype=9 would filter itself out. Is that correct? If so, you could try writing it that way, as the planner is probably not smart enough to make that transformation on its own. Once you have done that, than maybe a filtered index, something like the below, would come in useful.
create index on echo_sm.usernotificationfilters (filterid, filterby, filtertype)
where filtertype <> 9
But, unless you get rid of or speed up that sort, there is only so much improvement you can get with other things.
It appears you only want to retrieve one record per filterid: a record with filtertype = 9 if available, or just another, with dummy values for the other columns. This can be done by ordering BY (filtertype<>9), filtertype ) and picking only the first row via row_number() = 1:
-- EXPLAIN ANALYZE
SELECT xx.filterid
, case(xx.filtertype) when 9 then xx.filterby ELSE '-1' END AS filterby
, 9 AS filtertype -- xx.filtertype
-- , xx.rn
FROM (
SELECT t1.filterid , t1.filterby , t1.filtertype
, row_number() OVER (PARTITION BY t1.filterid ORDER BY (filtertype<>9), filtertype ) AS rn
FROM userfilters t1
) xx
WHERE xx.rn = 1
-- ORDER BY xx.filterid, xx.rn
;
This query can be supported by a an index on the same expression:
CREATE INDEX ON userfilters ( filterid , (filtertype<>9), filtertype ) ;
But, on my machine the UNION ALL version is faster (using the same index):
EXPLAIN ANALYZE
SELECT t1.filterid
, t1.filterby
, t1.filtertype
FROM userfilters t1
WHERE t1.filtertype = 9
UNION ALL
SELECT DISTINCT t1.filterid , '-1' AS filterby ,9 AS filtertype
FROM userfilters t1
WHERE NOT EXISTS (
SELECT *
FROM userfilters t2
WHERE t2.filtertype = 9 AND t2.filterid = t1.filterid
)
;
Even simpler (and faster!) is to use DISTINCT ON() , supported by the same conditional index:
-- EXPLAIN ANALYZE
SELECT DISTINCT ON (t1.filterid)
t1.filterid
, case(t1.filtertype) when 9 then t1.filterby ELSE '-1' END AS filterby
, 9 AS filtertype -- t1.filtertype
FROM userfilters t1
ORDER BY t1.filterid , (t1.filtertype<>9), t1.filtertype
;

Is there a way to use pg_trgm like operator with btree indexes on PostgreSQL?

I have two tables:
table_1 with ~1 million lines, with columns id_t1: integer, c1_t1: varchar, etc.
table_2 with ~50 million lines, with columns id_t2: integer, ref_id_t1: integer, c1_t2: varchar, etc.
ref_id_t1 is filled with id_t1 values , however they are not linked by a foreign key as table_2 doesn't know about table_1.
I need to do a request on both table like the following:
SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%');
Without any change or with basic indexes the request takes about a minute to complete as a sequencial scan is peformed on table_2. To prevent this I created a GIN idex with gin_trgm_ops option:
CREATE EXTENSION pg_trgm;
CREATE INDEX c1_t2_gin_index ON table_2 USING gin (c1_t2, gin_trgm_ops);
However this does not solve the problem as the inner request still takes a very long time.
EXPLAIN ANALYSE SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%'
Gives the following
Bitmap Heap Scan on table_2 t2 (cost=664.20..189671.00 rows=65058 width=4) (actual time=5101.286..22854.838 rows=69631 loops=1)
Recheck Cond: ((c1_t2 )::text ~~ '%1.1%'::text)
Rows Removed by Index Recheck: 49069703
Heap Blocks: exact=611548
-> Bitmap Index Scan on gin_trg (cost=0.00..647.94 rows=65058 width=0) (actual time=4911.125..4911.125 rows=49139334 loops=1)
Index Cond: ((c1_t2)::text ~~ '%1.1%'::text)
Planning time: 0.529 ms
Execution time: 22863.017 ms
The Bitmap Index Scan is fast, but as we need t2.ref_id_t1 PostgreSQL needs to perform an Bitmap Heap Scan which is not quick on 65000 lines of data.
The solution to avoid the Bitmap Heap Scan would be to perform an Index Only Scan. This is possible using multiple column with btree indexes, see https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html
If I change the request like to search the begining of c1_t2, even with the inner request returning 90000 lines, and if I create a btree index on c1_t2 and ref_id_t1 the request takes just over a second.
CREATE INDEX c1_t2_ref_id_t1_index
ON table_2 USING btree
(c1_t2 varchar_pattern_ops ASC NULLS LAST, ref_id_t1 ASC NULLS LAST)
EXPLAIN ANALYSE SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE 'aaa%');
Hash Join (cost=56561.99..105233.96 rows=1 width=2522) (actual time=953.647..1068.488 rows=36 loops=1)
Hash Cond: (t1.id_t1 = t2.ref_id_t1)
-> Seq Scan on table_1 t1 (cost=0.00..48669.65 rows=615 width=2522) (actual time=0.088..667.576 rows=790 loops=1)
Filter: (c1_t1 = 'A')
Rows Removed by Filter: 1083798
-> Hash (cost=56553.74..56553.74 rows=660 width=4) (actual time=400.657..400.657 rows=69632 loops=1)
Buckets: 131072 (originally 1024) Batches: 1 (originally 1) Memory Usage: 3472kB
-> HashAggregate (cost=56547.14..56553.74 rows=660 width=4) (actual time=380.280..391.871 rows=69632 loops=1)
Group Key: t2.ref_id_t1
-> Index Only Scan using c1_t2_ref_id_t1_index on table_2 t2 (cost=0.56..53907.28 rows=1055943 width=4) (actual time=0.014..202.034 rows=974737 loops=1)
Index Cond: ((c1_t2 ~>=~ 'aaa'::text) AND (c1_t2 ~<~ 'chb'::text))
Filter: ((c1_t2 )::text ~~ 'aaa%'::text)
Heap Fetches: 0
Planning time: 1.512 ms
Execution time: 1069.712 ms
However this is not possible with gin indexes, as these indexes don't store all data in the key.
Is there a way to use pg_trmg like extension with btree index so we can have index only scan with LIKE '%abc%' requests?

Is there a way to query for an integer value or NULL without using OR?

I'd like to query for a (list of) values or NULL but not use OR. The reasoning behind trying to not use OR is, that I need to use an index on that field to speed up a query.
A simple example to illustrate my question:
CREATE TABLE fruits
(
name text,
quantity integer
);
(The real table has lots of additional integer columns.)
The query that I'm not happy with is
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
The query I'm hoping for would be something like
SELECT * FROM fruits WHERE quantity MAGIC (1,2,3,4,NULL);
I'm using Postgresql 9.1.
As far as I can tell from the docs (e.g. http://www.postgresql.org/docs/9.1/static/functions-comparisons.html) and tests there is no way to do this. But I'm hoping one of you has some magic insight.
Test table with 100k rows:
create table fruits (name text, quantity integer);
insert into fruits (name, quantity)
select left(md5(i::text), 6), i
from generate_series(1, 10000) s(i);
With plain index on quantity:
create index fruits_index on fruits(quantity);
analyze fruits;
The query with or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fruits (cost=21.29..34.12 rows=4 width=11) (actual time=0.032..0.032 rows=4 loops=1)
Recheck Cond: ((quantity = ANY ('{1,2,3,4}'::integer[])) OR (quantity IS NULL))
-> BitmapOr (cost=21.29..21.29 rows=4 width=0) (actual time=0.025..0.025 rows=0 loops=1)
-> Bitmap Index Scan on fruits_index (cost=0.00..17.03 rows=4 width=0) (actual time=0.019..0.019 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
-> Bitmap Index Scan on fruits_index (cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (quantity IS NULL)
Total runtime: 0.089 ms
Without or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_index on fruits (cost=0.00..21.07 rows=4 width=11) (actual time=0.026..0.038 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
Total runtime: 0.085 ms
The coalesce version proposed by wildplasser leads to a sequential scan:
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on fruits (cost=0.00..217.50 rows=250 width=11) (actual time=0.023..4.358 rows=4 loops=1)
Filter: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Rows Removed by Filter: 9996
Total runtime: 4.395 ms
Unless a coalesce expression index is created:
create index fruits_coalesce_index on fruits(coalesce(quantity, -1));
analyze fruits;
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_coalesce_index on fruits (cost=0.00..25.34 rows=5 width=11) (actual time=0.112..0.124 rows=4 loops=1)
Index Cond: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Total runtime: 0.172 ms
But it is still worse than the plain or query with a plain index on quantity.
Ugly hack with COALESCE:
SELECT *
FROM fruits
WHERE COALESCE(quantity,1) IN (1,2,3,4)
;
Please check the resulting plan. IIRC, the optimiser knows about COALESCE() in cases like this.
UPDATE: Alternative: use the EXISTS(NOT EXISTS(NOT IN)) trick (which generates a different plan here)
-- EXPLAIN ANALYZE
SELECT *
FROM fruits fr
WHERE EXISTS (
SELECT * FROM fruits ex
WHERE ex.id = fr.id
AND NOT EXISTS (
SELECT * FROM fruits nx
WHERE nx.id = ex.id
AND nx.quantity NOT IN (1,2,3,4)
)
)
;
BTW: while testing, (upto 1 million rows, with only 4+ a few qualifying) , the first query (which does not use an index) is always faster than the second (which uses indices and hash anti-join) YMMV.
UPDATE 2: the original query IS NULL OR IN() is a clear winner here:
-- EXPLAIN ANALYZE
SELECT *
FROM fruits
WHERE quantity IS NULL
OR quantity IN (1,2,3,4)
;
This is not an answer to your exact question, but you could build a partial index tailored for your query:
CREATE INDEX idx_partial (quantity) ON fruits
WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
From the docs: http://www.postgresql.org/docs/current/interactive/indexes-partial.html
This index should then be used by your query and speed it up.