How can I speed up a window query with string prefix matching? - postgresql

I have a table with about 7.5million records and am trying to implement an autocomplete form based on said table, but the performance is pretty bad.
The schema (irrelevant fields omitted) is as follows:
COMPANIES
---------
sid (integer primary key)
world_hq_sid (integer)
name (varchar(64))
marketing_alias (varchar(64))
address_country_code (char(4))
address_state (varchar(64))
sort_order integer
search_weight integer
annual_sales integer
The fields passed in are the optional country_code and state, along with a search term. What I want is for the search term to match (case insensitive) the beginning of either name or marketing_alias. I want the top ten results, with those results that also match country and state at the top, then country only, then no state/country match. After that, I want the results sorted by sort_order.
Also, I only want one match per world_hq_sid. Finally, when I have the top match per world_hq_sid, I want the final results to be sorted by search_weight.
I'm using a window query to achieve the world_hq_sid part. Here is the query:
SELECT * FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY world_hq_sid ORDER BY CASE WHEN address_country_code = 'US' AND address_state = 'CA' THEN 2 WHEN address_country_code = 'US' THEN 1 ELSE 0 END desc, sort_order asc) AS r,
companies.*
FROM companies
WHERE ((upper(name) LIKE upper('co%')) OR (upper(marketing_alias) LIKE upper('co%')))
) x
WHERE x.r = 1
ORDER BY CASE WHEN address_country_code = 'US' AND address_state = 'CA' THEN 2 WHEN address_state = 'CA' THEN 1 ELSE 0 END desc, search_weight asc, annual_sales desc
LIMIT 10;
I have normal btree indexes on address_state, address_country_code, world_hq_sid, sort_order, and search_weight.
I have the following indexes on the name and marketing_alias fields:
CREATE INDEX companies_alias_pattern_upper_idx ON companies(upper(marketing_alias) varchar_pattern_ops);
CREATE INDEX companies_name_pattern_upper_idx ON companies(upper(name) varchar_pattern_ops)
And here is the explain analyze when I pass CA as the state and 'co' as the search term
Limit (cost=676523.01..676523.03 rows=10 width=939) (actual time=18695.686..18695.687 rows=10 loops=1)
-> Sort (cost=676523.01..676526.67 rows=1466 width=939) (actual time=18695.686..18695.687 rows=10 loops=1)
Sort Key: x.search_weight, x.annual_sales
Sort Method: top-N heapsort Memory: 30kB
-> Subquery Scan on x (cost=665492.58..676491.33 rows=1466 width=939) (actual time=18344.715..18546.830 rows=151527 loops=1)
Filter: (x.r = 1)
Rows Removed by Filter: 20672
-> WindowAgg (cost=665492.58..672825.08 rows=293300 width=931) (actual time=18344.710..18511.625 rows=172199 loops=1)
-> Sort (cost=665492.58..666225.83 rows=293300 width=931) (actual time=18344.702..18359.145 rows=172199 loops=1)
Sort Key: companies.world_hq_sid, (CASE WHEN ((companies.address_state)::text = 'CA'::text) THEN 1 ELSE 0 END), companies.sort_order
Sort Method: quicksort Memory: 108613kB
-> Bitmap Heap Scan on companies (cost=17236.64..518555.98 rows=293300 width=931) (actual time=1861.665..17999.806 rows=172199 loops=1)
Recheck Cond: ((upper((name)::text) ~~ 'CO%'::text) OR (upper((marketing_alias)::text) ~~ 'CO%'::text))
Filter: ((upper((name)::text) ~~ 'CO%'::text) OR (upper((marketing_alias)::text) ~~ 'CO%'::text))
-> BitmapOr (cost=17236.64..17236.64 rows=196219 width=0) (actual time=1829.061..1829.061 rows=0 loops=1)
-> Bitmap Index Scan on companies_name_pattern_upper_idx (cost=0.00..8987.98 rows=97772 width=0) (actual time=971.331..971.331 rows=169390 loops=1)
Index Cond: ((upper((name)::text) ~>=~ 'CO'::text) AND (upper((name)::text) ~<~ 'CP'::text))
-> Bitmap Index Scan on companies_alias_pattern_upper_idx (cost=0.00..8102.02 rows=98447 width=0) (actual time=857.728..857.728 rows=170616 loops=1)
Index Cond: ((upper((marketing_alias)::text) ~>=~ 'CO'::text) AND (upper((marketing_alias)::text) ~<~ 'CP'::text))
I've bumped work_mem and shared_buffers to 100M.
As you can see, this query returns in 18 seconds. What is odd is that the results are all over the board for different starting characters, from 400ms (acceptable) to 30 seconds (very not acceptable). Postgres gurus, my question is, am I just expecting too much of postgresql to perform such a query quickly consistently? Is there a way I can speed this up?

select *
from (
select distinct on (world_hq_sid)
world_hq_sid,
(address_country_code = 'US')::int + (address_state = 'CA')::int address_weight,
sort_order,
search_weight, annual_sales,
sid, name, marketing_alias,
address_country_code, address_state
from companies
where
upper(name) LIKE upper('co%')
OR upper(marketing_alias) LIKE upper('co%')
order by 1, 2 desc, 3
) s
order by
address_weight desc,
search_weight,
annual_sales desc
limit 10

For autocomplete it's possible to use trigram search.
pg_trgm module.
CREATE EXTENSION pg_trgm;
ALTER TABLE companies ADD COLUMN name_trgm TEXT NULL;
UPDATE companies SET name_trgm = UPPER(name);
CREATE INDEX companies_name_trgm_gin_idx ON companies USING GIN (name_trgm gin_trgm_ops);

Related

Postgres trigram search is slow

I have a table with around 3 million rows.
I created single gin index on multiple columns of the table.
CREATE INDEX search_idx ON customer USING gin (name gin_trgm_ops, id gin_trgm_ops, data gin_trgm_ops)
I am running following query (simplified to use single column in criteria) but it takes around 4 seconds:
EXPLAIN ANALYSE
SELECT c.id, similarity(c.name, 'john') sml
FROM customer c WHERE
c.name % 'john'
ORDER BY sml DESC
LIMIT 10
The output query plan is:
Limit (cost=9255.12..9255.14 rows=10 width=30) (actual time=3771.661..3771.665 rows=10 loops=1)
-> Sort (cost=9255.12..9260.43 rows=2126 width=30) (actual time=3771.659..3771.661 rows=10 loops=1)
Sort Key: (similarity((name)::text, 'john'::text)) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Bitmap Heap Scan on customer c (cost=1140.48..9209.18 rows=2126 width=30) (actual time=140.665..3770.478 rows=3605 loops=1)
Recheck Cond: ((name)::text % 'john'::text)
Rows Removed by Index Recheck: 939598
Heap Blocks: exact=158055 lossy=132577
-> Bitmap Index Scan on search_idx (cost=0.00..1139.95 rows=2126 width=0) (actual time=105.609..105.610 rows=458131 loops=1)
Index Cond: ((name)::text % 'john'::text)
Planning Time: 0.102 ms
I fail to understand that why the rows are not SORTED and LIMITed to 10 when retrieved from the search_idx in first step, and afterwards only 10 rows are fetched from the customer table (instead of 2126 rows)
Any ideas how can this query be made faster.
I tried gist index but i see no performance gains.
I also tried increasing the work_mem from 4MB to 32MB and I can see improvement of 1 sec but not more.
Also I noticed that even if I remove c.id in SELECT clause, postgres does not perform index only scan and still joins with main table.
Thanks for the help.
Update1:
After Laurenz Albe suggestion below the query performance increased and it is now around 600 ms. Plan looks like this now:
Subquery Scan on q (cost=0.41..78.29 rows=1 width=12) (actual time=63.150..574.536 rows=10 loops=1)
Filter: ((q.name)::text % 'john'::text)
-> Limit (cost=0.41..78.16 rows=10 width=40) (actual time=63.148..574.518 rows=10 loops=1)
-> Index Scan using search_name_idx on customer c (cost=0.41..2232864.76 rows=287182 width=40) (actual time=63.146..574.513 rows=10 loops=1)
Order By: ((name)::text <-> 'john'::text)
Planning Time: 42.671 ms
Execution Time: 585.554 ms
To get the 10 closest matches with index support, you should create a GiST index and query like this:
SELECT id, sml
FROM (SELECT c.id,
c.name,
similarity(c.name, 'john') sml
FROM customer c
ORDER BY c.name <-> 'john'
LIMIT 10) AS q
WHERE name % 'john';
The subquery can use the GiST index, and the outer query eliminates all results that do not exceed the pg_trgm.similarity_threshold.
UPDATE: Referring https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search.html
I updated the query to use word_similarity with multicolumn GIST index and it improved performance alot.
EXPLAIN (analyze, buffers)
WITH input AS (SELECT 'new york' AS search)
SELECT *
FROM customer, input
WHERE
(
input.search <% id
OR input.search <% name
OR input.search <% city
)
ORDER BY
input.search <<-> id,
input.search <<-> name,
input.search <<-> city
LIMIT 10
Explain plan showsindex only scan.

Can a LEFT JOIN be deferred to only apply to matching rows?

When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;

Postgresql max query on big indexed table has slow performance

I have a table inside my Postgresql database, called consumer_actions. It contains all the actions done by consumers registered in my app. At the moment, this table has ~ 500 million records. What i'm trying to do is to get the maximum id, based on the system that the action came from.
The definition of the table is:
CREATE TABLE public.consumer_actions (
id int4 NOT NULL,
system_id int4 NOT NULL,
consumer_id int4 NOT NULL,
action_id int4 NOT NULL,
payload_json jsonb NULL,
external_system_date timestamptz NULL,
local_system_date timestamptz NULL,
CONSTRAINT consumer_actions_pkey PRIMARY KEY (id, system_id)
);
CREATE INDEX consumer_actions_ext_date ON public.consumer_actions USING btree (external_system_date);
CREATE INDEX consumer_actions_system_consumer_id ON public.consumer_actions USING btree (system_id, consumer_id);
when i'm trying
select max(id) from consumer_actions where system_id = 1
it takes less than one second, but if i try to use the same index (consumer_actions_system_consumer_id) to get the max(id) by system_id = 2, it takes more than an hour.
select max(id) from consumer_actions where system_id = 2
I have also checked the query planner, is looks similar for both queries; i also rerun vacuum analyze on the table and a reindex. Neither of them helped. Any idea what i can do to improve the second query time?
Here are the query planners for both tables, and the size at the moment of this table:
explain analyze
select max(id) from consumer_actions where system_id = 1;
Result (cost=1.49..1.50 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..1.49 rows=1 width=4) (actual time=0.057..0.057 rows=1 loops=1)
-> Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca (cost=0.57..524024735.49 rows=572451344 width=4) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ((id IS NOT NULL) AND (system_id = 1))
Heap Fetches: 1
Planning Time: 0.173 ms
Execution Time: 0.092 ms
explain analyze
select max(id) from consumer_actions where system_id = 2;
Result (cost=6.46..6.47 rows=1 width=4) (actual time=7099484.855..7099484.858 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..6.46 rows=1 width=4) (actual time=7099484.839..7099484.841 rows=1 loops=1)
-> Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca (cost=0.57..20205843.58 rows=3436129 width=4) (actual time=7099484.833..7099484.834 rows=1 loops=1)
Index Cond: ((id IS NOT NULL) AND (system_id = 2))
Heap Fetches: 1
Planning Time: 3.078 ms
Execution Time: 7099484.992 ms
(8 rows)
select count(*) from consumer_actions; --result is 577408504
Instead of using an aggregation function like max() that has to potentially scan and aggregate large numbers of rows for a table like yours you could get similar results with a query designed to return the fewest rows possible:
SELECT id FROM consumer_actions WHERE system_id = ? ORDER BY id DESC LIMIT 1;
This should still benefit significantly in performance from the existing indices.
I think that you should create an index like this one
CREATE INDEX consumer_actions_system_system_id_id ON public.consumer_actions USING btree (system_id, id);

PostgreSQL table indexing

I want to index my tables for the following query:
select
t.*
from main_transaction t
left join main_profile profile on profile.id = t.profile_id
left join main_customer customer on (customer.id = profile.user_id)
where
(upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%')))
and t.service_type = 'SERVICE_1'
and t.status = 'SUCCESS'
and t.mode = 'AUTO'
and t.transaction_type = 'WITHDRAW'
and customer.client = 'corp'
and t.pub_date>='2018-09-05' and t.pub_date<='2018-11-05'
order by t.pub_date desc, t.id asc
LIMIT 1000;
This is how I tried to index my tables:
CREATE INDEX main_transaction_pr_id ON main_transaction (profile_id);
CREATE INDEX main_profile_user_id ON main_profile (user_id);
CREATE INDEX main_customer_client ON main_customer (client);
CREATE INDEX main_transaction_gin_req_no ON main_transaction USING gin (upper(request_no) gin_trgm_ops);
CREATE INDEX main_customer_gin_phone ON main_customer USING gin (upper(phone) gin_trgm_ops);
CREATE INDEX main_transaction_general ON main_transaction (service_type, status, mode, transaction_type); --> don't know if this one is true!!
After indexing like above my query is spending over 4.5 seconds for just selecting 1000 rows!
I am selecting from the following table which has 34 columns including 3 FOREIGN KEYs and it has over 3 million data rows:
CREATE TABLE main_transaction (
id integer NOT NULL DEFAULT nextval('main_transaction_id_seq'::regclass),
description character varying(255) NOT NULL,
request_no character varying(18),
account character varying(50),
service_type character varying(50),
pub_date" timestamptz(6) NOT NULL,
"service_id" varchar(50) COLLATE "pg_catalog"."default",
....
);
I am also joining two tables (main_profile, main_customer) for searching customer.phone and for selecting customer.client. To get to the main_customer table from main_transaction table, I can only go by main_profile
My question is how can I index my table too increase performance for above query?
Please, do not use UNION for OR for this case (upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%'))) instead can we use case when condition? Because, I have to convert my PostgreSQL query into Hibernate JPA! And I don't know how to convert UNION except Hibernate - Native SQL which I am not allowed to use.
Explain:
Limit (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.381 rows=1 loops=1)
-> Sort (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.380 rows=1 loops=1)
Sort Key: t.pub_date DESC, t.id
Sort Method: quicksort Memory: 27kB
-> Hash Join (cost=20817.10..411600.73 rows=38 width=1906) (actual time=3214.473..3885.369 rows=1 loops=1)
Hash Cond: (t.profile_id = profile.id)
Join Filter: ((upper((t.request_no)::text) ~~ '%20181104-2158-2723948%'::text) OR (upper((customer.phone)::text) ~~ '%20181104-2158-2723948%'::text))
Rows Removed by Join Filter: 593118
-> Seq Scan on main_transaction t (cost=0.00..288212.28 rows=205572 width=1906) (actual time=0.068..1527.677 rows=593119 loops=1)
Filter: ((pub_date >= '2016-09-05 00:00:00+05'::timestamp with time zone) AND (pub_date <= '2018-11-05 00:00:00+05'::timestamp with time zone) AND ((service_type)::text = 'SERVICE_1'::text) AND ((status)::text = 'SUCCESS'::text) AND ((mode)::text = 'AUTO'::text) AND ((transaction_type)::text = 'WITHDRAW'::text))
Rows Removed by Filter: 2132732
-> Hash (cost=17670.80..17670.80 rows=180984 width=16) (actual time=211.211..211.211 rows=181516 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 3166kB
-> Hash Join (cost=6936.09..17670.80 rows=180984 width=16) (actual time=46.846..183.689 rows=181516 loops=1)
Hash Cond: (customer.id = profile.user_id)
-> Seq Scan on main_customer customer (cost=0.00..5699.73 rows=181106 width=16) (actual time=0.013..40.866 rows=181618 loops=1)
Filter: ((client)::text = 'corp'::text)
Rows Removed by Filter: 16920
-> Hash (cost=3680.04..3680.04 rows=198404 width=8) (actual time=46.087..46.087 rows=198404 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2966kB
-> Seq Scan on main_profile profile (cost=0.00..3680.04 rows=198404 width=8) (actual time=0.008..20.099 rows=198404 loops=1)
Planning time: 0.757 ms
Execution time: 3885.680 ms
With the restriction to not use UNION, you won't get a good plan.
You can slightly speed up processing with the following indexes:
main_transaction ((service_type::text), (status::text), (mode::text),
(transaction_type::text), pub_date)
main_customer ((client::text))
These should at least get rid of the sequential scans, but the hash join that takes the lion's share of the processing time will remain.

Is there a way to query for an integer value or NULL without using OR?

I'd like to query for a (list of) values or NULL but not use OR. The reasoning behind trying to not use OR is, that I need to use an index on that field to speed up a query.
A simple example to illustrate my question:
CREATE TABLE fruits
(
name text,
quantity integer
);
(The real table has lots of additional integer columns.)
The query that I'm not happy with is
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
The query I'm hoping for would be something like
SELECT * FROM fruits WHERE quantity MAGIC (1,2,3,4,NULL);
I'm using Postgresql 9.1.
As far as I can tell from the docs (e.g. http://www.postgresql.org/docs/9.1/static/functions-comparisons.html) and tests there is no way to do this. But I'm hoping one of you has some magic insight.
Test table with 100k rows:
create table fruits (name text, quantity integer);
insert into fruits (name, quantity)
select left(md5(i::text), 6), i
from generate_series(1, 10000) s(i);
With plain index on quantity:
create index fruits_index on fruits(quantity);
analyze fruits;
The query with or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fruits (cost=21.29..34.12 rows=4 width=11) (actual time=0.032..0.032 rows=4 loops=1)
Recheck Cond: ((quantity = ANY ('{1,2,3,4}'::integer[])) OR (quantity IS NULL))
-> BitmapOr (cost=21.29..21.29 rows=4 width=0) (actual time=0.025..0.025 rows=0 loops=1)
-> Bitmap Index Scan on fruits_index (cost=0.00..17.03 rows=4 width=0) (actual time=0.019..0.019 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
-> Bitmap Index Scan on fruits_index (cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (quantity IS NULL)
Total runtime: 0.089 ms
Without or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_index on fruits (cost=0.00..21.07 rows=4 width=11) (actual time=0.026..0.038 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
Total runtime: 0.085 ms
The coalesce version proposed by wildplasser leads to a sequential scan:
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on fruits (cost=0.00..217.50 rows=250 width=11) (actual time=0.023..4.358 rows=4 loops=1)
Filter: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Rows Removed by Filter: 9996
Total runtime: 4.395 ms
Unless a coalesce expression index is created:
create index fruits_coalesce_index on fruits(coalesce(quantity, -1));
analyze fruits;
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_coalesce_index on fruits (cost=0.00..25.34 rows=5 width=11) (actual time=0.112..0.124 rows=4 loops=1)
Index Cond: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Total runtime: 0.172 ms
But it is still worse than the plain or query with a plain index on quantity.
Ugly hack with COALESCE:
SELECT *
FROM fruits
WHERE COALESCE(quantity,1) IN (1,2,3,4)
;
Please check the resulting plan. IIRC, the optimiser knows about COALESCE() in cases like this.
UPDATE: Alternative: use the EXISTS(NOT EXISTS(NOT IN)) trick (which generates a different plan here)
-- EXPLAIN ANALYZE
SELECT *
FROM fruits fr
WHERE EXISTS (
SELECT * FROM fruits ex
WHERE ex.id = fr.id
AND NOT EXISTS (
SELECT * FROM fruits nx
WHERE nx.id = ex.id
AND nx.quantity NOT IN (1,2,3,4)
)
)
;
BTW: while testing, (upto 1 million rows, with only 4+ a few qualifying) , the first query (which does not use an index) is always faster than the second (which uses indices and hash anti-join) YMMV.
UPDATE 2: the original query IS NULL OR IN() is a clear winner here:
-- EXPLAIN ANALYZE
SELECT *
FROM fruits
WHERE quantity IS NULL
OR quantity IN (1,2,3,4)
;
This is not an answer to your exact question, but you could build a partial index tailored for your query:
CREATE INDEX idx_partial (quantity) ON fruits
WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
From the docs: http://www.postgresql.org/docs/current/interactive/indexes-partial.html
This index should then be used by your query and speed it up.