Postgresql COALESCE performance problem - postgresql

I have this table in Postgresql:
CREATE TABLE my_table
(
id bigint NOT NULL,
value bigint,
CONSTRAINT my_table_pkey PRIMARY KEY (id)
);
There are ~50000 rows in my_table.
The question is, why the query:
SELECT * FROM my_table WHERE id = COALESCE(null, id) and value = ?
is slower than this one:
SELECT * FROM my_table WHERE value = ?
Is there any solution, other than optimizing the query string in app-layer?
EDIT: Practically, the question is how to rewrite the query select * from my_table where id=coalesce(?, id) and value=? to have worst case performance not less than that of select * from my_table where value=? in Postgresql 9.0

Try rewriting the query of the form
SELECT *
FROM my_table
WHERE value = ?
AND (? IS NULL OR id = ?)
From my own quick tests
INSERT INTO my_table select generate_series(1,50000),1;
UPDATE my_table SET value = id%17;
CREATE INDEX val_idx ON my_table(value);
VACUUM ANALYZE my_table;
\set idval 17
\set pval 0
explain analyze
SELECT *
FROM my_table
WHERE value = :pval
AND (:idval IS NULL OR id = :idval);
Index Scan using my_table_pkey on my_table (cost=0.00..8.29 rows=1 width=16) (actual time=0.034..0.035 rows=1 loops=1)
Index Cond: (id = 17)
Filter: (value = 0)
Total runtime: 0.064 ms
\set idval null
explain analyze
SELECT *
FROM my_table
WHERE value = :pval
AND (:idval IS NULL OR id = :idval);
Bitmap Heap Scan on my_table (cost=58.59..635.62 rows=2882 width=16) (actual time=0.373..1.594 rows=2941 loops=1)
Recheck Cond: (value = 0)
-> Bitmap Index Scan on validx (cost=0.00..57.87 rows=2882 width=0) (actual time=0.324..0.324 rows=2941 loops=1)
Index Cond: (value = 0)
Total runtime: 1.811 ms

From creating a similar table, populating it, updating statistics, and finally looking at the output of EXPLAIN ANALYZE, the only difference I see is that the first query filters like this:
Filter: ((id = COALESCE(id)) AND (value = 3))
and the second one filters like this:
Filter: (value = 3)
I see substantially different performance and execution plans when there's an index on the column "value". In the first case
Bitmap Heap Scan on my_table (cost=19.52..552.60 rows=5 width=16) (actual time=19.311..20.679 rows=1000 loops=1)
Recheck Cond: (value = 3)
Filter: (id = COALESCE(id))
-> Bitmap Index Scan on t2 (cost=0.00..19.52 rows=968 width=0) (actual time=19.260..19.260 rows=1000 loops=1)
Index Cond: (value = 3)
Total runtime: 22.138 ms
and in the second
Bitmap Heap Scan on my_table (cost=19.76..550.42 rows=968 width=16) (actual time=0.302..1.293 rows=1000 loops=1)
Recheck Cond: (value = 3)
-> Bitmap Index Scan on t2 (cost=0.00..19.52 rows=968 width=0) (actual time=0.276..0.276 rows=1000 loops=1)
Index Cond: (value = 3)
Total runtime: 2.174 ms
So I'd say it's slower because the db engine a) evaluates the COALESCE() expression rather than optimizing it away, and b) evaluating it involves an additional filter condition.

Related

Need to reduce the query optimization time in postgres

Use Case: Need to find the index and totalCount of the particular id in the table
I am having a table ann_details which has 60 million records and based on where condition I need to retrieve the rows along with index of that id
Query:
with a as (
select an.id, row_number() over (partition by created_at) as rn
from annotation an
where ( an.layer_id = '47afb169-aed2-4378-ab13-897836275da3' or an.job_id = '' or an.task_id = '') and
an.category_id in (10019)
) select (select count(1) from a ) as totalCount , rn-1 as index from a where a.id= '47afb169-aed2-4378-ab13-897836275da3_a93f0758-8fe0-4c76-992f-0be17e5618bf_484484101';
Output:
totalCount index
1797124,1791143
Execution Time: 5 sec 487 ms
explain and analyze
CTE Scan on a (cost=872778.54..907545.00 rows=7722 width=16) (actual time=5734.572..5735.989 rows=1 loops=1)
Filter: ((id)::text = '47afb169-aed2-4378-ab13-897836275da3_a93f0758-8fe0-4c76-992f-0be17e5618bf_484484101'::text)
Rows Removed by Filter: 1797123
CTE a
-> WindowAgg (cost=0.68..838031.38 rows=1544318 width=97) (actual time=133.660..3831.998 rows=1797124 loops=1)
-> Index Only Scan using test_index_test_2 on annotation an (cost=0.68..814866.61 rows=1544318 width=89) (actual time=133.647..2660.009 rows=1797124 loops=1)
Index Cond: (category_id = 10019)
Filter: (((layer_id)::text = '47afb169-aed2-4378-ab13-897836275da3'::text) OR ((job_id)::text = ''::text) OR ((task_id)::text = ''::text))
Rows Removed by Filter: 3773007
Heap Fetches: 101650
InitPlan 2 (returns $1)
-> Aggregate (cost=34747.15..34747.17 rows=1 width=8) (actual time=2397.391..2397.392 rows=1 loops=1)
-> CTE Scan on a a_1 (cost=0.00..30886.36 rows=1544318 width=0) (actual time=0.017..2156.210 rows=1797124 loops=1)
Planning time: 0.487 ms
Execution time: 5771.080 ms
Index:
CREATE INDEX test_index_test_2 ON public.annotation USING btree (category_id,created_at,layer_id,job_id,task_id,id);
From application we will be passing the job_id or task_id or layer_id and rest 2 will be passed as empty
Need help in optimizing the query to get the response in 2 sec
Query Plan: https://explain.depesz.com/s/mXme

How does postgres query planner work on partitioned tables?

We have a postgres 11 database with tables having a quite big number of rows, so we use postgres declerative partitioning to ensure query performance.
Today while wirting a database function I noticed some strange behaviour of the postgres query planner:
In this particular case we have two tables track.track and sensor.location.
The function shall return all locations for a given track.
The relation between track.track and sensor.location is given by a user_vehicle_id and a time range.
sensor.location is partitioned monthly by range using the column time
A query for this problem could look like this:
WITH single_track AS (
SELECT
start_time, stop_time, user_vehicle_id
FROM
track.track
WHERE
id = 1350000744800)
SELECT *
FROM
sensor.location as l, single_track as t
WHERE
l.time >= t.start_time AND
l.time <= t.stop_time AND
l.user_vehicle_id = t.user_vehicle_id
I would expect that the query planer only looks at those partitions of location which match the given time frame from start_time to stop_time.
Instead it performs a Bitmap Heap/Index scan on all partitions:
Nested Loop (cost=8.59..9308018.00 rows=722021 width=106) (actual time=1.796..2.296 rows=1025 loops=1)
CTE single_track
-> Index Scan using track_pkey on track (cost=0.42..8.44 rows=1 width=24) (actual time=0.023..0.024 rows=1 loops=1)
Index Cond: (id = '1350000744800'::bigint)
-> CTE Scan on single_track t (cost=0.00..0.02 rows=1 width=24) (actual time=0.027..0.029 rows=1 loops=1)
-> Append (cost=0.15..9286171.84 rows=2183770 width=82) (actual time=1.750..1.998 rows=1025 loops=1)
-> Index Scan using location_p2011_01_pkey on location_p2011_01 l (cost=0.15..8.83 rows=1 width=136) (never executed)
Index Cond: (("time" >= t.start_time) AND ("time" <= t.stop_time) AND (user_vehicle_id = t.user_vehicle_id))
-> Seq Scan on location_p2011_02 l_1 (cost=0.00..7.71 rows=1 width=82) (never executed)
Filter: (("time" >= t.start_time) AND ("time" <= t.stop_time) AND (t.user_vehicle_id = user_vehicle_id))
-> Bitmap Heap Scan on location_p2011_03 l_2 (cost=643.94..3370.03 rows=2087 width=114) (never executed)
Recheck Cond: (("time" >= t.start_time) AND ("time" <= t.stop_time) AND (user_vehicle_id = t.user_vehicle_id))
...
-> Index Scan using location_p2020_10_pkey on location_p2020_10 l_117 (cost=0.15..8.83 rows=1 width=136) (never executed)
Index Cond: (("time" >= t.start_time) AND ("time" <= t.stop_time) AND (user_vehicle_id = t.user_vehicle_id))
-> Index Scan using location_p2020_11_pkey on location_p2020_11 l_118 (cost=0.15..8.83 rows=1 width=136) (never executed)
Index Cond: (("time" >= t.start_time) AND ("time" <= t.stop_time) AND (user_vehicle_id = t.user_vehicle_id))
-> Index Scan using location_p2020_12_pkey on location_p2020_12 l_119 (cost=0.15..8.83 rows=1 width=136) (never executed)
Index Cond: (("time" >= t.start_time) AND ("time" <= t.stop_time) AND (user_vehicle_id = t.user_vehicle_id))
Planning Time: 11.046 ms
Execution Time: 4.144 ms
While playing around I discoverd, that using the same query but passing the times explicitly:
EXPLAIN ANALYSE
WITH single_track AS (
SELECT
start_time,
stop_time,
user_vehicle_id
FROM
track.track
WHERE
id = 1350000744800)
SELECT *
FROM
sensor.location as l, single_track as t
WHERE
l.time >= '2016-04-12 18:04:59' AND
l.time <= '2016-04-12 18:22:49' AND
l.user_vehicle_id = t.user_vehicle_id
produces the expected behaviour:
Nested Loop (cost=9.00..2111.73 rows=141 width=102) (actual time=0.085..2.408 rows=1025 loops=1)
CTE single_track
-> Index Scan using track_pkey on track (cost=0.42..8.44 rows=1 width=24) (actual time=0.017..0.018 rows=1 loops=1)
Index Cond: (id = '1350000744800'::bigint)
-> CTE Scan on single_track t (cost=0.00..0.02 rows=1 width=24) (actual time=0.021..0.022 rows=1 loops=1)
-> Append (cost=0.56..2099.99 rows=328 width=78) (actual time=0.060..2.081 rows=1025 loops=1)
-> Index Scan using location_p2016_04_pkey on location_p2016_04 l (cost=0.56..2098.35 rows=328 width=78) (actual time=0.058..1.994 rows=1025 loops=1)
Index Cond: (("time" >= '2016-04-12 18:04:59'::timestamp without time zone) AND ("time" <= '2016-04-12 18:22:49'::timestamp without time zone) AND (user_vehicle_id = t.user_vehicle_id))
Planning Time: 4.709 ms
Execution Time: 2.494 ms
Can anyone explain this behaviour and help me how to overcome this issue?
This seems to be too complicated for the PostgreSQL executor.
I'd suggest trying
SELECT *
FROM
sensor.location as l, single_track as t
WHERE
l.time >= (SELECT start_time FROM track.track WHERE id = 1350000744800) AND
l.time <= (SELECT stop_time FROM track.track WHERE id = 1350000744800) AND
l.user_vehicle_id = (SELECT user_vehicle_id FROM track.track WHERE id = 1350000744800)
Then PostgreSQL knows at least that there will only be a single value.
If that still doesn't work, split the query in two parts:
First, get the values from track.track.
Then construct a query using the results and execute that.
I also tried this, which may be close to what Laurenz Albe suggested. I wasn't able to confirm that this causes the right behavior, as EXPLAIN ANALYSE does not show the query plan of a psql funtction.
CREATE OR REPLACE FUNCTION location_from_track_id(
_track_id bigint)
RETURNS SETOF sensor.location
LANGUAGE 'plpgsql'
AS
$BODY$
DECLARE
_user_vehicle_id bigint;
_start_time timestamp without time zone;
_stop_time timestamp without time zone;
BEGIN
SELECT
user_vehicle_id,
start_time,
stop_time
INTO
_user_vehicle_id,
_start_time,
_stop_time
FROM
track.track
WHERE id=_track_id;
RETURN QUERY
SELECT *
FROM sensor.location
WHERE
time BETWEEN _start_time AND _stop_time AND
user_vehicle_id = _user_vehicle_id;
END;
$BODY$;

PostgreSQL table indexing

I want to index my tables for the following query:
select
t.*
from main_transaction t
left join main_profile profile on profile.id = t.profile_id
left join main_customer customer on (customer.id = profile.user_id)
where
(upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%')))
and t.service_type = 'SERVICE_1'
and t.status = 'SUCCESS'
and t.mode = 'AUTO'
and t.transaction_type = 'WITHDRAW'
and customer.client = 'corp'
and t.pub_date>='2018-09-05' and t.pub_date<='2018-11-05'
order by t.pub_date desc, t.id asc
LIMIT 1000;
This is how I tried to index my tables:
CREATE INDEX main_transaction_pr_id ON main_transaction (profile_id);
CREATE INDEX main_profile_user_id ON main_profile (user_id);
CREATE INDEX main_customer_client ON main_customer (client);
CREATE INDEX main_transaction_gin_req_no ON main_transaction USING gin (upper(request_no) gin_trgm_ops);
CREATE INDEX main_customer_gin_phone ON main_customer USING gin (upper(phone) gin_trgm_ops);
CREATE INDEX main_transaction_general ON main_transaction (service_type, status, mode, transaction_type); --> don't know if this one is true!!
After indexing like above my query is spending over 4.5 seconds for just selecting 1000 rows!
I am selecting from the following table which has 34 columns including 3 FOREIGN KEYs and it has over 3 million data rows:
CREATE TABLE main_transaction (
id integer NOT NULL DEFAULT nextval('main_transaction_id_seq'::regclass),
description character varying(255) NOT NULL,
request_no character varying(18),
account character varying(50),
service_type character varying(50),
pub_date" timestamptz(6) NOT NULL,
"service_id" varchar(50) COLLATE "pg_catalog"."default",
....
);
I am also joining two tables (main_profile, main_customer) for searching customer.phone and for selecting customer.client. To get to the main_customer table from main_transaction table, I can only go by main_profile
My question is how can I index my table too increase performance for above query?
Please, do not use UNION for OR for this case (upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%'))) instead can we use case when condition? Because, I have to convert my PostgreSQL query into Hibernate JPA! And I don't know how to convert UNION except Hibernate - Native SQL which I am not allowed to use.
Explain:
Limit (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.381 rows=1 loops=1)
-> Sort (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.380 rows=1 loops=1)
Sort Key: t.pub_date DESC, t.id
Sort Method: quicksort Memory: 27kB
-> Hash Join (cost=20817.10..411600.73 rows=38 width=1906) (actual time=3214.473..3885.369 rows=1 loops=1)
Hash Cond: (t.profile_id = profile.id)
Join Filter: ((upper((t.request_no)::text) ~~ '%20181104-2158-2723948%'::text) OR (upper((customer.phone)::text) ~~ '%20181104-2158-2723948%'::text))
Rows Removed by Join Filter: 593118
-> Seq Scan on main_transaction t (cost=0.00..288212.28 rows=205572 width=1906) (actual time=0.068..1527.677 rows=593119 loops=1)
Filter: ((pub_date >= '2016-09-05 00:00:00+05'::timestamp with time zone) AND (pub_date <= '2018-11-05 00:00:00+05'::timestamp with time zone) AND ((service_type)::text = 'SERVICE_1'::text) AND ((status)::text = 'SUCCESS'::text) AND ((mode)::text = 'AUTO'::text) AND ((transaction_type)::text = 'WITHDRAW'::text))
Rows Removed by Filter: 2132732
-> Hash (cost=17670.80..17670.80 rows=180984 width=16) (actual time=211.211..211.211 rows=181516 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 3166kB
-> Hash Join (cost=6936.09..17670.80 rows=180984 width=16) (actual time=46.846..183.689 rows=181516 loops=1)
Hash Cond: (customer.id = profile.user_id)
-> Seq Scan on main_customer customer (cost=0.00..5699.73 rows=181106 width=16) (actual time=0.013..40.866 rows=181618 loops=1)
Filter: ((client)::text = 'corp'::text)
Rows Removed by Filter: 16920
-> Hash (cost=3680.04..3680.04 rows=198404 width=8) (actual time=46.087..46.087 rows=198404 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2966kB
-> Seq Scan on main_profile profile (cost=0.00..3680.04 rows=198404 width=8) (actual time=0.008..20.099 rows=198404 loops=1)
Planning time: 0.757 ms
Execution time: 3885.680 ms
With the restriction to not use UNION, you won't get a good plan.
You can slightly speed up processing with the following indexes:
main_transaction ((service_type::text), (status::text), (mode::text),
(transaction_type::text), pub_date)
main_customer ((client::text))
These should at least get rid of the sequential scans, but the hash join that takes the lion's share of the processing time will remain.

Loose index scan in Postgres on more than one field?

I have several large tables in Postgres 9.2 (millions of rows) where I need to generate a unique code based on the combination of two fields, 'source' (varchar) and 'id' (int). I can do this by generating row_numbers over the result of:
SELECT source,id FROM tablename GROUP BY source,id
but the results can take a while to process. It has been recommended that if the fields are indexed, and there are a proportionally small number of index values (which is my case), that a loose index scan may be a better option: http://wiki.postgresql.org/wiki/Loose_indexscan
WITH RECURSIVE
t AS (SELECT min(col) AS col FROM tablename
UNION ALL
SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);
The example operates on a single field though. Trying to return more than one field generates an error: subquery must return only one column. One possibility might be to try retrieving an entire ROW - e.g. SELECT ROW(min(source),min(id)..., but then I'm not sure what the syntax of the WHERE statement would need to look like to work with individual row elements.
The question is: can the recursion-based code be modified to work with more than one column, and if so, how? I'm committed to using Postgres, but it looks like MySQL has implemented loose index scans for more than one column: http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
As recommended, I'm attaching my EXPLAIN ANALYZE results.
For my situation - where I'm selecting distinct values for 2 columns using GROUP BY, it's the following:
HashAggregate (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
-> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
Total runtime: 36127.789 ms
(3 rows)
I don't know how to do a 2-column index scan (that's the question), but for purposes of comparison, using a GROUP BY on one column, I get:
HashAggregate (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
-> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
Total runtime: 32350.899 ms
(3 rows)
But for a loose index scan on one column, I get:
Result (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
CTE t
-> Recursive Union (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
-> Result (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
-> Index Only Scan using tablename_id on tablename (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
-> WorkTable Scan on t (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 3
-> Result (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
InitPlan 2 (returns $3)
-> Limit (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
-> Index Only Scan using tablename_id on tablename (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
Index Cond: ((id IS NOT NULL) AND (id > t.id))
Heap Fetches: 0
-> Append (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
-> CTE Scan on t (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 1
-> Result (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
One-Time Filter: $5
InitPlan 5 (returns $5)
-> Index Only Scan using tablename_id on tablename (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id IS NULL)
Heap Fetches: 0
Total runtime: 2.040 ms
The full table definition looks like this:
CREATE TABLE tablename
(
source character(25),
id bigint NOT NULL,
time_ timestamp without time zone,
height numeric,
lon numeric,
lat numeric,
distance numeric,
status character(3),
geom geometry(PointZ,4326),
relid bigint
)
WITH (
OIDS=FALSE
);
CREATE INDEX tablename_height
ON public.tablename
USING btree
(height);
CREATE INDEX tablename_geom
ON public.tablename
USING gist
(geom);
CREATE INDEX tablename_id
ON public.tablename
USING btree
(id);
CREATE INDEX tablename_lat
ON public.tablename
USING btree
(lat);
CREATE INDEX tablename_lon
ON public.tablename
USING btree
(lon);
CREATE INDEX tablename_relid
ON public.tablename
USING btree
(relid);
CREATE INDEX tablename_sid
ON public.tablename
USING btree
(source COLLATE pg_catalog."default", id);
CREATE INDEX tablename_source
ON public.tablename
USING btree
(source COLLATE pg_catalog."default");
CREATE INDEX tablename_time
ON public.tablename
USING btree
(time_);
Answer selection:
I took some time in comparing the approaches that were provided. It's at times like this that I wish that more than one answer could be accepted, but in this case, I'm giving the tick to #jjanes. The reason for this is that his solution matches the question as originally posed more closely, and I was able to get some insights as to the form of the required WHERE statement. In the end, the HashAggregate is actually the fastest approach (for me), but that's due to the nature of my data, not any problems with the algorithms. I've attached the EXPLAIN ANALYZE for the different approaches below, and will be giving +1 to both jjanes and joop.
HashAggregate:
HashAggregate (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
-> Seq Scan on tablename (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
Total runtime: 24787.292 ms
Loose Index Scan modification
CTE Scan on t (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
Filter: (source IS NOT NULL)
Rows Removed by Filter: 1
CTE t
-> Recursive Union (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
-> Limit (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
Heap Fetches: 0
-> WorkTable Scan on t (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
Filter: (source IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
Index Cond: (ROW(source, id) > ROW(t.source, t.id))
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
Index Cond: (ROW(source, id) > ROW(t.source, t.id))
Heap Fetches: 0
Total runtime: 250491.559 ms
Merge Anti Join
Merge Anti Join (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 363464177
-> Index Only Scan using tablename_pkey on tablename src (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
Heap Fetches: 0
-> Materialize (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
-> Index Only Scan using tablename_pkey on tablename nx (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
Heap Fetches: 0
Total runtime: 541750.026 ms
Rather hideous, but this seems to work:
WITH RECURSIVE
t AS (
select a,b from (select a,b from foo order by a,b limit 1) asdf union all
select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),
(select b from foo where (a,b) > (t.a,t.b) order by a,b limit 1)
from t where t.a is not null)
select * from t where t.a is not null;
I don't really understand why the "is not nulls" are needed, as where do the nulls come from in the first place?
DROP SCHEMA zooi CASCADE;
CREATE SCHEMA zooi ;
SET search_path=zooi,public,pg_catalog;
CREATE TABLE tablename
( source character(25) NOT NULL
, id bigint NOT NULL
, time_ timestamp without time zone NOT NULL
, height numeric
, lon numeric
, lat numeric
, distance numeric
, status character(3)
, geom geometry(PointZ,4326)
, relid bigint
, PRIMARY KEY (source,id,time_) -- <<-- Primary key here
) WITH ( OIDS=FALSE);
-- invent some bogus data
INSERT INTO tablename(source,id,time_)
SELECT 'SRC_'|| (gs%10)::text
,gs/10
,gt
FROM generate_series(1,1000) gs
, generate_series('2013-12-01', '2013-12-07', '1hour'::interval) gt
;
Select unique values for two key fields:
VACUUM ANALYZE tablename;
EXPLAIN ANALYZE
SELECT source,id,time_
FROM tablename src
WHERE NOT EXISTS (
SELECT * FROM tablename nx
WHERE nx.source =src.source
AND nx.id = src.id
AND time_ > src.time_
)
;
Generates this plan here (Pg-9.3):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=4981.00..12837.82 rows=96667 width=42) (actual time=547.218..1194.335 rows=1000 loops=1)
Hash Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 145000
-> Seq Scan on tablename src (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.010..210.810 rows=145000 loops=1)
-> Hash (cost=2806.00..2806.00 rows=145000 width=42) (actual time=546.497..546.497 rows=145000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 9063kB
-> Seq Scan on tablename nx (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.006..259.864 rows=145000 loops=1)
Total runtime: 1197.374 ms
(9 rows)
The hash-joins will probably disappear once the data outgrows the work_mem:
Merge Anti Join (cost=0.83..8779.56 rows=29832 width=120) (actual time=0.981..2508.912 rows=1000 loops=1)
Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 184051
-> Index Scan using tablename_sid on tablename src (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.055..250.621 rows=145000 loops=1)
-> Index Scan using tablename_sid on tablename nx (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.008..603.403 rows=328906 loops=1)
Total runtime: 2510.505 ms
Lateral joins can give you a clean code to select multiple columns in nested selects, without checking for null as no subqueries in select clause.
-- Assuming you want to get one '(a,b)' for every 'a'.
with recursive t as (
(select a, b from foo order by a, b limit 1)
union all
(select s.* from t, lateral(
select a, b from foo f
where f.a > t.a
order by a, b limit 1) s)
)
select * from t;

Is there a way to query for an integer value or NULL without using OR?

I'd like to query for a (list of) values or NULL but not use OR. The reasoning behind trying to not use OR is, that I need to use an index on that field to speed up a query.
A simple example to illustrate my question:
CREATE TABLE fruits
(
name text,
quantity integer
);
(The real table has lots of additional integer columns.)
The query that I'm not happy with is
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
The query I'm hoping for would be something like
SELECT * FROM fruits WHERE quantity MAGIC (1,2,3,4,NULL);
I'm using Postgresql 9.1.
As far as I can tell from the docs (e.g. http://www.postgresql.org/docs/9.1/static/functions-comparisons.html) and tests there is no way to do this. But I'm hoping one of you has some magic insight.
Test table with 100k rows:
create table fruits (name text, quantity integer);
insert into fruits (name, quantity)
select left(md5(i::text), 6), i
from generate_series(1, 10000) s(i);
With plain index on quantity:
create index fruits_index on fruits(quantity);
analyze fruits;
The query with or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on fruits (cost=21.29..34.12 rows=4 width=11) (actual time=0.032..0.032 rows=4 loops=1)
Recheck Cond: ((quantity = ANY ('{1,2,3,4}'::integer[])) OR (quantity IS NULL))
-> BitmapOr (cost=21.29..21.29 rows=4 width=0) (actual time=0.025..0.025 rows=0 loops=1)
-> Bitmap Index Scan on fruits_index (cost=0.00..17.03 rows=4 width=0) (actual time=0.019..0.019 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
-> Bitmap Index Scan on fruits_index (cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (quantity IS NULL)
Total runtime: 0.089 ms
Without or:
explain analyze
SELECT * FROM fruits WHERE quantity IN (1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_index on fruits (cost=0.00..21.07 rows=4 width=11) (actual time=0.026..0.038 rows=4 loops=1)
Index Cond: (quantity = ANY ('{1,2,3,4}'::integer[]))
Total runtime: 0.085 ms
The coalesce version proposed by wildplasser leads to a sequential scan:
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Seq Scan on fruits (cost=0.00..217.50 rows=250 width=11) (actual time=0.023..4.358 rows=4 loops=1)
Filter: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Rows Removed by Filter: 9996
Total runtime: 4.395 ms
Unless a coalesce expression index is created:
create index fruits_coalesce_index on fruits(coalesce(quantity, -1));
analyze fruits;
explain analyze
SELECT *
FROM fruits
WHERE COALESCE(quantity, -1) IN (-1,1,2,3,4);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Scan using fruits_coalesce_index on fruits (cost=0.00..25.34 rows=5 width=11) (actual time=0.112..0.124 rows=4 loops=1)
Index Cond: (COALESCE(quantity, (-1)) = ANY ('{-1,1,2,3,4}'::integer[]))
Total runtime: 0.172 ms
But it is still worse than the plain or query with a plain index on quantity.
Ugly hack with COALESCE:
SELECT *
FROM fruits
WHERE COALESCE(quantity,1) IN (1,2,3,4)
;
Please check the resulting plan. IIRC, the optimiser knows about COALESCE() in cases like this.
UPDATE: Alternative: use the EXISTS(NOT EXISTS(NOT IN)) trick (which generates a different plan here)
-- EXPLAIN ANALYZE
SELECT *
FROM fruits fr
WHERE EXISTS (
SELECT * FROM fruits ex
WHERE ex.id = fr.id
AND NOT EXISTS (
SELECT * FROM fruits nx
WHERE nx.id = ex.id
AND nx.quantity NOT IN (1,2,3,4)
)
)
;
BTW: while testing, (upto 1 million rows, with only 4+ a few qualifying) , the first query (which does not use an index) is always faster than the second (which uses indices and hash anti-join) YMMV.
UPDATE 2: the original query IS NULL OR IN() is a clear winner here:
-- EXPLAIN ANALYZE
SELECT *
FROM fruits
WHERE quantity IS NULL
OR quantity IN (1,2,3,4)
;
This is not an answer to your exact question, but you could build a partial index tailored for your query:
CREATE INDEX idx_partial (quantity) ON fruits
WHERE quantity IN (1,2,3,4) OR quantity IS NULL;
From the docs: http://www.postgresql.org/docs/current/interactive/indexes-partial.html
This index should then be used by your query and speed it up.