Postgresql - Index scan - Slow filtering - postgresql

I try to improve query performances on a big (500M rows) time partitioned table. Here is the simplified table structure:
CREATE TABLE execution (
start_time TIMESTAMP WITH TIME ZONE NOT NULL,
end_time TIMESTAMP WITH TIME ZONE,
restriction_criteria VARCHAR(36) NOT NULL
PARTITION BY RANGE (start_time);
Time partitioning
is based on the start_time column because the end_time value is not known when the row is created.
is used to implement efficiently the retention policy.
Request looks like to this generic pattern
SELECT *
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY;
I've got the "best" performances using this index
CREATE INDEX IF NOT EXISTS end_time_desc_start_time_index ON execution USING btree (end_time DESC, start_time);
Yet, performances are not good enough.
Limit (cost=1303.21..27189.31 rows=20 width=1674) (actual time=6791.191..6791.198 rows=20 loops=1)
-> Incremental Sort (cost=1303.21..250693964.74 rows=193689 width=1674) (actual time=6791.189..6791.194 rows=20 loops=1)
" Sort Key: execution.end_time DESC, execution.id"
Presorted Key: execution.end_time
Full-sort Groups: 1 Sort Method: quicksort Average Memory: 64kB Peak Memory: 64kB
-> Merge Append (cost=8.93..250685248.74 rows=193689 width=1674) (actual time=4082.161..6791.047 rows=21 loops=1)
Sort Key: execution.end_time DESC
Subplans Removed: 15
-> Index Scan using execution_2021_10_end_time_start_time_idx on execution_2021_10 execution_1 (cost=0.56..113448316.66 rows=93103 width=1674) (actual time=578.896..578.896 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 734
-> Index Scan using execution_2021_11_end_time_start_time_idx on execution_2021_11 execution_2 (cost=0.56..113653576.54 rows=87605 width=1674) (actual time=116.841..116.841 rows=1 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 200
-> Index Scan using execution_2021_12_end_time_start_time_idx on execution_2021_12 execution_3 (cost=0.56..16367185.18 rows=12966 width=1674) (actual time=3386.416..6095.261 rows=21 loops=1)
Index Cond: ((start_time <= '2021-12-05 02:00:04+00'::timestamp with time zone) AND (start_time >= '2021-10-02 02:00:04+00'::timestamp with time zone))
" Filter: (((restriction_criteria)::text = ANY ('{123,456}'::text[])))"
Rows Removed by Filter: 5934
Planning Time: 4.108 ms
Execution Time: 6791.317 ms
The index Filter looks is very slow.
I set up a multi-column index hoping the filtering would be done in the Index cond. But it doesn't work
CREATE INDEX IF NOT EXISTS pagination_index ON execution USING btree (end_time DESC, start_time, restriction_criteria);
My feeling is that the first index column should be end_time because we want to leverage the btree index sorting capability. The second one should be restriction_criteria so that an index cond filters rows which doesn't match the restriction_criteria. However, this doesn't work because the query planner need to also check the start_time clause.
The alternative I imagine is to get rid of the partitioning because a multi-column end_time, restriction_critera index would work just fine.
Yet, this is not a perfect solution because dealing with our retention policy would become a pain.
Is there another alternative allowing to keep the start_time partitioning ?

I set up a multi-column index hoping the filtering would be done in the Index cond
The index machinery is very circumspect about what code it runs inside the index. It won't call any operators that it doesn't 'trust', because if the operator throws an error then the whole query will error out, possibly due to rows that weren't even user 'visible' in the first place (i.e. ones that were already deleted or created but never committed). No one wants that. Now the =ANY construct could be considered trustable, but it is not. That means it won't be applied in the Index Cond, but must be applied against the table row, which in turn means you need to visit the table, which is probably where all your time is going, visiting random table rows.
I don't know what it would take code-wise to make =ANY trusted. I've made efforts to investigate that in the past but really never got anywhere, the code around the ANY is too complicated for me to grasp. That would be a nice improvement for the future, but won't help you now anyway.
One way around this is to get an index-only scan. At that point it will call arbitrary code in the index, as it already knows the tuple is visible. But it won't do that for you, because you are selecting at least one column not in the index (and also not shown in your CREATE command, but obviously present anyway)
If you create an index like your widest one but adding "id" to the end, and only select from among those columns, then you should be get a much faster index only scans with merge appends.
If you have even more columns than the ones you've shown plus "id", and you really need to select those columns, and don't want to add all of them to the index, then you can use a trick to use an index-only scan anyway by doing a dummy self join:
with t as (SELECT id
FROM execution
WHERE start_time BETWEEN :from AND :to
AND restriction_criteria IN ('123', '456')
ORDER BY end_time DESC, id
FETCH NEXT 20 ROWS ONLY
)
select real.* from execution real join t using (id)
ORDER BY end_time DESC, id
(If "id" is not unique, then you might need to join on additional column. Also, you would need an index on "id", which you probably already have)
This one will still need to visit the table to fetch the extra columns, but only for the 20 rows being returned, not for all the ones failing the restriction_criteria.
If the restriction_criteria is very selective, another approach might be better: an index on or leading with that column. It will need to read and sort all of those rows (in the relevant partitions) before applying the LIMIT, but if it is very selective this will not take long.

While you can have the output sorted if the leading column is end_time you can reduce the amount of data processed if you use start_time as a leading column.
Since your filter in start_time and restriction_criteria, is excluding ~7000 rows in order to retrieve 20, maybe speeding up the filtering is more important that speeding up the sorting.
CREATE INDEX IF NOT EXISTS execution_start_time_restriction_idx
ON execution USING btree (start_time, restriction_criteria);
CREATE INDEX IF NOT EXISTS execution_restriction_start_time_idx
ON execution USING btree (restriction_criteria, start_time);
ANALYZE execution
If
FROM execution
WHERE start_time BETWEEN :from AND start_time :to
AND restriction_criteria IN ('123', '456')
Is more than the number of rows removed by the filter then having the `end_time as the leading column might be a good idea. But the planner should be able to figure that out for you.
In the end if some of those indices are not used you can drop it.

Related

Incorrect index usage Postgresql Version 12

Query Plan:
db=> explain
db-> SELECT MIN("id"), MAX("id") FROM "public"."tablename" WHERE ( "updated_at" >= '2022-07-24 09:08:05.926533' AND "updated_at" < '2022-07-28 09:16:54.95459' );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=128.94..128.95 rows=1 width=16)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan using tablename_pkey on tablename (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
InitPlan 2 (returns $1)
-> Limit (cost=0.57..64.47 rows=1 width=8)
-> Index Scan Backward using tablename_pkey on tablename tablename_1 (cost=0.57..416250679.26 rows=6513960 width=8)
Index Cond: (id IS NOT NULL)
Filter: ((updated_at >= '2022-07-24 09:08:05.926533'::timestamp without time zone) AND (updated_at < '2022-07-28 09:16:54.95459'::timestamp without time zone))
(11 rows)
Indexes:
"tablename_pkey" PRIMARY KEY, btree (id)
"tablename_updated_at_incl_id_partial_idx" btree (updated_at) INCLUDE (id) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone
Idea is when there is already a filtered index which only has small subset of records, why is query doing index scan on primary key, instead of tablename_updated_at_incl_id_partial_idx. Also this is a heap table not clustered table.
Because you're using MIN and MAX, try redefining your second index so id is part of the BTREE index, not just INCLUDEd in it. That may make searching for the MIN and MAX items faster.
Since a small fraction of your table really is over 6e6 rows, then your data must be huge. And I am guessing that id and updated_at are nearly perfectly correlated with each other, so selecting specifically for recent updated_at means you are also selecting for higher id. But the planner doesn't now about that. It thinks that by walking up the id index it can stop after walking about 1/6513960 of it, once it finds the first row qualifying on the time column. But instead it has to walk most of the index before finding that row.
The simplest solution probably to introduce some dummy arithmetic into the aggregates: SELECT MIN("id"+0), MAX("id"+0) ... This will force it not to use the index on id. This will probably be the most robust and simplest solution as long as you have the flexibility to change the query text in your app. But even if you can't change the app, this should at least allow you to verify my assumptions and capture an EXPLAIN (ANALYZE) of it while it is not using the pk index.
None of PostgreSQL's advanced statistics will (as of yet) fix this problem. so you are stuck with fixing it by changing the query or the indexes. Changing the query in the silly way I described is the best currently available solution, but if you need to do just with indexes there are some other less-good options but which will likely still be better than what you currently have.
One is to make the horrible index scan at least into a horrible index-only scan. You could replace your existing primary key index with one like create unique index on tablename (id) include (updated_at). Here the INCLUDE is necessary because otherwise the UNIQUE would not do what you want. It will still have to walk a large part of the index, but at least it won't need to keep jumping between index and table to fetch the time column. (Make sure the table is well-vacuumed)
Or, you could provide a partial index that the planner would find attractive, by switching the order of the columns in it: create index on tablename (id, updated_at) WHERE updated_at >= '2022-07-01 00:00:00'::timestamp without time zone The only thing that makes this better than your existing partial index is that this one would actually get used.

How can I optimize this join on timestamps in PostgreSQL

PostgreSQL version 10
Windows 10
16GB RAM
SSD
I'm ashamed to admit that, despite searching the hundred years of PG support archives, I cannot figure out this most basic problem. But here it is...
I have big_table with 45 million rows and little_table with 12,000 rows. I need to do a left join to include all big_table rows, along with the id's of little_table rows where big_table's timestamp overlaps with two timestamps in little_table.
This doesn't seem like it should be an extreme operation for PG, but it is taking 2 1/2 hours!
Any ideas on what I can do here? Or do you think I have unwittingly come up against the limitations of my software/hardware combo given the table size?
Thanks!
little_table with 12,000 rows
CREATE TABLE public.little_table
(
id bigint,
start_time timestamp without time zone,
stop_time timestamp without time zone
);
CREATE INDEX idx_little_table
ON public.little_table USING btree
(start_time, stop_time DESC);
big_table with 45 million rows
CREATE TABLE public.big_table
(
id bigint,
datetime timestamp without time zone
) ;
CREATE INDEX idx_big_table
ON public.big_table USING btree
(datetime);
Query
explain analyze
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime between lt.start_time and lt.stop_time)
Explain Results
Nested Loop Left Join (cost=0.29..3260589190.64 rows=64945831346 width=16) (actual time=0.672..9163998.367 rows=1374445323 loops=1)
-> Seq Scan on big_table bt (cost=0.00..694755.92 rows=45097792 width=16) (actual time=0.014..10085.746 rows=45098790 loops=1)
-> Index Scan using idx_little_table on little_table lt (cost=0.29..57.89 rows=1440 width=24) (actual time=0.188..0.199 rows=30 loops=45098790)
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
Planning time: 0.165 ms
Execution time: 9199473.052 ms
NOTE: My actual query criteria is a bit more complex, but this seems to be the root of the problem. If I can fix this part, I think I can fix the rest.
This query cannot perform any faster.
Since there is no equality operator (=) in your join condition, the only strategy left to PostgreSQL is a nested loop join. 45 million repetitions of an index scan on the small table just take a while.
I would suggest trying to change the start_time and end_time columns in the
little table to a single tsrange column. According to the docs, this datatype supports a GIST index which can speed up the "range contains element" operator #>. Maybe this will do better than the index scan on your current btree.
Generating 1.3 billion rows seems pretty extreme to me. How often do you need to do this, and how fast do you need it to be?
To explain a bit about your current plan:
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
While it is not obvious from what is displayed above, this always scans about half the index. It starts at the beginning of the index, and stops once start_time > bt.datetime, using bt.datetime <= stop_time as an in-index filter that need to examine each row before rejecting it.
To flesh out Bergi's answer, you could do this:
alter table little_table add range tsrange;
update little_table set range =tsrange(start_time,stop_time,'[]');
create index on little_table using gist(range);
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime <# lt.range)
In my hands, that is about 4 times faster than your current method.
If your join did not need to do a left join, then you could get some more efficient operations by joining the tables in the opposite order. Perhaps you could get better performance by separating this into 2 operations, and inner join and then a probe for missing values, and combining the results.

Why doesn't postresql use all columns in a multi-column index?

I am using the extension
CREATE EXTENSION btree_gin;
I have an index that looks like this...
create index boundaries2 on rets USING GIN(source, isonlastsync, status, (geoinfo::jsonb->'boundaries'), ctcvalidto, searchablePrice, ctcSortOrder);
before I started messing with it, the index looked like this, with the same results that I'm about to share, so it seems minor variations in the index definition don't make a difference:
create index boundaries on rets USING GIN((geoinfo::jsonb->'boundaries'), source, status, isonlastsync, ctcvalidto, searchablePrice, ctcSortOrder);
I give pgsql 11 this query:
explain analyze select id from rets where ((geoinfo::jsonb->'boundaries' ?| array['High School: Torrey Pines']) AND source='SDMLS'
AND searchablePrice>=800000 AND searchablePrice<=1200000 AND YrBlt>=2000 AND EstSF>=2300
AND Beds>=3 AND FB>=2 AND ctcSortOrder>'2019-07-05 16:02:54 UTC' AND Status IN ('ACTIVE','BACK ON MARKET')
AND ctcvalidto='9999-12-31 23:59:59 UTC' AND isonlastsync='true') order by LstDate desc, ctcSortOrder desc LIMIT 3000;
with result...
Limit (cost=120.06..120.06 rows=1 width=23) (actual time=472.849..472.850 rows=1 loops=1)
-> Sort (cost=120.06..120.06 rows=1 width=23) (actual time=472.847..472.848 rows=1 loops=1)
Sort Key: lstdate DESC, ctcsortorder DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on rets (cost=116.00..120.05 rows=1 width=23) (actual time=472.748..472.841 rows=1 loops=1)
Recheck Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Rows Removed by Index Recheck: 93
Filter: (isonlastsync AND (yrblt >= 2000) AND (estsf >= 2300) AND (beds >= 3) AND (fb >= 2) AND (status = ANY ('{ACTIVE,"BACK ON MARKET"}'::text[])))
Rows Removed by Filter: 10
Heap Blocks: exact=102
-> Bitmap Index Scan on boundaries2 (cost=0.00..116.00 rows=1 width=0) (actual time=471.762..471.762 rows=104 loops=1)
Index Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Planning Time: 0.333 ms
Execution Time: 474.311 ms
(14 rows)
The Question
Why are the columns status and isonlastsync not used by the Bitmap Index Scan on boundaries2?
It can do so if it predicts that filtering out those columns will be faster. This is usually the case if cardinality on columns is very low and you will fetch large enough portion of all rows; this is true for boolean like isonlastsync and usually true for status columns with just a few distinct values.
Rows Removed by Filter: 10 this is very little to filter out, either because your table does not hold large number of rows or most of them fit into condition you specified for those two columns. You might try generating more data in that table or selecting rows with rare status.
I suggest doing partial indexes (with WHERE condition), at least for boolean value and remove those two columns to make this index a bit more lightweight.
I cannot tell you why, but I can help you optimize the query.
You should not use a multi-column GIN index, but a GIN index on only the jsonb expression and a b-tree index on the other columns.
The order of columns matters: put the oned used in an equality condition first, with the most selective in the beginning. Next put the column with the must selective inequality or IN conditions. For the following columns, the order doesn't matter, as they will only act as filters in the index scan.
Make sure that the indexes are cached in RAM.
I'd expect you to be faster that way.
I think you're asking yourself the wrong question. As Lukasz answered already, PostgreSQL may find inefficient to check all columns in the index. The problem here is that your index is too big on disk.
Probably by trying to make this SQL faster you added as many columns as possible to the index, but it is backfiring.
The trick is to realize how much data PostgreSQL has to read to find your records. If your index contains too much data, it will have to read a lot. Also, be aware that low cardinality columns don't play well with BTree and common index types; generally you want to avoid indexing them.
To have an index as small as possible and it's quick to do lookups you have to find the column with more cardinality, or better, the column that will return less rows for your query. My guess is "ctcSortOrder". This will be the first column of your index.
Now look, after filtering by the 1st column, which column has now the most cardinality or will filter out most rows. I have no idea on your data, but "source" looks like a good candidate.
Try to avoid jsonb searches unless they are the primary source of the cardinality, and keep the index as a Btree. BTree is several times faster.
And like Lukasz suggested, look on partial indexes. For example add "WHERE Status IN ('ACTIVE','BACK ON MARKET') AND isonlastsync='true'" as these two may be common for all your searches.
Bottom line is, having a simpler, smaller index is faster than having all columns indexed. And the order of the columns does matter a lot. Stick with BTree unless there is a good reason (lots of cardinality in non-btree compatible types).
If your table is huge (>10M rows) consider table partitioning, for example by ctcSortOrder. But I don't think this is your case.

Slow execution time for a postgres query with multiple column index

We are running PostgresSql 9.6.11 database on Amazon RDS. The execution time of one of the queries is 6633.645 ms. This seems very slow. What changes can I make to improve the execution time for this query.
The query is selecting 3 columns where the data matches 6 of the columns.
select
platform,
publisher_platform,
adset_id
FROM "adsets"
WHERE
(("adsets"."account_id" IN ('1595321963838425', '1320001405', 'urn:li:sponsoredAccount:507697540')) AND
("adsets"."date" >= '2019-05-06 00:00:00.000000+0000') AND ("adsets"."date" <= '2019-05-13 23:59:59.999999+0000'))
GROUP BY
"adsets"."platform",
"adsets"."publisher_platform",
"adsets"."adset_id"
ORDER BY
"adsets"."platform",
"adsets"."publisher_platform",
"adsets"."adset_id";
The query is based on a table called adset table. The table has the following columns
account_id | text
campaign_id | text
adset_id | text
name | text
date | timestamp without time zone
publisher_platform | text
and 15 other columns which are a mix of integers and text fields.
We have added the following indexes -
"adsets_composite_unique_key" UNIQUE CONSTRAINT, btree (platform, account_id, campaign_id, adset_id, date, publisher_platform)
"adsets_account_id_date_idx" btree (account_id DESC, date DESC) CLUSTER
"adsets_account_id_index" btree (account_id)
"adsets_adset_id_index" btree (adset_id)
"adsets_campaign_id_index" btree (campaign_id)
"adsets_name_index" btree (name)
"adsets_platform_platform_id_publisher_platform" btree (account_id, platform, publisher_platform, adset_id)
"idx_account_date_adsets" btree (account_id, date)
"platform_pub_index" btree (platform, publisher_platform, adset_id).
The work_mem on postgres has been set to 125MB
Explain (analyse) shows
Group (cost=33447.55..33532.22 rows=8437 width=29) (actual time=6625.170..6633.062 rows=2807 loops=1)
Group Key: platform, publisher_platform, adset_id
-> Sort (cost=33447.55..33468.72 rows=8467 width=29) (actual time=6625.168..6629.271 rows=22331 loops=1)
Sort Key: platform, publisher_platform, adset_id
Sort Method: quicksort Memory: 2513kB
-> Bitmap Heap Scan on adsets (cost=433.63..32895.18 rows=8467 width=29) (actual time=40.003..6471.898 rows=22331 loops=1)
Recheck Cond: ((account_id = ANY ('{1595321963838425,1320001405,urn:li:sponsoredAccount:507697540}'::text[])) AND (date >= '2019-05-06 00:00:00'::timestamp without time zone) AND (date <= '
2019-05-13 23:59:59.999999'::timestamp without time zone))
Heap Blocks: exact=52907
-> Bitmap Index Scan on idx_account_date_adsets (cost=0.00..431.51 rows=8467 width=0) (actual time=27.335..27.335 rows=75102 loops=1)
Index Cond: ((account_id = ANY ('{1595321963838425,1320001405,urn:li:sponsoredAccount:507697540}'::text[])) AND (date >= '2019-05-06 00:00:00'::timestamp without time zone) AND (date
<= '2019-05-13 23:59:59.999999'::timestamp without time zone))
Planning time: 5.380 ms
Execution time: 6633.645 ms
(12 rows)
Explain depesz
First of all, you are using GROUP BY without actually selecting any aggregates. You might as well just do SELECT DISTINCT in your query. This aside, here is the B tree index which you probably should be using:
CREATE INDEX idx ON adsets (account_id, date, platform, publisher_platform,
adset_id);
The problem with your current index is that, while it does cover the columns you are selecting, it does not involve the columns which appear in the WHERE clause. This means that Postgres might choose to not even use the index, and rather just scan the entire table.
Note that my suggestion still does nothing to deal with the select distinct portion of the query, but at least it might speed up everything which comes before that part of the query.
Here is your updated query:
SELECT DISTINCT
platform,
publisher_platform,
adset_id
FROM adsets
WHERE
account_id IN ('1595321963838425', '1320001405',
'urn:li:sponsoredAccount:507697540') AND
date >= '2019-05-06' AND date < '2019-05-14';
Your problem are the many “false positives” that are found during the bitmap index scan phase and removed during the heap scan phase. Since there is no additional filter, I guess that the extra rows must be removed because they are not visible.
See if a VACUUM adsets will improve the query performance.

Postgres not using partial timestamp index on interval queries (e.g., now() - interval '7 days' )

I have a simple table that store precipitation readings from online gauges. Here's the table definition:
CREATE TABLE public.precip
(
gauge_id smallint,
inches numeric(8, 2),
reading_time timestamp with time zone
)
CREATE INDEX idx_precip3_id
ON public.precip USING btree
(gauge_id)
CREATE INDEX idx_precip3_reading_time
ON public.precip USING btree
(reading_time)
CREATE INDEX idx_precip_last_five_days
ON public.precip USING btree
(reading_time)
TABLESPACE pg_default WHERE reading_time > '2017-02-26 00:00:00+00'::timestamp with time zone
It's grown quite large: about 38 million records that go back 18 months. Queries rarely request rows that are more than 7 days old and I created the partial index on the reading_time field so Postgres can traverse a much smaller index. But it's not using the partial index on all queries. It does use the partial index on
explain analyze select * from precip where gauge_id = 208 and reading_time > '2017-02-27'
Bitmap Heap Scan on precip (cost=8371.94..12864.51 rows=1169 width=16) (actual time=82.216..162.127 rows=2046 loops=1)
Recheck Cond: ((gauge_id = 208) AND (reading_time > '2017-02-27 00:00:00+00'::timestamp with time zone))
-> BitmapAnd (cost=8371.94..8371.94 rows=1169 width=0) (actual time=82.183..82.183 rows=0 loops=1)
-> Bitmap Index Scan on idx_precip3_id (cost=0.00..2235.98 rows=119922 width=0) (actual time=20.754..20.754 rows=125601 loops=1)
Index Cond: (gauge_id = 208)
-> Bitmap Index Scan on idx_precip_last_five_days (cost=0.00..6135.13 rows=331560 width=0) (actual time=60.099..60.099 rows=520867 loops=1)
Total runtime: 162.631 ms
But it does not use the partial index on the following. Instead, it's use the full index on reading_time
explain analyze select * from precip where gauge_id = 208 and reading_time > now() - interval '7 days'
Bitmap Heap Scan on precip (cost=8460.10..13007.47 rows=1182 width=16) (actual time=154.286..228.752 rows=2067 loops=1)
Recheck Cond: ((gauge_id = 208) AND (reading_time > (now() - '7 days'::interval)))
-> BitmapAnd (cost=8460.10..8460.10 rows=1182 width=0) (actual time=153.799..153.799 rows=0 loops=1)
-> Bitmap Index Scan on idx_precip3_id (cost=0.00..2235.98 rows=119922 width=0) (actual time=15.852..15.852 rows=125601 loops=1)
Index Cond: (gauge_id = 208)
-> Bitmap Index Scan on idx_precip3_reading_time (cost=0.00..6223.28 rows=335295 width=0) (actual time=136.162..136.162 rows=522993 loops=1)
Index Cond: (reading_time > (now() - '7 days'::interval))
Total runtime: 228.647 ms
Note that today is 3/5/2017, so these two queries are essentially requesting the rows. But it seems like Postgres won't use the partial index unless the timestamp in the where clause is "hard coded". Is the query planner not evaluating now() - interval '7 days' before deciding which index to use? I posted the query plans as suggested by one of the first people to respond.
I've written several functions (stored procedures) that summarize rain fall in the last 6 hours, 12 hours .... 72 hours. They all use the interval approach in the query (e.g., reading_time > now() - interval '7 days'). I don't want to move this code into the application to send Postgres the hard coded timestamp. That would create a lot of messy php code that shouldn't be necessary.
Suggestions on how to encourage Postgres to use the partial index instead? My plan is to redefine the date range on the partial index nightly (drop index --> create index), but that seems a bit silly if Postgres isn't going to use it.
Thanks,
Alex
Generally speaking, an index can be used, when the indexed column(s) is/are compared to constants (literal values), function calls, which are marked at least STABLE (which means that within a single statement, multiple calls of the functions -- with same parameters -- will produce the same results), and combination of those.
now() (which is an alias of current_timestamp) is marked as STABLE and timestamp_mi_interval() (which is the back-up function for the operator <timestamp> - <interval>) is marked as IMMUTABLE, which is even stricter than STABLE (now(), current_timestamp and transaction_timestamp marks the start of the transaction, statement_timestamp() marks the start of the statement -- still STABLE -- but clock_timestamp() gives the timestamp as seen on a clock, thus it is VOLATILE).
So in theory, the WHERE reading_time > now() - interval '7 days' should be able to use an index on the reading_time column. And it really does. But, since you defined a partial index, the planner needs to prove the following:
However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time.
And that is what is happening with your query, which has and reading_time > now() - interval '7 days'. By the time now() - interval '7 days' is evaluated, the planning already happened. And PostgreSQL couldn't prove that the predicate (reading_time > '2017-02-26 00:00:00+00') will be true. But when you used reading_time > '2017-02-27' it could prove that.
You could "guide" the planner with constant values, like this:
where gauge_id = 208
and reading_time > '2017-02-26 00:00:00+00'
and reading_time > now() - interval '7 days'
This way the planner realizes, that it can use the partial index, because indexed_col > index_condition and indexed_col > something_else implies that indexed_col will larger than (at least) index_condition. Maybe it will be larger than something_else too, but it doesn't matter for using the index.
I'm not sure if that is the answer you were looking for though. IMHO, if you have a really large amount of data (and PostgreSQL 9.5+) a single BRIN index might suit your needs better.
Queries are planned and then cached for possible later use, which includes the choice of indexes to apply. Since your query includes the volatile function now(), the partial index can not be used because the planner has no certainty about what the volatile function will return and thus if it will match the partial index. Any human reading the query will understand that the partial index would be a match, but the planner is not that smart that it knows what now() does; the only thing it knows is that it is a volatile function.
A better solution in your case would be to partition the table into smaller chunks based on the reading_time. Properly crafted queries will then only access a single partition.