Does Postgres use indexes if casting timestamp to date? - postgresql

Let's say I have a table with some columns and a column dt which is of type TIMESTAMP.
I create a (non functional) index on this column.
Then I execute a query
SELECT *
FROM tbl
WHERE
dt::DATE = NOW()::DATE
The question is will Postgres use the index I've created earlier and under which circumstances it will/will not?
I understand that a functional index would cover this case, but does a simple index cover both cases or not when it's a TIMESTAMP -> DATE type conversion?
EDIT:
performing an EXPLAIN ANALYZE on the query tells us it does not use index and performs a Seq scan (table with 3+ mil records:
Seq Scan on tbl (cost=0.00..192289.92 rows=17043 width=12) (actual time=7.237..2493.496 rows=4928 loops=1)
Filter: ((dt)::date = (now())::date)
Rows Removed by Filter: 3397155
Total runtime: 2494.546 ms
Let me ask a question differently then, is it possible to make Postgres utilize this index or should I create another one?

A simple index will not work in this case; try it with EXPLAIN.
What you could do to use the simple index is
WHERE dt >= current_date::timestamptz
AND dt < (current_date + 1)::timestamptz
I think that this is pretty readable and the best solution, but if you want to go with your current query, you'll have to add a second index on (dt::date).
Don't forget that every additional index costs space and slows down the performance of data modifying statements.

Related

How can I optimize this join on timestamps in PostgreSQL

PostgreSQL version 10
Windows 10
16GB RAM
SSD
I'm ashamed to admit that, despite searching the hundred years of PG support archives, I cannot figure out this most basic problem. But here it is...
I have big_table with 45 million rows and little_table with 12,000 rows. I need to do a left join to include all big_table rows, along with the id's of little_table rows where big_table's timestamp overlaps with two timestamps in little_table.
This doesn't seem like it should be an extreme operation for PG, but it is taking 2 1/2 hours!
Any ideas on what I can do here? Or do you think I have unwittingly come up against the limitations of my software/hardware combo given the table size?
Thanks!
little_table with 12,000 rows
CREATE TABLE public.little_table
(
id bigint,
start_time timestamp without time zone,
stop_time timestamp without time zone
);
CREATE INDEX idx_little_table
ON public.little_table USING btree
(start_time, stop_time DESC);
big_table with 45 million rows
CREATE TABLE public.big_table
(
id bigint,
datetime timestamp without time zone
) ;
CREATE INDEX idx_big_table
ON public.big_table USING btree
(datetime);
Query
explain analyze
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime between lt.start_time and lt.stop_time)
Explain Results
Nested Loop Left Join (cost=0.29..3260589190.64 rows=64945831346 width=16) (actual time=0.672..9163998.367 rows=1374445323 loops=1)
-> Seq Scan on big_table bt (cost=0.00..694755.92 rows=45097792 width=16) (actual time=0.014..10085.746 rows=45098790 loops=1)
-> Index Scan using idx_little_table on little_table lt (cost=0.29..57.89 rows=1440 width=24) (actual time=0.188..0.199 rows=30 loops=45098790)
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
Planning time: 0.165 ms
Execution time: 9199473.052 ms
NOTE: My actual query criteria is a bit more complex, but this seems to be the root of the problem. If I can fix this part, I think I can fix the rest.
This query cannot perform any faster.
Since there is no equality operator (=) in your join condition, the only strategy left to PostgreSQL is a nested loop join. 45 million repetitions of an index scan on the small table just take a while.
I would suggest trying to change the start_time and end_time columns in the
little table to a single tsrange column. According to the docs, this datatype supports a GIST index which can speed up the "range contains element" operator #>. Maybe this will do better than the index scan on your current btree.
Generating 1.3 billion rows seems pretty extreme to me. How often do you need to do this, and how fast do you need it to be?
To explain a bit about your current plan:
Index Cond: ((bt.datetime >= start_time) AND (bt.datetime <= stop_time))
While it is not obvious from what is displayed above, this always scans about half the index. It starts at the beginning of the index, and stops once start_time > bt.datetime, using bt.datetime <= stop_time as an in-index filter that need to examine each row before rejecting it.
To flesh out Bergi's answer, you could do this:
alter table little_table add range tsrange;
update little_table set range =tsrange(start_time,stop_time,'[]');
create index on little_table using gist(range);
select
bt.id as bt_id,
lt.id as lt_id
from
big_table bt
left join
little_table lt
on
(bt.datetime <# lt.range)
In my hands, that is about 4 times faster than your current method.
If your join did not need to do a left join, then you could get some more efficient operations by joining the tables in the opposite order. Perhaps you could get better performance by separating this into 2 operations, and inner join and then a probe for missing values, and combining the results.

Why doesn't postresql use all columns in a multi-column index?

I am using the extension
CREATE EXTENSION btree_gin;
I have an index that looks like this...
create index boundaries2 on rets USING GIN(source, isonlastsync, status, (geoinfo::jsonb->'boundaries'), ctcvalidto, searchablePrice, ctcSortOrder);
before I started messing with it, the index looked like this, with the same results that I'm about to share, so it seems minor variations in the index definition don't make a difference:
create index boundaries on rets USING GIN((geoinfo::jsonb->'boundaries'), source, status, isonlastsync, ctcvalidto, searchablePrice, ctcSortOrder);
I give pgsql 11 this query:
explain analyze select id from rets where ((geoinfo::jsonb->'boundaries' ?| array['High School: Torrey Pines']) AND source='SDMLS'
AND searchablePrice>=800000 AND searchablePrice<=1200000 AND YrBlt>=2000 AND EstSF>=2300
AND Beds>=3 AND FB>=2 AND ctcSortOrder>'2019-07-05 16:02:54 UTC' AND Status IN ('ACTIVE','BACK ON MARKET')
AND ctcvalidto='9999-12-31 23:59:59 UTC' AND isonlastsync='true') order by LstDate desc, ctcSortOrder desc LIMIT 3000;
with result...
Limit (cost=120.06..120.06 rows=1 width=23) (actual time=472.849..472.850 rows=1 loops=1)
-> Sort (cost=120.06..120.06 rows=1 width=23) (actual time=472.847..472.848 rows=1 loops=1)
Sort Key: lstdate DESC, ctcsortorder DESC
Sort Method: quicksort Memory: 25kB
-> Bitmap Heap Scan on rets (cost=116.00..120.05 rows=1 width=23) (actual time=472.748..472.841 rows=1 loops=1)
Recheck Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Rows Removed by Index Recheck: 93
Filter: (isonlastsync AND (yrblt >= 2000) AND (estsf >= 2300) AND (beds >= 3) AND (fb >= 2) AND (status = ANY ('{ACTIVE,"BACK ON MARKET"}'::text[])))
Rows Removed by Filter: 10
Heap Blocks: exact=102
-> Bitmap Index Scan on boundaries2 (cost=0.00..116.00 rows=1 width=0) (actual time=471.762..471.762 rows=104 loops=1)
Index Cond: ((source = 'SDMLS'::text) AND (((geoinfo)::jsonb -> 'boundaries'::text) ?| '{"High School: Torrey Pines"}'::text[]) AND (ctcvalidto = '9999-12-31 23:59:59+00'::timestamp with time zone) AND (searchableprice >= 800000) AND (searchableprice <= 1200000) AND (ctcsortorder > '2019-07-05 16:02:54+00'::timestamp with time zone))
Planning Time: 0.333 ms
Execution Time: 474.311 ms
(14 rows)
The Question
Why are the columns status and isonlastsync not used by the Bitmap Index Scan on boundaries2?
It can do so if it predicts that filtering out those columns will be faster. This is usually the case if cardinality on columns is very low and you will fetch large enough portion of all rows; this is true for boolean like isonlastsync and usually true for status columns with just a few distinct values.
Rows Removed by Filter: 10 this is very little to filter out, either because your table does not hold large number of rows or most of them fit into condition you specified for those two columns. You might try generating more data in that table or selecting rows with rare status.
I suggest doing partial indexes (with WHERE condition), at least for boolean value and remove those two columns to make this index a bit more lightweight.
I cannot tell you why, but I can help you optimize the query.
You should not use a multi-column GIN index, but a GIN index on only the jsonb expression and a b-tree index on the other columns.
The order of columns matters: put the oned used in an equality condition first, with the most selective in the beginning. Next put the column with the must selective inequality or IN conditions. For the following columns, the order doesn't matter, as they will only act as filters in the index scan.
Make sure that the indexes are cached in RAM.
I'd expect you to be faster that way.
I think you're asking yourself the wrong question. As Lukasz answered already, PostgreSQL may find inefficient to check all columns in the index. The problem here is that your index is too big on disk.
Probably by trying to make this SQL faster you added as many columns as possible to the index, but it is backfiring.
The trick is to realize how much data PostgreSQL has to read to find your records. If your index contains too much data, it will have to read a lot. Also, be aware that low cardinality columns don't play well with BTree and common index types; generally you want to avoid indexing them.
To have an index as small as possible and it's quick to do lookups you have to find the column with more cardinality, or better, the column that will return less rows for your query. My guess is "ctcSortOrder". This will be the first column of your index.
Now look, after filtering by the 1st column, which column has now the most cardinality or will filter out most rows. I have no idea on your data, but "source" looks like a good candidate.
Try to avoid jsonb searches unless they are the primary source of the cardinality, and keep the index as a Btree. BTree is several times faster.
And like Lukasz suggested, look on partial indexes. For example add "WHERE Status IN ('ACTIVE','BACK ON MARKET') AND isonlastsync='true'" as these two may be common for all your searches.
Bottom line is, having a simpler, smaller index is faster than having all columns indexed. And the order of the columns does matter a lot. Stick with BTree unless there is a good reason (lots of cardinality in non-btree compatible types).
If your table is huge (>10M rows) consider table partitioning, for example by ctcSortOrder. But I don't think this is your case.

Suggest appropriate index on my database

I have a table product
Product(id BIGINT,
... Some more columns here
expired DATE);
I want to create index on expired field for faster retrieval.
Majority of time my where clause is
...
WHERE (expired IS NULL OR expired > now());
Please can you suggest which index is more suitable for me.
When I execute explain analyze for the above query
EXPLAIN ANALYZE
SELECT 1
FROM product
WHERE (expired IS NULL) OR (expired > now());
it gave me following result. In which it is not using index which I have created.
Seq Scan on product (cost=0.00..190711.22 rows=5711449 width=0) (actual time=0.009..8653.380 rows=7163105 loops=1)
Filter: ((expired IS NULL) OR (expired > now()))
Rows Removed by Filter: 43043
Planning time: 0.117 ms
Execution time: 15679.478 ms
(5 rows)
I guess that is because of "OR" condition. i tried to create function base index but it gave me following error
ERROR: functions in index expression must be marked IMMUTABLE
Is there any alternate way we can do?
You should change the WHERE clause to look like this:
... WHERE COALESCE(expired, DATE 'infinity') > current_date;
This is equivalent to your query, but now you can use the following index:
CREATE INDEX ON product (COALESCE(expired, DATE 'infinity'));
Probably the default B-tree index is the most appropriate one for you; hash indexes only handle "equals" comparisons, and the GiST and GIN indexes are for more complex data types that what you are using:
https://www.postgresql.org/docs/current/static/indexes-types.html
In fact the B-tree is the default, so all you need to do is something like:
CREATE INDEX Products_expired_idx ON TABLE Products (expired)

Postgres query optimization (forcing an index scan)

Below is my query. I am trying to get it to use an index scan, but it will only seq scan.
By the way the metric_data table has 130 million rows. The metrics table has about 2000 rows.
metric_data table columns:
metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
How can I get this query to use my PRIMARY KEY index?
SELECT
S.metric,
D.t,
D.d
FROM metric_data D
INNER JOIN metrics S
ON S.id = D.metric_id
WHERE S.NAME = ANY (ARRAY ['cpu', 'mem'])
AND D.t BETWEEN '2012-02-05 00:00:00'::TIMESTAMP
AND '2012-05-05 00:00:00'::TIMESTAMP;
EXPLAIN:
Hash Join (cost=271.30..3866384.25 rows=294973 width=25)
Hash Cond: (d.metric_id = s.id)
-> Seq Scan on metric_data d (cost=0.00..3753150.28 rows=29336784 width=20)
Filter: ((t >= '2012-02-05 00:00:00'::timestamp without time zone)
AND (t <= '2012-05-05 00:00:00'::timestamp without time zone))
-> Hash (cost=270.44..270.44 rows=68 width=13)
-> Seq Scan on metrics s (cost=0.00..270.44 rows=68 width=13)
Filter: ((sym)::text = ANY ('{cpu,mem}'::text[]))
For testing purposes you can force the use of the index by "disabling" sequential scans - best in your current session only:
SET enable_seqscan = OFF;
Do not use this on a productive server. Details in the manual here.
I quoted "disabling", because you cannot actually disable sequential table scans. But any other available option is now preferable for Postgres. This will prove that the multicolumn index on (metric_id, t) can be used - just not as effective as an index on the leading column.
You probably get better results by switching the order of columns in your PRIMARY KEY (and the index used to implement it behind the curtains with it) to (t, metric_id). Or create an additional index with reversed columns like that.
Is a composite index also good for queries on the first field?
You do not normally have to force better query plans by manual intervention. If setting enable_seqscan = OFF leads to a much better plan, something is probably not right in your database. Consider this related answer:
Keep PostgreSQL from sometimes choosing a bad query plan
You cannot force index scan in this case because it will not make it faster.
You currently have index on metric_data (metric_id, t), but server cannot take advantage of this index for your query, because it needs to be able to discriminate by metric_data.t only (without metric_id), but there is no such index. Server can use sub-fields in compound indexes, but only starting from the beginning. For example, searching by metric_id will be able to employ this index.
If you create another index on metric_data (t), your query will make use of that index and will work much faster.
Also, you should make sure that you have an index on metrics (id).
Have you tried to use:
WHERE S.NAME = ANY (VALUES ('cpu'), ('mem'))
instead of
ARRAY
like here
It appears you are lacking suitable FK constraints:
CREATE TABLE metric_data
( metric_id integer
, t timestamp
, d double precision
, PRIMARY KEY (metric_id, t)
, FOREIGN KEY metrics_xxx_fk (metric_id) REFERENCES metrics (id)
)
and in table metrics:
CREATE TABLE metrics
( id INTEGER PRIMARY KEY
...
);
Also check if your statistics are sufficient (and fine-grained enough, since you intend to select 0.2 % of the metrics_data table)

PostgreSQL index not used for query on IP ranges

I'm using PostgreSQL 9.2 and have a table of IP ranges. Here's the SQL:
CREATE TABLE ips (
id serial NOT NULL,
begin_ip_num bigint,
end_ip_num bigint,
country_name character varying(255),
CONSTRAINT ips_pkey PRIMARY KEY (id )
)
I've added plain B-tree indices on both begin_ip_num and end_ip_num:
CREATE INDEX index_ips_on_begin_ip_num ON ips (begin_ip_num);
CREATE INDEX index_ips_on_end_ip_num ON ips (end_ip_num );
The query being used is:
SELECT ips.* FROM ips
WHERE 3065106743 BETWEEN begin_ip_num AND end_ip_num;
The problem is that my BETWEEN query is only using the index on begin_ip_num. After using the index, it filters the result using end_ip_num. Here's the EXPLAIN ANALYZE result:
Index Scan using index_ips_on_begin_ip_num on ips (cost=0.00..2173.83 rows=27136 width=76) (actual time=16.349..16.350 rows=1 loops=1)
Index Cond: (3065106743::bigint >= begin_ip_num)
Filter: (3065106743::bigint <= end_ip_num)
Rows Removed by Filter: 47596
Total runtime: 16.425 ms
I've already tried various combinations of indices including adding a composite index on both begin_ip_num and end_ip_num.
Try a multicolumn index, but with reversed order on the second column:
CREATE INDEX index_ips_begin_end_ip_num ON ips (begin_ip_num, end_ip_num DESC);
Ordering is mostly irrelevant for a single-column index, since it can be scanned backwards almost as fast. But it is important for multicolumn indexes.
With the index I propose, Postgres can scan the first column and find the address, where the rest of the index fulfills the first condition. Then it can, for each value of the first column, return all rows that fulfill the second condition, until the first one fails. Then jump to the next value of the first column, etc.
This is still not very efficient and Postgres may be faster just scanning the first index column and filtering for the second. Very much depends on your data distribution.
Either way, CLUSTER using the multicolumn index from above can help performance:
CLUSTER ips USING index_ips_begin_end_ip_num
This way, candidates fulfilling your first condition are packed onto the same or adjacent data pages. Can help performance a lot with if you have lots of rows per value of the first column. Else it is hardly effective.
(There are also non-blocking external tools for the purpose: pg_repack or pg_squeeze.)
Also, is autovacuum running and configured properly or have you run ANALYZE on the table? You need current statistics for Postgres to pick appropriate query plans.
What would really help here is a GiST index for a int8range column, available since PostgreSQL 9.2. See:
Optimizing queries on a range of timestamps (two columns)
If your IP ranges can be covered with one of the built-in network types inet or cidr, consider to replace your two bigint columns. Or, better yet, look to the additional module ip4r by Andrew Gierth (not in the standard distribution. The indexing strategy changes accordingly.
Barring that, you can check out this related answer on dba.SE with using a sophisticated regime with partial indexes. Advanced stuff, but it delivers great performance:
Can spatial index help a “range - order by - limit” query
I had exactly this same problem on a nearly identical dataset from maxmind.com's free geiop table. I solved it using Erwin's tip about range types and GiST indexes. The GiST index was key. Without it I was querying at best about 3 rows per second. With it I queried nearly 500000 rows in under 10 seconds! Since Erwin didn't post detailed instructions on how to do this, I thought I'd add them, here...
First of all, you must add a new column having the range type, note that int8range is required for bigint types. Next set its values appropriately, note that the '[]' parameter indicates to make the range inclusive at lower and upper bounds (rtfm). Finally add the index, note that the GiST index is where all the performance advantage comes from.
alter table ips add column iprange int8range;
update ips set iprange=int8range(begin_ip_num, end_ip_num, '[]');
create index index_ips_on_iprange on ips using gist (iprange);
Having laid the groundwork, you can now use the '<#' contained-by operator to search specific addresses against the table. See http://www.postgresql.org/docs/9.2/static/functions-range.html
SELECT "ips".* FROM "ips" WHERE (3065106743::bigint <# iprange);
I'm a bit late to this party, but this is what works really well for me.
Consider installing ip4r extension. It basically allows you to define a column that can hold IP ranges. The name of the extension implies it is just for IPv4, but currently it is also support IPv6.
After you populate table with ranges within that column all you need, is to create GIST index:
CREATE INDEX ip_zip_ip4_range ON ip_zip USING gist (ip4_range);
I have almost 10 million ranges in my database, but queries take fraction of a milisecond:
region=> select count(*) from ip_zip ;
count
---------
9566133
region=> explain analyze select * from ip_zip where '8.8.8.8'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.085..0.086 rows=1 loops=1)
Recheck Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ('8.8.8.8'::ip4r <<= ip4_range)
Planning time: 0.106 ms
Execution time: 0.118 ms
(7 rows)
region=> explain analyze select * from ip_zip where '254.50.22.54'::ip4 <<= ip4_range;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on ip_zip (cost=234.55..25681.29 rows=9566 width=22) (actual time=0.059..0.059 rows=1 loops=1)
Recheck Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Heap Blocks: exact=1
-> Bitmap Index Scan on ip_zip_ip4_range (cost=0.00..232.16 rows=9566 width=0) (actual time=0.048..0.048 rows=1 loops=1)
Index Cond: ('254.50.22.54'::ip4r <<= ip4_range)
Planning time: 0.102 ms
Execution time: 0.145 ms
(7 rows)
I believe your query looks like WHERE [constant] BETWEEN begin_ip_num AND end_ipnum or
As far as I know Postgres doesn't have "AND-EQUAL " access plan, so you need to add a composite index on 2 columns as suggested by Erwin Brandstetter.