PostgreSQL gist index - postgresql

I have a table with two date like dateTo and dateFrom, i would like use daterange approach in queries and a gist index, but it seem doesn't work. The table looks like:
CREATE TABLE test (
id bigeserial,
begin_date date,
end_date date
);
CREATE INDEX "idx1"
ON test
USING gist (daterange(begin_date, end_date));
Then when i try to explain a query like:
SELECT t.*
FROM test t
WHERE daterange(t.begin_date,t.end_date,'[]') && daterange('2015-12-30 00:00:00.0','2016-10-28 00:00:00.0','[]')
i get a Seq Scan.
Is this usage of gist index wrong, or is this scenario not feasible?

You have an index on the expression daterange(begin_date, end_date), but you query your table with daterange(begin_date, end_date, '[]') && .... PostgreSQL won't do math instead of you. To re-phrase your problem, it is like you're indexing (int_col + 2) and querying WHERE int_col + 1 > 2. Because the two expressions are different, the index will not be used in any circumstances. But as you can see, you can do the math (i.e. re-phrase the formula) sometimes.
You'll either need:
CREATE INDEX idx1 ON test USING gist (daterange(begin_date, end_date, '[]'));
Or:
CREATE INDEX idx2 ON test USING gist (daterange(begin_date, end_date + 1));
Note: both of them creates a range which includes end_date. The latter one uses the fact that daterange is discrete.
And use the following predicates for each of the indexes above:
WHERE daterange(begin_date, end_date, '[]') && daterange(?, ?, ?)
Or:
WHERE daterange(begin_date, end_date + 1) && daterange(?, ?, ?)
Note: the third parameter of the range constructor on the right side of && does not matter (in the context of index usage).

Related

Range contains element filter not using index of the element column

I have a table with a timestamp column that has a btree index.
CREATE TABLE tstest AS
SELECT '2019-11-10'::timestamp + random() * '10 day'::interval ts
FROM generate_series(1,10000) s;
CREATE INDEX ON tstest(ts);
I would like to find all row between a timerange. Both ends of the range can be "infinite"/null, start of the range must be exclusive and end is inclusice. So this forms a query:
SELECT * FROM tstest WHERE ts <# tsrange($1, $2, '(]');
The result is correct but the index of the ts column is not used and a seq scan is done instead.
To use the index correctly I have to do the query like this:
SELECT * FROM tstest WHERE ($1 IS null OR $1 < ts) AND ($2 IS null OR ts <= $2);
I like the <# syntax more. It is easier to understand and it is shorter.
Is there something I can do differently to utilize the index and make the query faster? Maybe a different type of index instead?
I have also tried to add a gist index for ts column using the btree_gist module but that didn't change the situation.
I have tested this with PostgreSQL 9.6 and 12.0.
A btree index (which you are using) doesn't support the <# operator and a GiST index wouldn't help because the timestamp column isn't a range type.
But you can still make the btree index usable by using coalesce and infinity which removes the use of the OR condition:
SELECT *
FROM tstest
WHERE ts > coalesce($1, '-infinity')
and ts <= coalesce($2, 'infinity');

PostgreSQL using different index for same query

I have a SQL query which is using inner join on two tables and filtering data based on several params. Going by the query plan, for different values of query params (like different date range), Postgres is using different index.
I am aware of the fact that Postgres determines if the index has to be used or not, depending on the number or rows in the result set. But why does Postgres choose to use different index for same query. The query time varies by a factor of 10, between the two cases. How can I optimise the query? As Postgres does not allows the user to define the index to be used in a query.
Edit:
explain (analyze, buffers, verbose) SELECT COUNT(*) FROM "bookings" INNER JOIN "hotels" ON "hotels"."id" = "bookings"."hotel_id" WHERE "bookings"."hotel_id" = 37016 AND (bookings.status in (0,1,2,3,4,5,6,7,9,10,11,12)) AND (bookings.source in (0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70) or bookings.status in (0,1,2,3,4,5,6,7,8,9,10,11,13)) AND (
bookings.source in (4,66,65)
OR
date(timezone('+05:30',bookings.created_at))>checkin
OR
(
( date(timezone('+05:30',bookings.created_at))=checkin
and
extract (epoch from COALESCE(cancellation_time,NOW())-bookings.created_at)>600
)
OR
( date(timezone('+05:30',bookings.created_at))<checkin
and
extract (epoch from COALESCE(cancellation_time,NOW())-bookings.created_at)>600
and
(
extract (epoch from ((bookings.checkin||' '||hotels.checkin_time)::timestamp -COALESCE(cancellation_time,bookings.checkin))) < extract(epoch from '16 hours'::interval)
OR
(DATE(bookings.checkout)-DATE(bookings.checkin))*(COALESCE(bookings.oyo_rooms,0)+COALESCE(bookings.owner_rooms,0)) > 3
)
)
)
) AND (bookings.checkin >= '2018-11-21') AND (bookings.checkin <= '2019-05-19') AND "bookings"."hotel_id" = '37016' AND "bookings"."status" IN (0, 1, 2, 3, 12);
QueryPlan : https://explain.depesz.com/s/SPeb
explain (analyze, buffers, verbose) SELECT COUNT(*) FROM "bookings" INNER JOIN "hotels" ON "hotels"."id" = 37016 WHERE "bookings"."hotel_id" = 37016 AND (bookings.status in (0,1,2,3,4,5,6,7,9,10,11,12)) AND (bookings.source in (0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70) or bookings.status in (0,1,2,3,4,5,6,7,8,9,10,11,13)) AND (
bookings.source in (4,66,65)
OR
date(timezone('+05:30',bookings.created_at))>checkin
OR
(
( date(timezone('+05:30',bookings.created_at))=checkin
and
extract (epoch from COALESCE(cancellation_time,now())-bookings.created_at)>600
)
OR
( date(timezone('+05:30',bookings.created_at))<checkin
and
extract (epoch from COALESCE(cancellation_time,now())-bookings.created_at)>600
and
(extract (epoch from ((bookings.checkin||' '||hotels.checkin_time)::timestamp -COALESCE(cancellation_time,bookings.checkin))) < extract(epoch from '16 hours'::interval)
OR
(DATE(bookings.checkout)-DATE(bookings.checkin))*(COALESCE(bookings.oyo_rooms,0)+COALESCE(bookings.owner_rooms,0)) > 3
)
)
)
) AND (bookings.checkin >= '2018-11-22') AND (bookings.checkin <= '2019-05-19') AND "bookings"."hotel_id" = '37016' AND "bookings"."status" IN (0,1,2,3,4,12);
QueryPlan: https://explain.depesz.com/s/DWD
Finally found the solution to this problem. I am querying on the basis of more than 10 possible values of a column (status in this case). If I break this query into multiple sub-queries each querying upon only 1 status value and aggregate the result using union all, then the query plan executed uses optimized index for each subquery.
Results: The query time decreased by 10 times by this change.
Possible explanation for this behaviour, the query planner fetches less number of rows for each subquery and uses the optimized index in this case. I am not sure about if this is the correct explanation.

How can I find a null field efficiently using index

I have a query where I'm trying to find a null field from millions of records. There will only be one or two.
The query looks like this:
SELECT *
FROM “table”
WHERE “id” = $1
AND “end_time” IS NULL
ORDER BY “start_time” DESC LIMIT 1
How can I make this query more performant eg using indexes in the database.
try a partial index, smth like:
create index iname on "table" (id, start_time) where end_time is null;

PostgreSQL Check Contraints not Passed in Join

Consider the following structures, a header and line table, both of which are partitioned by date:
create table stage.order_header (
order_id int not null,
order_date date not null
);
create table stage.order_line (
order_id int not null,
order_date date not null,
order_line int not null
);
create table stage.order_header_2013 (
constraint order_header_2013_ck1
check (order_date >= '2013-01-01' and order_date < '2014-01-01')
) inherits (stage.order_header);
create table stage.order_header_2014 (
constraint order_header_2014_ck1
check (order_date >= '2014-01-01' and order_date < '2015-01-01')
) inherits (stage.order_header);
create table stage.order_line_2013 (
constraint order_line_2013_ck1
check (order_date >= '2013-01-01' and order_date < '2014-01-01')
) inherits (stage.order_line);
create table stage.order_line_2014 (
constraint order_line_2014_ck1
check (order_date >= '2014-01-01' and order_date < '2015-01-01')
) inherits (stage.order_line);
If I look at the explain plan on the following query:
select
*
from
stage.order_header h
join stage.order_line l on
h.order_id = l.order_id and
h.order_date = l.order_date
where
h.order_date = '2014-04-01'
It invokes both check constraints and only physically scans the "2014" partitions.
However, if I use an inequality:
where
h.order_date > '2014-04-01' and
h.order_date < '2014-05-01'
The check constaint is invoked on the header, but not on the line, and the query will scan the entire line_2013 table, even though the records cannot exist. My thought was that since order_date is included in the join, then any limits on it in one table would propagate to the joined table, but that doesn't appear to be the case.
If I explicitly to this:
where
h.order_date > '2014-04-01' and
h.order_date < '2014-05-01' and
l.order_date > '2014-04-01' and
l.order_date < '2014-05-01'
Then everything works as expected.
My question is this: I now know this and can add the extra limitations in the where clause, but my concern is with everyone else using the database that doesn't know to do this. Is there a structural (or other) change I can make that would resolve this? I tried adding foreign key constraints, but that didn't change the plan.
Also, the query itself is really physically scanning the 2013 table. It's not just the explain plan.
EDIT:
I did submit this to bugs, but it appears this behavior is unlikely to change... this is what promted me to see a workaround.
The response to my report was:
If I specifically invoke the range on both the h and l tables, it will
work fine, but since the join specifies those fields have to be the
same, can that condition be propagated automatically?
No. We currently deduce equality transitively, so the planner is able
to extract the constraint l.transaction_date = '2014-03-01' from your
query (and then use that to reason about the check constraints on l's
children). But there's nothing comparable for inequalities, and it's
not clear that adding such logic to the planner would be a net win.
It would be more complicated than the equality case and less often
useful.

Is it possible to optimize a SELECT COUNT(*) query using a filtered index as a hint to achieve constant speed?

I'd like to count all the Orders that are not urgent and whose order status = 1 (shipped).
This should be a very simple query to optimize. I'd like to put a simple filtered index on the Orders table to cover this query to make it a constant time/O(1) operation. However, when I look at the query plan, it looks like it's using a Index Scan which doesn't make sense. Ideally, this query should just returning the number of items in the index.
The table look like this (simplified to get to the essence):
CREATE TABLE [dbo].[Orders](
[Id] [int] IDENTITY(1,1) NOT NULL,
[IsUrgent] [bit] NOT NULL,
[Status] [tinyint] NOT NULL
CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED ( [Id] ASC )
I've created this filtered index:
CREATE INDEX IX_Orders_ShippedNonUrgent ON Orders(Id) WHERE IsUrgent = 0 AND Status = 1;
Now, when I do this query:
SELECT COUNT(*) FROM Orders WHERE IsUrgent = 0 AND Status = 1
I see that the query plan is using IX_Orders_ShippedNonUrgent, but it's doing an Index Scan and performing around 200 reads across the ~150,000 rows in Orders.
Is it possible to always have this query run in constant time assuming the filtered index is kept up to date? Ideally, it should only perform 1 read to get the size of the index.
If I switch to a non-filtered index like this:
CREATE INDEX IX_Orders_IsUrgentStatus ON Orders(IsUrgent, Status);
The query plan uses an Index Seek, but still performs many more reads than should be necessary to answer this simple query.
UPDATE
I'm able to do this
SELECT TOP 1 rows FROM sys.partitions p
INNER JOIN sys.indexes i
ON i.name = 'IX_Orders_ShippedNonUrgent'
AND i.object_id = p.object_id
AND i.index_id = p.index_id
and get the result in 9 reads but it seems like there should be a much easier and less brittle way of using the simple COUNT(*) query.
It seems like what I'm wanting isn't possible. The best answer was left in the comments by Nikola Markovinović which is to forget about the filtered index and use an indexed view instead:
CREATE VIEW [dbo].vw_Orders_TotalShippedNonUrgent WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS TotalOrders
FROM dbo.Orders WHERE IsUrgent = 0 AND Status = 1;
with
CREATE UNIQUE CLUSTERED INDEX IX_vw_Orders_TotalShippedNonUrgent ON vw_Orders_TotalShippedNonUrgent(TotalOrders);
This forces creating views and their index for each summary statistic that I want as well as rewriting the query to ask the view instead of the simple approach, but it is fast at only 2 reads.
I'll leave this question open for awhile in case anyone has a simpler approach that's just as fast.