PostgreSQL using different index for same query - postgresql

I have a SQL query which is using inner join on two tables and filtering data based on several params. Going by the query plan, for different values of query params (like different date range), Postgres is using different index.
I am aware of the fact that Postgres determines if the index has to be used or not, depending on the number or rows in the result set. But why does Postgres choose to use different index for same query. The query time varies by a factor of 10, between the two cases. How can I optimise the query? As Postgres does not allows the user to define the index to be used in a query.
Edit:
explain (analyze, buffers, verbose) SELECT COUNT(*) FROM "bookings" INNER JOIN "hotels" ON "hotels"."id" = "bookings"."hotel_id" WHERE "bookings"."hotel_id" = 37016 AND (bookings.status in (0,1,2,3,4,5,6,7,9,10,11,12)) AND (bookings.source in (0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70) or bookings.status in (0,1,2,3,4,5,6,7,8,9,10,11,13)) AND (
bookings.source in (4,66,65)
OR
date(timezone('+05:30',bookings.created_at))>checkin
OR
(
( date(timezone('+05:30',bookings.created_at))=checkin
and
extract (epoch from COALESCE(cancellation_time,NOW())-bookings.created_at)>600
)
OR
( date(timezone('+05:30',bookings.created_at))<checkin
and
extract (epoch from COALESCE(cancellation_time,NOW())-bookings.created_at)>600
and
(
extract (epoch from ((bookings.checkin||' '||hotels.checkin_time)::timestamp -COALESCE(cancellation_time,bookings.checkin))) < extract(epoch from '16 hours'::interval)
OR
(DATE(bookings.checkout)-DATE(bookings.checkin))*(COALESCE(bookings.oyo_rooms,0)+COALESCE(bookings.owner_rooms,0)) > 3
)
)
)
) AND (bookings.checkin >= '2018-11-21') AND (bookings.checkin <= '2019-05-19') AND "bookings"."hotel_id" = '37016' AND "bookings"."status" IN (0, 1, 2, 3, 12);
QueryPlan : https://explain.depesz.com/s/SPeb
explain (analyze, buffers, verbose) SELECT COUNT(*) FROM "bookings" INNER JOIN "hotels" ON "hotels"."id" = 37016 WHERE "bookings"."hotel_id" = 37016 AND (bookings.status in (0,1,2,3,4,5,6,7,9,10,11,12)) AND (bookings.source in (0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70) or bookings.status in (0,1,2,3,4,5,6,7,8,9,10,11,13)) AND (
bookings.source in (4,66,65)
OR
date(timezone('+05:30',bookings.created_at))>checkin
OR
(
( date(timezone('+05:30',bookings.created_at))=checkin
and
extract (epoch from COALESCE(cancellation_time,now())-bookings.created_at)>600
)
OR
( date(timezone('+05:30',bookings.created_at))<checkin
and
extract (epoch from COALESCE(cancellation_time,now())-bookings.created_at)>600
and
(extract (epoch from ((bookings.checkin||' '||hotels.checkin_time)::timestamp -COALESCE(cancellation_time,bookings.checkin))) < extract(epoch from '16 hours'::interval)
OR
(DATE(bookings.checkout)-DATE(bookings.checkin))*(COALESCE(bookings.oyo_rooms,0)+COALESCE(bookings.owner_rooms,0)) > 3
)
)
)
) AND (bookings.checkin >= '2018-11-22') AND (bookings.checkin <= '2019-05-19') AND "bookings"."hotel_id" = '37016' AND "bookings"."status" IN (0,1,2,3,4,12);
QueryPlan: https://explain.depesz.com/s/DWD

Finally found the solution to this problem. I am querying on the basis of more than 10 possible values of a column (status in this case). If I break this query into multiple sub-queries each querying upon only 1 status value and aggregate the result using union all, then the query plan executed uses optimized index for each subquery.
Results: The query time decreased by 10 times by this change.
Possible explanation for this behaviour, the query planner fetches less number of rows for each subquery and uses the optimized index in this case. I am not sure about if this is the correct explanation.

Related

Extremely slow planning for query with lot of joins in PostgreSQL

(Postgres v13)
I've got a query which takes 2 - 5 seconds to plan. The query joins my languages table and translations table to get translation results for multiple languages. When I add even more languages/translations to load the execution time is exponentially growing.
select
key0_.id as col_0_0_,
key0_.name as col_1_0_,
(select
count(screenshot60_.id)
from
screenshot screenshot60_
inner join
key key61_
on screenshot60_.key_id=key61_.id
where
key0_.id=key61_.id) as col_2_0_,
languages2_.tag as col_3_0_,
translatio31_.id as col_4_0_,
translatio31_.text as col_5_0_,
translatio31_.state as col_6_0_,
translatio31_.auto as col_7_0_,
translatio31_.mt_provider as col_8_0_,
languages3_.tag as col_11_0_,
translatio32_.id as col_12_0_,
translatio32_.text as col_13_0_,
translatio32_.state as col_14_0_,
translatio32_.auto as col_15_0_,
translatio32_.mt_provider as col_16_0_,
... the same over and over many times ...
languages30_.tag as col_227_0_,
translatio59_.id as col_228_0_,
translatio59_.text as col_229_0_,
translatio59_.state as col_230_0_,
translatio59_.auto as col_231_0_,
translatio59_.mt_provider as col_232_0_,
0 as col_233_0_,
0 as col_234_0_
from
key key0_
inner join
project project1_
on key0_.project_id=project1_.id
inner join
language languages2_
on project1_.id=languages2_.project_id
and (
languages2_.tag='en-US'
)
inner join
language languages3_
on project1_.id=languages3_.project_id
and (
languages3_.tag='es-PE'
)
... many times the same ...
inner join
language languages30_
on project1_.id=languages30_.project_id
and (
languages30_.tag='es-MX'
)
left outer join
translation translatio31_
on key0_.id=translatio31_.key_id
and (
translatio31_.language_id=languages2_.id
)
... many times the same ...
left outer join
translation translatio59_
on key0_.id=translatio59_.key_id
and (
translatio59_.language_id=languages30_.id
)
where
(
key0_.name in (
'base_administrative_notes.desc'
)
)
and
key0_.project_id=836
group by
key0_.id ,
languages2_.tag ,
translatio31_.id ,
languages3_.tag ,
translatio32_.id ,
... many times the same ...
languages30_.tag ,
translatio59_.id
order by
key0_.name asc nulls first,
key0_.id asc nulls first limit 1
The visualised EXPLAIN ANALYSE result: https://explain.dalibo.com/plan/uWS (the full query can be found there as well as raw output from explain (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)).
I found in other threads that this can be caused by using too many indexes on the tables, but I only have a unique index on my translations table on key_id and language_id columns.
EDIT:
I've found out that setting join_collapse_limit to some value between 1 to 5 reduces the planning to under 200ms. Don't know if this is the best solution, but I am going to use it as a workaround for now.
As Laurenz Albe explained, the planner is probably trying to reorder the joins to optimize the query.
With n tables, the number of possible joins order is n! (factorial n).
My suggestion is to :
make sure the order is the best in your query
set that particular parameter to 1 before the query
play the query
reset the parameter
You can check Alicja's slide deck (from slide 22) where she illustrates that particular problem with examples here: https://www.postgresql.eu/events/pgconfeu2017/sessions/session/1617/slides/9/FromMinutesToMilliseconds.pdf

Postgres select on large (15m rows) table extremely slow, even with index

I'm trying to run EXPLAIN ANALYZE but it simply won't finish because it's so slow. If it does, I'll post the results, but for now, here is the EXPLAIN.
Query:
EXPLAIN SELECT
*
FROM
"Posts" AS "Post"
WHERE
(
"Post"."featurePostOnDate" > '2020-06-25 19:28:07.816 +00:00'
OR (
"Post"."featurePostOnDate" IS NULL
AND "Post"."userId" IN (6863684)
)
)
AND "Post"."private" IS NULL
ORDER BY
"Post"."featurePostOnDate" DESC NULLS LAST,
"Post"."createdAt" DESC NULLS LAST
LIMIT 10;
Result:
Limit (cost=0.56..110.92 rows=10 width=1136)
-> Index Scan using posts_updated_following_feed_idx on "Posts" "Post" (cost=0.56..284949.60 rows=25819 width=1136)
Filter: (("featurePostOnDate" > '2020-06-25 19:28:07.816+00'::timestamp with time zone) OR (("featurePostOnDate" IS NULL) AND ("userId" = 6863684)))
Index:
CREATE INDEX "posts_updated_following_feed_idx" ON "public"."Posts" USING btree (
"featurePostOnDate" DESC NULLS LAST,
"createdAt" DESC NULLS LAST
)
WHERE
private IS NULL;
You would need to write it as two separate queries, one for each branch of the OR. Apply the limit to each query, then combine them and apply the limit again jointly. But if the first branch finds ten rows, the second one doesn't need to run at all as all non-NULL dates already come first.
So, as you are having 15m rows, and you have used ANALYZE. Using ANALYZE actually runs the query, you can refer it from here https://www.postgresql.org/docs/9.1/sql-explain.html.
And in WHERE clause you have used the fields which are not indexed
WHERE
(
"Post"."featurePostOnDate" > '2020-06-25 19:28:07.816 +00:00'
OR (
"Post"."featurePostOnDate" IS NULL
AND "Post"."userId" IN (6863684)
)
)
AND "Post"."private" IS NULL
So it is actually doing a sequential scan to filter out the rows
Filter: (("featurePostOnDate" > '2020-06-25 19:28:07.816+00'::timestamp with time zone) OR (("featurePostOnDate" IS NULL) AND ("userId" = 6863684)))
That might be the reason your query is slow.
You might need compound indexes on (featurePostOnDate, userId, private) and (featurePostOnDate, private).
I hope this helps.

PostgreSQL gist index

I have a table with two date like dateTo and dateFrom, i would like use daterange approach in queries and a gist index, but it seem doesn't work. The table looks like:
CREATE TABLE test (
id bigeserial,
begin_date date,
end_date date
);
CREATE INDEX "idx1"
ON test
USING gist (daterange(begin_date, end_date));
Then when i try to explain a query like:
SELECT t.*
FROM test t
WHERE daterange(t.begin_date,t.end_date,'[]') && daterange('2015-12-30 00:00:00.0','2016-10-28 00:00:00.0','[]')
i get a Seq Scan.
Is this usage of gist index wrong, or is this scenario not feasible?
You have an index on the expression daterange(begin_date, end_date), but you query your table with daterange(begin_date, end_date, '[]') && .... PostgreSQL won't do math instead of you. To re-phrase your problem, it is like you're indexing (int_col + 2) and querying WHERE int_col + 1 > 2. Because the two expressions are different, the index will not be used in any circumstances. But as you can see, you can do the math (i.e. re-phrase the formula) sometimes.
You'll either need:
CREATE INDEX idx1 ON test USING gist (daterange(begin_date, end_date, '[]'));
Or:
CREATE INDEX idx2 ON test USING gist (daterange(begin_date, end_date + 1));
Note: both of them creates a range which includes end_date. The latter one uses the fact that daterange is discrete.
And use the following predicates for each of the indexes above:
WHERE daterange(begin_date, end_date, '[]') && daterange(?, ?, ?)
Or:
WHERE daterange(begin_date, end_date + 1) && daterange(?, ?, ?)
Note: the third parameter of the range constructor on the right side of && does not matter (in the context of index usage).

Compare counts in PostgreSQL

I want to compare two results of queries of the same table, by checking theresulting row count, but Postgres doesn't support column aliases in the where clause.
select id from article where version=1308
and exists(
select count(ident) as count1 from artprice AS p1
where p1.valid_to<to_timestamp(1586642400000) or p1.valid_from>to_timestamp(1672441199000)
and p1.article=article.id
and p1.count1=(select count(ident) from artprice where article=article.id)
)
I also cannot use aggregate functions in the where clause, so
select id from article where version=1308
and exists(
select count(ident) as count1 from artprice AS p1
where p1.valid_to<to_timestamp(1586642400000) or p1.valid_from>to_timestamp(1672441199000)
and p1.article=article.id
and p1.count(ident)=(select count(ident) from artprice where article=article.id)
)
also doesn't work. Any ideas?
EDIT:
What I want to get are articles where every article price is outside of a valid range defined by validFrom andValidTo.
I now changed the statement by negating the positive conditions:
Select distinct article.id from Article article, ArtPrice price
where
(
(article.version=?)
and
(
(
(
(
(not(price.valid_from>=?)) or (not(price.valid_to<=?))
)
and
(
(not(price.valid_from<=?)) or (not(price.valid_to>=?))
)
)
and
(
(not(price.valid_to>=?)) or (not(price.valid_to<=?))
)
)
and
(
(not(price.valid_from>=?)) or (not(price.valid_from<=?))
)
)
) and article.id=price.article
Probably not the very elegant solution, but it works.
Aggregates are not allowed in WHERE clause, but there's HAVING clause for them.
EDIT: What I want to get are articles where every article price is outside of a valid range defined by validFrom andValidTo.
I think that bool_or() would be a good fit here when combined with range operations:
SELECT article.id
FROM Article AS article
JOIN ArtPrice AS price ON price.article = article.id
WHERE article.version = 1308
GROUP BY article.id
HAVING NOT bool_or(tsrange(price.valid_from, price.valid_to)
&& tsrange(to_timestamp(1586642400000),
to_timestamp(1672441199000)))
This reads as "...those having not any price tsrange overlap with given tsrange".
Postgresql also supports the SQL OVERLAPS operator:
(price.valid_from, price.valid_to) OVERLAPS (to_timestamp(1586642400000),
to_timestamp(1672441199000))
As a note, it operates on half-open intervals start <= time < end.

need to copy all rows with C_PROV_TYPE ='014' and C_SPECILTY = '300' and insert back 3 rows with same data + max sequence number + 1 i.e =4,5,6

need 3 rows for each one of the two valid rows displayed below output like below:
Primary key is C_PROCEDURE + C_PROV_TYPE + SPEC_SEQ_NO!
output shall be like bELOW
You could try something like this:
INSERT INTO YourTable (
C_PROCEDURE,
C_PROV_TYPE,
I_PT_SPEC_SEQ_NO,
C_SPECIALTY
)
SELECT
s.C_PROCEDURE,
s.C_PROV_TYPE,
s.MaxSeq + ROW_NUMBER() OVER (
PARTITION BY s.C_PROCEDURE, s.C_PROV_TYPE
ORDER BY v.rn, s.I_PT_SPEC_SEQ_NO),
s.C_SPECIALTY + v.rn
FROM (
SELECT
*,
MAX(I_PT_SPEC_SEQ_NO) OVER (
PARTITION BY C_PROCEDURE, C_PROV_TYPE
) AS MaxSeq
FROM YourTable
) s
CROSS JOIN (
VALUES (1), (2), (3)
) v (rn)
WHERE s.C_PROV_TYPE = '014'
AND s.C_SPECIALTY = '300'
;
Basically, the subquery returns all the YourTable rows supplied with the maximum values of I_PT_SPEC_SEQ_NO for every partition of (C_PROCEDURE, C_PROV_TYPE) using the windowing MAX() function (MAX(...) OVER (...)).
The resulting set of that subquery is then cross-joined to an inline 3-row table (which produces three copies of every row returned) and filtered by the specified values of C_PROV_TYPE and C_SPECIALTY.
New data rows pull C_PROCEDURE and C_PROV_TYPE directly from the subquery. The new C_SPECIALTY values are produced using those from the subquery and the rn values of the inline table. The new sequence numbers are generated with the help of the ROW_NUMBER() function and the maximum sequence numbers returned by the subquery.
As I didn't have access to a working installation of DB2, I was testing my script in SQL Server 2008, trying to stick to features that I understood DB2 supported as well as SQL Server. This SQL Fiddle demo also uses a SQL Server 2008 instance to demonstrate how the query works.