Postgresql index not used in inner join

Postgresql index not used in inner join - postgresql

I am using 9.1
Here is the query:
select a.id,b.flag
from a, b
where b.starting_date>='2002-01-01'::date
and a.zip_code= b.zip_code
and abs( a.date1-b.date1 ) <=60
and abs( a.balance-b.balance ) <=1000
and abs( a.value-b.value ) <=3000
Where table a and b have every single field indexed (not multi-column index) using btree.
Here is explain analyze verbose results:
Merge Join (cost=0.00..448431434.24 rows=440104742 width=16) (actual time=384.268..15151912.652 rows=672144 loops=1)
Output: a.id, b.flag
Merge Cond: (a.zip_code = b.zip_code)
Join Filter: ((abs((a.date1 - b.date1)) <= 60) AND (abs((a.balance - b.balance)) <= 1000) AND (abs((a.value - b.value)) <= 3000))
-> Index Scan using indx_a_zip_code on a (cost=0.00..950851.26 rows=6800857 width=32) (actual time=0.028..22292.274 rows=2080440 loops=1)
Output: a.id, a.zip_code, a.date1, a.balance, a.value
-> Materialize (cost=0.00..1906889.40 rows=19744024 width=28) (actual time=0.032..6148075.701 rows=6472114362 loops=1)
Output: b.balance, b.date1, b.zip_code, b.starting_date, b.value,
-> Index Scan using indx_zip_code on b (cost=0.00..1857529.34 rows=19744024 width=28) (actual time=0.025..76893.104 rows=19078422 loops=1)
Output:b.balance, b.date1, b.zip_code, b.starting_date
Filter: (b.starting_date >= '2002-01-01'::date)
Total runtime: 15155983.643 ms
It appears to me that the query planner is using index for zip_code but not other fields. I have 8 million rows in table a and 20 million rows in table b and it takes 3 hours to finish (The RAM is 64GB, and all critical postgresql server configurations have been tuned according to:
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Any advice/suggestion is appreciated. Thanks a million!

Related

Can a LEFT JOIN be deferred to only apply to matching rows?

When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.

Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.

Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;

Very slow query on GROUP BY

I am having a really slow query (~100mins). I have omitted a lot of the inner child nodes by denoting it with a suffix ...
HashAggregate (cost=6449635645.84..6449635742.59 rows=1290 width=112) (actual time=5853093.882..5853095.159 rows=785 loops=1)
Group Key: p.processid
-> Nested Loop (cost=10851145.36..6449523319.09 rows=832050 width=112) (actual time=166573.289..5853043.076 rows=3904 loops=1)
Join Filter: (SubPlan 2)
Rows Removed by Join Filter: 617040
-> Merge Left Join (cost=5425572.68..5439530.95 rows=1290 width=799) (actual time=80092.782..80114.828 rows=788 loops=1) ...
-> Materialize (cost=5425572.68..5439550.30 rows=1290 width=112) (actual time=109.689..109.934 rows=788 loops=788) ...
SubPlan 2
-> Limit (cost=3869.12..3869.13 rows=5 width=8) (actual time=9.155..9.156 rows=5 loops=620944) ...
Planning time: 1796.764 ms
Execution time: 5853316.418 ms
(2836 rows)
The above query plan is a query executed to the view, schema below (simplified)
create or replace view foo_bar_view(processid, column_1, run_count) as
SELECT
q.processid,
q.column_1,
q.run_count
FROM
(
SELECT
r.processid,
avg(h.some_column) AS column_1,
-- many more aggregate function on many more columns
count(1) AS run_count
FROM
foo_bar_table r,
foo_bar_table h
WHERE (h.processid IN (SELECT p.processid
FROM process p
LEFT JOIN bar i ON p.barid = i.id
LEFT JOIN foo ii ON i.fooid = ii.fooid
JOIN foofoobar pt ON p.typeid = pt.typeid AND pt.displayname ~~
((SELECT ('%'::text || property.value) || '%'::text
FROM property
WHERE property.name = 'something'::text))
WHERE p.processid < r.processid
AND (ii.name = r.foo_name OR ii.name IS NULL AND r.foo_name IS NULL)
ORDER BY p.processid DESC
LIMIT 5))
GROUP BY r.processid
) q;
I would just like to understand, does this mean that most of the time is spent performing the GROUP BY processid?
If not, what is causing the issue? I can't think of a reason why is this query so slow.
The aggregate functions used are avg, min, max, stddev.
A total of 52 of them were used, 4 on each of the 13 columns.
Update: Expanding on the child node of SubPlan 2. We can see that the Bitmap Index Scan on process_pkey part is the bottleneck.
-> Bitmap Heap Scan on process p_30 (cost=1825.89..3786.00 rows=715 width=24) (actual time=8.642..8.833 rows=394 loops=620944)
Recheck Cond: ((typeid = pt_30.typeid) AND (processid < p.processid))
Heap Blocks: exact=185476288
-> BitmapAnd (cost=1825.89..1825.89 rows=715 width=0) (actual time=8.611..8.611 rows=0 loops=620944)
-> Bitmap Index Scan on ix_process_typeid (cost=0.00..40.50 rows=2144 width=0) (actual time=0.077..0.077 rows=788 loops=620944)
Index Cond: (typeid = pt_30.typeid)
-> Bitmap Index Scan on process_pkey (cost=0.00..1761.20 rows=95037 width=0) (actual time=8.481..8.481 rows=145093 loops=620944)
Index Cond: (processid < p.processid)
What I am unable to figure out is why is it using a Bitmap Index Scan and not Index Scan. From what it seems, there should only be 788 rows that needs to be compared? Wouldn't that be faster? If not how can I optimise this query?
processid is of bigint type and has an index
The complete execution plan is here.

You conveniently left out the names of the tables in the execution plan, but I assume that the nested loop join is between foo_bar_table r and foo_bar_table h, and the subplan is the IN condition.
The high execution time is caused by the subplan, which is executed for each potential join result, that is 788 * 788 = 620944 times. 620944 * 9.156 accounts for 5685363 milliseconds.
Create this index:
CREATE INDEX ON process (typeid, processid, installationid);
And run VACUUM:
VACUUM process;
That should give you a fast index-only scan.

find all points in t2 within 1000m of all points in t1

I have 2 tables, t1 and t2, each with a geography type column called pts_geog, and each with a column id which is a unit identifier. I want to add a column to t1 which counts how many units from t2 are within a distance of 1000m to the any given point in t1. Both tables reasonably large, with each about 150000 rows. To compute the distance of each point in t1 to each point in t2 however results in a very expensive operation, so I am looking for some guidance as to what I'm doing has any hope. I have never been able to complete this operation because out of memory. I could split the operation somehow (with a where along another dimension of t1), but I need more help. Here is the select that I would like to use:
select
count(nullif(
ST_DWithin(
g1.pts_geog,
g2.gts_geog,
1000,
false),
false)) as close_1000
from
t1 as g1,
t2 as g2
where
g1.pts_geog IS NOT NULL
and
g2.pts_geog IS NOT NULL
GROUP BY g1.id
suggested answer and EXPLAIN:
airbnb=> EXPLAIN ANALYZE
airbnb-> SELECT t1.listing_id, count(*)
airbnb-> FROM paris as t1
airbnb-> JOIN airdna_property as t2
airbnb-> ON ST_DWithin( t1.pts_geog, t2.pts_geog,1000 )
airbnb-> WHERE t2.city='Paris'
airbnb-> group by t1.listing_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1030317.33..1030386.39 rows=6906 width=8) (actual time=2802071.616..2802084.109 rows=54400 loops=1)
Group Key: t1.listing_id
-> Nested Loop (cost=0.41..1030282.80 rows=6906 width=8) (actual time=0.827..2604319.421 rows=785571807 loops=1)
-> Seq Scan on airdna_property t2 (cost=0.00..74893.44 rows=141004 width=56) (actual time=0.131..738.133 rows=141506 loops=1)
Filter: (city = 'Paris'::text)
Rows Removed by Filter: 400052
-> Index Scan using paris_pts_geog_idx on paris t1 (cost=0.41..6.77 rows=1 width=64) (actual time=0.133..17.865 rows=5552 loops=141506)
Index Cond: (pts_geog && _st_expand(t2.pts_geog, '1000'::double precision))
Filter: ((t2.pts_geog && _st_expand(pts_geog, '1000'::double precision)) AND _st_dwithin(pts_geog, t2.pts_geog, '1000'::double precision, true))
Rows Removed by Filter: 3260
Planning time: 0.197 ms
Execution time: 2802086.005 ms
output of version:
version | postgis_version
----------------------------------------------------------------------------------------------------------+---------------------------------------
PostgreSQL 9.5.7 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005, 64-bit | 2.2 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
Update 2
This is after creating the indices as suggested. notice that the number of rows slightly increased because I added new data, but this is still the same size of problem. it takes 52 minutes. It still says Seq Scan on city, and I don't understand: why doesn't it do an index scan there, given I created one?
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=904989.83..905049.21 rows=5938 width=8) (actual time=3118569.759..3118581.444 rows=54400 loops=1)
Group Key: t1.listing_id
-> Nested Loop (cost=0.41..904960.14 rows=5938 width=8) (actual time=2.624..2881694.755 rows=837837851 loops=1)
-> Seq Scan on airdna_property t2 (cost=0.00..74842.84 rows=121245 width=56) (actual time=2.263..949.073 rows=151018 loops=1)
Filter: (city = 'Paris'::text)
Rows Removed by Filter: 435564
-> Index Scan using paris_pts_geog_idx on paris t1 (cost=0.41..6.84 rows=1 width=64) (actual time=0.139..18.555 rows=5548 loops=151018)
Index Cond: (pts_geog && _st_expand(t2.pts_geog, '1000'::double precision))
Filter: ((t2.pts_geog && _st_expand(pts_geog, '1000'::double precision)) AND _st_dwithin(pts_geog, t2.pts_geog, '1000'::double precision, true))
Rows Removed by Filter: 3257
Planning time: 0.377 ms
Execution time: 3118583.203 ms
(12 rows)

All you're doing is selecting the count just move the clause out of the select list to trim up the join.
SELECT t1.id, count(*)
FROM t1
JOIN t2
ON ST_DWithin( t1.pts_geog, t2.pts_geog, 1000 )
GROUP BY t1.id;
If you need an index, which ST_DWithin can use run this..
CREATE INDEX ON t1 USING gist (pts_geog);
CREATE INDEX ON t2 USING gist (pts_geog);
VACUUM ANALYZE t1;
VACUUM ANALYZE t2;
Now run the SELECT query above.
Update 2
Your plan shows that you have seq scan on city, so create an index on city and then we'll see what more we can do
CREATE INDEX ON airdna_property (city);
ANALYZE airdna_property;

Need help optimizing this simple query

My end goal is to optimize my query using indexes, but I'm having trouble adding the right index. Everything I've tried results in the same cost in the Explain diagram, and no indication that it's even using any indexes.
I have two tables:
event that has two date columns: start_date and end_date (can be null).
fiscal_date that has:
two date columns start_date and end_date (cannot be null)
a fiscal_year column of type char(4)
a fiscal_quarter column of type char(1)
There's another table address that's just a one-to-one with a foreign key in event. There's no indexes on it save the public key.
I have a query that I can't change that figures out what fiscal quarter and year the event starts in:
SELECT
e.*,
(select 'Q' || fd.fiscal_quarter || ' FY' || fd.fiscal_year
from fiscal_date fd
where e.start_date between fd.start_date and fd.end_date
limit 1) as fiscal_quarter_year,
(select 'Q' || fd.fiscal_quarter
from fiscal_date fd
where e.start_date between fd.start_date and fd.end_date
limit 1) as fiscal_quarter,
(select 'FY' || fd.fiscal_year
from fiscal_date fd
where e.start_date between fd.start_date and fd.end_date
limit 1) as fiscal_year,
a.street1, a.street2, a.street3, a.city, a.state, a.country, a.postal_code
FROM event AS e
LEFT OUTER JOIN address a ON e.address_id=a.address_id;
Here's an EXPLAIN of the query (notice all the expensive seq scans on the left):
As requested, here's the output of explain analyze:
Hash Left Join (cost=115.78..2846.64 rows=1649 width=5087) (actual time=18.334..134.279 rows=1649 loops=1)
Hash Cond: (e.address_id = a.address_id)
-> Seq Scan on event e (cost=0.00..323.49 rows=1649 width=5031) (actual time=0.223..19.808 rows=1649 loops=1)
-> Hash (cost=68.68..68.68 rows=3768 width=60) (actual time=17.797..17.797 rows=3768 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 248kB
-> Seq Scan on address a (cost=0.00..68.68 rows=3768 width=60) (actual time=0.004..9.071 rows=3768 loops=1)
SubPlan 1
-> Limit (cost=0.00..0.49 rows=1 width=28) (actual time=0.011..0.014 rows=1 loops=1649)
-> Seq Scan on fiscal_date fd (cost=0.00..1.46 rows=3 width=28) (actual time=0.006..0.006 rows=1 loops=1649)
Filter: (($0 >= start_date) AND ($0 <= end_date))
SubPlan 2
-> Limit (cost=0.00..0.48 rows=1 width=8) (actual time=0.010..0.012 rows=1 loops=1649)
-> Seq Scan on fiscal_date fd (cost=0.00..1.43 rows=3 width=8) (actual time=0.006..0.006 rows=1 loops=1649)
Filter: (($1 >= start_date) AND ($1 <= end_date))
SubPlan 3
-> Limit (cost=0.00..0.48 rows=1 width=20) (actual time=0.010..0.012 rows=1 loops=1649)
-> Seq Scan on fiscal_date fd (cost=0.00..1.43 rows=3 width=20) (actual time=0.005..0.005 rows=1 loops=1649)
Filter: (($2 >= start_date) AND ($2 <= end_date))
Total runtime: 138.008 ms
I've tried adding indexes to event that index the start and end dates (both and individually), adding an index to fiscal_date's date columns, but nothing seems to be reducing the cost calculation of this query.
How do I optimize this query, or isn't it possible?

Ok, so your problem isn't the fiscal date sequential scans, the number of rows are so few that sequential scan is probably the correct thing to do. You probably want Index on address._id on both tables.
If adress_is is the primary key of adress table it already is indexed.
Also, just to be sure, run vacuum full and vacuum analyze on all tables.
EDIT:
The performance seems really bad considering there are so few rows (under 10000 is nothing). Are the tables really big or is the hardware ancient? If not you should probably take a serious look at the configuration (working mem etc)

Why my query is slow?

SELECT "Series".*
,"SeriesTranslations"."id" AS "SeriesTranslations.id"
,"SeriesTranslations"."title" AS "SeriesTranslations.title"
,"SeriesTranslations"."subtitle" AS "SeriesTranslations.subtitle"
,"SeriesTranslations"."slug" AS "SeriesTranslations.slug"
,"SeriesTranslations"."language" AS "SeriesTranslations.language"
,"SeriesTranslations"."seoTitle" AS "SeriesTranslations.seoTitle"
,"SeriesTranslations"."seoDescription" AS "SeriesTranslations.seoDescription"
,"Posts"."id" AS "Posts.id"
,"Posts"."type" AS "Posts.type"
,"Posts"."contentDuration" AS "Posts.contentDuration"
,"Posts"."publishDate" AS "Posts.publishDate"
,"Posts"."publishedAt" AS "Posts.publishedAt"
,"Posts"."thumbnailUrl" AS "Posts.thumbnailUrl"
,"Posts"."imageUrl" AS "Posts.imageUrl"
,"Posts"."media" AS "Posts.media"
,"Posts.PostTranslations"."id" AS "Posts.PostTranslations.id"
,"Posts.PostTranslations"."slug" AS "Posts.PostTranslations.slug"
,"Posts.PostTranslations"."title" AS "Posts.PostTranslations.title"
,"Posts.PostTranslations"."subtitle" AS "Posts.PostTranslations.subtitle"
,"Posts.PostTranslations"."language" AS "Posts.PostTranslations.language"
FROM (
SELECT "Series"."id"
,"Series"."thumbnailUrl"
,"Series"."imageUrl"
,"Series"."coverUrl"
FROM "Series" AS "Series"
WHERE EXISTS (
SELECT *
FROM "SeriesTranslations" AS t
WHERE t.LANGUAGE IN ('en-us')
AND t.slug = 'in-residence-architecture-design-video-series'
AND t."SeriesId" = "Series"."id" LIMIT 1
) LIMIT 1
) AS "Series"
INNER JOIN "SeriesTranslations" AS "SeriesTranslations" ON "Series"."id" = "SeriesTranslations"."SeriesId"
AND "SeriesTranslations"."language" IN ('en-us')
LEFT JOIN "Posts" AS "Posts" ON "Series"."id" = "Posts"."SeriesId"
AND EXISTS (
SELECT *
FROM "PostTranslations" AS pt
WHERE pt.LANGUAGE IN ('en-us')
AND pt."PostId" = "Posts"."id" LIMIT 1
)
LEFT JOIN "PostTranslations" AS "Posts.PostTranslations" ON "Posts"."id" = "Posts.PostTranslations"."PostId"
AND "Posts.PostTranslations"."language" IN ('en-us')
ORDER BY "Posts"."publishDate" DESC;
It loads data from 4 tables "Series", "SeriesTranslations", "Posts" and "PostsTranslations". I retrieves single "Series" based on "SeriesTranslations" slug and also all "Posts" that belong to this series with their translations.
This query takes ~1.5 sec when series is returned with 14 posts (TOTAL 14 rows are returned from query). In DB there are just few series (no more than 5), each one has 2 translations. However there are many posts in DB - around 2000 and each one has 2 translations so around 4k PostTranslations...
Here is EXPLAIN result
I have unique indexes on "slug", "language" in "SeriesTranslations" and "PostTranslations" and also I have forign keys on "Posts"."SeriesId", "SeriesTranslations"."SeriesId" and "PostTranslations"."PostId"
EXPLAIN here http://explain.depesz.com/s/fhm
I simplified query as suggested: (removed one subquery and moved conditions to inner join) - however query is still slow...
SELECT "Series"."id"
,"Series"."thumbnailUrl"
,"Series"."imageUrl"
,"Series"."coverUrl"
,"SeriesTranslations"."id" AS "SeriesTranslations.id"
,"SeriesTranslations"."title" AS "SeriesTranslations.title"
,"SeriesTranslations"."subtitle" AS "SeriesTranslations.subtitle"
,"SeriesTranslations"."slug" AS "SeriesTranslations.slug"
,"SeriesTranslations"."language" AS "SeriesTranslations.language"
,"SeriesTranslations"."seoTitle" AS "SeriesTranslations.seoTitle"
,"SeriesTranslations"."seoDescription" AS "SeriesTranslations.seoDescription"
,"Posts"."id" AS "Posts.id"
,"Posts"."type" AS "Posts.type"
,"Posts"."contentDuration" AS "Posts.contentDuration"
,"Posts"."publishDate" AS "Posts.publishDate"
,"Posts"."publishedAt" AS "Posts.publishedAt"
,"Posts"."thumbnailUrl" AS "Posts.thumbnailUrl"
,"Posts"."imageUrl" AS "Posts.imageUrl"
,"Posts"."media" AS "Posts.media"
,"Posts.PostTranslations"."id" AS "Posts.PostTranslations.id"
,"Posts.PostTranslations"."slug" AS "Posts.PostTranslations.slug"
,"Posts.PostTranslations"."title" AS "Posts.PostTranslations.title"
,"Posts.PostTranslations"."subtitle" AS "Posts.PostTranslations.subtitle"
,"Posts.PostTranslations"."language" AS "Posts.PostTranslations.language"
FROM "Series" AS "Series"
INNER JOIN "SeriesTranslations" AS "SeriesTranslations" ON "Series"."id" = "SeriesTranslations"."SeriesId"
AND "SeriesTranslations"."language" IN ('en-us')
AND "SeriesTranslations"."slug" = 'sdf'
LEFT JOIN "Posts" AS "Posts" ON "Series"."id" = "Posts"."SeriesId"
AND EXISTS (
SELECT *
FROM "PostTranslations" AS pt
WHERE pt.LANGUAGE IN ('en-us')
AND pt."PostId" = "Posts"."id" LIMIT 1
)
LEFT JOIN "PostTranslations" AS "Posts.PostTranslations" ON "Posts"."id" = "Posts.PostTranslations"."PostId"
AND "Posts.PostTranslations"."language" IN ('en-us')
WHERE (1 = 1)
ORDER BY "Posts"."publishDate" DESC
,"Posts"."id" DESC;
And here is new query plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1014671.76..1014671.76 rows=1 width=695) (actual time=2140.906..2140.908 rows=14 loops=1)
Sort Key: "Posts"."publishDate", "Posts".id
Sort Method: quicksort Memory: 45kB
-> Nested Loop Left Join (cost=0.03..1014671.76 rows=1 width=695) (actual time=85.862..2140.745 rows=14 loops=1)
Join Filter: ("Posts".id = "Posts.PostTranslations"."PostId")
Rows Removed by Join Filter: 28266
-> Nested Loop (cost=0.03..1014165.24 rows=1 width=564) (actual time=85.307..2042.304 rows=14 loops=1)
Join Filter: ("Series".id = "SeriesTranslations"."SeriesId")
Rows Removed by Join Filter: 35
-> Index Scan using "SeriesTranslations-slug-language-unique" on "SeriesTranslations" (cost=0.03..4.03 rows=1 width=200) (actual time=0.044..0.046 rows=1 loops=1)
Index Cond: ((slug = 'in-residence-architecture-design-video-series'::text) AND (language = 'en-us'::text))
-> Nested Loop Left Join (cost=0.00..1014159.63 rows=450 width=368) (actual time=85.243..2042.207 rows=49 loops=1)
Join Filter: ("Series".id = "Posts"."SeriesId")
Rows Removed by Join Filter: 18131
-> Seq Scan on "Series" (cost=0.00..11.35 rows=450 width=100) (actual time=0.006..0.046 rows=9 loops=1)
-> Materialize (cost=0.00..1012330.79 rows=1010 width=272) (actual time=4.422..226.499 rows=2020 loops=9)
-> Seq Scan on "Posts" (cost=0.00..1012329.78 rows=1010 width=272) (actual time=39.785..2020.448 rows=2020 loops=1)
Filter: (SubPlan 1)
SubPlan 1
-> Limit (cost=0.00..500.94 rows=1 width=1267) (actual time=0.995..0.995 rows=1 loops=2020)
-> Seq Scan on "PostTranslations" pt (cost=0.00..500.94 rows=1 width=1267) (actual time=0.992..0.992 rows=1 loops=2020)
Filter: ((language = 'en-us'::text) AND ("PostId" = "Posts".id))
Rows Removed by Filter: 1591
-> Seq Scan on "PostTranslations" "Posts.PostTranslations" (cost=0.00..499.44 rows=2020 width=135) (actual time=0.003..3.188 rows=2020 loops=14)
Filter: (language = 'en-us'::text)
Rows Removed by Filter: 964
Total runtime: 2141.432 ms
(27 rows)

An index on the FKs might help the JOINs:
CREATE INDEX ON PostTranslations (PostId); -- For FK
VACUUM ANALYZE PostTranslations ; -- refresh statistics
CREATE INDEX ON SeriesTranslations (SeriesId ); -- FK
VACUUM ANALYZE SeriesTranslations ;
CREATE INDEX ON Posts (SeriesId) ; -- FK
VACUUM ANALYZE Posts ;
And REMOVE the LIMIT 1 from the EXISTS(...) subqueries. They can only do harm.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Postgresql index not used in inner join - postgresql

Related

Can a LEFT JOIN be deferred to only apply to matching rows?

Very slow query on GROUP BY

find all points in t2 within 1000m of all points in t1

Need help optimizing this simple query

Why my query is slow?

Categories

Resources