I have two tables:
fccuser=# select count(*) from public.fine_collection where user_id = 5000;
count
-------
2500
(1 row)
fccuser=# select count(*) from public.police_notice where user_id = 5000;
count
-------
1011
(1 row)
And when I issue
fccuser=# select count(*)
from public.fine_collection, public.police_notice
where fine_collection.user_id = 5000
and fine_collection.user_id = police_notice.user_id;
I was expecting 2500 but I got
count
2527500
(1 row)
i.e., a Cartesian product of the two. And analyze is:
fccuser=# explain analyze verbose select count(*) from public.fine_collection, public.police_notice where fine_collection.user_id = 5000 and fine_collection.user_id = police_notice.user_id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=47657.20..47657.21 rows=1 width=0) (actual time=1991.552..1991.552 rows=1 loops=1)
Output: count(*)
-> Nested Loop (cost=0.86..39760.60 rows=3158640 width=0) (actual time=0.448..1462.155 rows=2527500 loops=1)
-> Index Only Scan using idx_user_id on public.fine_collection (cost=0.43..265.98 rows=8774 width=8) (actual time=0.213..2.448 rows=2500 loops=1)
Output: fine_collection.user_id
Index Cond: (fine_collection.user_id = 5000)
Heap Fetches: 1771
-> Materialize (cost=0.42..12.52 rows=360 width=2) (actual time=0.000..0.205 rows=1011 loops=2500)
Output: police_notice.user_id
-> Index Only Scan using idx_pn_userid on public.police_notice (cost=0.42..10.72 rows=360 width=2) (actual time=0.217..1.101 rows=1011 loops=1)
Output: police_notice.user_id
Index Cond: (police_notice.user_id = 5000)
Heap Fetches: 751
Planning time: 2.126 ms
Execution time: 1991.697 ms
(15 rows)
And postgres documentation states when join is performed on non-primary columns it first creates a Cartesian product (cross join) and then applies the filter. But I think Cartesian product would have all rows with same user_id in my case, so not sure how filter can be applied
Same happens with left join, inner join etc., only subquery seems to give correct result of 2500.
I am reasonably sure it does not work this way in MySQL though. Any thoughts?
Thank you
The result of your join is correct. You join every collection with user_id 5000 with every police notice with the same user_id. You have 2500 and 1011 rows joined together, and this produces 2527500 new rows.
You're using legacy join syntax, so here's the query rephrased to use a readable ANSI join.
SELECT
count(*)
FROM public.fine_collection
INNER JOIN public.police_notice
ON (fine_collection.user_id = police_notice.user_id)
WHERE fine_collection.user_id = 5000;
So, you're doing a count(*). This counts all rows in the cross product of both tables that match the join condition and where clause.
In other words, the result is the the number of rows with user_id = 5000 in each table, multiplied together.
Your query does the same thing as
SELECT
(SELECT count(*) FROM public.fine_collection WHERE user_id = 5000)
*
(SELECT count(*) FROM public.police_notice.user_id WHERE user_id = 5000);
and yes, 2500 * 1011 = 2527500 so that's exactly right.
If you expect 2500, you need to join on or group on a key in fine_collection.
Related
When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;
I have a table in Redshift with a few billion rows which looks like this
CREATE TABLE channels AS (
fact_key TEXT NOT NULL distkey
job_key BIGINT
channel_key TEXT NOT NULL
)
diststyle key
compound sortkey(job_key, channel_key);
When I query by job_key + channel_key my seq scan is properly restricted by the full sortkey if I use specific values for channel_key in my query.
EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN ('1234', '1235', '1236', '1237')
XN Seq Scan on channels scd (cost=0.00..3178474.92 rows=3428929 width=77)
Filter: ((((channel_key)::text = '1234'::text) OR ((channel_key)::text = '1235'::text) OR ((channel_key)::text = '1236'::text) OR ((channel_key)::text = '1237'::text)) AND (job_key = 1))
However if I query against channel_key by using IN + a subquery Redshift does not use the sortkey.
EXPLAIN
SELECT * FROM channels scd
WHERE scd.job_key = 1 AND scd.channel_key IN (select distinct channel_key from other_channel_list where job_key = 14 order by 1)
XN Hash IN Join DS_DIST_ALL_NONE (cost=3.75..3540640.36 rows=899781 width=77)
Hash Cond: (("outer".channel_key)::text = ("inner".channel_key)::text)
-> XN Seq Scan on channels scd (cost=0.00..1765819.40 rows=141265552 width=77)
Filter: (job_key = 1)
-> XN Hash (cost=3.75..3.75 rows=1 width=402)
-> XN Subquery Scan "IN_subquery" (cost=0.00..3.75 rows=1 width=402)
-> XN Unique (cost=0.00..3.74 rows=1 width=29)
-> XN Seq Scan on other_channel_list (cost=0.00..3.74 rows=1 width=29)
Filter: (job_key = 14)
Is it possible to get this to work? My ultimate goal is to turn this into a view so pre-defining my list of channel_keys won't work.
Edit to provide more context:
This is part of a larger query and the results of this get hash joined to some other data. If I hard-code the channel_keys then the input to the hash join is ~2 million rows. If I use the IN condition with the subquery (nothing else changes) then the input to the hash join is 400 million rows. The total query time goes from ~40 seconds to 15+ minutes.
Does this give you a better plan than the subquery version?
with other_channels as (
select distinct channel_key from other_channel_list where job_key = 14 order by 1
)
SELECT *
FROM channels scd
JOIN other_channels ocd on scd.channel_key = ocd.channel_key
WHERE scd.job_key = 1
I'm working with PostgreSQL. I have two tables, assume for the sake of this problem me has multiple IDs. The first table Table1 deals with messages sent:
me | friends | messages_sent
----------------------------
0 1 10
0 2 7
0 3 7
0 4 6
1 1 5
1 2 12
...
The second Table2 deals with messages received:
me | friends | messages_received
----------------------------
0 4 17
0 2 7
0 1 9
0 3 0
...
How I can get a table like (though, order of friends not important):
me | friends | messages_total
----------------------------
0 1 19
0 2 14
0 3 7
0 4 23
...
The part I'm pretty stumped on is joining both tables on me while then adding the values of a friend given an equal value for me ... thoughts?
You should join the two tables using both fields me and friends and then simply add up the messages received and sent. Using a FULL JOIN ensures that all situations, such as me sending but not receiving from a friend and vice-versa, are included.
SELECT me, friends,
coalesce(messages_sent, 0) + coalesce(messages_received, 0) AS messages_total
FROM Table1
FULL JOIN Table2 USING (me, friends)
ORDER BY me;
You can simply generate the union of the two tables and then use a GROUP BY to group combinations of me and friends adding the message counts with an aggregate function:
SELECT me, friends, sum(count) AS messages_total
FROM (
SELECT me, friends, messages_sent AS count FROM Table1
UNION ALL
SELECT me, friends, messages_received FROM Table2
) AS t
GROUP BY me, friends;
Edit: I was about to edit my answer in order to add a note recommending Patrick's answer as being better, but I decided for fun to run a simple benchmark. So if we have the following setup (1 million rows for each table):
CREATE TABLE table1 (
me integer not null,
friends integer not null,
messages_sent integer not null
);
CREATE TABLE table2 (
me integer not null,
friends integer not null,
messages_received integer not null
);
INSERT INTO table1 SELECT n1, n2, floor(random()*10)::integer FROM generate_series(1, 1000) t1(n1), generate_series(1, 1000) t2(n2);
INSERT INTO table2 SELECT n1, n2, floor(random()*10)::integer FROM generate_series(1, 1000) t1(n1), generate_series(1, 1000) t2(n2);
CREATE INDEX ON table1(me, friends);
CREATE INDEX ON table2(me, friends);
ANALYZE;
Then the first solution:
$ EXPLAIN ANALYZE
SELECT me, friends, sum(count) AS messages_total
FROM (
SELECT me, friends, messages_sent AS count FROM Table1
UNION ALL
SELECT me, friends, messages_received FROM Table2
) AS t
GROUP BY me, friends;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=45812.00..46212.00 rows=40000 width=12) (actual time=1201.602..1499.285 rows=1000000 loops=1)
Group Key: table1.me, table1.friends
-> Append (cost=0.00..30812.00 rows=2000000 width=12) (actual time=0.022..299.260 rows=2000000 loops=1)
-> Seq Scan on table1 (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.020..91.357 rows=1000000 loops=1)
-> Seq Scan on table2 (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.004..77.672 rows=1000000 loops=1)
Planning time: 0.255 ms
Execution time: 1529.642 ms
And the second solution:
$ EXPLAIN ANALYZE
SELECT me, friends,
coalesce(messages_sent, 0) + coalesce(messages_received, 0) AS messages_total
FROM Table1
FULL JOIN Table2 USING (me, friends)
ORDER BY me;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=219582.13..222082.13 rows=1000000 width=24) (actual time=1501.873..1583.915 rows=1000000 loops=1)
Sort Key: (COALESCE(table1.me, table2.me))
Sort Method: external sort Disk: 21512kB
-> Merge Full Join (cost=0.85..99414.29 rows=1000000 width=24) (actual time=0.074..912.598 rows=1000000 loops=1)
Merge Cond: ((table1.me = table2.me) AND (table1.friends = table2.friends))
-> Index Scan using table1_me_friends_idx on table1 (cost=0.42..38483.49 rows=1000000 width=12) (actual time=0.039..165.772 rows=1000000 loops=1)
-> Index Scan using table2_me_friends_idx on table2 (cost=0.42..38483.49 rows=1000000 width=12) (actual time=0.018..194.177 rows=1000000 loops=1)
Planning time: 1.091 ms
Execution time: 1615.011 ms
So suprisingly, the solution with the FULL JOIN performs slightly worse, even though it can utilize the index. I guess this has to do with the full join; for other types of join it would have been much better.
If I want to select 0.5% rows, or even 5% rows from the following table via a PK, the query planner correctly chooses to use the PK index. Here is the table:
create table weather as
with numbers as(
select generate_series as id from generate_series(0,1048575))
select id,
50 + 50*sin(id) as temperature_in_f,
50 + 50*sin(id) as humidity_in_percent
from numbers;
alter table weather
add constraint pk_weather primary key(id);
vacuum analyze weather;
The stats are up-to-date, and the following query does use the PK index:
explain analyze select sum(w.id), sum(humidity_in_percent), count(*)
from weather as w
where w.id between 1 and 66720;
Suppose, however, that we need to join this table with another, much smaller, one:
create table lightnings
as
select id as weather_id
from weather
where humidity_in_percent between 99.99 and 100;
alter table lightnings
add constraint pk_lightnings
primary key(weather_id);
analyze lightnings;
Here is my join, in four logically equivalent forms:
explain analyze select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
explain analyze select sum(w.id), count(*)
from weather as w
join lightnings as l
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
explain analyze select sum(w.id), count(*)
from lightnings as l
join weather as w
on l.weather_id=w.id
where w.humidity_in_percent between 99.99 and 100;
-- replaced explicit join with where clause
explain analyze select sum(w.id), count(*)
from lightnings as l, weather as w
where w.humidity_in_percent between 99.99 and 100
and l.weather_id=w.id;
Unfortunately the query planner resorts to scanning the whole weather table:
"Aggregate (cost=22645.68..22645.69 rows=1 width=4) (actual time=167.427..167.427 rows=1 loops=1)"
" -> Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
" Hash Cond: (w.id = l.weather_id)"
" -> Seq Scan on weather w (cost=0.00..22407.64 rows=5106 width=4) (actual time=0.013..158.593 rows=6672 loops=1)"
" Filter: ((humidity_in_percent >= 99.99::double precision) AND (humidity_in_percent <= 100::double precision))"
" Rows Removed by Filter: 1041904"
" -> Hash (cost=96.72..96.72 rows=6672 width=4) (actual time=2.479..2.479 rows=6672 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 235kB"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.009..0.908 rows=6672 loops=1)"
"Planning time: 0.326 ms"
"Execution time: 167.581 ms"
The query planner's estimate on how many rows in weather table will be selected is rows=5106. This is more or less close to the exact value of 6672. If I select this small number of rows in weather table via id, the PK index is used. If I select the same amount via a join with another table, the query planner goes for scanning the table.
What am I missing?
select version()
"PostgreSQL 9.4.0"
Edit: if I remove the condition on humidity, the query planner correctly recognizes that the condition on weather.id is quite selective, and chooses to use the index on PK:
explain analyze select sum(w.id), count(*) from weather as w
where exists(select * from lightnings as l
where l.weather_id=w.id);
"Aggregate (cost=14677.84..14677.85 rows=1 width=4) (actual time=37.200..37.200 rows=1 loops=1)"
" -> Nested Loop (cost=0.42..14644.48 rows=6672 width=4) (actual time=0.022..36.189 rows=6672 loops=1)"
" -> Seq Scan on lightnings l (cost=0.00..96.72 rows=6672 width=4) (actual time=0.011..0.868 rows=6672 loops=1)"
" -> Index Only Scan using pk_weather on weather w (cost=0.42..2.17 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=6672)"
" Index Cond: (id = l.weather_id)"
" Heap Fetches: 0"
"Planning time: 0.321 ms"
"Execution time: 37.254 ms"
Yet adding a condition totally confuses the query planner.
Expecting the optimiser to use an index on the PK of the larger table implies that you expect the query to be driven from the smaller table. Of course, you know that the rows that the smaller table will join to in the larger one are the same as those selected by the predicate on it, but the optimiser does not.
Look at the line on the plan:
Hash Join (cost=180.12..22645.52 rows=32 width=4) (actual time=2.500..166.444 rows=6672 loops=1)"
It expects 32 rows to result from the join, but 6672 actually result.
Anyway, it pretty much has the option of:
A full scan on the smaller table, and an index lookup on the larger, with the predicate being used to filter out rows subsequent to the join (and expecting most of the rows to then be filtered out).
A full scan on both tables, with rows being removed by the predicate on the larger table, and a hash join of the result.
A scan of the larger table with rows being removed by the predicate, and an index lookup on the smaller table that may fail to find a value.
The second of these has been judged to be the lowest cost, and I think it is correct to do so based on the evidence it has, as hash joins are very efficient for joining many rows.
Of course it would probably be more efficient to place an index on weather(humidity_in_percent,id) in this particular case, but I suspect that this is a modified version of your real situation (the sum of the id column?) so specific advice may not be applicable.
I believe the differences your seeing between the first query, which uses the index and the other 3 which don't, is in the where clause.
In the first query, your where clause is on w.id, which is indexed.
In the other 3, the effective where clause is on w.humidity_in_percent. I tested the following ...
create index wh_idx on weather(humidity_in_percent);
explain analyse select sum(w.id), count(*) from weather as w
where w.humidity_in_percent between 99.99 and 100
and exists(select * from lightnings as l
where l.weather_id=w.id);
and get a much better plan. I tried to post the actual plan returned, but I'm having trouble formatting it for proper display, sorry.
EXPLAIN ANALYZE
SELECT count(*)
FROM "businesses"
WHERE (
source = 'facebook'
OR EXISTS(
SELECT *
FROM provider_business_map pbm
WHERE
pbm.hotstepper_business_id=businesses.id
AND pbm.provider_name='facebook'
)
);
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=233538965.74..233538965.75 rows=1 width=0) (actual time=116169.720..116169.721 rows=1 loops=1)
-> Seq Scan on businesses (cost=0.00..233521096.48 rows=7147706 width=0) (actual time=11.284..116165.646 rows=3693 loops=1)
Filter: (((source)::text = 'facebook'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
SubPlan 1
-> Index Scan using idx_provider_hotstepper_business on provider_business_map pbm (cost=0.00..16.29 rows=1 width=0) (never executed)
Index Cond: (((provider_name)::text = 'facebook'::text) AND (hotstepper_business_id = businesses.id))
SubPlan 2
-> Index Scan using idx_provider_hotstepper_business on provider_business_map pbm (cost=0.00..16.28 rows=1 width=4) (actual time=0.045..5.685 rows=3858 loops=1)
Index Cond: ((provider_name)::text = 'facebook'::text)
Total runtime: 116169.820 ms
(10 rows)
This query takes over a minute and it's doing a count that results in ~3000. It seems the bottleneck is the sequential scan but I'm not sure what index I would need on the database to optimize this. It's also worth noting that I haven't tuned postgres so if there's any tuning that would help it may be worth considering. Although my DB is 15GB and I don't plan on trying to fit all of that in memory anytime soon so I'm not sure changing RAM related values would help a lot.
OR is notorious for lousy performance. Try splitting it into a union of two completely separate queries on the two tables:
SELECT COUNT(*) FROM (
SELECT id
FROM businesses
WHERE source = 'facebook'
UNION -- union makes the ids unique in the result
SELECT hotstepper_business_id
FROM provider_business_map
WHERE provider_name = 'facebook'
AND hotstepper_business_id IS NOT NULL
) x
If hotstepper_business_id can not be null, you may remove the line
AND hotstepper_business_id IS NOT NULL
If you want the whole business row, you'd could simply wrap the above query with an IN (...):
SELECT * FROM businesses
WHERE ID IN (
-- above inner query
)
But a much better performing query would be to modify the above query use use a join:
SELECT *
FROM businesses
WHERE source = 'facebook'
UNION
SELECT b.*
FROM provider_business_map m
JOIN businesses b
ON b.id = m.hotstepper_business_id
WHERE provider_name = 'facebook'
I'd at least try rewriting the dependent subquery as;
SELECT COUNT(DISTINCT b.*)
FROM businesses b
LEFT JOIN provider_business_map pbm
ON b.id=pbm.hotstepper_business_id
WHERE b.source = 'facebook'
OR pbm.provider_name = 'facebook';
Unless I'm mis-reading something, an index on businesses.id exists, but make sure there are also indexes on provider_business_map.hotstepper_business_id, businesses.source and provider_business_map.provider_name for best performance.
create index index_name on businesses(source);
Since there 3,693 rows matches in more than 7 million rows it will probably use the index. Do not forget to
analyse businesses;