Let's say we have the following 2 tables:
purchases
-> id
-> classic_id(indexed TEXT)
-> other columns
purchase_items_2(a temporary table)
-> id
-> order_id(indexed TEXT)
-> other columns
I want to do a SQL join between the 2 tables like so:
Select pi.id, pi.order_id, p.id
from purchase_items_2 pi
INNER JOIN purchases p ON pi.order_id = p.classic.id
This thing should use the indexes no? It is not.
Any clue why?
This is the explanation of the query
INNER JOIN purchases ON #{#table_name}.order_id = purchases.classic_id")
QUERY PLAN
---------------------------------------------------------------------------------
Hash Join (cost=433.80..744.69 rows=5848 width=216)
Hash Cond: ((purchase_items_2.order_id)::text = (purchases.classic_id)::text)
-> Seq Scan on purchase_items_2 (cost=0.00..230.48 rows=5848 width=208)
-> Hash (cost=282.80..282.80 rows=12080 width=16)
-> Seq Scan on purchases (cost=0.00..282.80 rows=12080 width=16)
(5 rows)
When I do a where query
Select pi.id
from purchase_items_2 pi
where pi.order_id = 'gigel'
It uses the index
QUERY PLAN
--------------------------------------------------------------------------------------------------
Bitmap Heap Scan on purchase_items_2 (cost=4.51..80.78 rows=29 width=208)
Recheck Cond: ((order_id)::text = 'gigel'::text)
-> Bitmap Index Scan on index_purchase_items_2_on_order_id (cost=0.00..4.50 rows=29 width=0)
Index Cond: ((order_id)::text = 'gigel'::text)\n(4 rows)
Since you have no WHERE condition, the query has to read all rows of both tables anyway. And since the hash table built by the hash join fits in work_mem, a hash join (that has to perform a sequential scan on both tables) is the most efficient join strategy.
PostgreSQL doesn't use the indexes because it is faster without them in this specific query.
Related
When joining on a table and then filtering (LIMIT 30 for instance), Postgres will apply a JOIN operation on all rows, even if the columns from those rows is only used in the returned column, and not as a filtering predicate.
This would be understandable for an INNER JOIN (PG has to know if the row will be returned or not) or for a LEFT JOIN without a unique constraint (PG has to know if more than one row will be returned or not), but for a LEFT JOIN on a UNIQUE column, this seems wasteful: if the query matches 10k rows, then 10k joins will be performed, and then only 30 will be returned.
It would seem more efficient to "delay", or defer, the join, as much as possible, and this is something that I've seen happen on some other queries.
Splitting this into a subquery (SELECT * FROM (SELECT * FROM main WHERE x LIMIT 30) LEFT JOIN secondary) works, by ensuring that only 30 items are returned from the main table before joining them, but it feels like I'm missing something, and the "standard" form of the query should also apply the same optimization.
Looking at the EXPLAIN plans, however, I can see that the number of rows joined is always the total number of rows, without "early bailing out" as you could see when, for instance, running a Seq Scan with a LIMIT 5.
Example schema, with a main table and a secondary one: secondary columns will only be returned, never filtered on.
drop table if exists secondary;
drop table if exists main;
create table main(id int primary key not null, main_column int);
create index main_column on main(main_column);
insert into main(id, main_column) SELECT i, i % 3000 from generate_series( 1, 1000000, 1) i;
create table secondary(id serial primary key not null, main_id int references main(id) not null, secondary_column int);
create unique index secondary_main_id on secondary(main_id);
insert into secondary(main_id, secondary_column) SELECT i, (i + 17) % 113 from generate_series( 1, 1000000, 1) i;
analyze main;
analyze secondary;
Example query:
explain analyze verbose select main.id, main_column, secondary_column
from main
left join secondary on main.id = secondary.main_id
where main_column = 5
order by main.id
limit 50;
This is the most "obvious" way of writing the query, takes on average around 5ms on my computer.
Explain:
Limit (cost=3742.93..3743.05 rows=50 width=12) (actual time=5.010..5.322 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
-> Sort (cost=3742.93..3743.76 rows=332 width=12) (actual time=5.006..5.094 rows=50 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Nested Loop Left Join (cost=11.42..3731.90 rows=332 width=12) (actual time=0.123..4.446 rows=334 loops=1)
Output: main.id, main.main_column, secondary.secondary_column
Inner Unique: true
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.106..1.021 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.056..0.057 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.12 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=334)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = main.id)
Planning Time: 0.761 ms
Execution Time: 5.423 ms
explain analyze verbose select m.id, main_column, secondary_column
from (
select main.id, main_column
from main
where main_column = 5
order by main.id
limit 50
) m
left join secondary on m.id = secondary.main_id
where main_column = 5
order by m.id
limit 50
This returns the same results, in 2ms.
The total EXPLAIN cost is also three times higher, in line with the performance gain we're seeing.
Limit (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.219..2.027 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
-> Nested Loop Left Join (cost=1048.44..1057.21 rows=1 width=12) (actual time=1.216..1.900 rows=50 loops=1)
Output: m.id, m.main_column, secondary.secondary_column
Inner Unique: true
-> Subquery Scan on m (cost=1048.02..1048.77 rows=1 width=8) (actual time=1.201..1.515 rows=50 loops=1)
Output: m.id, m.main_column
Filter: (m.main_column = 5)
-> Limit (cost=1048.02..1048.14 rows=50 width=8) (actual time=1.196..1.384 rows=50 loops=1)
Output: main.id, main.main_column
-> Sort (cost=1048.02..1048.85 rows=332 width=8) (actual time=1.194..1.260 rows=50 loops=1)
Output: main.id, main.main_column
Sort Key: main.id
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on public.main (cost=11.00..1036.99 rows=332 width=8) (actual time=0.054..0.753 rows=334 loops=1)
Output: main.id, main.main_column
Recheck Cond: (main.main_column = 5)
Heap Blocks: exact=334
-> Bitmap Index Scan on main_column (cost=0.00..10.92 rows=332 width=0) (actual time=0.029..0.030 rows=334 loops=1)
Index Cond: (main.main_column = 5)
-> Index Scan using secondary_main_id on public.secondary (cost=0.42..8.44 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=50)
Output: secondary.id, secondary.main_id, secondary.secondary_column
Index Cond: (secondary.main_id = m.id)
Planning Time: 0.161 ms
Execution Time: 2.115 ms
This is a toy dataset here, but on a real DB, the IO difference is significant (no need to fetch 1000 rows when 30 are enough), and the timing difference also quickly adds up (up to an order of magnitude slower).
So my question: is there any way to get the planner to understand that the JOIN can be applied much later in the process?
It seems like something that could be applied automatically to gain a sizeable performance boost.
Deferred joins are good. It's usually helpful to run the limit operation on a subquery that yields only the id values. The order by....limit operation has to sort less data just to discard it.
select main.id, main.main_column, secondary.secondary_column
from main
join (
select id
from main
where main_column = 5
order by id
limit 50
) selection on main.id = selection.id
left join secondary on main.id = secondary.main_id
order by main.id
limit 50
It's also possible adding id to your main_column index will help. With a BTREE index the query planner knows it can get the id values in ascending order from the index, so it may be able to skip the sort step entirely and just scan the first 50 values.
create index main_column on main(main_column, id);
Edit In a large table, the heavy lifting of your query will be the selection of the 50 main.id values to process. To get those 50 id values as cheaply as possible you can use a scan of the covering index I proposed with the subquery I proposed. Once you've got your 50 id values, looking up 50 rows' worth of details from your various tables by main.id and secondary.main_id is trivial; you have the correct indexes in place and it's a limited number of rows. Because it's a limited number of rows it won't take much time.
It looks like your table sizes are too small for various optimizations to have much effect, though. Query plans change a lot when tables are larger.
Alternative query, using row_number() instead of LIMIT (I think you could even omit LIMIT here):
-- prepare q3 AS
select m.id, main_column, secondary_column
from (
select id, main_column
, row_number() OVER (ORDER BY id, main_column) AS rn
from main
where main_column = 5
) m
left join secondary on m.id = secondary.main_id
WHERE m.rn <= 50
ORDER BY m.id
LIMIT 50
;
Puttting the subsetting into a CTE can avoid it to be merged into the main query:
PREPARE q6 AS
WITH
-- MATERIALIZED -- not needed before version 12
xxx AS (
SELECT DISTINCT x.id
FROM main x
WHERE x.main_column = 5
ORDER BY x.id
LIMIT 50
)
select m.id, m.main_column, s.secondary_column
from main m
left join secondary s on m.id = s.main_id
WHERE EXISTS (
SELECT *
FROM xxx x WHERE x.id = m.id
)
order by m.id
-- limit 50
;
I have two tables in a Postgres 11 database:
client table
--------
client_id integer
client_name character_varying
file table
--------
file_id integer
client_id integer
file_name character_varying
The client table is not partitioned, the file table is partitioned by client_id (partition by list). When a new client is inserted into the client table, a trigger creates a new partition for the file table.
The file table has a foreign key constraint referencing the client table on client_id.
When I execute this SQL (where c.client_id = 1), everything seems fine:
explain
select *
from client c
join file f using (client_id)
where c.client_id = 1;
Partition pruning is used, only the partition file_p1 is scanned:
Nested Loop (cost=0.00..3685.05 rows=100001 width=82)
-> Seq Scan on client c (cost=0.00..1.02 rows=1 width=29)
Filter: (client_id = 1)
-> Append (cost=0.00..2684.02 rows=100001 width=57)
-> Seq Scan on file_p1 f (cost=0.00..2184.01 rows=100001 width=57)
Filter: (client_id = 1)
But when I use a where clause like "where c.client_name = 'test'", the database scans in all partitions and does not recognize, that client_name "test" is equal to client_id 1:
explain
select *
from client c
join file f using (client_id)
where c.client_name = 'test';
Execution plan:
Hash Join (cost=1.04..6507.57 rows=100001 width=82)
Hash Cond: (f.client_id = c.client_id)
-> Append (cost=0.00..4869.02 rows=200002 width=57)
-> Seq Scan on file_p1 f (cost=0.00..1934.01 rows=100001 width=57)
-> Seq Scan on file_p4 f_1 (cost=0.00..1934.00 rows=100000 width=57)
-> Seq Scan on file_pdefault f_2 (cost=0.00..1.00 rows=1 width=556)
-> Hash (cost=1.02..1.02 rows=1 width=29)
-> Seq Scan on client c (cost=0.00..1.02 rows=1 width=29)
Filter: ((name)::text = 'test'::text)
So for this SQL, alle partitions in the file-table are scanned.
So should every select use the column on which the tables are partitioned by? Is the database not able to deviate the partition pruning criteria?
Edit:
To add some information:
In the past, I have been working with Oracle databases most of the time.
The execution plan there would be something like
Do a full table scan on client table with the client name to find out the client_id.
Do a "PARTITION LIST" access to the file table, where SQL Developer states PARTITION_START = KEY and PARTITION_STOP = KEY to indicate the exact partition is not known when calculating the execution plan, but the access will be done to only a list of paritions, which are calculated on the client_id found in the client table.
This is what I would have expected in Postgresql as well.
The documentation states that dynamic partition pruning is possible
(...) During actual execution of the query plan. Partition pruning may also be performed here to remove partitions using values which are only known during actual query execution. This includes values from subqueries and values from execution-time parameters such as those from parameterized nested loop joins.
If I understand it correctly, it applies to prepared statements or queries with subqueries which provide the partition key value as a parameter. Use explain analyse to see dynamic pruning (my sample data contains a million rows in three partitions):
explain analyze
select *
from file
where client_id = (
select client_id
from client
where client_name = 'test');
Append (cost=25.88..22931.88 rows=1000000 width=14) (actual time=0.091..96.139 rows=333333 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on client (cost=0.00..25.88 rows=6 width=4) (actual time=0.040..0.042 rows=1 loops=1)
Filter: (client_name = 'test'::text)
Rows Removed by Filter: 2
-> Seq Scan on file_p1 (cost=0.00..5968.66 rows=333333 width=14) (actual time=0.039..70.026 rows=333333 loops=1)
Filter: (client_id = $0)
-> Seq Scan on file_p2 (cost=0.00..5968.68 rows=333334 width=14) (never executed)
Filter: (client_id = $0)
-> Seq Scan on file_p3 (cost=0.00..5968.66 rows=333333 width=14) (never executed)
Filter: (client_id = $0)
Planning Time: 0.423 ms
Execution Time: 109.189 ms
Note that scans for partitions p2 and p3 were never executed.
Answering your exact question, the partition pruning in queries with joins described in the question is not implemented in Postgres (yet?)
I am trying to improve the performance of a SQL query on a Postgres 9.4 database. I managed to re-write the query so that it would use indexes and it is superfast now! But I do not quite understand why.
This is the original query:
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.optimized_localized_day sldt ON sldt.day_id = dt.id
INNER JOIN public.day_template_place dtp ON dtp.day_template_id = dt.id
INNER JOIN public.optimized_place op ON op.geoname_id = dtp.geoname_id
WHERE
op.alternate_localized_names ILIKE unaccent('%query%') OR
lower(sldt.unaccent_title) LIKE unaccent(lower('%query%')) OR
lower(sldt.unaccent_description) LIKE unaccent(lower('%query%'))
ORDER BY dt.updated_at DESC
LIMIT 100;
I have placed 3 trigram indexes using pg_trgm on op.alternate_localized_names, lower(sldt.unaccent_title) and lower(sldt.unaccent_description).
But, Postgres is not using them, instead it performs a SeqScan on the full tables to join them as shown by EXPLAIN:
Limit
-> Unique
-> Sort
Sort Key: dt.updated_at, dt.id
-> Hash Join
Hash Cond: (sldt.day_id = dt.id)
Join Filter: ((op.alternate_localized_names ~~* unaccent('%query%'::text)) OR (lower(sldt.unaccent_title) ~~ unaccent('%query%'::text)) OR (lower(sldt.unaccent_description) ~~ unaccent('%query%'::text)))
-> Seq Scan on optimized_localized_day sldt
-> Hash
-> Hash Join
Hash Cond: (dtp.geoname_id = op.geoname_id)
-> Hash Join
Hash Cond: (dtp.day_template_id = dt.id)
-> Seq Scan on day_template_place dtp
-> Hash
-> Seq Scan on day dt
-> Hash
-> Seq Scan on optimized_place op
However, when I split the query in 2, one to search on public.optimized_localized_day and one on public.optimized_place, it now uses their indexes:
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.day_template_place dtp ON dtp.day_template_id = dt.id
INNER JOIN public.optimized_place op ON op.geoname_id = dtp.geoname_id
WHERE op.alternate_localized_names ILIKE unaccent('%query%')
UNION
SELECT DISTINCT dt.id, dt.updated_at
FROM public.day dt
INNER JOIN public.optimized_localized_day sldt ON sldt.day_id = dt.id
WHERE lower(sldt.unaccent_title) LIKE unaccent(lower('%query%'))
OR lower(sldt.unaccent_description) LIKE unaccent(lower('%query%'));
And the EXPLAIN:
HashAggregate
-> Append
-> HashAggregate
-> Nested Loop
-> Nested Loop
-> Bitmap Heap Scan on optimized_place op
Recheck Cond: (alternate_localized_names ~~* unaccent('%query%'::text))
-> Bitmap Index Scan on idx_trgm_place_lower
Index Cond: (alternate_localized_names ~~* unaccent('%jericho%'::text))
-> Bitmap Heap Scan on day_template_place dtp
Recheck Cond: (geoname_id = op.geoname_id)
-> Bitmap Index Scan on day_template_place_geoname_idx
Index Cond: (geoname_id = op.geoname_id)
-> Index Scan using day_pkey on day dt
Index Cond: (id = dtp.day_template_id)
-> HashAggregate
-> Nested Loop
-> Bitmap Heap Scan on optimized_localized_day sldt
Recheck Cond: ((lower(unaccent_title) ~~ unaccent('%query%'::text)) OR (lower(unaccent_description) ~~ unaccent('%query%'::text)))
-> BitmapOr
-> Bitmap Index Scan on tgrm_idx_localized_day_title
Index Cond: (lower(unaccent_title) ~~ unaccent('%query%'::text))
-> Bitmap Index Scan on tgrm_idx_localized_day_description
Index Cond: (lower(unaccent_description) ~~ unaccent('%query%'::text))
-> Index Scan using day_pkey on day dt_1
Index Cond: (id = sldt.day_id)
From what I understand, having conditions on 2 separate tables in an OR clause causes Postgres to join the tables first and then filter them. But I am not sure about this. Second thing that puzzles me, I would like to understand how Postgres manages the filtering in the second query.
Do you guys know how Postgres handles those 2 cases ?
Thanks :)
The transformation of the original query to the UNION cannot be made automatically.
Consider a simplified case:
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE x.a = 0 OR x.b = 0;
Imagine it has three result rows:
a | b
---+---
0 | 0
1 | 0
1 | 0
If you replace this with
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE x.a = 0
UNION
SELECT x.a, y.b
FROM x JOIN y USING (c)
WHERE y.b = 0;
the result will only have two rows, because UNION removes duplicates.
If you use UNION ALL instead, the result will have four rows, because the row with the two zeros will appear twice, once from each branch of the query.
So this transformation cannot always be made safely.
In your case, you can get away with it, because you remove duplicates anyway.
By the way: if you use UNION, you don't need the DISTINCT any more, because duplicates will be removed anyway. Your query will become cheaper if you remove the DISTINCTs.
In the second branch of your second query, PostgreSQL can handle the OR with index scans because the conditions are on the same table. In that case, PostgreSQL can perform a bitmap index scan:
The index is scanned, and PostgreSQL build a bitmap in memory that contains 1 for each table row where the index scan results in a match and 0 otherwise.
This bitmap is ordered in the physical order of the table rows.
The same thing happens for the other condition with the other index.
The resulting bitmaps are joined with a bit-wise OR operation.
The resulting bitmap is used to fetch the matching rows from the table.
A trigram index is only a filter that can have false positive results, so the original condition has to be re-checked during that table scan.
I have an accounts table, units table and reports table. An account has many units (foreign key of units is account_id), and a unit has many reports (foreign key of reports is unit_id). I want to select account name, total number of units for that account, and the last report time:
SELECT accounts.name AS account_name,
COUNT(units.id) AS unit_count,
(SELECT reports.time FROM reports INNER JOIN units ON units.id = reports.unit_id ORDER BY time desc LIMIT 1) AS last_reported_time
FROM accounts
INNER JOIN units ON accounts.id = units.account_id
INNER JOIN reports ON units.id = reports.unit_id
GROUP BY account_name, last_reported_time
ORDER BY unit_count desc;
This query has been running forever, and I am not sure it's doing what I expect.
An account has many units and a unit has many reports. I want to display the time of the newest report from all the units associated for each given account. Is this query correct? If not, how can I accomplish my task (if possible without using a scripting language)?
The result of EXPLAIN:
Sort (cost=21466114.58..21466547.03 rows=172980 width=38)
Sort Key: (count(public.units.id))
InitPlan 1 (returns $0)
-> Limit (cost=0.00..12.02 rows=1 width=8)
-> Nested Loop (cost=0.00..928988485.04 rows=77309416 width=8)
-> Index Scan Backward using index_reports_on_time on reports (cost=0.00..296291138.34 rows=77309416 width=12)
-> Index Scan using units_pkey on units (cost=0.00..8.17 rows=1 width=4)
Index Cond: (public.units.id = public.reports.unit_id)
-> GroupAggregate (cost=20807359.99..21446321.09 rows=172980 width=38)
-> Sort (cost=20807359.99..20966559.70 rows=63679885 width=38)
Sort Key: accounts.name, public.units.last_reported_time
-> Hash Join (cost=975.50..3846816.82 rows=63679885 width=38)
Hash Cond: (public.reports.unit_id = public.units.id)
-> Seq Scan on reports (cost=0.00..2919132.16 rows=77309416 width=4)
-> Hash (cost=961.43..961.43 rows=1126 width=38)
-> Hash Join (cost=16.37..961.43 rows=1126 width=38)
Hash Cond: (public.units.account_id = accounts.id)
-> Seq Scan on units (cost=0.00..928.67 rows=1367 width=28)
-> Hash (cost=11.72..11.72 rows=372 width=18)
-> Seq Scan on accounts (cost=0.00..11.72 rows=372 width=18)
About 95% of the cost of the query is here:
-> Sort (cost=20807359.99..20966559.70 rows=63679885 width=38)
Sort Key: accounts.name, public.units.last_reported_time
-> Hash Join (cost=975.50..3846816.82 rows=63679885 width=38)
Hash Cond: (public.reports.unit_id = public.units.id)
-> Seq Scan on reports (cost=0.00..2919132.16 rows=77309416 width=4)
Do you have an index on reports.unit_id? If not, you should definitely add one.
Other than that, the output column unit_count appears to be giving the number of units per account, but calculating it after all the joins and then ordering by it is very wasteful. The sub-query in the select list is something of mystery to me; I presume you want the most recent reporting time for each unit but it will give you only the last time of reporting over all units combined. Try this instead:
SELECT a.account_name, u.unit_count, r.last_reported_time
FROM account a
JOIN (
SELECT account_id, COUNT(*) AS unit_count
FROM units
GROUP BY 1) u ON u.account_id = a.id
LEFT JOIN ( -- allow for units that have not yet submitted a report
SELECT u.account_id, max(r.time) AS last_reported_time
FROM reports r
JOIN units u ON u.id = r.unit_id
GROUP BY 1) r ON r.account_id = a.id
ORDER BY 2 DESC;
EXPLAIN ANALYZE
SELECT count(*)
FROM "businesses"
WHERE (
source = 'facebook'
OR EXISTS(
SELECT *
FROM provider_business_map pbm
WHERE
pbm.hotstepper_business_id=businesses.id
AND pbm.provider_name='facebook'
)
);
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=233538965.74..233538965.75 rows=1 width=0) (actual time=116169.720..116169.721 rows=1 loops=1)
-> Seq Scan on businesses (cost=0.00..233521096.48 rows=7147706 width=0) (actual time=11.284..116165.646 rows=3693 loops=1)
Filter: (((source)::text = 'facebook'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
SubPlan 1
-> Index Scan using idx_provider_hotstepper_business on provider_business_map pbm (cost=0.00..16.29 rows=1 width=0) (never executed)
Index Cond: (((provider_name)::text = 'facebook'::text) AND (hotstepper_business_id = businesses.id))
SubPlan 2
-> Index Scan using idx_provider_hotstepper_business on provider_business_map pbm (cost=0.00..16.28 rows=1 width=4) (actual time=0.045..5.685 rows=3858 loops=1)
Index Cond: ((provider_name)::text = 'facebook'::text)
Total runtime: 116169.820 ms
(10 rows)
This query takes over a minute and it's doing a count that results in ~3000. It seems the bottleneck is the sequential scan but I'm not sure what index I would need on the database to optimize this. It's also worth noting that I haven't tuned postgres so if there's any tuning that would help it may be worth considering. Although my DB is 15GB and I don't plan on trying to fit all of that in memory anytime soon so I'm not sure changing RAM related values would help a lot.
OR is notorious for lousy performance. Try splitting it into a union of two completely separate queries on the two tables:
SELECT COUNT(*) FROM (
SELECT id
FROM businesses
WHERE source = 'facebook'
UNION -- union makes the ids unique in the result
SELECT hotstepper_business_id
FROM provider_business_map
WHERE provider_name = 'facebook'
AND hotstepper_business_id IS NOT NULL
) x
If hotstepper_business_id can not be null, you may remove the line
AND hotstepper_business_id IS NOT NULL
If you want the whole business row, you'd could simply wrap the above query with an IN (...):
SELECT * FROM businesses
WHERE ID IN (
-- above inner query
)
But a much better performing query would be to modify the above query use use a join:
SELECT *
FROM businesses
WHERE source = 'facebook'
UNION
SELECT b.*
FROM provider_business_map m
JOIN businesses b
ON b.id = m.hotstepper_business_id
WHERE provider_name = 'facebook'
I'd at least try rewriting the dependent subquery as;
SELECT COUNT(DISTINCT b.*)
FROM businesses b
LEFT JOIN provider_business_map pbm
ON b.id=pbm.hotstepper_business_id
WHERE b.source = 'facebook'
OR pbm.provider_name = 'facebook';
Unless I'm mis-reading something, an index on businesses.id exists, but make sure there are also indexes on provider_business_map.hotstepper_business_id, businesses.source and provider_business_map.provider_name for best performance.
create index index_name on businesses(source);
Since there 3,693 rows matches in more than 7 million rows it will probably use the index. Do not forget to
analyse businesses;