I want to improve the performance of these 2 queries:
select object into latitude
from data
where predicate = 'latitude'
and subject = ( select object
from data
where subject = (select object
from data
where subject = 'url1' and predicate = '#isLocatedAt')
and predicate = 'http://schema.org/geo');
select object into Longitude
from data
where predicate = 'longitude'
and subject = ( select object
from data
where subject = (select object
from data
where subject = 'url1' and predicate = '#isLocatedAt')
and predicate = 'http://schema.org/geo');
fiddle
I don't have an index on data.object because it's a text and it's too big.
Explain analyse (for 1 requête):
QUERY PLAN
Index Scan using subjectetpredicate on data (cost=25.19..36.76 rows=3 width=63) (actual time=34.299..34.303 rows=1 loops=1)
Index Cond: (((subject)::text = $1) AND ((predicate)::text = 'latitude'::text))
InitPlan 2 (returns $1)
-> Index Scan using subjectetpredicate on data data_2 (cost=12.94..24.50 rows=3 width=63) (actual time=31.374..31.379 rows=1 loops=1)
Index Cond: (((subject)::text = $0) AND ((predicate)::text = 'geo'::text))
InitPlan 1 (returns $0)
-> Index Scan using subjectetpredicate on data data_1 (cost=0.69..12.25 rows=3 width=63) (actual time=0.329..0.332 rows=1 loops=1)
Index Cond: (((subject)::text = 'url1'::text) AND ((predicate)::text = '#isLocatedAt'::text))
Planning Time: 1.071 ms
Execution Time: 34.359 ms
Your query plan shows that you're doing 6 index scans sequentially. Rewrite your script so that you have to do only 4, by storing the shared result in a temporary variable:
select object into geolocation
-- ^^^^^^^^^^^^^^^^
from data
where predicate = 'http://schema.org/geo'
and subject = (select object
from data
where predicate = '#isLocatedAt'
and subject = 'url1');
select object into latitude
from data
where predicate = 'latitude'
and subject = geolocation;
-- ^^^^^^^^^^^
select object into longitude
from data
where predicate = 'longitude'
and subject = geolocation;
-- ^^^^^^^^^^^
You can achieve the same in a single query (which doesn't necessarily make it any faster or easier to read) by using a CTE or simply by flipping your subqueries to get a result with multiple columns:
select
(
select object
from data
where predicate = 'longitude'
and subject = geolocation
), (
select object
from data
where predicate = 'latitude'
and subject = geolocation
)
into longitude, latitude
from (
select object as geolocation
from data
where predicate = 'http://schema.org/geo'
and subject = (select object
from data
where predicate = '#isLocatedAt'
and subject = 'url1')
) as temp;
Instead of nested subqueries, we can rethink this as a series of self-joins and flip it on its head. I find this much easier to understand, and it's much faster (at least on this tiny dataset).
url1, #isLocatedAt, y
y, http://schema.org/geo, z
z, (latitude, longitude)
To do it in a single query, filter it on where predicate in ('latitude', 'longitude').
select
l2.predicate, l2.object
from data l0
join data l1 on l0.object = l1.subject and l1.predicate = 'http://schema.org/geo'
join data l2 on l1.object = l2.subject and l2.predicate in ('latitude', 'longitude')
where l0.subject = 'url1' and l0.predicate = '#isLocatedAt'
This is a couple orders of magnitude faster, though it would have to be run against a realistic amount of data to matter.
QUERY PLAN
Nested Loop (cost=0.43..24.51 rows=1 width=548) (actual time=0.040..0.042 rows=2 loops=1)
-> Nested Loop (cost=0.29..16.34 rows=1 width=32) (actual time=0.022..0.023 rows=1 loops=1)
Join Filter: (l0.object = (l1.subject)::text)
-> Index Scan using subjectetpredicate on data l0 (cost=0.14..8.16 rows=1 width=32) (actual time=0.011..0.012 rows=1 loops=1)
Index Cond: (((subject)::text = 'url1'::text) AND ((predicate)::text = '#isLocatedAt'::text))
-> Index Scan using predicate on data l1 (cost=0.14..8.16 rows=1 width=548) (actual time=0.007..0.008 rows=1 loops=1)
Index Cond: ((predicate)::text = 'http://schema.org/geo'::text)
-> Index Scan using subjectetpredicate on data l2 (cost=0.14..8.16 rows=1 width=1064) (actual time=0.017..0.018 rows=2 loops=1)
Index Cond: ((subject)::text = l1.object)
Filter: ((predicate)::text = ANY ('{latitude,longitude}'::text[]))
Planning Time: 0.192 ms
Execution Time: 0.072 ms
Demonstration.
This approach is also a step towards generalizing it as a recursive CTE.
Note that you don't need an index on subject because you have an index on (subject, predicate).
Related
We are working on a platform right now and I'm having difficulties optimizing one of our queries, it's actually a view.
When explaining the query a full table scan is performed on my user_role table.
The query behind the view looks like what I've posted below (I used * for select instead of different columns of the different tables, just to showcase my issue). All results are based on that simplified query.
SELECT t.*
FROM task t
LEFT JOIN project p ON t.project_id = p.id
LEFT JOIN company comp ON comp.id = p.company_id
LEFT JOIN company_users cu ON cu.companies_id = comp.id
LEFT JOIN user_table u ON u.email= cu.users_email
LEFT JOIN user_role role ON u.email= role.user_email
WHERE lower(t.type) = 'project'
AND t.project_id IS NOT NULL
AND role.role::text in ('SOME_ROLE_1', 'SOME_ROLE_2')
Basically this query runs ok like this. If I explain the query it uses all my index in place, nice!
But.. As soon as I start added to extra where clauses the issue arises. I simply add this:
and u.email = 'my#email.com'
and comp.id = 4
and p.id = 3
And all of the sudden on tables company_users and user_role a full table scan is performed. No on table project.
The full query plan is:
Nested Loop (cost=0.98..22.59 rows=1 width=97) (actual time=0.115..4.448 rows=189 loops=1)
-> Nested Loop (cost=0.84..22.02 rows=1 width=632) (actual time=0.099..3.091 rows=252 loops=1)
-> Nested Loop (cost=0.70..20.10 rows=1 width=613) (actual time=0.082..1.774 rows=252 loops=1)
-> Nested Loop (cost=0.56..19.81 rows=1 width=621) (actual time=0.068..0.919 rows=252 loops=1)
-> Nested Loop (cost=0.43..19.62 rows=1 width=101) (actual time=0.058..0.504 rows=63 loops=1)
-> Index Scan using task_project_id_index on task t (cost=0.28..11.43 rows=1 width=97) (actual time=0.041..0.199 rows=63 loops=1)
Index Cond: (project_id IS NOT NULL)
Filter: (lower((type)::text) = 'project'::text)
-> Index Scan using project_id_uindex on project p (cost=0.15..8.17 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=63)
Index Cond: (id = t.project_id)
-> Index Scan using company_users_companies_id_index on company_users cu (cost=0.14..0.17 rows=1 width=520) (actual time=0.002..0.004 rows=4 loops=63)
Index Cond: (companies_id = p.company_id)
-> Index Only Scan using company_id_index on company comp (cost=0.14..0.29 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=252)
Index Cond: (id = p.company_id)
Heap Fetches: 252
-> Index Only Scan using user_table_email_index on user_table u (cost=0.14..1.81 rows=1 width=19) (actual time=0.004..0.004 rows=1 loops=252)
Index Cond: (email = (cu.users_email)::text)
Heap Fetches: 252
-> Index Scan using user_role_user_email_index on user_role role (cost=0.14..0.56 rows=1 width=516) (actual time=0.004..0.004 rows=1 loops=252)
Index Cond: ((user_email)::text = (u.email)::text)
Filter: ((role)::text = ANY ('{COMPANY_ADMIN,COMPANY_USER}'::text[]))
Rows Removed by Filter: 0
Planning time: 2.581 ms
Execution time: 4.630 ms
The explanation for company_users in particular is:
SEQ_SCAN (Seq Scan) table: company_users; 1 1.44 0.0 Parent Relationship = Inner;
Parallel Aware = false;
Alias = cu;
Plan Width = 520;
Filter = ((companies_id = 4) AND ((users_email)::text = 'my#email.com'::text));
However I already created an index on the company_users table: create index if not exists company_users_users_email_companies_id_index on company_users (users_email, companies_id);.
Same counts for the user_role table.
The explanation is
SEQ_SCAN (Seq Scan) table: user_role; 1 1.45 0.0 Parent Relationship = Inner;
Parallel Aware = false;
Alias = role;
Plan Width = 516;
Filter = (((role)::text = ANY ('{SOME_ROLE_1,SOME_ROLE_2}'::text[])) AND ((user_email)::text = 'my#email.com'::text));
And thus for user_role I also have an index in place on columns role and user_email: create index if not exists user_role_role_user_email_index on user_role (role, user_email);
Did not expect it to change anything but anyways I did try to adapt the query to include only a single role in the where class and also tried with OR statements instead of an IN but all makes no difference!
The weird thing to me is that without my extra 3 where filters it works perfect, and as soon as I add them it does not work, but I do have indexes created for the fields mentioned in the explanations...
We are using exactly the same algo in a few other queries and they all suffer the same issue on those two fields...
So the big question is, how can I further improve my queries and make it use the index instead of a full table scan?
Create an index and calculate statistics:
CREATE INDEX ON task (lower(type)) WHERE project_id IS NOT NULL;
ANALYZE task;
That should improve the estimate, so that PostgreSQL chooses a better join strategy.
We have a PostgreSQL query with multiple tables and left outer joins, and is running very slow.
It is completing in 25-40s, so we want to optimize it more and want to decrease run time to 1-2 sec.
select a.campaignid, b.campaign_name , case when b.message_type_id = 1 then 'Promotional'
when b.message_type_id = 2 then 'Transactional'
else 'Other' end as Campaign_type, c.username , aggregator_type,
e.cli_manager_id as senderID,
b.schedule_time as campaign_schedule_date,
count(a.mobile) as campaign_submitted_count, count(case when a.status = 'DELIVRD' then mobile end) as Delivered,
count(a.mobile) as Total_count,
count(case when a.status = 'FAILED' then mobile end) as failure_count,
count(case when a.status = 'DND_check_failed' then mobile end) as DND_count,
sum(credits_used) as credits_used
from tbl_cdr_test a left outer join tbl_campaign b
on a.campaignid = b.tbl_campaign_id left outer join tbl_users_master c
on b.user_id =c.user_master_id
left outer join tbl_cli_manager e on b.user_id = e.user_id
left outer join tbl_user_channel f on b.user_id =f.user_id
left outer join tbl_user_configurations g on b.user_id = g.user_id
where date(insert_datetime) between '2020-05-23' and '2020-06-23'
and c.username = coalesce(null, c.username)
and g.msg_cat_id = coalesce(null, g.msg_cat_id)
and a.campaignid = coalesce(null, a.campaignid)
and e.cli_manager_id = coalesce(null, e.cli_manager_id)
group by a.campaignid, b.campaign_name , b.message_type_id,c.username , b.schedule_time,
aggregator_type, e.cli_manager_id;
We have create appropriate indexes as well, but still it is taking time.
Moreover there is "external merge disk" sorting method in execution plan whereas to resolve same I have set work_mem = 50MB. Still it is using disk sort instead of memory.Please suggest
Below is execution plan:
GroupAggregate (cost=4872.01..4872.07 rows=1 width=543) (actual time=20564.239..27415.264 rows=8 loops=1)
Group Key: a.campaignid, b.campaign_name, b.message_type_id, c.username, b.schedule_time, f.aggregator_type, e.cli_manager_id
-> Sort (cost=4872.01..4872.01 rows=1 width=483) (actual time=19627.424..25020.702 rows=3206196 loops=1)
Sort Key: a.campaignid, b.campaign_name, b.message_type_id, c.username, b.schedule_time, f.aggregator_type, e.cli_manager_id
Sort Method: external merge Disk: 281456kB
-> Nested Loop (cost=22.03..4872.00 rows=1 width=483) (actual time=99.704..12086.244 rows=3206196 loops=1)
Join Filter: (b.user_id = g.user_id)
-> Nested Loop Left Join (cost=21.89..4871.79 rows=1 width=495) (actual time=99.688..4518.533 rows=3206196 loops=1)
-> Nested Loop (cost=21.75..4871.54 rows=1 width=77) (actual time=99.664..935.689 rows=356244 loops=1)
-> Nested Loop (cost=21.33..31.57 rows=1 width=65) (actual time=0.295..2.376 rows=588 loops=1)
Join Filter: (b.user_id = c.user_master_id)
-> Merge Join (cost=21.18..30.22 rows=6 width=46) (actual time=0.246..0.663 rows=588 loops=1)
Merge Cond: (e.user_id = b.user_id)
-> Index Scan using "idx_FK_7hc6agd_tbl_cli_ma_1592228110_32" on tbl_cli_manager e (cost=0.42..6281.84 rows=762 width=12) (actual time=0.014..0.035 rows=5 loops=1)
Filter: (cli_manager_id = COALESCE(cli_manager_id))
-> Sort (cost=20.76..21.13 rows=147 width=34) (actual time=0.225..0.333 rows=585 loops=1)
Sort Key: b.user_id
Sort Method: quicksort Memory: 36kB
-> Seq Scan on tbl_campaign b (cost=0.00..15.47 rows=147 width=34) (actual time=0.013..0.154 rows=147 loops=1)
-> Index Scan using ind_user_master_c_user on tbl_users_master c (cost=0.14..0.21 rows=1 width=19) (actual time=0.002..0.002 rows=1 loops=588)
Index Cond: (user_master_id = e.user_id)
Filter: ((username)::text = (COALESCE(username))::text)
-> Append (cost=0.42..4839.94 rows=3 width=20) (actual time=0.546..1.426 rows=606 loops=588)
-> Index Scan using testh11_campaignid_idx on testh11 a (cost=0.42..4253.99 rows=2 width=20) (actual time=0.543..0.543 rows=0 loops=588)
Index Cond: (campaignid = b.tbl_campaign_id)
Filter: ((campaignid = COALESCE(campaignid)) AND (date(insert_datetime) >= '2020-05-23'::date) AND (date(insert_datetime) <= '2020-06-23'::date))
Rows Removed by Filter: 656
-> Index Scan using testh21_campaignid_idx on testh21 a_1 (cost=0.42..585.94 rows=1 width=20) (actual time=0.002..0.796 rows=606 loops=588)
Index Cond: (campaignid = b.tbl_campaign_id)
Filter: ((campaignid = COALESCE(campaignid)) AND (date(insert_datetime) >= '2020-05-23'::date) AND (date(insert_datetime) <= '2020-06-23'::date))
-> Index Scan using idx_user_id_tbl_user_c_1592227657_19 on tbl_user_channel f (cost=0.14..0.24 rows=1 width=422) (actual time=0.002..0.004 rows=9 loops=356244)
Index Cond: (user_id = b.user_id)
-> Index Scan using "idx_FK_6958qvy_tbl_user_c_1592228774_151" on tbl_user_configurations g (cost=0.14..0.20 rows=1 width=8) (actual time=0.002..0.002 rows=1 loops=3206196)
Index Cond: (user_id = e.user_id)
Filter: (msg_cat_id = COALESCE(msg_cat_id))
Planning Time: 6.561 ms
Execution Time: 27477.860 ms
There is a gross underestimate of the result rows for the index scan on testh21. The consequence is that PostgreSQL chooses nested loop joins, which is where your time is spent.
Try the following:
New statistics:
ANALYZE testh21;
If that improves the estimate, make sure that autoanalyze treats the table more often.
Prevent bad estimates caused by correlation:
CREATE STATISTICS testh21_stat (dependencies)
ON campaignid, insert_datetime FROM testh21;
ANALYZE testh21;
Perhaps there is a correlation between the columns, and that improves the estimate.
More detailed statistics: try raising default_statistics_target before ANALYZE of the table.
If you cannot improve the estimates, take the hammer and set enable_nestloop = off for the duration of the query.
I have the following SQL query
EXPLAIN ANALYZE
SELECT
full_address,
street_address,
street.street,
(
select
city
from
city
where
city.id = property.city_id
)
AS city,
(
select
state_code
from
state
where
id = property.state_id
)
AS state_code,
(
select
zipcode
from
zipcode
where
zipcode.id = property.zipcode_id
)
AS zipcode
FROM
property
INNER JOIN
street
ON street.id = property.street_id
WHERE
street.street = 'W San Miguel Ave'
AND property.zipcode_id =
(
SELECT
id
FROM
zipcode
WHERE
zipcode = '85340'
)
Below is the EXPLAIN ANALYZE results
Gather (cost=1008.86..226541.68 rows=1 width=161) (actual time=59.311..21956.143 rows=184 loops=1)
Workers Planned: 2
Params Evaluated: $3
Workers Launched: 2
InitPlan 4 (returns $3)
-> Index Scan using zipcode_zipcode_county_id_state_id_index on zipcode zipcode_1 (cost=0.28..8.30 rows=1 width=16) (actual time=0.039..0.040 rows=1 loops=1)
Index Cond: (zipcode = '85340'::citext)
-> Nested Loop (cost=0.56..225508.35 rows=1 width=113) (actual time=7430.172..14723.451 rows=61 loops=3)
-> Parallel Seq Scan on street (cost=0.00..13681.63 rows=1 width=28) (actual time=108.023..108.053 rows=1 loops=3)
Filter: (street = 'W San Miguel Ave'::citext)
Rows Removed by Filter: 99131
-> Index Scan using property_street_address_street_id_city_id_state_id_zipcode_id_c on property (cost=0.56..211826.71 rows=1 width=117) (actual time=10983.195..21923.063 rows=92 loops=2)
Index Cond: ((street_id = street.id) AND (zipcode_id = $3))
SubPlan 1
-> Index Scan using city_id_pk on city (cost=0.28..8.30 rows=1 width=9) (actual time=0.003..0.003 rows=1 loops=184)
Index Cond: (id = property.city_id)
SubPlan 2
-> Index Scan using state_id_pk on state (cost=0.27..8.34 rows=1 width=3) (actual time=0.002..0.002 rows=1 loops=184)
Index Cond: (id = property.state_id)
SubPlan 3
-> Index Scan using zipcode_id_pk on zipcode (cost=0.28..8.30 rows=1 width=6) (actual time=0.002..0.003 rows=1 loops=184)
Index Cond: (id = property.zipcode_id)
Planning Time: 1.228 ms
Execution Time: 21956.246 ms
Is it possible to speed up this query by adding more indexes?
The query can be rewritten using joins rather than subselects. This may be faster and easier to index.
SELECT
full_address,
street_address,
street.street,
city.city as city,
state.state_code as state_code,
zipcode.zipcode as zipcode,
FROM
property
INNER JOIN street ON street.id = property.street_id
INNER JOIN city ON city.id = property.city_id
INNER JOIN state ON state.id = property.state_id
INNER JOIN zipcode ON zipcode.id = property.zipcode_id
WHERE
street.street = 'W San Miguel Ave'
AND zipcode.zipcode = '85340'
Assuming all the foreign keys (property.street_id, property.city_id, etc...) are indexed this now becomes a search on street.street and zipcode.zipcode. As long as they are indexed the query should take milliseconds.
I have a query where the Postgres is performing a Hash join with sequence scan instead of an Index join with Nested loop, when I use an OR condition. This is causing the query to take 2 seconds instead of completing in < 100ms. I have run VACUUM ANALYZE and have rebuilt the index on the PATIENTCHARTNOTE table (which is about 5GB) but its still using hash join. Do you have any suggestions on how I can improve this?
explain analyze
SELECT Count (_pcn.id) AS total_open_note
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND ( _pt.assigned_to_user_id = '136964'
OR _pcn.createdby_id = '136964'
);
Aggregate (cost=237655.59..237655.60 rows=1 width=8) (actual time=1602.069..1602.069 rows=1 loops=1)
-> Hash Join (cost=83095.43..237645.30 rows=4117 width=4) (actual time=944.850..1602.014 rows=241 loops=1)
Hash Cond: (_appt.patient_id = _pt.id)
Join Filter: ((_pt.assigned_to_user_id = 136964) OR (_pcn.createdby_id = 136964))
Rows Removed by Join Filter: 94036
-> Hash Join (cost=46650.68..182243.64 rows=556034 width=12) (actual time=415.862..1163.812 rows=94457 loops=1)
Hash Cond: (_pcn.appointment_id = _appt.id)
-> Seq Scan on patientchartnote _pcn (cost=0.00..112794.20 rows=1073978 width=12) (actual time=0.016..423.262 rows=1
073618 loops=1)
Filter: (active AND (title IS NOT NULL) AND ((title)::text <> ''::text))
Rows Removed by Filter: 22488
-> Hash (cost=35223.61..35223.61 rows=696486 width=8) (actual time=414.749..414.749 rows=692839 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2732kB
-> Seq Scan on appointment _appt (cost=0.00..35223.61 rows=696486 width=8) (actual time=0.010..271.208 rows=69
2839 loops=1)
Filter: (datecomplete IS NULL)
Rows Removed by Filter: 652426
-> Hash (cost=24698.57..24698.57 rows=675694 width=12) (actual time=351.566..351.566 rows=674929 loops=1)
Buckets: 131072 Batches: 16 Memory Usage: 2737kB
-> Seq Scan on patient _pt (cost=0.00..24698.57 rows=675694 width=12) (actual time=0.013..197.268 rows=674929 loops=
1)
Filter: active
Rows Removed by Filter: 17426
Planning time: 1.533 ms
Execution time: 1602.715 ms
When I replace "OR _pcn.createdby_id = '136964'" with "AND _pcn.createdby_id = '136964'", Postgres performs an index scan
Aggregate (cost=29167.56..29167.57 rows=1 width=8) (actual time=937.743..937.743 rows=1 loops=1)
-> Nested Loop (cost=1.28..29167.55 rows=7 width=4) (actual time=19.136..937.669 rows=37 loops=1)
-> Nested Loop (cost=0.85..27393.03 rows=1654 width=4) (actual time=2.154..910.250 rows=1649 loops=1)
-> Index Scan using patient_activeassigned_idx on patient _pt (cost=0.42..3075.00 rows=1644 width=8) (actual time=1.
599..11.820 rows=1627 loops=1)
Index Cond: ((active = true) AND (assigned_to_user_id = 136964))
Filter: active
-> Index Scan using appointment_datepatient_idx on appointment _appt (cost=0.43..14.75 rows=4 width=8) (actual time=
0.543..0.550 rows=1 loops=1627)
Index Cond: ((patient_id = _pt.id) AND (datecomplete IS NULL))
-> Index Scan using patientchartnote_activeappointment_idx on patientchartnote _pcn (cost=0.43..1.06 rows=1 width=8) (actual time=0.014..0.014 rows=0 loops=1649)
Index Cond: ((active = true) AND (createdby_id = 136964) AND (appointment_id = _appt.id) AND (title IS NOT NULL))
Filter: (active AND ((title)::text <> ''::text))
Planning time: 1.489 ms
Execution time: 937.910 ms
(13 rows)
Using OR in SQL queries usually results in bad performance.
That is because – different from AND – it does not restrict, but extend the number of rows in the query result. With AND, you can use an index scan for one part of the condition and further restrict the result set with a filter on the second condition. That is not possible with OR.
So PostgreSQL does the only thing left: it computes the whole join and then filters out all rows that do not match the condition. Of course that is very inefficient when you are joining three tables (I didn't count the outer join).
Assuming that all columns called id are primary keys, you could rewrite the query as follows:
SELECT count(*) FROM
(SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pt.assigned_to_user_id = '136964'
UNION
SELECT _pcn.id
FROM patientchartnote _pcn
INNER JOIN appointment _appt
ON _appt.id = _pcn.appointment_id
INNER JOIN patient _pt
ON _pt.id = _appt.patient_id
LEFT OUTER JOIN person _ps
ON _ps.id = _pt.appuser_id
WHERE _pcn.active = true
AND _pt.active = true
AND _appt.datecomplete IS NULL
AND _pcn.title IS NOT NULL
AND _pcn.title <> ''
AND _pcn.createdby_id = '136964'
) q;
Even though this is running the query twice, indexes can be used to filter out all but a few rows early on, so this query should perform better.
I looking for ways to abstract database access to postgres. In my examples I will use a hypothetical twitter clone in nodejs, but in the end it's a question about how postgres handles prepared statements, so the language and library don't really matter:
Suppose I want to be able to access a list of all tweets from a user by username:
name: "tweets by username"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.username = $1"
values: [username]
That works fine, but it seems inefficient, both in practical terms and code-quality terms to have to make another function to handle getting tweets by email rather than by username:
name: "tweets by email"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.email = $1"
values: [email]
Is it possible to include a field as a parameter to the prepared statement?
name: "tweets by user"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.$1 = $2"
values: [field, value]
While it's true that this might be a bit less efficient in the corner case of accessing tweets by user_id, that's a trade I'm willing to make to improve code quality, and hopefully overall improve efficiency by reducing the number of query templates to 1 instead of 3+.
#Clodoaldo 's answer is correct in that it allows the capability you desire and should return the right results. Unfortunately it produces rather slow execution.
I set up an experimental data base with tweets and users. populated 10K users each with 100 tweets (1M tweet records). I indexed the PKs u.id, t.id, the FK t.user_id and the predicate fields u.username, u.email.
create table t(id serial PRIMARY KEY, data integer, user_id bignit);
create index t1 t(user_id);
create table u(id serial PRIMARY KEY, name text, email text);
create index u1 on u(name);
create index u2 on u(email);
insert into u(name,email) select i::text, i::text from generate_series(1,10000) i;
insert into t(data,user_id) select i, (i/100)::bigint from generate_series(1,1000000) i;
analyze table t;
analyze table u;
A simple query using one field as predicate is very fast:
prepare qn as select t.* from t join u on t.user_id = u.id where u.name = $1;
explain analyze execute qn('1111');
Nested Loop (cost=0.00..19.81 rows=1 width=16) (actual time=0.030..0.057 rows=100 loops=1)
-> Index Scan using u1 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (name = $1)
-> Index Scan using t1 on t (cost=0.00..10.10 rows=100 width=16) (actual time=0.007..0.023 rows=100 loops=1)
Index Cond: (t.user_id = u.id)
Total runtime: 0.093 ms
A query using case in the where as #Clodoaldo proposed takes almost 30 seconds:
prepare qen as select t.* from t join u on t.user_id = u.id
where case $2 when 'e' then u.email = $1 when 'n' then u.name = $1 end;
explain analyze execute qen('1111','n');
Merge Join (cost=25.61..38402.69 rows=500000 width=16) (actual time=27.771..26345.439 rows=100 loops=1)
Merge Cond: (t.user_id = u.id)
-> Index Scan using t1 on t (cost=0.00..30457.35 rows=1000000 width=16) (actual time=0.023..17.741 rows=111200 loops=1)
-> Index Scan using u_pkey on u (cost=0.00..42257.36 rows=500000 width=4) (actual time=0.325..26317.384 rows=1 loops=1)
Filter: CASE $2 WHEN 'e'::text THEN (u.email = $1) WHEN 'n'::text THEN (u.name = $1) ELSE NULL::boolean END
Total runtime: 26345.535 ms
Observing that plan, I thought that using a union subselect then filtering its results to get the id appropriate to the parametrized predicate choice would allow the planner to use specific indexes for each predicate. Turns out I was right:
prepare qen2 as
select t.*
from t
join (
SELECT id from
(
SELECT 'n' as fld, id from u where u.name = $1
UNION ALL
SELECT 'e' as fld, id from u where u.email = $1
) poly
where poly.fld = $2
) uu
on t.user_id = uu.id;
explain analyze execute qen2('1111','n');
Nested Loop (cost=0.00..28.31 rows=100 width=16) (actual time=0.058..0.120 rows=100 loops=1)
-> Subquery Scan poly (cost=0.00..16.96 rows=1 width=4) (actual time=0.041..0.073 rows=1 loops=1)
Filter: (poly.fld = $2)
-> Append (cost=0.00..16.94 rows=2 width=4) (actual time=0.038..0.070 rows=2 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..8.47 rows=1 width=4) (actual time=0.038..0.038 rows=1 loops=1)
-> Index Scan using u1 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.038..0.038 rows=1 loops=1)
Index Cond: (name = $1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..8.47 rows=1 width=4) (actual time=0.031..0.032 rows=1 loops=1)
-> Index Scan using u2 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.030..0.031 rows=1 loops=1)
Index Cond: (email = $1)
-> Index Scan using t1 on t (cost=0.00..10.10 rows=100 width=16) (actual time=0.015..0.028 rows=100 loops=1)
Index Cond: (t.user_id = poly.id)
Total runtime: 0.170 ms
SELECT t.*
FROM tweets t
inner join users u on t.user_id = u.user_id
WHERE case $2
when 'username' then u.username = $1
when 'email' then u.email = $1
else u.user_id = $1
end