I have been given a requirement to query 8(+) different Redshift tables from a Lambda function and return some of the fields where email = provided email address.
Current implementation is a JDBC connection using Spring Data JDBC :
#Query("select * from
(select product_name, email, firstname, lastname, accountid, phone,
from companyA) companyA
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyB) companyB
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyC) companyC
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyD) companyD
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyE) companyE
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyF) companyF
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyG) companyG
natural full join
(select product_name, email, firstname, lastname, accountid, phone,
from companyH) companyH
where email = :emailId;")
public List<Contact> getContactSortByDate(String emailId);
The problem with this approach is that each of these tables are quite large and so the query can take 45+ seconds. To make it worse, I will eventually need to implement a one to many join on "accountid" to 8 account tables to grab some fields there.
And... my boss wants all of this to happen in less than 5 seconds... I don't think that will be possible, but I wanted to know if there were things I can do to speed this search up as much as possible?
Questions:
is there a way to make this query faster?
If I am able to switch to using the Redshift client from inside the VPC instead of a JDBC connection, is there still any hope of getting below (or anywhere near) 5 seconds?
EDIT:
As requested, here is the Updated results of '''EXPLAIN```:
XN Subquery Scan alltables (cost=35.45..142252362.72 rows=35 width=520)
-> XN Append (cost=35.45..142252362.37 rows=35 width=1440)
-> XN Subquery Scan "*SELECT* 1" (cost=35.45..67200090.91 rows=11 width=1440)
-> XN Hash Join DS_BCAST_INNER (cost=35.45..67200090.80 rows=11 width=1440)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyA_accounts acc (cost=0.00..4.91 rows=491 width=816)
-> XN Hash (cost=35.43..35.43 rows=10 width=663)
-> XN Seq Scan on companyA usr (cost=0.00..35.43 rows=10 width=663)
Filter: ((email)::text = 'foobybar#barfoo.com'::text)
-> XN Subquery Scan "*SELECT* 2" (cost=41.85..41600165.37 rows=11 width=943)
-> XN Hash Join DS_BCAST_INNER (cost=41.85..41600165.26 rows=11 width=943)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyB_accounts acc (cost=0.00..10.96 rows=1096 width=550)
-> XN Hash (cost=41.83..41.83 rows=10 width=406)
-> XN Seq Scan on companyB usr (cost=0.00..41.83 rows=10 width=406)
Filter: ((email)::text = 'foobybar#barfoo.com'::text)
-> XN Subquery Scan "*SELECT* 3" (cost=512.54..6241501.45 rows=2 width=1374)
-> XN Hash Join DS_BCAST_INNER (cost=512.54..6241501.43 rows=2 width=1374)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyC_accounts acc (cost=0.00..439.50 rows=43950 width=771)
-> XN Hash (cost=512.54..512.54 rows=1 width=614)
-> XN Seq Scan on companyC usr (cost=0.00..512.54 rows=1 width=614)
Filter: ((email)::text = 'foobybar#barfoo.com'::text)
-> XN Subquery Scan "*SELECT* 4" (cost=2178.78..4403808.58 rows=3 width=985)
-> XN Hash Join DS_BCAST_INNER (cost=2178.78..4403808.55 rows=3 width=985)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyD_accounts acc (cost=0.00..501.46 rows=50146 width=809)
-> XN Hash (cost=2178.78..2178.78 rows=2 width=212)
-> XN Seq Scan on companyD usr (cost=0.00..2178.78 rows=2 width=212)
Filter: (((email)::text = 'foobybar#barfoo.com'::text) AND (accountid IS NOT NULL))
-> XN Subquery Scan "*SELECT* 5" (cost=3534.94..4244248.75 rows=2 width=511)
-> XN Hash Join DS_BCAST_INNER (cost=3534.94..4244248.73 rows=2 width=511)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyE_accounts acc (cost=0.00..219.62 rows=21962 width=347)
-> XN Hash (cost=3534.94..3534.94 rows=2 width=203)
-> XN Seq Scan on companyE usr (cost=0.00..3534.94 rows=2 width=203)
Filter: (((email)::text = 'foobybar#barfoo.com'::text) AND (accountid IS NOT NULL))
-> XN Subquery Scan "*SELECT* 6" (cost=810.77..1921107.33 rows=1 width=1175)
-> XN Hash Join DS_BCAST_INNER (cost=810.77..1921107.32 rows=1 width=1175)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyF_accounts acc (cost=0.00..131.80 rows=13180 width=1030)
-> XN Hash (cost=810.76..810.76 rows=1 width=184)
-> XN Seq Scan on companyF usr (cost=0.00..810.76 rows=1 width=184)
Filter: ((email)::text = 'foobybar#barfoo.com'::text)
-> XN Subquery Scan "*SELECT* 7" (cost=705.19..9200982.48 rows=3 width=1204)
-> XN Hash Join DS_BCAST_INNER (cost=705.19..9200982.45 rows=3 width=1204)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyG_accounts acc (cost=0.00..85.30 rows=8530 width=790)
-> XN Hash (cost=705.19..705.19 rows=2 width=449)
-> XN Seq Scan on companyG usr (cost=0.00..705.19 rows=2 width=449)
Filter: ((email)::text = 'foobybar#barfoo.com'::text)
-> XN Subquery Scan "*SELECT* 8" (cost=420.21..7440457.49 rows=2 width=1135)
-> XN Hash Join DS_BCAST_INNER (cost=420.21..7440457.47 rows=2 width=1135)
Hash Cond: (("outer".id)::text = ("inner".accountid)::text)
-> XN Seq Scan on companyH_accounts acc (cost=0.00..11.46 rows=1146 width=784)
-> XN Hash (cost=420.20..420.20 rows=2 width=362)
-> XN Seq Scan on companyH usr (cost=0.00..420.20 rows=2 width=362)
Filter: ((email)::text = 'foobybar#barfoo.com'::text)
Solution:
As O.Jones pointed out, 'UNION ALL' sped it up by quite a bit! I still have a bunch of joins, but brought the time down by a lot. Here is the final solution:
SELECT *
FROM (
SELECT usr.product_name, acc.product_loc, acc.phone, usr.email, usr.firstname, usr.lastname, usr.accountid,
FROM companyA usr
JOIN companyA_accounts acc ON usr.accountid = acc.id
UNION ALL
SELECT usr.product_name, acc.product_loc, acc.phone, usr.email, usr.firstname, usr.lastname, usr.accountid,
FROM companyB usr
JOIN companyB_accounts acc ON usr.accountid = acc.id
/* UNION ALL SELECT repeated for the rest of the tables */
) alltables
WHERE alltables.email = :emailId;
I am still open to information on best practices / efficiency if switching to RedshiftClient though!
Have you tried something like this query, using an ordinary UNION ALL rather than all those JOINs?
SELECT *
FROM (
SELECT product_name, email, firstname, lastname, accountid, phone,
FROM companyA
UNION ALL
SELECT product_name, email, firstname, lastname, accountid, phone,
FROM companyB
/* UNION ALL SELECT repeated for the rest of the tables */
) alltables
JOIN some_other_table ON alltables.accountid = some_other_table.accountid
WHERE alltables.email = :emailId;
A query like this can exploit any indexes on the email columns of any tables that have them.
And, as my example shows you can do the joins with fairly clean SQL.
Related
I have no idea how to simplify this problem, so this is going to be a long question.
For openers, for reasons I won't get into, I normalized out long paragraphs to a table named shared.notes.
Next I have a complicated view with a number of paragraph lookups. Each note_id field is (a) indexed and (b) has a foreign key constraint to the notes table. Pseudo code below:
CREATE VIEW shared.vw_get_the_whole_kit_and_kaboodle AS
SELECT
yada yada
, mi.electrical_note_id
, electrical_notes.note AS electrical_notes
, mi.hvac_note_id
, hvac_notes.note AS hvac_notes
, mi.network_note_id
, network_notes.note AS network_notes
, mi.plumbing_note_id
, plumbing_notes.note AS plumbing_notes
, mi.specification_note_id
, specification_notes.note AS specification_notes
, mi.structural_note_id
, structural_notes.note AS structural_notes
FROM shared.a_table AS mi
JOIN shared.generic_items AS gi
ON mi.generic_item_id = gi.generic_item_id
JOIN shared.manufacturers AS mft
ON mi.manufacturer_id = mft.manufacturer_id
JOIN shared.notes AS electrical_notes
ON mi.electrical_note_id = electrical_notes.note_id
JOIN shared.notes AS hvac_notes
ON mi.hvac_note_id = hvac_notes.note_id
JOIN shared.notes AS plumbing_notes
ON mi.plumbing_note_id = plumbing_notes.note_id
JOIN shared.notes AS specification_notes
ON mi.specification_note_id = specification_notes.note_id
JOIN shared.notes AS structural_notes
ON mi.structural_note_id = structural_notes.note_id
JOIN shared.notes AS network_notes
ON mi.network_note_id = network_notes.note_id
JOIN shared.connectivity AS nc
ON mi.connectivity_id = nc.connectivity_id
WHERE
mi.deletion_date IS NULL;
Then I select against this view:
SELECT
lots of columns...
FROM shared.vw_get_the_whole_kit_and_kaboodle
WHERE
is_active = TRUE
AND is_inventory = FALSE;
Strangely, in the cloud GCP databases, I've not run into problems yet, and there are thousands of rows involved in a number of these tables.
Meanwhile back at the ranch, on my local PC, I've got a test version of the database. SAME EXACT SQL, down to the last letter. Trust me on that. For table definitions, view definitions, indexes... everything.
The cloud will return queries nearly instantaneously.
The local PC will hang--this despite the fact that the PC database has a mere handful of rows each in the various tables. So if one should hang, it ought to be the cloud databases. But it's the other way around; the tiny-dataset database is the one that fails.
Add this plot twist in: if I remove the filter for is_inventory, the query on the PC returns instantaneously. Also, if I just remove, one by one, the joins to the notes table, after about half of them are gone, the PC starts to finish instantaneously. It's almost like it's upset to be hitting the same table so many times with one query.
If I run EXPLAIN (without the ANALYZE option), here's the NO-hang version:
Hash Left Join (cost=31.55..40.09 rows=43 width=751)
Hash Cond: (mi.mounting_location_id = ml.mounting_location_id)
-> Hash Left Join (cost=30.34..38.76 rows=43 width=719)
Hash Cond: (mi.price_type_id = pt.price_type_id)
-> Hash Join (cost=29.25..37.53 rows=43 width=687)
Hash Cond: (mi.connectivity_id = nc.connectivity_id)
-> Nested Loop (cost=28.16..36.21 rows=43 width=655)
Join Filter: (mi.network_note_id = network_notes.note_id)
-> Seq Scan on notes network_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=28.16..34.66 rows=43 width=623)
Join Filter: (mi.plumbing_note_id = plumbing_notes.note_id)
-> Seq Scan on notes plumbing_notes (cost=0.00..1.01 rows=1 width=48)
-> Hash Join (cost=28.16..33.11 rows=43 width=591)
Hash Cond: (mi.generic_item_id = gi.generic_item_id)
-> Hash Join (cost=5.11..9.95 rows=43 width=559)
Hash Cond: (mi.structural_note_id = structural_notes.note_id)
-> Hash Join (cost=4.09..8.57 rows=43 width=527)
Hash Cond: (mi.specification_note_id = specification_notes.note_id)
-> Hash Join (cost=3.07..7.37 rows=43 width=495)
Hash Cond: (mi.hvac_note_id = hvac_notes.note_id)
-> Hash Join (cost=2.04..5.99 rows=43 width=463)
Hash Cond: (mi.electrical_note_id = electrical_notes.note_id)
-> Hash Join (cost=1.02..4.70 rows=43 width=431)
Hash Cond: (mi.manufacturer_id = mft.manufacturer_id)
-> Seq Scan on mft_items mi (cost=0.00..3.44 rows=43 width=399)
Filter: ((deletion_date IS NULL) AND is_active)
-> Hash (cost=1.01..1.01 rows=1 width=48)
-> Seq Scan on manufacturers mft (cost=0.00..1.01 rows=1 width=48)
-> Hash (cost=1.01..1.01 rows=1 width=48)
-> Seq Scan on notes electrical_notes (cost=0.00..1.01 rows=1 width=48)
-> Hash (cost=1.01..1.01 rows=1 width=48)
-> Seq Scan on notes hvac_notes (cost=0.00..1.01 rows=1 width=48)
-> Hash (cost=1.01..1.01 rows=1 width=48)
-> Seq Scan on notes specification_notes (cost=0.00..1.01 rows=1 width=48)
-> Hash (cost=1.01..1.01 rows=1 width=48)
-> Seq Scan on notes structural_notes (cost=0.00..1.01 rows=1 width=48)
-> Hash (cost=15.80..15.80 rows=580 width=48)
-> Seq Scan on generic_items gi (cost=0.00..15.80 rows=580 width=48)
-> Hash (cost=1.04..1.04 rows=4 width=36)
-> Seq Scan on connectivity nc (cost=0.00..1.04 rows=4 width=36)
-> Hash (cost=1.04..1.04 rows=4 width=36)
-> Seq Scan on price_types pt (cost=0.00..1.04 rows=4 width=36)
-> Hash (cost=1.09..1.09 rows=9 width=48)
-> Seq Scan on mounting_locations ml (cost=0.00..1.09 rows=9 width=48)
And this is the hang version:
Hash Left Join (cost=26.43..38.57 rows=16 width=751)
Hash Cond: (mi.mounting_location_id = ml.mounting_location_id)
-> Hash Left Join (cost=25.23..37.32 rows=16 width=719)
Hash Cond: (mi.price_type_id = pt.price_type_id)
-> Hash Join (cost=24.14..36.18 rows=16 width=687)
Hash Cond: (mi.connectivity_id = nc.connectivity_id)
-> Nested Loop (cost=23.05..35.00 rows=16 width=655)
Join Filter: (mi.network_note_id = network_notes.note_id)
-> Seq Scan on notes network_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=23.05..33.79 rows=16 width=623)
Join Filter: (mi.structural_note_id = structural_notes.note_id)
-> Seq Scan on notes structural_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=23.05..32.58 rows=16 width=591)
Join Filter: (mi.electrical_note_id = electrical_notes.note_id)
-> Seq Scan on notes electrical_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=23.05..31.37 rows=16 width=559)
Join Filter: (mi.specification_note_id = specification_notes.note_id)
-> Seq Scan on notes specification_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=23.05..30.16 rows=16 width=527)
Join Filter: (mi.plumbing_note_id = plumbing_notes.note_id)
-> Seq Scan on notes plumbing_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=23.05..28.95 rows=16 width=495)
Join Filter: (mi.hvac_note_id = hvac_notes.note_id)
-> Seq Scan on notes hvac_notes (cost=0.00..1.01 rows=1 width=48)
-> Nested Loop (cost=23.05..27.74 rows=16 width=463)
Join Filter: (mi.manufacturer_id = mft.manufacturer_id)
-> Seq Scan on manufacturers mft (cost=0.00..1.01 rows=1 width=48)
-> Hash Join (cost=23.05..26.53 rows=16 width=431)
Hash Cond: (mi.generic_item_id = gi.generic_item_id)
-> Seq Scan on mft_items mi (cost=0.00..3.44 rows=16 width=399)
Filter: ((deletion_date IS NULL) AND is_active AND (NOT is_inventory))
-> Hash (cost=15.80..15.80 rows=580 width=48)
-> Seq Scan on generic_items gi (cost=0.00..15.80 rows=580 width=48)
-> Hash (cost=1.04..1.04 rows=4 width=36)
-> Seq Scan on connectivity nc (cost=0.00..1.04 rows=4 width=36)
-> Hash (cost=1.04..1.04 rows=4 width=36)
-> Seq Scan on price_types pt (cost=0.00..1.04 rows=4 width=36)
-> Hash (cost=1.09..1.09 rows=9 width=48)
-> Seq Scan on mounting_locations ml (cost=0.00..1.09 rows=9 width=48)
I'd like to understand what I should be doing differently to escape this hang condition. Unfortunately, I'm not clear on what I'm doing wrong.
I have the following SQL query
EXPLAIN ANALYZE
SELECT
full_address,
street_address,
street.street,
(
select
city
from
city
where
city.id = property.city_id
)
AS city,
(
select
state_code
from
state
where
id = property.state_id
)
AS state_code,
(
select
zipcode
from
zipcode
where
zipcode.id = property.zipcode_id
)
AS zipcode
FROM
property
INNER JOIN
street
ON street.id = property.street_id
WHERE
street.street = 'W San Miguel Ave'
AND property.zipcode_id =
(
SELECT
id
FROM
zipcode
WHERE
zipcode = '85340'
)
Below is the EXPLAIN ANALYZE results
Gather (cost=1008.86..226541.68 rows=1 width=161) (actual time=59.311..21956.143 rows=184 loops=1)
Workers Planned: 2
Params Evaluated: $3
Workers Launched: 2
InitPlan 4 (returns $3)
-> Index Scan using zipcode_zipcode_county_id_state_id_index on zipcode zipcode_1 (cost=0.28..8.30 rows=1 width=16) (actual time=0.039..0.040 rows=1 loops=1)
Index Cond: (zipcode = '85340'::citext)
-> Nested Loop (cost=0.56..225508.35 rows=1 width=113) (actual time=7430.172..14723.451 rows=61 loops=3)
-> Parallel Seq Scan on street (cost=0.00..13681.63 rows=1 width=28) (actual time=108.023..108.053 rows=1 loops=3)
Filter: (street = 'W San Miguel Ave'::citext)
Rows Removed by Filter: 99131
-> Index Scan using property_street_address_street_id_city_id_state_id_zipcode_id_c on property (cost=0.56..211826.71 rows=1 width=117) (actual time=10983.195..21923.063 rows=92 loops=2)
Index Cond: ((street_id = street.id) AND (zipcode_id = $3))
SubPlan 1
-> Index Scan using city_id_pk on city (cost=0.28..8.30 rows=1 width=9) (actual time=0.003..0.003 rows=1 loops=184)
Index Cond: (id = property.city_id)
SubPlan 2
-> Index Scan using state_id_pk on state (cost=0.27..8.34 rows=1 width=3) (actual time=0.002..0.002 rows=1 loops=184)
Index Cond: (id = property.state_id)
SubPlan 3
-> Index Scan using zipcode_id_pk on zipcode (cost=0.28..8.30 rows=1 width=6) (actual time=0.002..0.003 rows=1 loops=184)
Index Cond: (id = property.zipcode_id)
Planning Time: 1.228 ms
Execution Time: 21956.246 ms
Is it possible to speed up this query by adding more indexes?
The query can be rewritten using joins rather than subselects. This may be faster and easier to index.
SELECT
full_address,
street_address,
street.street,
city.city as city,
state.state_code as state_code,
zipcode.zipcode as zipcode,
FROM
property
INNER JOIN street ON street.id = property.street_id
INNER JOIN city ON city.id = property.city_id
INNER JOIN state ON state.id = property.state_id
INNER JOIN zipcode ON zipcode.id = property.zipcode_id
WHERE
street.street = 'W San Miguel Ave'
AND zipcode.zipcode = '85340'
Assuming all the foreign keys (property.street_id, property.city_id, etc...) are indexed this now becomes a search on street.street and zipcode.zipcode. As long as they are indexed the query should take milliseconds.
I looking for ways to abstract database access to postgres. In my examples I will use a hypothetical twitter clone in nodejs, but in the end it's a question about how postgres handles prepared statements, so the language and library don't really matter:
Suppose I want to be able to access a list of all tweets from a user by username:
name: "tweets by username"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.username = $1"
values: [username]
That works fine, but it seems inefficient, both in practical terms and code-quality terms to have to make another function to handle getting tweets by email rather than by username:
name: "tweets by email"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.email = $1"
values: [email]
Is it possible to include a field as a parameter to the prepared statement?
name: "tweets by user"
text: "SELECT (SELECT * FROM tweets WHERE tweets.user_id = users.user_id) FROM users WHERE users.$1 = $2"
values: [field, value]
While it's true that this might be a bit less efficient in the corner case of accessing tweets by user_id, that's a trade I'm willing to make to improve code quality, and hopefully overall improve efficiency by reducing the number of query templates to 1 instead of 3+.
#Clodoaldo 's answer is correct in that it allows the capability you desire and should return the right results. Unfortunately it produces rather slow execution.
I set up an experimental data base with tweets and users. populated 10K users each with 100 tweets (1M tweet records). I indexed the PKs u.id, t.id, the FK t.user_id and the predicate fields u.username, u.email.
create table t(id serial PRIMARY KEY, data integer, user_id bignit);
create index t1 t(user_id);
create table u(id serial PRIMARY KEY, name text, email text);
create index u1 on u(name);
create index u2 on u(email);
insert into u(name,email) select i::text, i::text from generate_series(1,10000) i;
insert into t(data,user_id) select i, (i/100)::bigint from generate_series(1,1000000) i;
analyze table t;
analyze table u;
A simple query using one field as predicate is very fast:
prepare qn as select t.* from t join u on t.user_id = u.id where u.name = $1;
explain analyze execute qn('1111');
Nested Loop (cost=0.00..19.81 rows=1 width=16) (actual time=0.030..0.057 rows=100 loops=1)
-> Index Scan using u1 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1)
Index Cond: (name = $1)
-> Index Scan using t1 on t (cost=0.00..10.10 rows=100 width=16) (actual time=0.007..0.023 rows=100 loops=1)
Index Cond: (t.user_id = u.id)
Total runtime: 0.093 ms
A query using case in the where as #Clodoaldo proposed takes almost 30 seconds:
prepare qen as select t.* from t join u on t.user_id = u.id
where case $2 when 'e' then u.email = $1 when 'n' then u.name = $1 end;
explain analyze execute qen('1111','n');
Merge Join (cost=25.61..38402.69 rows=500000 width=16) (actual time=27.771..26345.439 rows=100 loops=1)
Merge Cond: (t.user_id = u.id)
-> Index Scan using t1 on t (cost=0.00..30457.35 rows=1000000 width=16) (actual time=0.023..17.741 rows=111200 loops=1)
-> Index Scan using u_pkey on u (cost=0.00..42257.36 rows=500000 width=4) (actual time=0.325..26317.384 rows=1 loops=1)
Filter: CASE $2 WHEN 'e'::text THEN (u.email = $1) WHEN 'n'::text THEN (u.name = $1) ELSE NULL::boolean END
Total runtime: 26345.535 ms
Observing that plan, I thought that using a union subselect then filtering its results to get the id appropriate to the parametrized predicate choice would allow the planner to use specific indexes for each predicate. Turns out I was right:
prepare qen2 as
select t.*
from t
join (
SELECT id from
(
SELECT 'n' as fld, id from u where u.name = $1
UNION ALL
SELECT 'e' as fld, id from u where u.email = $1
) poly
where poly.fld = $2
) uu
on t.user_id = uu.id;
explain analyze execute qen2('1111','n');
Nested Loop (cost=0.00..28.31 rows=100 width=16) (actual time=0.058..0.120 rows=100 loops=1)
-> Subquery Scan poly (cost=0.00..16.96 rows=1 width=4) (actual time=0.041..0.073 rows=1 loops=1)
Filter: (poly.fld = $2)
-> Append (cost=0.00..16.94 rows=2 width=4) (actual time=0.038..0.070 rows=2 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..8.47 rows=1 width=4) (actual time=0.038..0.038 rows=1 loops=1)
-> Index Scan using u1 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.038..0.038 rows=1 loops=1)
Index Cond: (name = $1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..8.47 rows=1 width=4) (actual time=0.031..0.032 rows=1 loops=1)
-> Index Scan using u2 on u (cost=0.00..8.46 rows=1 width=4) (actual time=0.030..0.031 rows=1 loops=1)
Index Cond: (email = $1)
-> Index Scan using t1 on t (cost=0.00..10.10 rows=100 width=16) (actual time=0.015..0.028 rows=100 loops=1)
Index Cond: (t.user_id = poly.id)
Total runtime: 0.170 ms
SELECT t.*
FROM tweets t
inner join users u on t.user_id = u.user_id
WHERE case $2
when 'username' then u.username = $1
when 'email' then u.email = $1
else u.user_id = $1
end
The docs for Pg's Window function say:
The rows considered by a window function are those of the "virtual table" produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any. For example, a row removed because it does not meet the WHERE condition is not seen by any window function. A query can contain multiple window functions that slice up the data in different ways by means of different OVER clauses, but they all act on the same collection of rows defined by this virtual table.
However, I'm not seeing this. It seems to me like the Select Filter is very near to the left margin and the top (last thing done).
=# EXPLAIN SELECT * FROM chrome_nvd.view_options where fkey_style = 303451;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Subquery Scan view_options (cost=2098450.26..2142926.28 rows=14825 width=180)
Filter: (view_options.fkey_style = 303451)
-> Sort (cost=2098450.26..2105862.93 rows=2965068 width=189)
Sort Key: o.sequence
-> WindowAgg (cost=1446776.02..1506077.38 rows=2965068 width=189)
-> Sort (cost=1446776.02..1454188.69 rows=2965068 width=189)
Sort Key: h.name, k.name
-> WindowAgg (cost=802514.45..854403.14 rows=2965068 width=189)
-> Sort (cost=802514.45..809927.12 rows=2965068 width=189)
Sort Key: h.name
-> Hash Join (cost=18.52..210141.57 rows=2965068 width=189)
Hash Cond: (o.fkey_opt_header = h.id)
-> Hash Join (cost=3.72..169357.09 rows=2965068 width=166)
Hash Cond: (o.fkey_opt_kind = k.id)
-> Seq Scan on options o (cost=0.00..128583.68 rows=2965068 width=156)
-> Hash (cost=2.21..2.21 rows=121 width=18)
-> Seq Scan on opt_kind k (cost=0.00..2.21 rows=121 width=18)
-> Hash (cost=8.80..8.80 rows=480 width=31)
-> Seq Scan on opt_header h (cost=0.00..8.80 rows=480 width=31)
(19 rows)
These two WindowAgg's essentially change the plan to something that seems to never finish from the much faster
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Subquery Scan view_options (cost=329.47..330.42 rows=76 width=164) (actual time=20.263..20.403 rows=42 loops=1)
-> Sort (cost=329.47..329.66 rows=76 width=189) (actual time=20.258..20.300 rows=42 loops=1)
Sort Key: o.sequence
Sort Method: quicksort Memory: 35kB
-> Hash Join (cost=18.52..327.10 rows=76 width=189) (actual time=19.427..19.961 rows=42 loops=1)
Hash Cond: (o.fkey_opt_header = h.id)
-> Hash Join (cost=3.72..311.25 rows=76 width=166) (actual time=17.679..18.085 rows=42 loops=1)
Hash Cond: (o.fkey_opt_kind = k.id)
-> Index Scan using options_pkey on options o (cost=0.00..306.48 rows=76 width=156) (actual time=17.152..17.410 rows=42 loops=1)
Index Cond: (fkey_style = 303451)
-> Hash (cost=2.21..2.21 rows=121 width=18) (actual time=0.432..0.432 rows=121 loops=1)
-> Seq Scan on opt_kind k (cost=0.00..2.21 rows=121 width=18) (actual time=0.042..0.196 rows=121 loops=1)
-> Hash (cost=8.80..8.80 rows=480 width=31) (actual time=1.687..1.687 rows=480 loops=1)
-> Seq Scan on opt_header h (cost=0.00..8.80 rows=480 width=31) (actual time=0.030..0.748 rows=480 loops=1)
Total runtime: 20.893 ms
(15 rows)
What is going on, and how do I fix it? I'm using Postgresql 8.4.8. Here is what the actual view is doing:
SELECT o.fkey_style, h.name AS header, k.name AS kind
, o.code, o.name AS option_name, o.description
, count(*) OVER (PARTITION BY h.name) AS header_count
, count(*) OVER (PARTITION BY h.name, k.name) AS header_kind_count
FROM chrome_nvd.options o
JOIN chrome_nvd.opt_header h ON h.id = o.fkey_opt_header
JOIN chrome_nvd.opt_kind k ON k.id = o.fkey_opt_kind
ORDER BY o.sequence;
No, PostgreSQL will only push down a WHERE clause on a VIEW that does not have an Aggregate. (Window functions are consider Aggregates).
< x> I think that's just an implementation limitation
< EvanCarroll> x: I wonder what would have to be done to push the
WHERE clause down in this case.
< EvanCarroll> the planner would have to know that the WindowAgg doesn't itself add selectivity and therefore it is safe to push the WHERE down?
< x> EvanCarroll; a lot of very complicated work with the planner, I'd presume
And,
< a> EvanCarroll: nope. a filter condition on a view applies to the output of the view and only gets pushed down if the view does not involve aggregates
Can I optimize this query, or modify the table structure in order to shorten the execution time? I don't really understand the output of EXPLAIN. Am I missing some index?
EXPLAIN SELECT COUNT(*) AS count,
q.query_str
FROM click_fact cf,
query q,
date_dim dd,
queries_p_day_mv qpd
WHERE dd.date_dim_id = qpd.date_dim_id
AND qpd.query_id = q.query_id
AND type = 'S'
AND cf.query_id = q.query_id *emphasized text*
AND dd.pg_date BETWEEN '2010-12-29' AND '2011-01-28'
AND qpd.interface_id IN (SELECT DISTINCT interface_id from interface WHERE lang = 'sv')
GROUP BY q.query_str
ORDER BY count DESC;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=19170.15..19188.80 rows=7460 width=12)
Sort Key: (count(*))
-> HashAggregate (cost=18597.03..18690.28 rows=7460 width=12)
-> Nested Loop (cost=10.20..18559.73 rows=7460 width=12)
-> Nested Loop (cost=10.20..14975.36 rows=2452 width=20)
Join Filter: (qpd.interface_id = interface.interface_id)
-> Unique (cost=1.03..1.04 rows=1 width=4)
-> Sort (cost=1.03..1.04 rows=1 width=4)
Sort Key: interface.interface_id
-> Seq Scan on interface (cost=0.00..1.02 rows=1 width=4)
Filter: (lang = 'sv'::text)
-> Nested Loop (cost=9.16..14943.65 rows=2452 width=24)
-> Hash Join (cost=9.16..14133.58 rows=2452 width=8)
Hash Cond: (qpd.date_dim_id = dd.date_dim_id)
-> Seq Scan on queries_p_day_mv qpd (cost=0.00..11471.93 rows=700793 width=12)
-> Hash (cost=8.81..8.81 rows=28 width=4)
-> Index Scan using date_dim_pg_date_index on date_dim dd (cost=0.00..8.81 rows=28 width=4)
Index Cond: ((pg_date >= '2010-12-29'::date) AND (pg_date <= '2011-01-28'::date))
-> Index Scan using query_pkey on query q (cost=0.00..0.32 rows=1 width=16)
Index Cond: (q.query_id = qpd.query_id)
-> Index Scan using click_fact_query_id_index on click_fact cf (cost=0.00..1.01 rows=36 width=4)
Index Cond: (cf.query_id = qpd.query_id)
Filter: (cf.type = 'S'::bpchar)
Updated with EXPLAIN ANALYZE:
EXPLAIN ANALYZE SELECT COUNT(*) AS count,
q.query_str
FROM click_fact cf,
query q,
date_dim dd,
queries_p_day_mv qpd
WHERE dd.date_dim_id = qpd.date_dim_id
AND qpd.query_id = q.query_id
AND type = 'S'
AND cf.query_id = q.query_id
AND dd.pg_date BETWEEN '2010-12-29' AND '2011-01-28'
AND qpd.interface_id IN (SELECT DISTINCT interface_id from interface WHERE lang = 'sv')
GROUP BY q.query_str
ORDER BY count DESC;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=19201.06..19220.52 rows=7784 width=12) (actual time=51017.162..51046.102 rows=17586 loops=1)
Sort Key: (count(*))
Sort Method: external merge Disk: 632kB
-> HashAggregate (cost=18600.67..18697.97 rows=7784 width=12) (actual time=50935.411..50968.678 rows=17586 loops=1)
-> Nested Loop (cost=10.20..18561.75 rows=7784 width=12) (actual time=42.079..43666.404 rows=3868592 loops=1)
-> Nested Loop (cost=10.20..14975.91 rows=2453 width=20) (actual time=23.678..14609.282 rows=700803 loops=1)
Join Filter: (qpd.interface_id = interface.interface_id)
-> Unique (cost=1.03..1.04 rows=1 width=4) (actual time=0.104..0.110 rows=1 loops=1)
-> Sort (cost=1.03..1.04 rows=1 width=4) (actual time=0.100..0.102 rows=1 loops=1)
Sort Key: interface.interface_id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on interface (cost=0.00..1.02 rows=1 width=4) (actual time=0.038..0.041 rows=1 loops=1)
Filter: (lang = 'sv'::text)
-> Nested Loop (cost=9.16..14944.20 rows=2453 width=24) (actual time=23.550..12553.786 rows=700808 loops=1)
-> Hash Join (cost=9.16..14133.80 rows=2453 width=8) (actual time=18.283..3885.700 rows=700808 loops=1)
Hash Cond: (qpd.date_dim_id = dd.date_dim_id)
-> Seq Scan on queries_p_day_mv qpd (cost=0.00..11472.08 rows=700808 width=12) (actual time=0.014..1587.106 rows=700808 loops=1)
-> Hash (cost=8.81..8.81 rows=28 width=4) (actual time=18.221..18.221 rows=31 loops=1)
-> Index Scan using date_dim_pg_date_index on date_dim dd (cost=0.00..8.81 rows=28 width=4) (actual time=14.388..18.152 rows=31 loops=1)
Index Cond: ((pg_date >= '2010-12-29'::date) AND (pg_date <= '2011-01-28'::date))
-> Index Scan using query_pkey on query q (cost=0.00..0.32 rows=1 width=16) (actual time=0.005..0.006 rows=1 loops=700808)
Index Cond: (q.query_id = qpd.query_id)
-> Index Scan using click_fact_query_id_index on click_fact cf (cost=0.00..1.01 rows=36 width=4) (actual time=0.005..0.022 rows=6 loops=700803)
Index Cond: (cf.query_id = qpd.query_id)
Filter: (cf.type = 'S'::bpchar)
You may try to eliminate subquery:
SELECT COUNT(*) AS count,
q.query_str
FROM click_fact cf,
query q,
date_dim dd,
queries_p_day_mv qpd
WHERE dd.date_dim_id = qpd.date_dim_id
AND qpd.query_id = q.query_id
AND type = 'S'
AND cf.query_id = q.query_id
AND dd.pg_date BETWEEN '2010-12-29' AND '2011-01-28'
AND qpd.interface_id = interface.interface_id
AND interface.lang = 'sv'
GROUP BY q.query_str
ORDER BY count DESC;
Also, if interface table is big, creating ingex on lang may help. index in queries_p_day_mv on day_dim_id may help too.
Generally, the first thing to try is to look for Seq Scans and try to make them index scans by creating indexes.
HTH
SELECT COUNT(*) AS count,
q.query_str
FROM date_dim dd
JOIN queries_p_date_mv qpd
ON qpd.date_dim_id = dd.date_dim_id
AND qpd.interface_id IN
(
SELECT interface_id
FROM interface
WHERE lang = 'sv'
)
JOIN query q
ON q.query_id = qpd.query_id
JOIN click_fact cf
ON cf.query_id = q.query_id
AND cf.type = 'S'
WHERE dd.pg_date BETWEEN '2010-12-29' AND '2011-01-28'
GROUP BY
q.query_str
ORDER BY
count DESC
Create the following indexes (in addition to your existing ones):
queries_p_date_mv (interface_id, date_dim_id)
interface (lang)
click_fact (query_id, type)
Could you please post the definitions of your tables?