My table1 has 25,000+ rows and table2 has only 1 row. Both of them have almost 30 columns. I need to add all the columns in table2 (which has only one row) to the columns in table1 so I can do further calculations. One way to do it is
select * from table1 cross join table2
It gives the desired results but the performance is not good.
I am wondering if there is a better or faster way to get the combined table. I am using PostgreSQL
Here is the output from
explain analyze select * from table1 cross join table2
Nested Loop (cost=0.00..195264.90 rows=15533650 width=336) (actual time=0.013..46.189 rows=25465 loops=1)
-> Seq Scan on table1 (cost=0.00..1076.65 rows=25465 width=232) (actual time=0.007..6.912 rows=25465 loops=1)
-> Materialize (cost=0.00..19.15 rows=610 width=104) (actual time=0.000..0.000 rows=1 loops=25465)
-> Seq Scan on table2 (cost=0.00..16.10 rows=610 width=104) (actual time=0.001..0.002 rows=1 loops=1)
Planning time: 0.153 ms
Execution time: 50.868 ms
Thanks.
Related
While executing below two queries, I notice serious difference in query plan. Why is that?
select * from table1
where id = 'dummy' or id in (select id from table2 where id = 'dummy')
Query plan
Seq Scan on table1 (cost=8.30..49611.63 rows=254478 width=820) (actual time=535.477..557.431 rows=1 loops=1)
Filter: (((code)::text = 'dummy'::text) OR (hashed SubPlan 1))
Rows Removed by Filter: 510467
SubPlan 1
-> Index Scan using idx on table2 (cost=0.29..8.30 rows=1 width=8) (actual time=0.009..0.012 rows=0 loops=1)
Index Cond: ((id)::text = 'dummy'::text)
Planning Time: 0.165 ms
Execution Time: 557.517 ms
select * from table1
where id = 'dummy'
union
select * from table1
where id in (select id from table2 where id = 'dummy')
Unique (cost=25.22..25.42 rows=2 width=5818) (actual time=0.045..0.047 rows=1 loops=1)
-> Sort (cost=25.22..25.23 rows=2 width=5818) (actual time=0.045..0.046 rows=1 loops=1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.42..25.21 rows=2 width=5818) (actual time=0.016..0.026 rows=1 loops=1)
-> Index Scan using id on table1 (cost=0.42..8.44 rows=1 width=820) (actual time=0.015..0.016 rows=1 loops=1)
Index Cond: ((id)::text = 'dummy'::text)
-> Nested Loop (cost=0.71..16.74 rows=1 width=820) (actual time=0.009..0.009 rows=0 loops=1)
-> Index Scan using idx on table2 (cost=0.29..8.30 rows=1 width=8) (actual time=0.008..0.008 rows=0 loops=1)
Index Cond: ((id)::text = 'dummy'::text)
-> Index Scan using pkey on table1 (cost=0.42..8.44 rows=1 width=820) (never executed)
Index Cond: (id = table2.id)
Planning Time: 0.753 ms
Execution Time: 0.131 ms
So the main difference you can see is the first query returns 254478 rows but the second just returns 2 rows.
Why is that?
Please do another test -- run both these queries -- do they give the same results as the queries without my changes?
select * from table1
where table1.id = 'dummy' or
table1.id in (select table2.id from table2 where table2.id = 'dummy')
select * from table1
where table1.id = 'dummy'
union
select * from table1
where table1.id in (select table2.id from table2 where table2.id = 'dummy')
I don't think you are sharing your actual code with us -- because as written your code makes little sense -- you are returning a list of ids in the sub-query that equal 'dummy' -- so you will just get a list of dummy multiple times.
Note these comments are not true since they had no impact on the results -- the order of operations was working as expected
What result do you get when when you do this:
select * from table1
where (id = 'dummy') or id in (select id from table2 where id = 'dummy')
The reason your query was giving more results is because it was selecting records from table1 where id equals dummy or id = id. The query in the original post gives you all the records. The OR was being applied to the first expression not splitting two expressions.
I have this query that is highly inefficient, if I remove all the count columns, it takes 10 seconds to query (the tables are quite large, around 750mb each). But if I add 1 count column, it takes 36 seconds to execute, if I leave it all in, it doesn't finish at all
I tried sum(case when r.value is not null then 1 else 0 end) in place of count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses, but it gets incorrect counts
SELECT c.id,
c.title,
count(DISTINCT cc.*) AS contacts,
count(DISTINCT m.user_id) AS texters,
count(DISTINCT cc.*) FILTER (WHERE cc.assignment_id IS NULL) AS needs_assignment,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = false) AS sent_messages,
count(DISTINCT m.*) FILTER (WHERE m.is_from_contact = true) AS received_messages,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = false) AS contacted,
count(DISTINCT cc.*) FILTER (WHERE m.is_from_contact = true) AS received_reply,
count(DISTINCT cc.*) FILTER (WHERE cc.message_status = 'needsResponse'::text AND NOT cc.is_opted_out) AS needs_response,
count(DISTINCT cc.*) FILTER (WHERE cc.is_opted_out = true) AS opt_outs,
count(DISTINCT m.*) FILTER (WHERE m.error_code IS NOT NULL AND m.error_code <> 0) AS errors,
c.is_started,
c.is_archived,
c.use_dynamic_assignment,
c.due_by,
c.created_at,
creator.email AS creator_email,
concat(c.join_token, '/join/', c.id) AS join_path,
count(DISTINCT r.*) FILTER (WHERE r.value IS NOT NULL) AS responses,
c.batch_size,
c.texting_hours_start,
c.texting_hours_end,
c.timezone,
c.organization_id
FROM campaign c
JOIN "user" creator ON c.creator_id = creator.id
LEFT JOIN campaign_contact cc ON c.id = cc.campaign_id
LEFT JOIN message m ON m.campaign_contact_id = cc.id
LEFT JOIN question_response r ON cc.id = r.campaign_contact_id
GROUP BY c.id, creator.email
Any direction is appreciated, thank you!
Create some test data...
create unlogged table users( user_id serial primary key, login text unique not null );
insert into users (login) select 'user'||n from generate_series(1,100000) n;
create unlogged table messages( message_id serial primary key, sender_id integer not null, receiver_id integer not null);
insert into messages (sender_id,receiver_id) select random()*100000+1, random()*100000+1 from generate_series(1,1000000);
create index messages_s on messages(sender_id);
create index messages_r on messages(receiver_id);
vacuum analyze users,messages;
And then:
EXPLAIN ANALYZE
SELECT user_id, count(DISTINCT m1.message_id), count(DISTINCT m2.message_id)
FROM users u
LEFT JOIN messages m1 ON m1.receiver_id = user_id
LEFT JOIN messages m2 ON m2.sender_id = user_id
GROUP BY user_id;
GroupAggregate (cost=4.39..326190.22 rows=100000 width=20) (actual time=4.023..3331.031 rows=100000 loops=1)
Group Key: u.user_id
-> Merge Left Join (cost=4.39..250190.22 rows=10000000 width=12) (actual time=3.987..2161.032 rows=9998915 loops=1)
Merge Cond: (u.user_id = m1.receiver_id)
-> Merge Left Join (cost=2.11..56522.26 rows=1000000 width=8) (actual time=3.978..515.730 rows=1000004 loops=1)
Merge Cond: (u.user_id = m2.sender_id)
-> Index Only Scan using users_pkey on users u (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.016..10.149 rows=100000 loops=1)
Heap Fetches: 0
-> Index Scan using messages_s on messages m2 (cost=0.42..41168.40 rows=1000000 width=8) (actual time=0.011..397.128 rows=999996 loops=1)
-> Materialize (cost=0.42..43668.42 rows=1000000 width=8) (actual time=0.008..746.748 rows=9998810 loops=1)
-> Index Scan using messages_r on messages m1 (cost=0.42..41168.42 rows=1000000 width=8) (actual time=0.006..392.426 rows=999997 loops=1)
Execution Time: 3432.131 ms
Since I put in 100k users and 1M messages, each user has about 100 messages as sender and 100 also as receiver, which means the joins generate 100*100=10k rows per user which then have to be processed by the count(DISTINCT ...) aggregates. Postgres doesn't realize this is all unnecessary because the counts and group by's should really be moved inside the joined tables, which means this is extremely slow.
The solution is to move the aggregation inside the joined tables manually, to avoid generating all these unnecessary extra rows.
EXPLAIN ANALYZE
SELECT user_id, m1.cnt, m2.cnt
FROM users u
LEFT JOIN (SELECT receiver_id, count(*) cnt FROM messages GROUP BY receiver_id) m1 ON m1.receiver_id = user_id
LEFT JOIN (SELECT sender_id, count(*) cnt FROM messages GROUP BY sender_id) m2 ON m2.sender_id = user_id;
Hash Left Join (cost=46780.40..48846.42 rows=100000 width=20) (actual time=469.699..511.613 rows=100000 loops=1)
Hash Cond: (u.user_id = m2.sender_id)
-> Hash Left Join (cost=23391.68..25195.19 rows=100000 width=12) (actual time=237.435..262.545 rows=100000 loops=1)
Hash Cond: (u.user_id = m1.receiver_id)
-> Seq Scan on users u (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.015..5.162 rows=100000 loops=1)
-> Hash (cost=22243.34..22243.34 rows=91867 width=12) (actual time=237.252..237.253 rows=99991 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m1 (cost=20406.00..22243.34 rows=91867 width=12) (actual time=210.817..227.793 rows=99991 loops=1)
-> HashAggregate (cost=20406.00..21324.67 rows=91867 width=12) (actual time=210.815..222.794 rows=99991 loops=1)
Group Key: messages.receiver_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.010..47.173 rows=1000000 loops=1)
-> Hash (cost=22241.52..22241.52 rows=91776 width=12) (actual time=232.003..232.004 rows=99992 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 5321kB
-> Subquery Scan on m2 (cost=20406.00..22241.52 rows=91776 width=12) (actual time=205.401..222.517 rows=99992 loops=1)
-> HashAggregate (cost=20406.00..21323.76 rows=91776 width=12) (actual time=205.400..217.518 rows=99992 loops=1)
Group Key: messages_1.sender_id
Batches: 1 Memory Usage: 14353kB
-> Seq Scan on messages messages_1 (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.008..43.402 rows=1000000 loops=1)
Planning Time: 0.574 ms
Execution Time: 515.753 ms
I used a schema that is a bit different from yours, but you get the idea: instead of generating lots of duplicate rows by doing what is essentially a cross product, push aggregations into the joined tables so they return only one row per value of whatever column you're joining on, then remove the GROUP BY from the main query since it is no longer necessary.
Note that count(DISTINCT table.*) is not smart enough to understand that it can do this by looking only at the primary key of the table if there is one, so it will pull the whole row to run the distinct on it. When a table is named "message" or "question_response" it smells like it has a largish TEXT column in it, which will make this very slow. So in case you really need a count(distinct ...) you should use count(DISTINCT table.primarykey):
explain analyze SELECT count(distinct user_id) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=15.220..15.221 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=4) (actual time=0.016..5.830 rows=100000 loops=1)
Execution Time: 15.263 ms
explain analyze SELECT count(distinct users.*) from users;
Aggregate (cost=1791.00..1791.01 rows=1 width=8) (actual time=90.896..90.896 rows=1 loops=1)
-> Seq Scan on users (cost=0.00..1541.00 rows=100000 width=37) (actual time=0.038..38.497 rows=100000 loops=1)
Execution Time: 90.958 ms
The problem is the DISTINCT in the aggregate functions. PostgreSQL is not very smart about processing these.
No knowing your data model, I cannot tell if the DISTINCT is really needed. Omit it if you can.
I'm trying to take advantages of partitioning in one case:
I have table "events" which partitioned by list by field "dt_pk" which is foreign key to table "dates".
-- Schema
drop schema if exists test cascade;
create schema test;
-- Tables
create table if not exists test.dates (
id bigint primary key,
dt date not null
);
create sequence test.seq_events_id;
create table if not exists test.events
(
id bigint not null,
dt_pk bigint not null,
content_int bigint,
foreign key (dt_pk) references test.dates(id) on delete cascade,
primary key (dt_pk, id)
)
partition by list (dt_pk);
-- Partitions
create table test.events_1 partition of test.events for values in (1);
create table test.events_2 partition of test.events for values in (2);
create table test.events_3 partition of test.events for values in (3);
-- Fill tables
insert into test.dates (id, dt)
select id, dt
from (
select 1 id, '2020-01-01'::date as dt
union all
select 2 id, '2020-01-02'::date as dt
union all
select 3 id, '2020-01-03'::date as dt
) t;
do $$
declare
dts record;
begin
for dts in (
select id
from test.dates
) loop
for k in 1..10000 loop
insert into test.events (id, dt_pk, content_int)
values (nextval('test.seq_events_id'), dts.id, random_between(1, 1000000));
end loop;
commit;
end loop;
end;
$$;
vacuum analyze test.dates, test.events;
I want to run select like this:
select *
from test.events e
join test.dates d on e.dt_pk = d.id
where d.dt between '2020-01-02'::date and '2020-01-03'::date;
But in this case partition pruning doesn't work. It's clear, I don't have constant for partition key. But from documentation I know that there is partition pruning at execution time, which works with value obtained from a subquery:
Partition pruning can be performed not only during the planning of a
given query, but also during its execution. This is useful as it can
allow more partitions to be pruned when clauses contain expressions
whose values are not known at query planning time, for example,
parameters defined in a PREPARE statement, using a value obtained from
a subquery, or using a parameterized value on the inner side of a
nested loop join.
So I rewrite my query like this and I expected partitionin pruning:
select *
from test.events e
where e.dt_pk in (
select d.id
from test.dates d
where d.dt between '2020-01-02'::date and '2020-01-03'::date
);
But explain for this select says:
Hash Join (cost=1.07..833.07 rows=20000 width=24) (actual time=3.581..15.989 rows=20000 loops=1)
Hash Cond: (e.dt_pk = d.id)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.005..6.361 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.104 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.005..1.127 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.097 rows=10000 loops=1)
-> Hash (cost=1.04..1.04 rows=2 width=8) (actual time=0.006..0.006 rows=2 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.206 ms
Execution Time: 17.237 ms
So, we read all partitions. I even tried to the planner to use nested loop join, because I read in documentation "parameterized value on the inner side of a nested loop join", but it didn't work:
set enable_hashjoin to off;
set enable_mergejoin to off;
And again:
Nested Loop (cost=0.00..1443.05 rows=20000 width=24) (actual time=9.160..25.252 rows=20000 loops=1)
Join Filter: (e.dt_pk = d.id)
Rows Removed by Join Filter: 30000
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.008..6.280 rows=30000 loops=1)
-> Seq Scan on events_1 e (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.105 rows=10000 loops=1)
-> Seq Scan on events_2 e_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.008..1.047 rows=10000 loops=1)
-> Seq Scan on events_3 e_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.007..1.082 rows=10000 loops=1)
-> Materialize (cost=0.00..1.05 rows=2 width=8) (actual time=0.000..0.000 rows=2 loops=30000)
-> Seq Scan on dates d (cost=0.00..1.04 rows=2 width=8) (actual time=0.004..0.004 rows=2 loops=1)
Filter: ((dt >= '2020-01-02'::date) AND (dt <= '2020-01-03'::date))
Rows Removed by Filter: 1
Planning Time: 0.202 ms
Execution Time: 26.516 ms
Then I noticed that in every example of "partition pruning at execution time" I see only = condition, not in.
And it really works that way:
explain (analyze) select * from test.events e where e.dt_pk = (select id from test.dates where id = 2);
Append (cost=1.04..718.04 rows=30000 width=24) (actual time=0.014..3.018 rows=10000 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on dates (cost=0.00..1.04 rows=1 width=8) (actual time=0.007..0.008 rows=1 loops=1)
Filter: (id = 2)
Rows Removed by Filter: 2
-> Seq Scan on events_1 e (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
-> Seq Scan on events_2 e_1 (cost=0.00..189.00 rows=10000 width=24) (actual time=0.004..2.009 rows=10000 loops=1)
Filter: (dt_pk = $0)
-> Seq Scan on events_3 e_2 (cost=0.00..189.00 rows=10000 width=24) (never executed)
Filter: (dt_pk = $0)
Planning Time: 0.135 ms
Execution Time: 3.639 ms
And here is my final question: does partition pruning at execution time work only with subquery returning one item, or there is a way to get advantages of partition pruning with subquery returning a list?
And why doesn't it work with nested loop join, did I understand something wrong in words:
This includes values from subqueries and values from execution-time
parameters such as those from parameterized nested loop joins.
Or "parameterized nested loop joins" is something different from regular nested loop joins?
There is no partition pruning in your nested loop join because the partitioned table is on the outer side, which is always scanned completely. The inner side is scanned with the join key from the outer side as parameter (hence parameterized scan), so if the partitioned table were on the inner side of the nested loop join, partition pruning could happen.
Partition pruning with IN lists can take place if the list vales are known at plan time:
EXPLAIN (COSTS OFF)
SELECT * FROM test.events WHERE dt_pk IN (1, 2);
QUERY PLAN
---------------------------------------------------
Append
-> Seq Scan on events_1
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
-> Seq Scan on events_2
Filter: (dt_pk = ANY ('{1,2}'::bigint[]))
(5 rows)
But no attempts are made to flatten a subquery, and PostgreSQL doesn't use partition pruning, even if you force the partitioned table to be on the inner side (enable_material = off, enable_hashjoin = off, enable_mergejoin = off):
EXPLAIN (ANALYZE)
SELECT * FROM test.events WHERE dt_pk IN (SELECT 1 UNION SELECT 2);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.06..2034.09 rows=20000 width=24) (actual time=0.057..15.523 rows=20000 loops=1)
Join Filter: (events_1.dt_pk = (1))
Rows Removed by Join Filter: 40000
-> Unique (cost=0.06..0.07 rows=2 width=4) (actual time=0.026..0.029 rows=2 loops=1)
-> Sort (cost=0.06..0.07 rows=2 width=4) (actual time=0.024..0.025 rows=2 loops=1)
Sort Key: (1)
Sort Method: quicksort Memory: 25kB
-> Append (cost=0.00..0.05 rows=2 width=4) (actual time=0.006..0.009 rows=2 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=1)
-> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
-> Append (cost=0.00..642.00 rows=30000 width=24) (actual time=0.012..4.334 rows=30000 loops=2)
-> Seq Scan on events_1 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.011..1.057 rows=10000 loops=2)
-> Seq Scan on events_2 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.004..0.641 rows=10000 loops=2)
-> Seq Scan on events_3 (cost=0.00..164.00 rows=10000 width=24) (actual time=0.002..0.594 rows=10000 loops=2)
Planning Time: 0.531 ms
Execution Time: 16.567 ms
(16 rows)
I am not certain, but it may be because the tables are so small. You might want to try with bigger tables.
If you care more about get it working than the fine details, and you haven't tried this yet: you can rewrite the query to something like
explain analyze select *
from test.dates d
join test.events e on e.dt_pk = d.id
where
d.dt between '2020-01-02'::date and '2020-01-03'::date
and e.dt_pk in (extract(day from '2020-01-02'::date)::int,
extract(day from '2020-01-03'::date)::int);
which will give the expected pruning.
I have two tables:
table_1 with ~1 million lines, with columns id_t1: integer, c1_t1: varchar, etc.
table_2 with ~50 million lines, with columns id_t2: integer, ref_id_t1: integer, c1_t2: varchar, etc.
ref_id_t1 is filled with id_t1 values , however they are not linked by a foreign key as table_2 doesn't know about table_1.
I need to do a request on both table like the following:
SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%');
Without any change or with basic indexes the request takes about a minute to complete as a sequencial scan is peformed on table_2. To prevent this I created a GIN idex with gin_trgm_ops option:
CREATE EXTENSION pg_trgm;
CREATE INDEX c1_t2_gin_index ON table_2 USING gin (c1_t2, gin_trgm_ops);
However this does not solve the problem as the inner request still takes a very long time.
EXPLAIN ANALYSE SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE '%abc%'
Gives the following
Bitmap Heap Scan on table_2 t2 (cost=664.20..189671.00 rows=65058 width=4) (actual time=5101.286..22854.838 rows=69631 loops=1)
Recheck Cond: ((c1_t2 )::text ~~ '%1.1%'::text)
Rows Removed by Index Recheck: 49069703
Heap Blocks: exact=611548
-> Bitmap Index Scan on gin_trg (cost=0.00..647.94 rows=65058 width=0) (actual time=4911.125..4911.125 rows=49139334 loops=1)
Index Cond: ((c1_t2)::text ~~ '%1.1%'::text)
Planning time: 0.529 ms
Execution time: 22863.017 ms
The Bitmap Index Scan is fast, but as we need t2.ref_id_t1 PostgreSQL needs to perform an Bitmap Heap Scan which is not quick on 65000 lines of data.
The solution to avoid the Bitmap Heap Scan would be to perform an Index Only Scan. This is possible using multiple column with btree indexes, see https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html
If I change the request like to search the begining of c1_t2, even with the inner request returning 90000 lines, and if I create a btree index on c1_t2 and ref_id_t1 the request takes just over a second.
CREATE INDEX c1_t2_ref_id_t1_index
ON table_2 USING btree
(c1_t2 varchar_pattern_ops ASC NULLS LAST, ref_id_t1 ASC NULLS LAST)
EXPLAIN ANALYSE SELECT * FROM table_1 t1 WHERE t1.c1_t1= 'A' AND t1.id_t1 IN
(SELECT t2.ref_id_t1 FROM table_2 t2 WHERE t2.c1_t2 LIKE 'aaa%');
Hash Join (cost=56561.99..105233.96 rows=1 width=2522) (actual time=953.647..1068.488 rows=36 loops=1)
Hash Cond: (t1.id_t1 = t2.ref_id_t1)
-> Seq Scan on table_1 t1 (cost=0.00..48669.65 rows=615 width=2522) (actual time=0.088..667.576 rows=790 loops=1)
Filter: (c1_t1 = 'A')
Rows Removed by Filter: 1083798
-> Hash (cost=56553.74..56553.74 rows=660 width=4) (actual time=400.657..400.657 rows=69632 loops=1)
Buckets: 131072 (originally 1024) Batches: 1 (originally 1) Memory Usage: 3472kB
-> HashAggregate (cost=56547.14..56553.74 rows=660 width=4) (actual time=380.280..391.871 rows=69632 loops=1)
Group Key: t2.ref_id_t1
-> Index Only Scan using c1_t2_ref_id_t1_index on table_2 t2 (cost=0.56..53907.28 rows=1055943 width=4) (actual time=0.014..202.034 rows=974737 loops=1)
Index Cond: ((c1_t2 ~>=~ 'aaa'::text) AND (c1_t2 ~<~ 'chb'::text))
Filter: ((c1_t2 )::text ~~ 'aaa%'::text)
Heap Fetches: 0
Planning time: 1.512 ms
Execution time: 1069.712 ms
However this is not possible with gin indexes, as these indexes don't store all data in the key.
Is there a way to use pg_trmg like extension with btree index so we can have index only scan with LIKE '%abc%' requests?
I have 2 tables, t1 and t2, each with a geography type column called pts_geog, and each with a column id which is a unit identifier. I want to add a column to t1 which counts how many units from t2 are within a distance of 1000m to the any given point in t1. Both tables reasonably large, with each about 150000 rows. To compute the distance of each point in t1 to each point in t2 however results in a very expensive operation, so I am looking for some guidance as to what I'm doing has any hope. I have never been able to complete this operation because out of memory. I could split the operation somehow (with a where along another dimension of t1), but I need more help. Here is the select that I would like to use:
select
count(nullif(
ST_DWithin(
g1.pts_geog,
g2.gts_geog,
1000,
false),
false)) as close_1000
from
t1 as g1,
t2 as g2
where
g1.pts_geog IS NOT NULL
and
g2.pts_geog IS NOT NULL
GROUP BY g1.id
suggested answer and EXPLAIN:
airbnb=> EXPLAIN ANALYZE
airbnb-> SELECT t1.listing_id, count(*)
airbnb-> FROM paris as t1
airbnb-> JOIN airdna_property as t2
airbnb-> ON ST_DWithin( t1.pts_geog, t2.pts_geog,1000 )
airbnb-> WHERE t2.city='Paris'
airbnb-> group by t1.listing_id;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=1030317.33..1030386.39 rows=6906 width=8) (actual time=2802071.616..2802084.109 rows=54400 loops=1)
Group Key: t1.listing_id
-> Nested Loop (cost=0.41..1030282.80 rows=6906 width=8) (actual time=0.827..2604319.421 rows=785571807 loops=1)
-> Seq Scan on airdna_property t2 (cost=0.00..74893.44 rows=141004 width=56) (actual time=0.131..738.133 rows=141506 loops=1)
Filter: (city = 'Paris'::text)
Rows Removed by Filter: 400052
-> Index Scan using paris_pts_geog_idx on paris t1 (cost=0.41..6.77 rows=1 width=64) (actual time=0.133..17.865 rows=5552 loops=141506)
Index Cond: (pts_geog && _st_expand(t2.pts_geog, '1000'::double precision))
Filter: ((t2.pts_geog && _st_expand(pts_geog, '1000'::double precision)) AND _st_dwithin(pts_geog, t2.pts_geog, '1000'::double precision, true))
Rows Removed by Filter: 3260
Planning time: 0.197 ms
Execution time: 2802086.005 ms
output of version:
version | postgis_version
----------------------------------------------------------------------------------------------------------+---------------------------------------
PostgreSQL 9.5.7 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005, 64-bit | 2.2 USE_GEOS=1 USE_PROJ=1 USE_STATS=1
Update 2
This is after creating the indices as suggested. notice that the number of rows slightly increased because I added new data, but this is still the same size of problem. it takes 52 minutes. It still says Seq Scan on city, and I don't understand: why doesn't it do an index scan there, given I created one?
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=904989.83..905049.21 rows=5938 width=8) (actual time=3118569.759..3118581.444 rows=54400 loops=1)
Group Key: t1.listing_id
-> Nested Loop (cost=0.41..904960.14 rows=5938 width=8) (actual time=2.624..2881694.755 rows=837837851 loops=1)
-> Seq Scan on airdna_property t2 (cost=0.00..74842.84 rows=121245 width=56) (actual time=2.263..949.073 rows=151018 loops=1)
Filter: (city = 'Paris'::text)
Rows Removed by Filter: 435564
-> Index Scan using paris_pts_geog_idx on paris t1 (cost=0.41..6.84 rows=1 width=64) (actual time=0.139..18.555 rows=5548 loops=151018)
Index Cond: (pts_geog && _st_expand(t2.pts_geog, '1000'::double precision))
Filter: ((t2.pts_geog && _st_expand(pts_geog, '1000'::double precision)) AND _st_dwithin(pts_geog, t2.pts_geog, '1000'::double precision, true))
Rows Removed by Filter: 3257
Planning time: 0.377 ms
Execution time: 3118583.203 ms
(12 rows)
All you're doing is selecting the count just move the clause out of the select list to trim up the join.
SELECT t1.id, count(*)
FROM t1
JOIN t2
ON ST_DWithin( t1.pts_geog, t2.pts_geog, 1000 )
GROUP BY t1.id;
If you need an index, which ST_DWithin can use run this..
CREATE INDEX ON t1 USING gist (pts_geog);
CREATE INDEX ON t2 USING gist (pts_geog);
VACUUM ANALYZE t1;
VACUUM ANALYZE t2;
Now run the SELECT query above.
Update 2
Your plan shows that you have seq scan on city, so create an index on city and then we'll see what more we can do
CREATE INDEX ON airdna_property (city);
ANALYZE airdna_property;