LEFT JOIN doesn't play nicely with partitioned tables in Postgres? - postgresql

I have a query in Postgres that joins 2 tables, one of them is partitioned. The execution plan was showing the query would hit all of the partitions of the table, even though I'm including the partitioning key in the query.
I finally realized the issue was a LEFT JOIN I had. If I replace A LEFT JOIN B by A, B then Postgres starts targeting only 1 partition instead of all of the partitions.
Anybody knows if it's a bug in Postgres execution planner?
Here the simplified query:
EXPLAIN SELECT in_memory_table.my_col
FROM (SELECT '2019-08-09 19:00:00+00:00') in_memory_table(my_col)
LEFT JOIN partitioned_table
ON (partitioned_table.id = '1111111111111111122222222222222'
AND partitioned_table.my_col = in_memory_table.my_col);
This yields this execution plan:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.56..58.28 rows=1 width=8)
Join Filter: (partitioned_table_201908_partition.my_col = ('2019-08-09 19:00:00+00'::timestamp with time zone))
-> Result (cost=0.00..0.01 rows=1 width=8)
-> Append (cost=0.56..58.17 rows=7 width=8)
-> Index Scan using partitioned_table_201908_partition_pkey on partitioned_table_201908_partition (cost=0.56..8.58 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
-> Index Scan using partitioned_table_201909_partition_pkey on partitioned_table_201909_partition (cost=0.15..8.17 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
-> Index Scan using partitioned_table_201910_partition_pkey on partitioned_table_201910_partition (cost=0.15..8.17 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
-> Index Scan using partitioned_table_201911_partition_pkey on partitioned_table_201911_partition (cost=0.15..8.17 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
-> Index Scan using partitioned_table_201912_partition_pkey on partitioned_table_201912_partition (cost=0.15..8.17 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
-> Index Scan using partitioned_table_202001_partition_pkey on partitioned_table_202001_partition (cost=0.15..8.17 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
-> Index Scan using partitioned_table_pkey on partitioned_table_000000_partition (cost=0.70..8.72 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
(18 rows)
If I instead do this:
EXPLAIN SELECT in_memory_table.my_col
FROM (SELECT '2019-08-09 19:00:00+00:00') in_memory_table(my_col),
partitioned_table
WHERE (partitioned_table.id = '1111111111111111122222222222222'
AND partitioned_table.my_col = in_memory_table.my_col);
Then the execution plan looks how I expected it to look:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Append (cost=0.56..8.59 rows=1 width=8)
-> Index Scan using partitioned_table_201908_partition_pkey on partitioned_table_201908_partition (cost=0.56..8.58 rows=1 width=8)
Index Cond: (id = '1111111111111111122222222222222'::uuid)
Filter: (result_timestamp = '2019-08-09 19:00:00+00'::timestamp with time zone)
Why is the LEFT JOIN version hitting all partitions?
Other info
I'm using Postgres 11.1
my_col is the partitioning key (partitioned by range)
Why am I using LEFT JOIN? this is a simplified version, the real one uses a LEFT OUTER JOIN, but that one is harder to ask about (I cannot transform the ON into a WHERE in that one).

Related

PostgreSQL: Scan relevant partitions only with subquery

I have a table foundation.data which is partitioned over an asset_id with hash partitioning.
Whenever I do a query such as
with deleted_unprocessed_data as (
delete from foundation.unprocessed_ids d
where id = any(select id from foundation.unprocessed_ids up order by up.asset_id, up.data_point_timestamp asc limit 1000)
returning id, asset_id, data_point_timestamp
)
, ids_to_process as
(
insert into foundation.processed_ids select * from deleted_unprocessed_data
returning id, asset_id
)
select jh.asset_id,
min(jh.data_point_timestamp) as minborder,
max(jh.data_point_timestamp) as maxborder
from
(
select id, asset_id, data_point_timestamp FROM
foundation.DATA fd
where id = any(select id from ids_to_process)
and fd.asset_id = any(select asset_id from ids_to_process)
) jh
group by asset_id;
the explain will show that it accesses all partitions:
...
-> Append (cost=0.42..1822.75 rows=256 width=20) (actual time=0.285..0.617 rows=1 loops=1000)
-> Index Scan using asset_id_hash_0_id_idx on asset_id_hash_0 fd (cost=0.42..7.10 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
-> Index Scan using asset_id_hash_1_id_idx on asset_id_hash_1 fd_1 (cost=0.42..6.66 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
...
-> Index Scan using asset_id_hash_72_id_idx on asset_id_hash_72 fd_72 (cost=0.43..8.35 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
...
-> Index Scan using asset_id_hash_255_id_idx on asset_id_hash_255 fd_255 (cost=0.42..6.87 rows=1 width=20) (actual time=0.002..0.002 rows=0 loops=1000)
Index Cond: (id = ids_to_process.id)
How do force the planner to only access the relevant partitions?
The subquery is accessing only one partition:
-> Hash (cost=145.14..145.14 rows=200 width=4)
-> Group (cost=0.00..143.14 rows=200 width=4)
Group Key: asset_7800_1.asset_id
-> Seq Scan on asset_7800 asset_7800_1 (cost=0.00..135.11 rows=3209 width=4)
Filter: (asset_id = 7800)
The outer query on the other hand is accessing all partitions. I think this is because the planner can't be sure that the subquery will return rows belonging to only one partition.
Don't think there is an obvious way to do this unfortunately. Try running the explain with the ANALYZE option. That will tell you what Postgres actually did and then you might see that it actually only ended up scanning one partition.
Answering my own question here but do not understand the dynamics behind so happy for hints on your side as to why this changes so much.
I used a join instead:
with deleted_unprocessed_data as (
delete from foundation.unprocessed_ids d
where id = any(select id from foundation.unprocessed_ids up order by up.asset_id, up.data_point_timestamp asc limit 1000
-- case
-- when numberOfItems<1000 then numberOfItems
-- when numberOfItems>999 then 1000
-- end
)
returning id, asset_id, data_point_timestamp
)
, ids_to_process as
(
insert into foundation.processed_ids select * from deleted_unprocessed_data
returning id, asset_id
)
select jh.asset_id,
min(jh.data_point_timestamp) as minborder,
max(jh.data_point_timestamp) as maxborder
from
(
select fd.id, fd.asset_id, fd.data_point_timestamp FROM
foundation.DATA fd
**join** ids_to_process itp
on fd.asset_id = itp.asset_id
and fd.id = itp.id
) jh
group by asset_id;
Then I get the following explain analyze which specifies that all index scans but on one partition were never executed:
...
-> Append (cost=0.42..1823.39 rows=256 width=20) (actual time=0.002..0.003 rows=1 loops=1000)
-> Index Scan using asset_id_hash_0_id_idx on asset_id_hash_0 fd (cost=0.42..7.10 rows=1 width=20) (never executed)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)
-> Index Scan using asset_id_hash_1_id_idx on asset_id_hash_1 fd_1 (cost=0.42..6.67 rows=1 width=20) (never executed)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)
...
-> Index Scan using asset_id_hash_72_id_idx on asset_id_hash_72 fd_72 (cost=0.43..8.35 rows=1 width=20) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)
...
-> Index Scan using asset_id_hash_255_id_idx on asset_id_hash_255 fd_255 (cost=0.42..6.87 rows=1 width=20) (never executed)
Index Cond: (id = itp.id)
Filter: (itp.asset_id = asset_id)

Postgres index issue

We are working on a platform right now and I'm having difficulties optimizing one of our queries, it's actually a view.
When explaining the query a full table scan is performed on my user_role table.
The query behind the view looks like what I've posted below (I used * for select instead of different columns of the different tables, just to showcase my issue). All results are based on that simplified query.
SELECT t.*
FROM task t
LEFT JOIN project p ON t.project_id = p.id
LEFT JOIN company comp ON comp.id = p.company_id
LEFT JOIN company_users cu ON cu.companies_id = comp.id
LEFT JOIN user_table u ON u.email= cu.users_email
LEFT JOIN user_role role ON u.email= role.user_email
WHERE lower(t.type) = 'project'
AND t.project_id IS NOT NULL
AND role.role::text in ('SOME_ROLE_1', 'SOME_ROLE_2')
Basically this query runs ok like this. If I explain the query it uses all my index in place, nice!
But.. As soon as I start added to extra where clauses the issue arises. I simply add this:
and u.email = 'my#email.com'
and comp.id = 4
and p.id = 3
And all of the sudden on tables company_users and user_role a full table scan is performed. No on table project.
The full query plan is:
Nested Loop (cost=0.98..22.59 rows=1 width=97) (actual time=0.115..4.448 rows=189 loops=1)
-> Nested Loop (cost=0.84..22.02 rows=1 width=632) (actual time=0.099..3.091 rows=252 loops=1)
-> Nested Loop (cost=0.70..20.10 rows=1 width=613) (actual time=0.082..1.774 rows=252 loops=1)
-> Nested Loop (cost=0.56..19.81 rows=1 width=621) (actual time=0.068..0.919 rows=252 loops=1)
-> Nested Loop (cost=0.43..19.62 rows=1 width=101) (actual time=0.058..0.504 rows=63 loops=1)
-> Index Scan using task_project_id_index on task t (cost=0.28..11.43 rows=1 width=97) (actual time=0.041..0.199 rows=63 loops=1)
Index Cond: (project_id IS NOT NULL)
Filter: (lower((type)::text) = 'project'::text)
-> Index Scan using project_id_uindex on project p (cost=0.15..8.17 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=63)
Index Cond: (id = t.project_id)
-> Index Scan using company_users_companies_id_index on company_users cu (cost=0.14..0.17 rows=1 width=520) (actual time=0.002..0.004 rows=4 loops=63)
Index Cond: (companies_id = p.company_id)
-> Index Only Scan using company_id_index on company comp (cost=0.14..0.29 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=252)
Index Cond: (id = p.company_id)
Heap Fetches: 252
-> Index Only Scan using user_table_email_index on user_table u (cost=0.14..1.81 rows=1 width=19) (actual time=0.004..0.004 rows=1 loops=252)
Index Cond: (email = (cu.users_email)::text)
Heap Fetches: 252
-> Index Scan using user_role_user_email_index on user_role role (cost=0.14..0.56 rows=1 width=516) (actual time=0.004..0.004 rows=1 loops=252)
Index Cond: ((user_email)::text = (u.email)::text)
Filter: ((role)::text = ANY ('{COMPANY_ADMIN,COMPANY_USER}'::text[]))
Rows Removed by Filter: 0
Planning time: 2.581 ms
Execution time: 4.630 ms
The explanation for company_users in particular is:
SEQ_SCAN (Seq Scan) table: company_users; 1 1.44 0.0 Parent Relationship = Inner;
Parallel Aware = false;
Alias = cu;
Plan Width = 520;
Filter = ((companies_id = 4) AND ((users_email)::text = 'my#email.com'::text));
However I already created an index on the company_users table: create index if not exists company_users_users_email_companies_id_index on company_users (users_email, companies_id);.
Same counts for the user_role table.
The explanation is
SEQ_SCAN (Seq Scan) table: user_role; 1 1.45 0.0 Parent Relationship = Inner;
Parallel Aware = false;
Alias = role;
Plan Width = 516;
Filter = (((role)::text = ANY ('{SOME_ROLE_1,SOME_ROLE_2}'::text[])) AND ((user_email)::text = 'my#email.com'::text));
And thus for user_role I also have an index in place on columns role and user_email: create index if not exists user_role_role_user_email_index on user_role (role, user_email);
Did not expect it to change anything but anyways I did try to adapt the query to include only a single role in the where class and also tried with OR statements instead of an IN but all makes no difference!
The weird thing to me is that without my extra 3 where filters it works perfect, and as soon as I add them it does not work, but I do have indexes created for the fields mentioned in the explanations...
We are using exactly the same algo in a few other queries and they all suffer the same issue on those two fields...
So the big question is, how can I further improve my queries and make it use the index instead of a full table scan?
Create an index and calculate statistics:
CREATE INDEX ON task (lower(type)) WHERE project_id IS NOT NULL;
ANALYZE task;
That should improve the estimate, so that PostgreSQL chooses a better join strategy.

Postgres using unperformant plan when rerun

I'm importing a non circular graph and flattening the ancestors to an array per code. This works fine (for a bit): ~45s for 400k codes over ~900k edges.
However, after the first successful execution Postgres decides to stop using the Nested Loop and the update query performance drops drastically: ~2s per code.
I can force the issue by putting a vacuum right before the update but I am curious why the unoptimization is happening.
DROP TABLE IF EXISTS tmp_anc;
DROP TABLE IF EXISTS tmp_rel;
DROP TABLE IF EXISTS tmp_edges;
DROP TABLE IF EXISTS tmp_codes;
CREATE TABLE tmp_rel (
from_id BIGINT,
to_id BIGINT,
);
COPY tmp_rel FROM 'rel.txt' WITH DELIMITER E'\t' CSV HEADER;
CREATE TABLE tmp_edges(
start_node BIGINT,
end_node BIGINT
);
INSERT INTO tmp_edges(start_node, end_node)
SELECT from_id AS start_node, to_id AS end_node
FROM tmp_rel;
CREATE INDEX tmp_edges_end ON tmp_edges (end_node);
CREATE TABLE tmp_codes (
id BIGINT,
active SMALLINT,
);
COPY tmp_codes FROM 'codes.txt' WITH DELIMITER E'\t' CSV HEADER;
CREATE TABLE tmp_anc(
code BIGINT,
ancestors BIGINT[]
);
INSERT INTO tmp_anc
SELECT DISTINCT(id)
FROM tmp_codes
WHERE active = 1;
CREATE INDEX tmp_anc_codes ON tmp_anc_codes (code);
VACUUM; -- Need this for the update to execute optimally
UPDATE tmp_anc sa SET ancestors = (
WITH RECURSIVE ancestors(code) AS (
SELECT start_node FROM tmp_edges WHERE end_node = sa.code
UNION
SELECT se.start_node
FROM tmp_edges se, ancestors a
WHERE se.end_node = a.code
)
SELECT array_agg(code) FROM ancestors
);
Table stats:
tmp_rel 507 MB 0 bytes
tmp_edges 74 MB 37 MB
tmp_codes 32 MB 0 bytes
tmp_anc 22 MB 8544 kB
Explains:
Without VACUUM before UPDATE:
Update on tmp_anc sa (cost=10000000000.00..11081583053.74 rows=10 width=46) (actual time=38294.005..38294.005 rows=0 loops=1)
-> Seq Scan on tmp_anc sa (cost=10000000000.00..11081583053.74 rows=10 width=46) (actual time=3300.974..38292.613 rows=10 loops=1)
SubPlan 2
-> Aggregate (cost=108158305.25..108158305.26 rows=1 width=32) (actual time=3829.253..3829.253 rows=1 loops=10)
CTE ancestors
-> Recursive Union (cost=81.97..66015893.05 rows=1872996098 width=8) (actual time=0.037..3827.917 rows=45 loops=10)
-> Bitmap Heap Scan on tmp_edges (cost=81.97..4913.18 rows=4328 width=8) (actual time=0.022..0.022 rows=2 loops=10)
Recheck Cond: (end_node = sa.code)
Heap Blocks: exact=12
-> Bitmap Index Scan on tmp_edges_end (cost=0.00..80.89 rows=4328 width=0) (actual time=0.014..0.014 rows=2 loops=10)
Index Cond: (end_node = sa.code)
-> Merge Join (cost=4198.89..2855105.79 rows=187299177 width=8) (actual time=163.746..425.295 rows=10 loops=90)
Merge Cond: (a.code = se.end_node)
-> Sort (cost=4198.47..4306.67 rows=43280 width=8) (actual time=0.012..0.016 rows=5 loops=90)
Sort Key: a.code
Sort Method: quicksort Memory: 25kB
-> WorkTable Scan on ancestors a (cost=0.00..865.60 rows=43280 width=8) (actual time=0.000..0.001 rows=5 loops=90)
-> Materialize (cost=0.42..43367.08 rows=865523 width=16) (actual time=0.010..337.592 rows=537171 loops=90)
-> Index Scan using tmp_edges_end on edges se (cost=0.42..41203.27 rows=865523 width=16) (actual time=0.009..247.547 rows=537171 loops=90)
-> CTE Scan on ancestors (cost=0.00..37459921.96 rows=1872996098 width=8) (actual time=1.227..3829.159 rows=45 loops=10)
With VACUUM before UPDATE:
Update on tmp_anc sa (cost=0.00..2949980136.43 rows=387059 width=14) (actual time=74701.329..74701.329 rows=0 loops=1)
-> Seq Scan on tmp_anc sa (cost=0.00..2949980136.43 rows=387059 width=14) (actual time=0.336..70324.848 rows=387059 loops=1)
SubPlan 2
-> Aggregate (cost=7621.50..7621.51 rows=1 width=8) (actual time=0.180..0.180 rows=1 loops=387059)
CTE ancestors
-> Recursive Union (cost=0.42..7583.83 rows=1674 width=8) (actual time=0.005..0.162 rows=32 loops=387059)
-> Index Scan using tmp_edges_end on tmp_edges (cost=0.42..18.93 rows=4 width=8) (actual time=0.004..0.005 rows=2 loops=387059)
Index Cond: (end_node = sa.code)
-> Nested Loop (cost=0.42..753.14 rows=167 width=8) (actual time=0.003..0.019 rows=10 loops=2700448)
-> WorkTable Scan on ancestors a (cost=0.00..0.80 rows=40 width=8) (actual time=0.000..0.001 rows=5 loops=2700448)
-> Index Scan using tmp_edges_end on tmp_edges se (cost=0.42..18.77 rows=4 width=16) (actual time=0.003..0.003 rows=2 loops=12559395)
Index Cond: (end_node = a.code)
-> CTE Scan on ancestors (cost=0.00..33.48 rows=1674 width=8) (actual time=0.007..0.173 rows=32 loops=387059)
The first execution plan has really bad estimates (Bitmap Index Scan on tmp_edges_end estimates 4328 instead of 2 rows), while the second execution has good estimates and thus chooses a good plan.
So something between the two executions you quote above must have changed the estimates.
Moreover, you say that the first execution of the UPDATE (for which we have no EXPLAIN (ANALYZE) output) was fast.
The only good explanation for the initial performance drop is that it takes the autovacuum daemon some time to collect statistics for the new tables. This normally improves query performance, but of course it can also work the other way around.
Also, a VACUUM usually doesn't fix performance issues. Could it be that you used VACUUM (ANALYZE)?
It would be interesting to know how things are when you collect statistics before your initial UPDATE:
ANALYZE tmp_edges;
When I read your query more closely, however, I wonder why you use a correlated subquery for that. Maybe it would be faster to do something like:
UPDATE tmp_anc sa
SET ancestors = a.codes
FROM (WITH RECURSIVE ancestors(code, start_node) AS
(SELECT tmp_anc.code, tmp_edges.start_node
FROM tmp_edges
JOIN tmp_anc ON tmp_edges.end_node = tmp_anc.code
UNION
SELECT a.code, se.start_node
FROM tmp_edges se
JOIN ancestors a ON se.end_node = a.code
)
SELECT code,
array_agg(start_node) AS codes
FROM ancestors
GROUP BY (code)
) a
WHERE sa.code = a.code;
(This is untested, so there may be mistakes.)

Loose index scan in Postgres on more than one field?

I have several large tables in Postgres 9.2 (millions of rows) where I need to generate a unique code based on the combination of two fields, 'source' (varchar) and 'id' (int). I can do this by generating row_numbers over the result of:
SELECT source,id FROM tablename GROUP BY source,id
but the results can take a while to process. It has been recommended that if the fields are indexed, and there are a proportionally small number of index values (which is my case), that a loose index scan may be a better option: http://wiki.postgresql.org/wiki/Loose_indexscan
WITH RECURSIVE
t AS (SELECT min(col) AS col FROM tablename
UNION ALL
SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);
The example operates on a single field though. Trying to return more than one field generates an error: subquery must return only one column. One possibility might be to try retrieving an entire ROW - e.g. SELECT ROW(min(source),min(id)..., but then I'm not sure what the syntax of the WHERE statement would need to look like to work with individual row elements.
The question is: can the recursion-based code be modified to work with more than one column, and if so, how? I'm committed to using Postgres, but it looks like MySQL has implemented loose index scans for more than one column: http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
As recommended, I'm attaching my EXPLAIN ANALYZE results.
For my situation - where I'm selecting distinct values for 2 columns using GROUP BY, it's the following:
HashAggregate (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
-> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
Total runtime: 36127.789 ms
(3 rows)
I don't know how to do a 2-column index scan (that's the question), but for purposes of comparison, using a GROUP BY on one column, I get:
HashAggregate (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
-> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
Total runtime: 32350.899 ms
(3 rows)
But for a loose index scan on one column, I get:
Result (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
CTE t
-> Recursive Union (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
-> Result (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
-> Index Only Scan using tablename_id on tablename (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
-> WorkTable Scan on t (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 3
-> Result (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
InitPlan 2 (returns $3)
-> Limit (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
-> Index Only Scan using tablename_id on tablename (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
Index Cond: ((id IS NOT NULL) AND (id > t.id))
Heap Fetches: 0
-> Append (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
-> CTE Scan on t (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 1
-> Result (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
One-Time Filter: $5
InitPlan 5 (returns $5)
-> Index Only Scan using tablename_id on tablename (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id IS NULL)
Heap Fetches: 0
Total runtime: 2.040 ms
The full table definition looks like this:
CREATE TABLE tablename
(
source character(25),
id bigint NOT NULL,
time_ timestamp without time zone,
height numeric,
lon numeric,
lat numeric,
distance numeric,
status character(3),
geom geometry(PointZ,4326),
relid bigint
)
WITH (
OIDS=FALSE
);
CREATE INDEX tablename_height
ON public.tablename
USING btree
(height);
CREATE INDEX tablename_geom
ON public.tablename
USING gist
(geom);
CREATE INDEX tablename_id
ON public.tablename
USING btree
(id);
CREATE INDEX tablename_lat
ON public.tablename
USING btree
(lat);
CREATE INDEX tablename_lon
ON public.tablename
USING btree
(lon);
CREATE INDEX tablename_relid
ON public.tablename
USING btree
(relid);
CREATE INDEX tablename_sid
ON public.tablename
USING btree
(source COLLATE pg_catalog."default", id);
CREATE INDEX tablename_source
ON public.tablename
USING btree
(source COLLATE pg_catalog."default");
CREATE INDEX tablename_time
ON public.tablename
USING btree
(time_);
Answer selection:
I took some time in comparing the approaches that were provided. It's at times like this that I wish that more than one answer could be accepted, but in this case, I'm giving the tick to #jjanes. The reason for this is that his solution matches the question as originally posed more closely, and I was able to get some insights as to the form of the required WHERE statement. In the end, the HashAggregate is actually the fastest approach (for me), but that's due to the nature of my data, not any problems with the algorithms. I've attached the EXPLAIN ANALYZE for the different approaches below, and will be giving +1 to both jjanes and joop.
HashAggregate:
HashAggregate (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
-> Seq Scan on tablename (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
Total runtime: 24787.292 ms
Loose Index Scan modification
CTE Scan on t (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
Filter: (source IS NOT NULL)
Rows Removed by Filter: 1
CTE t
-> Recursive Union (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
-> Limit (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
Heap Fetches: 0
-> WorkTable Scan on t (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
Filter: (source IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
Index Cond: (ROW(source, id) > ROW(t.source, t.id))
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
Index Cond: (ROW(source, id) > ROW(t.source, t.id))
Heap Fetches: 0
Total runtime: 250491.559 ms
Merge Anti Join
Merge Anti Join (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 363464177
-> Index Only Scan using tablename_pkey on tablename src (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
Heap Fetches: 0
-> Materialize (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
-> Index Only Scan using tablename_pkey on tablename nx (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
Heap Fetches: 0
Total runtime: 541750.026 ms
Rather hideous, but this seems to work:
WITH RECURSIVE
t AS (
select a,b from (select a,b from foo order by a,b limit 1) asdf union all
select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),
(select b from foo where (a,b) > (t.a,t.b) order by a,b limit 1)
from t where t.a is not null)
select * from t where t.a is not null;
I don't really understand why the "is not nulls" are needed, as where do the nulls come from in the first place?
DROP SCHEMA zooi CASCADE;
CREATE SCHEMA zooi ;
SET search_path=zooi,public,pg_catalog;
CREATE TABLE tablename
( source character(25) NOT NULL
, id bigint NOT NULL
, time_ timestamp without time zone NOT NULL
, height numeric
, lon numeric
, lat numeric
, distance numeric
, status character(3)
, geom geometry(PointZ,4326)
, relid bigint
, PRIMARY KEY (source,id,time_) -- <<-- Primary key here
) WITH ( OIDS=FALSE);
-- invent some bogus data
INSERT INTO tablename(source,id,time_)
SELECT 'SRC_'|| (gs%10)::text
,gs/10
,gt
FROM generate_series(1,1000) gs
, generate_series('2013-12-01', '2013-12-07', '1hour'::interval) gt
;
Select unique values for two key fields:
VACUUM ANALYZE tablename;
EXPLAIN ANALYZE
SELECT source,id,time_
FROM tablename src
WHERE NOT EXISTS (
SELECT * FROM tablename nx
WHERE nx.source =src.source
AND nx.id = src.id
AND time_ > src.time_
)
;
Generates this plan here (Pg-9.3):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=4981.00..12837.82 rows=96667 width=42) (actual time=547.218..1194.335 rows=1000 loops=1)
Hash Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 145000
-> Seq Scan on tablename src (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.010..210.810 rows=145000 loops=1)
-> Hash (cost=2806.00..2806.00 rows=145000 width=42) (actual time=546.497..546.497 rows=145000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 9063kB
-> Seq Scan on tablename nx (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.006..259.864 rows=145000 loops=1)
Total runtime: 1197.374 ms
(9 rows)
The hash-joins will probably disappear once the data outgrows the work_mem:
Merge Anti Join (cost=0.83..8779.56 rows=29832 width=120) (actual time=0.981..2508.912 rows=1000 loops=1)
Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 184051
-> Index Scan using tablename_sid on tablename src (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.055..250.621 rows=145000 loops=1)
-> Index Scan using tablename_sid on tablename nx (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.008..603.403 rows=328906 loops=1)
Total runtime: 2510.505 ms
Lateral joins can give you a clean code to select multiple columns in nested selects, without checking for null as no subqueries in select clause.
-- Assuming you want to get one '(a,b)' for every 'a'.
with recursive t as (
(select a, b from foo order by a, b limit 1)
union all
(select s.* from t, lateral(
select a, b from foo f
where f.a > t.a
order by a, b limit 1) s)
)
select * from t;

Postgres Slow group by query with max

I am using postgres 9.1 and I have a table with about 3.5M rows of eventtype (varchar) and eventtime (timestamp) - and some other fields. There are only about 20 different eventtype's and the event time spans about 4 years.
I want to get the last timestamp of each event type. If I run a query like:
select eventtype, max(eventtime)
from allevents
group by eventtype
it takes around 20 seconds. Selecting distinct eventtype's is equally slow. The query plan shows a full sequential scan of the table - not surprising it is slow.
Explain analyse for the above query gives:
HashAggregate (cost=84591.47..84591.68 rows=21 width=21) (actual time=20918.131..20918.141 rows=21 loops=1)
-> Seq Scan on allevents (cost=0.00..66117.98 rows=3694698 width=21) (actual time=0.021..4831.793 rows=3694392 loops=1)
Total runtime: 20918.204 ms
If I add a where clause to select a specific eventtype, it takes anywhere from 40ms to 150ms which is at least decent.
Query plan when selecting specific eventtype:
GroupAggregate (cost=343.87..24942.71 rows=1 width=21) (actual time=98.397..98.397 rows=1 loops=1)
-> Bitmap Heap Scan on allevents (cost=343.87..24871.07 rows=14325 width=21) (actual time=6.820..89.610 rows=19736 loops=1)
Recheck Cond: ((eventtype)::text = 'TEST_EVENT'::text)
-> Bitmap Index Scan on allevents_idx2 (cost=0.00..340.28 rows=14325 width=0) (actual time=6.121..6.121 rows=19736 loops=1)
Index Cond: ((eventtype)::text = 'TEST_EVENT'::text)
Total runtime: 98.482 ms
Primary key is (eventtype, eventtime). I also have the following indexes:
allevents_idx (event time desc, eventtype)
allevents_idx2 (eventtype).
How can I speed up the query?
Results of query play for correlated subquery suggested by #denis below with 14 manually entered values gives:
Function Scan on unnest val (cost=0.00..185.40 rows=100 width=32) (actual time=0.121..8983.134 rows=14 loops=1)
SubPlan 2
-> Result (cost=1.83..1.84 rows=1 width=0) (actual time=641.644..641.645 rows=1 loops=14)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..1.83 rows=1 width=8) (actual time=641.640..641.641 rows=1 loops=14)
-> Index Scan using allevents_idx on allevents (cost=0.00..322672.36 rows=175938 width=8) (actual time=641.638..641.638 rows=1 loops=14)
Index Cond: ((eventtime IS NOT NULL) AND ((eventtype)::text = val.val))
Total runtime: 8983.203 ms
Using the recursive query suggested by #jjanes, the query runs between 4 and 5 seconds with the following plan:
CTE Scan on t (cost=260.32..448.63 rows=101 width=32) (actual time=0.146..4325.598 rows=22 loops=1)
CTE t
-> Recursive Union (cost=2.52..260.32 rows=101 width=32) (actual time=0.075..1.449 rows=22 loops=1)
-> Result (cost=2.52..2.53 rows=1 width=0) (actual time=0.074..0.074 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..2.52 rows=1 width=13) (actual time=0.070..0.071 rows=1 loops=1)
-> Index Scan using allevents_idx2 on allevents (cost=0.00..9315751.37 rows=3696851 width=13) (actual time=0.070..0.070 rows=1 loops=1)
Index Cond: ((eventtype)::text IS NOT NULL)
-> WorkTable Scan on t (cost=0.00..25.58 rows=10 width=32) (actual time=0.059..0.060 rows=1 loops=22)
Filter: (eventtype IS NOT NULL)
SubPlan 3
-> Result (cost=2.53..2.54 rows=1 width=0) (actual time=0.059..0.059 rows=1 loops=21)
InitPlan 2 (returns $3)
-> Limit (cost=0.00..2.53 rows=1 width=13) (actual time=0.057..0.057 rows=1 loops=21)
-> Index Scan using allevents_idx2 on allevents (cost=0.00..3114852.66 rows=1232284 width=13) (actual time=0.055..0.055 rows=1 loops=21)
Index Cond: (((eventtype)::text IS NOT NULL) AND ((eventtype)::text > t.eventtype))
SubPlan 6
-> Result (cost=1.83..1.84 rows=1 width=0) (actual time=196.549..196.549 rows=1 loops=22)
InitPlan 5 (returns $6)
-> Limit (cost=0.00..1.83 rows=1 width=8) (actual time=196.546..196.546 rows=1 loops=22)
-> Index Scan using allevents_idx on allevents (cost=0.00..322946.21 rows=176041 width=8) (actual time=196.544..196.544 rows=1 loops=22)
Index Cond: ((eventtime IS NOT NULL) AND ((eventtype)::text = t.eventtype))
Total runtime: 4325.694 ms
What you need is a "skip scan" or "loose index scan". PostgreSQL's planner does not yet implement those automatically, but you can trick it into using one by using a recursive query.
WITH RECURSIVE t AS (
SELECT min(eventtype) AS eventtype FROM allevents
UNION ALL
SELECT (SELECT min(eventtype) as eventtype FROM allevents WHERE eventtype > t.eventtype)
FROM t where t.eventtype is not null
)
select eventtype, (select max(eventtime) from allevents where eventtype=t.eventtype) from t;
There may be a way to collapse the max(eventtime) into the recursive query rather than doing it outside that query, but if so I have not hit upon it.
This needs an index on (eventtype, eventtime) in order to be efficient. You can have it be DESC on the eventtime, but that is not necessary. This is efficiently only if eventtype has only a few distinct values (21 of them, in your case).
Based on the question you already have the relevant index.
If upgrading to Postgres 9.3 or an index on (eventtype, eventtime desc) doesn't make a difference, this is a case where rewriting the query so it uses a correlated subquery works very well if you can enumerate all of the event types manually:
select val as eventtype,
(select max(eventtime)
from allevents
where allevents.eventtype = val
) as eventtime
from unnest('{type1,type2,…}'::text[]) as val;
Here's the plans I get when running similar queries:
denis=# select version();
version
-----------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.3.1 on x86_64-apple-darwin11.4.2, compiled by Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn), 64-bit
(1 row)
Test data:
denis=# create table test (evttype int, evttime timestamp, primary key (evttype, evttime));
CREATE TABLE
denis=# insert into test (evttype, evttime) select i, now() + (i % 3) * interval '1 min' - j * interval '1 sec' from generate_series(1,10) i, generate_series(1,10000) j;
INSERT 0 100000
denis=# create index on test (evttime, evttype);
CREATE INDEX
denis=# vacuum analyze test;
VACUUM
First query:
denis=# explain analyze select evttype, max(evttime) from test group by evttype; QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=2041.00..2041.10 rows=10 width=12) (actual time=54.983..54.987 rows=10 loops=1)
-> Seq Scan on test (cost=0.00..1541.00 rows=100000 width=12) (actual time=0.009..15.954 rows=100000 loops=1)
Total runtime: 55.045 ms
(3 rows)
Second query:
denis=# explain analyze select val as evttype, (select max(evttime) from test where test.evttype = val) as evttime from unnest('{1,2,3,4,5,6,7,8,9,10}'::int[]) val;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on unnest val (cost=0.00..48.39 rows=100 width=4) (actual time=0.086..0.292 rows=10 loops=1)
SubPlan 2
-> Result (cost=0.46..0.47 rows=1 width=0) (actual time=0.024..0.024 rows=1 loops=10)
InitPlan 1 (returns $1)
-> Limit (cost=0.42..0.46 rows=1 width=8) (actual time=0.021..0.021 rows=1 loops=10)
-> Index Only Scan Backward using test_pkey on test (cost=0.42..464.42 rows=10000 width=8) (actual time=0.019..0.019 rows=1 loops=10)
Index Cond: ((evttype = val.val) AND (evttime IS NOT NULL))
Heap Fetches: 0
Total runtime: 0.370 ms
(9 rows)
index on (eventtype, eventtime desc) should help. or reindex on primary key index. I would also recommend replace type of eventtype to enum (if number of types is fixed) or int/smallint. This will decrease size of data and indexes so queries will run faster.