Postgres selecting sub-optimal query plan in production - postgresql

I've got a query with an ORDER and a LIMIT to support a paginated interface:
SELECT segment_members.id AS t0_r0,
segment_members.segment_id AS t0_r1,
segment_members.account_id AS t0_r2,
segment_members.score AS t0_r3,
segment_members.created_at AS t0_r4,
segment_members.updated_at AS t0_r5,
segment_members.posts_count AS t0_r6,
accounts.id AS t1_r0,
accounts.platform AS t1_r1,
accounts.username AS t1_r2,
accounts.created_at AS t1_r3,
accounts.updated_at AS t1_r4,
accounts.remote_id AS t1_r5,
accounts.name AS t1_r6,
accounts.language AS t1_r7,
accounts.description AS t1_r8,
accounts.timezone AS t1_r9,
accounts.profile_image_url AS t1_r10,
accounts.post_count AS t1_r11,
accounts.follower_count AS t1_r12,
accounts.following_count AS t1_r13,
accounts.uri AS t1_r14,
accounts.location AS t1_r15,
accounts.favorite_count AS t1_r16,
accounts.raw AS t1_r17,
accounts.followers_completed_at AS t1_r18,
accounts.followings_completed_at AS t1_r19,
accounts.followers_started_at AS t1_r20,
accounts.followings_started_at AS t1_r21,
accounts.profile_fetched_at AS t1_r22,
accounts.managed_source_id AS t1_r23
FROM segment_members
INNER JOIN accounts ON accounts.id = segment_members.account_id
WHERE segment_members.segment_id = 1
ORDER BY accounts.follower_count ASC LIMIT 20
OFFSET 0;
Here are the indexes on the tables:
accounts
"accounts_pkey" PRIMARY KEY, btree (id)
"index_accounts_on_remote_id_and_platform" UNIQUE, btree (remote_id, platform)
"index_accounts_on_description" btree (description)
"index_accounts_on_favorite_count" btree (favorite_count)
"index_accounts_on_follower_count" btree (follower_count)
"index_accounts_on_following_count" btree (following_count)
"index_accounts_on_lower_username_and_platform" btree (lower(username::text), platform)
"index_accounts_on_post_count" btree (post_count)
"index_accounts_on_profile_fetched_at_and_platform" btree (profile_fetched_at, platform)
"index_accounts_on_username" btree (username)
segment_members
"segment_members_pkey" PRIMARY KEY, btree (id)
"index_segment_members_on_segment_id_and_account_id" UNIQUE, btree (segment_id, account_id)
"index_segment_members_on_account_id" btree (account_id)
"index_segment_members_on_segment_id" btree (segment_id)
In my development and staging databases, the query plan looks like the following, and the query executes very quickly.
Limit (cost=4802.15..4802.20 rows=20 width=2086)
-> Sort (cost=4802.15..4803.20 rows=421 width=2086)
Sort Key: accounts.follower_count
-> Nested Loop (cost=20.12..4790.95 rows=421 width=2086)
-> Bitmap Heap Scan on segment_members (cost=19.69..1244.24 rows=421 width=38)
Recheck Cond: (segment_id = 1)
-> Bitmap Index Scan on index_segment_members_on_segment_id_and_account_id (cost=0.00..19.58 rows=
421 width=0)
Index Cond: (segment_id = 1)
-> Index Scan using accounts_pkey on accounts (cost=0.43..8.41 rows=1 width=2048)
Index Cond: (id = segment_members.account_id)
In production, however, the query plan is the following, and the query takes forever (several minutes until it hits the statement timeout).
Limit (cost=0.86..25120.72 rows=20 width=2130)
-> Nested Loop (cost=0.86..4614518.64 rows=3674 width=2130)
-> Index Scan using index_accounts_on_follower_count on accounts (cost=0.43..2779897.53 rows=3434917 width=209
2)
-> Index Scan using index_segment_members_on_segment_id_and_account_id on segment_members (cost=0.43..0.52 row
s=1 width=38)
Index Cond: ((segment_id = 1) AND (account_id = accounts.id))
accounts has about 6m rows in staging and 3m in production. segment_members has about 300k rows in staging and 4m in production. Is it the differences in table sizes that is causing the differences in the query plan selection? Is there any way I can get Postgres to use the faster query plan in production?
Update:
Here's the EXPLAIN ANALYZE from the slow production server:
Limit (cost=0.86..22525.66 rows=20 width=2127) (actual time=173.148..187568.247 rows=20 loops=1)
-> Nested Loop (cost=0.86..4654749.92 rows=4133 width=2127) (actual time=173.141..187568.193 rows=20 loops=1)
-> Index Scan using index_accounts_on_follower_count on accounts (cost=0.43..2839731.81 rows=3390197 width=2089) (actual time=0.110..180374.279 rows=1401278 loops=1)
-> Index Scan using index_segment_members_on_segment_id_and_account_id on segment_members (cost=0.43..0.53 rows=1 width=38) (actual time=0.003..0.003 rows=0 loops=1401278)
Index Cond: ((segment_id = 1) AND (account_id = accounts.id))
Total runtime: 187568.318 ms
(6 rows)

Either your table statistics are not up to date or the two queries you present are very different. The second one estimates to retrieve 3.5M rows (rows=3434917). ORDER BY / LIMIT 20 is forced to sort all 3.5 million rows to find the top 20, which is going to be extremely expensive - unless you have a matching index.
The first query plan expects to sort 421 rows. Not even close. Different query plans are no surprise.
It would be interesting to see the output of EXPLAIN ANALYZE, not just EXPLAIN. (Expensive for the second query!)
It very much depends on how many account_id for each segment_id. If segment_id is not selective, the query cannot be fast. Your only other option is a MATERIALIZED VIEW with the top n rows per segment_id and an appropriate regime to keep it up to date.
If your statistics are not up to date, just run ANALYZE on both tables and retry.
It might help to increase the statistics target for selected columns:
ALTER TABLE segment_members ALTER segment_id SET STATISTICS 1000;
ALTER TABLE segment_members ALTER account_id SET STATISTICS 1000;
ALTER TABLE accounts ALTER id SET STATISTICS 1000;
ALTER TABLE accounts ALTER follower_count SET STATISTICS 1000;
ANALYZE segment_members(segment_id, account_id);
ANALYZE accounts (id, follower_count);
Details:
Check statistics targets in PostgreSQL
Keep PostgreSQL from sometimes choosing a bad query plan
Better indexes
I addition to your existing UNIQUE constraint index_segment_members_on_segment_id_and_account_id on segment_members, I suggest a multicolumn index on accounts:
CREATE INDEX index_accounts_on_follower_count ON accounts (id, follower_count)
Again, run ANALYZE after creating the index.
Some indexes useless?
All other indexes in your question are irrelevant for this query. They may be useful for other purposes or useless.
This index is 100% dead freight, drop it. (Detailed explanation here.)
"index_segment_members_on_segment_id" btree (segment_id)
This one may be useless:
"index_accounts_on_description" btree (description)
Since a "description" is typically free text that is hardly used to order rows or in a WHERE condition with a suitable operator. But that's just an educated guess.

Related

SQL Performance problem with like query after migration from MySQL to PostgreSQL

I migrated my database from MySQL to PostgreSQL with pgloader, it's globally much more efficient but a query with like condition is more slower on PostgreSQL.
MySQL : ~1ms
PostgreSQL : ~110 ms
Table info:
105 columns
23 indexes
1.6M records
Columns info:
name character varying(30) COLLATE pg_catalog."default" NOT NULL,
ratemax3v3 integer NOT NULL DEFAULT 0,
Query is :
SELECT name, realm, region, class, id
FROM personnages
WHERE blacklisted = 0 AND name LIKE 'Krok%' AND region = 'eu'
ORDER BY ratemax3v3 DESC LIMIT 5;
EXPLAIN ANALYSE (PostgreSQL)
Limit (cost=629.10..629.12 rows=5 width=34) (actual time=111.128..111.130 rows=5 loops=1)
-> Sort (cost=629.10..629.40 rows=117 width=34) (actual time=111.126..111.128 rows=5 loops=1)
Sort Key: ratemax3v3 DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on personnages (cost=9.63..627.16 rows=117 width=34) (actual time=110.619..111.093 rows=75 loops=1)
Recheck Cond: ((name)::text ~~ 'Krok%'::text)
Rows Removed by Index Recheck: 1
Filter: ((blacklisted = 0) AND ((region)::text = 'eu'::text))
Rows Removed by Filter: 13
Heap Blocks: exact=88
-> Bitmap Index Scan on trgm_idx_name (cost=0.00..9.60 rows=158 width=0) (actual time=110.581..110.582 rows=89 loops=1)
Index Cond: ((name)::text ~~ 'Krok%'::text)
Planning Time: 0.268 ms
Execution Time: 111.174 ms
Pgloader have been created indexes on ratemax3v3 and name like:
CREATE INDEX idx_24683_ratemax3v3
ON wow.personnages USING btree
(ratemax3v3 ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX idx_24683_name
ON wow.personnages USING btree
(name COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
I created a new index on name :
CREATE INDEX trgm_idx_name ON wow.personnages USING GIST (name gist_trgm_ops);
I'm quite a beginner with postgresql at the moment.
Do you see anything I could do?
Don't hesitate to ask me if you need more information!
Antoine
To support a LIKE query like that (left anchored) you need to use a special "operator class":
CREATE INDEX ON wow.personnages(name varchar_pattern_ops);
But for your given query, an index on multiple columns would probably be more efficient:
CREATE INDEX ON wow.personnages(region, blacklisted, name varchar_pattern_ops);
Of maybe even a filtered index if e.g. the blacklisted = 0 is a static condition and there are relatively few rows matching that condition.
CREATE INDEX ON wow.personnages(region, name varchar_pattern_ops)
WHERE blacklisted = 0;
If the majority of the rows has blacklisted = 0 that won't really help (and adding the column to the index wouldn't help either). In that case just an index with (region, name varchar_pattern_ops) is probably more efficient.
If your pattern is anchored at the beginning, the following index would perform better:
CREATE INDEX ON personnages (name text_pattern_ops);
Besides, GIN indexes usually perform better than GiST indexes in a case like this. Try with a GIN index.
Finally, it is possible that the trigrams k, kr, kro, rok and ok occur very frequently, which would also make the index perform bad.

Optimizing indexes for query on large table with multiple joins

We have an images table containing around ~25 million records and when I query the table based on the values from several joins the planner's estimates are quite different from the actual results for row counts. We have other queries that are roughly the same without all of the joins and it is much faster. I would like to know what steps I can take to debug and optimize the query. Also, is it better to have one index covering all columns included in the join and the where clause or a multiple indexes one for each join column and then another with all of the fields in the where clause?
The query:
EXPLAIN ANALYZE
SELECT "images".* FROM "images"
INNER JOIN "locations" ON "locations"."id" = "images"."location_id"
INNER JOIN "users" ON "images"."creator_id" = "users"."id"
INNER JOIN "user_groups" ON "users"."id" = "user_groups"."user_id"
WHERE "images"."deleted_at" IS NULL
AND "user_groups"."group_id" = 7
AND "images"."creator_type" = 'User'
AND "images"."status" = 2
AND "locations"."active" = TRUE
ORDER BY date_uploaded DESC
LIMIT 50
OFFSET 0;
The explain:
Limit (cost=25670.61..25670.74 rows=50 width=585) (actual time=1556.250..1556.278 rows=50 loops=1)
-> Sort (cost=25670.61..25674.90 rows=1714 width=585) (actual time=1556.250..1556.264 rows=50 loops=1)
Sort Key: images.date_uploaded
Sort Method: top-N heapsort Memory: 75kB
-> Nested Loop (cost=1.28..25613.68 rows=1714 width=585) (actual time=0.097..1445.777 rows=160886 loops=1)
-> Nested Loop (cost=0.85..13724.04 rows=1753 width=585) (actual time=0.069..976.326 rows=161036 loops=1)
-> Nested Loop (cost=0.29..214.87 rows=22 width=8) (actual time=0.023..0.786 rows=22 loops=1)
-> Seq Scan on user_groups (cost=0.00..95.83 rows=22 width=4) (actual time=0.008..0.570 rows=22 loops=1)
Filter: (group_id = 7)
Rows Removed by Filter: 5319
-> Index Only Scan using users_pkey on users (cost=0.29..5.40 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=22)
Index Cond: (id = user_groups.user_id)
Heap Fetches: 18
-> Index Scan using creator_date_uploaded_Where_pub_not_del on images (cost=0.56..612.08 rows=197 width=585) (actual time=0.062..40.992 rows=7320 loops=22)
Index Cond: ((creator_id = users.id) AND ((creator_type)::text = 'User'::text) AND (status = 2))
-> Index Scan using locations_pkey on locations (cost=0.43..6.77 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=161036)
Index Cond: (id = images.location_id)
Filter: active
Rows Removed by Filter: 0
Planning time: 1.694 ms
Execution time: 1556.352 ms
We are running Postgres 9.4 on an RDS db.m4.large instance.
As for the query itself, the only thing you can do is skipping on users table. From EXPLAIN you can see that it only does an Index Only Scan without actually touching the table. So, technically your query could look like this:
SELECT images.* FROM images
INNER JOIN locations ON locations.id = images.location_id
INNER JOIN user_groups ON images.creator_id = user_groups.user_id
WHERE images.deleted_at IS NULL
AND user_groups.group_id = 7
AND images.creator_type = 'User'
AND images.status = 2
AND locations.active = TRUE
ORDER BY date_uploaded DESC
OFFSET 0 LIMIT 50
The rest is about indexes. locations seems to have very little data, so optimization here will gain you nothing. user_groups on the other hand could benefit from an index ON (user_id) WHERE group_id = 7 or ON (group_id, user_id). This should remove some extra filtering on table content.
-- Option 1
CREATE INDEX ix_usergroups_userid_groupid7
ON user_groups (user_id)
WHERE group_id = 7;
-- Option 2
CREATE INDEX ix_usergroups_groupid_userid
ON user_groups (group_id, user_id);
Of course, the biggest thing here is images. Currently, the planer would do an index scan on creator_date_uploaded_Where_pub_not_del which I suspect does not fully match the requirements. Here, multiple options come to mind depending on your usage pattern - from one where the search parameters are rather common:
-- Option 1
CREATE INDEX ix_images_creatorid_typeuser_status2_notdel
ON images (creator_id)
WHERE creator_type = 'User' AND status = 2 AND deleted_at IS NULL;
to one with completely dynamic parameters:
-- Option 2
CREATE INDEX ix_images_status_creatortype_creatorid_notdel
ON images (status, creator_type, creator_id)
WHERE deleted_at IS NULL;
The first index is preferable as it is smaller (values are filtered-out rather than indexed).
To summarize, unless you are limited by memory (or other factors), I would add indexes on user_groups and images. Correct choice of indexes must be confirmed empirically, as multiple options are usually available and the situation depends on statistical distribution of data.
Here's a different approach:
I think one of the problems is that you are doing joins 1714 times, and then just returning the first 50 results. We'll probably want to avoid extra joins as soon as possible.
For this, We'll try to have an index by date_uploaded first. And then we will filter by the rest of the columns. Also, We add creator_id for getting an index-only scan:
CREATE INDEX ix_images_sort_test
ON images (date_uploaded desc, creator_id)
WHERE creator_type = 'User' AND status = 2 AND deleted_at IS NULL;
Also you may use the generic version (unfiltered). But it should somewhat worse. Since the first column will be date_uploaded we will need to read the whole index for the filtering of the rest of the columns.
CREATE INDEX ix_images_sort_test
ON images (date_uploaded desc, status, creator_type, creator_id)
WHERE deleted_at IS NULL;
The pity here is that you are also filtering by group_id, which is on another table. But even that, It may be worth trying this approach.
Also, verify that all joined tables have an index on the foreign key.
So, add an index for user_groups as (user_id, group_id)
Also, as Boris noticed, you may remove the "Users" join.

Postgres Proper Index for Sorting and Join

I have a simple schema and query, but am experiencing consistent awful performance with certain parameters.
Schema:
CREATE TABLE locations (
id integer NOT NULL,
barcode_id integer NOT NULL
);
CREATE TABLE barcodes (
id integer NOT NULL,
value citext NOT NULL
);
ALTER TABLE ONLY locations ADD CONSTRAINT locations_pkey PRIMARY KEY (id);
ALTER TABLE ONLY barcodes ADD CONSTRAINT barcodes_pkey PRIMARY KEY (id);
ALTER TABLE ONLY locations ADD CONSTRAINT fk_locations_barcodes FOREIGN KEY (barcode_id) REFERENCES barcodes(id);
CREATE INDEX index_barcodes_on_value ON barcodes (value);
CREATE INDEX index_locations_on_barcode_id ON locations (barcode_id);
Query:
EXPLAIN ANALYZE
SELECT *
FROM locations
JOIN barcodes ON locations.barcode_id = barcodes.id
ORDER BY barcodes.value ASC
LIMIT 50;
Analysis:
Limit (cost=0.71..3564.01 rows=50 width=34) (actual time=0.043..683.025 rows=50 loops=1)
-> Nested Loop (cost=0.71..4090955.00 rows=57404 width=34) (actual time=0.043..683.017 rows=50 loops=1)
-> Index Scan using index_barcodes_on_value on barcodes (cost=0.42..26865.99 rows=496422 width=15) (actual time=0.023..218.775 rows=372138 loops=1)
-> Index Scan using index_locations_on_barcode_id on locations (cost=0.29..5.32 rows=287 width=8) (actual time=0.001..0.001 rows=0 loops=372138)
Index Cond: (barcode_id = barcodes.id)
Planning time: 0.167 ms
Execution time: 683.078 ms
500+ ms for the number of entries in my schema (500,000 barcodes and 60,000 locations) doesn't make sense. Can I do anything to improve the performance?
Note:
Even stranger is the execution time depends on the data. In drafting this question I attempted to include seeded random data, but the seeds seem to be performant:
Seed:
INSERT INTO barcodes (id, value) SELECT seed.id, gen_random_uuid() FROM generate_series(1,500000) AS seed(id);
INSERT INTO locations (id, barcode_id) SELECT seed.id, (RANDOM() * 500000) FROM generate_series(1,60000) AS seed(id);
Analysis:
Limit (cost=0.71..3602.63 rows=50 width=86) (actual time=0.089..1.123 rows=50 loops=1)
-> Nested Loop (cost=0.71..4330662.42 rows=60116 width=86) (actual time=0.088..1.115 rows=50 loops=1)
-> Index Scan using index_barcodes_on_value on barcodes (cost=0.42..44972.42 rows=500000 width=41) (actual time=0.006..0.319 rows=376 loops=1)
-> Index Scan using index_locations_on_barcode_id on locations (cost=0.29..5.56 rows=301 width=8) (actual time=0.002..0.002 rows=0 loops=376)
Index Cond: (barcode_id = barcodes.id)
Planning time: 0.213 ms
Execution time: 1.152 ms
Edit:
Analysis of the tables:
ANALYZE VERBOSE barcodes;
INFO: analyzing "public.barcodes"
INFO: "barcodes": scanned 2760 of 2760 pages, containing 496157 live
rows and 0 dead rows; 30000 rows in sample, 496157 estimated total rows
ANALYZE
Time: 62.937 ms
ANALYZE VERBOSE locations;
INFO: analyzing "public.locations"
INFO: "locations": scanned 254 of 254 pages, containing 57394 live rows
and 0 dead rows; 30000 rows in sample, 57394 estimated total rows
ANALYZE
Time: 21.447 ms
The problem is that the barcodes with low values don't have matches in locations, which PostgreSQL cannot know. So its plan to fetch the barcodes in the correct output order via the index and then join values from locations until it found 50 of them is much worse than it expected.
I would ANALYZE the tables and
DROP INDEX index_barcodes_on_value;
That should keep PostgreSQL from choosing that plan.
I don't know what plan PostgreSQL will choose then.
For a nested loop the following index might help:
CREATE INDEX ON locations(id);

Optimising UUID Lookups in Postgres

All uuid columns below use the native Postgres uuid column type.
Have a lookup table where the uuid (uuid type 4 - so as random as can feasibly be) is the primary key. Regularly pull sequence of rows, say 10,000 from this lookup table.
Then, wish to use that set of uuid's retrieved from the lookup table to query other tables, typically two others, using the UUID's just retrieved. The UUID's in the other tables (tables A and B) are not primary keys. UUID columns in other tables A and B have UNIQUE constraints (btree indices).
Currently not doing this merging using a JOIN of any kind, just simple:
Query lookup table, get uuids.
Query table A using uuids from (1)
Query table B using uuids from (1)
The issue is that queries (2) and (3) are surprisingly slow. So for around 4000 rows in tables A and B, particularly table A, around 30-50 seconds typically. Table A has around 60M rows.
Dealing with just table A, when using EXPLAIN ANALYZE, reports as doing an "Index Scan" on the uuid column in column A, with an Index Cond in the EXPLAIN ANALYZE output.
I've experiment with various WHERE clauses:
uuid = ANY ('{
uuid = ANY(VALUES('
uuid ='uuid1' OR uuid='uuid2' etc ....
And experimented with btree (distinct), hash index table A on uuid, btree and hash index.
By far the fastest (which is still relatively slow) is: btree and use of "ANY ('{" in the WHERE clause.
Various opinions I've read:
Actually doing a proper JOIN e.g. LEFT OUTER JOIN across the three tables.
That the use of uuid type 4 is the problem, it being a randomly generated id, as opposed to a sequence based id.
Possibly experimenting with work_mem.
Anyway. Wondered if anyone else had any other suggestions?
Table: "lookup"
uuid: type uuid. not null. plain storage.
datetime_stamp: type bigint. not null. plain storage.
harvest_date_stamp: type bigint. not null. plain storage.
state: type smallint. not null. plain storage.
Indexes:
"lookup_pkey" PRIMARY KEY, btree (uuid)
"lookup_32ff3898" btree (datetime_stamp)
"lookup_6c8369bc" btree (harvest_date_stamp)
"lookup_9ed39e2e" btree (state)
Has OIDs: no
Table: "article_data"`
int: type integer. not null default nextval('article_data_id_seq'::regclass). plain storage.
title: text.
text: text.
insertion_date: date
harvest_date: timestamp with time zone.
uuid: uuid.
Indexes:
"article_data_pkey" PRIMARY KEY, btree (id)
"article_data_uuid_key" UNIQUE CONSTRAINT, btree (uuid)
Has OIDs: no
Both lookup and article_data have around 65m rows. Two queries:
SELECT uuid FROM lookup WHERE state = 200 LIMIT 4000;
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS):
Limit (cost=0.00..4661.02 rows=4000 width=16) (actual time=0.009..1.036 rows=4000 loops=1)
Buffers: shared hit=42
-> Seq Scan on lookup (cost=0.00..1482857.00 rows=1272559 width=16) (actual time=0.008..0.777 rows=4000 loops=1)
Filter: (state = 200)
Rows Removed by Filter: 410
Buffers: shared hit=42
Total runtime: 1.196 ms
(7 rows)
Question: Why does this do a sequence scan and not an index scan when there is a btree on state?
SELECT article_data.id, article_data.uuid, article_data.title, article_data.text
FROM article_data
WHERE uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,..... 3999 more uuid's ....}'::uuid[]);
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS):
Index Scan using article_data_uuid_key on article_data (cost=5.56..34277.00 rows=4000 width=581) (actual time=0.063..66029.031 rows=400
0 loops=1)
Index Cond: (uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,5618754f-544b-4700-9d24-c364fd0ba4e9,958e37e3-6e6e-4b2a-b854-48e88ac1fdb7, ba56b483-59b2-4ae5-ae44-910401f3221b,aa4
aca60-a320-4ed3-b7b4-829e6ca63592,05f1c0b9-1f9b-4e1c-8f41-07545d694e6b,7aa4dee9-be17-49df-b0ca-d6e63b0dc023,e9037826-86c4-4bbc-a9d5-6977ff7458af,db5852bf- a447-4a1d-9673-ead2f7045589
,6704d89 .......}'::uuid[]))
Buffers: shared hit=16060 read=4084 dirtied=292
Total runtime: 66041.443 ms
(4 rows)
Question: Why is this so slow, even though it's reading from disk?
Without seeing your table structure and the output of explain analyze..., I'd expect an inner join on the lookup table to give the best performance. (My table_a has about 10 million rows.)
select *
from table_a
inner join
-- Brain dead way to get about 1000 rows
-- from a renamed scratch table.
(select test_uuid from lookup_table
where test_id < 10000) t
on table_a.test_uuid = t.test_uuid;
"Nested Loop (cost=0.72..8208.85 rows=972 width=36) (actual time=0.041..11.825 rows=999 loops=1)"
" -> Index Scan using uuid_test_2_test_id_key on lookup_table (cost=0.29..39.30 rows=972 width=16) (actual time=0.015..0.414 rows=999 loops=1)"
" Index Cond: (test_id Index Scan using uuid_test_test_uuid_key on table_a (cost=0.43..8.39 rows=1 width=20) (actual time=0.010..0.011 rows=1 loops=999)"
" Index Cond: (test_uuid = lookup_table.test_uuid)"
"Planning time: 0.503 ms"
"Execution time: 11.953 ms"

postgreSQL get last ID in partitioned tables /

my question is basically the same as this one, but i couldn't find an answer, its also written "to be solved in the next release" and "easy for min/max scans"
PostgreSQL+table partitioning: inefficient max() and min()
CREATE TABLE mc_handstats
(
id integer NOT NULL DEFAULT nextval('mc_handst_id_seq'::regclass),
playerid integer NOT NULL,
CONSTRAINT mc_handst_pkey PRIMARY KEY (id),
);
table is partitioned over playerid.
CREATE TABLE mc_handst_0000 ( CHECK ( playerid >= 0 AND playerid < 10000) ) INHERITS (mc_handst) TABLESPACE ssd01;
CREATE TABLE mc_handst_0010 ( CHECK ( playerid >= 10000 AND playerid < 30000) ) INHERITS (mc_handst) TABLESPACE ssd02;
CREATE TABLE mc_handst_0030 ( CHECK ( playerid >= 30000 AND playerid < 50000) ) INHERITS (mc_handst) TABLESPACE ssd03;
...
CREATE INDEX mc_handst_0000_PlayerID ON mc_handst_0000 (playerid);
CREATE INDEX mc_handst_0010_PlayerID ON mc_handst_0010 (playerid);
CREATE INDEX mc_handst_0030_PlayerID ON mc_handst_0030 (playerid);
...
plus create trigger on playerID
i want to get the last id (i could also get the value for the sequence, but i am used to work with tables/colums), but pSQL seems to be rather stupid scanning the table:
EXPLAIN ANALYZE select max(id) from mc_handstats; (the real query runs forever)
"Aggregate (cost=9080859.04..9080859.05 rows=1 width=4) (actual time=181867.626..181867.626 rows=1 loops=1)"
" -> Append (cost=0.00..8704322.43 rows=150614644 width=4) (actual time=2.460..163638.343 rows=151134891 loops=1)"
" -> Seq Scan on mc_handstats (cost=0.00..0.00 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=1)"
" -> Seq Scan on mc_handst_0000 mc_handstats (cost=0.00..728523.69 rows=12580969 width=4) (actual time=2.457..10800.539 rows=12656647 loops=1)"
...
ALL TABLES
...
"Total runtime: 181867.819 ms"
EXPLAIN ANALYZE select max(id) from mc_handst_1000
"Aggregate (cost=83999.50..83999.51 rows=1 width=4) (actual time=1917.933..1917.933 rows=1 loops=1)"
" -> Seq Scan on mc_handst_1000 (cost=0.00..80507.40 rows=1396840 width=4) (actual time=0.007..1728.268 rows=1396717 loops=1)"
"Total runtime: 1918.494 ms"
the runtime for the partitioned table is 'snap', and completely off the record over the master table. (postgreSQL 9.2)
\d mc_handstats (only the indexes)
Indexes:
"mc_handst_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"mc_handst_playerid_fkey" FOREIGN KEY (playerid) REFERENCES mc_players(id)
Triggers:
mc_handst_insert_trigger BEFORE INSERT ON mc_handstats FOR EACH ROW EXECUTE PROCEDURE mc_handst_insert_function()
Number of child tables: 20 (Use \d+ to list them.)
\d mc_handst_1000
Indexes:
"mc_handst_1000_playerid" btree (playerid)
Check constraints:
"mc_handst_1000_playerid_check" CHECK (playerid >= 1000000 AND playerid < 1100000)
hm, no PK index in the sub tables. while i don't understand why the result for max(id) is pretty fast on the subtables (as there is no index) and slow from the master table, it seems i need to add an index for PK also for all subtables. maybe that solves it.
CREATE INDEX mc_handst_0010_ID ON mc_handst_0010 (id);
... plus many more ...
and everything fine. still strange why it worked fast on the subtables before, that made me think they are indexed, but i also don't care to much.
thanks for this!
The first thing you need to do is index all the child tables on (id) and see if max(id) is smart enough to do an index scan on each table. I think i should be but I am not entirely sure.
If not, here's what I would do: I would start with currval([sequence_name]) and work back until a record is found. You could do something check blocks of 10 at a time, or the like in what is essentially a sparse scan. This could be done with a CTE like such (again relies on indexes):
WITH RECURSIVE ids (
select max(id) as max_id, currval('mc_handst_id_seq') - 10 as min_block
FROM mc_handst
WHERE id BETWEEN currval('mc_handst_id_seq') - 10 AND currval('mc_handst_id_seq')
UNION ALL
SELECT max(id), i.min_block - 10
FROM mc_handst
JOIN ids i ON id BETWEEN i.min_block - 10 AND i.min_block
WHERE i.max_id IS NULL
)
SELECT max(max_id) from ids;
That should do a sparse scan if the planner won't use an index once the partitions are indexed. In most cases it should only do one scan but it will repeat as necessary to find an id. Note that it might run forever on an empty table.
Assuming a parent's table like this:
CREATE TABLE parent AS (
id not null default nextval('parent_id_seq'::regclass)
... other columns ...
);
Whether you're using a rule or a trigger to divert the INSERTs into the child tables, immediately after the INSERT you may use:
SELECT currval('parent_id_seq'::regclass);
to get the last id inserted by your session, independently of concurrent INSERTs, each session having its own copy of the last sequence value it has obtained.
https://dba.stackexchange.com/questions/58497/return-id-from-partitioned-table-in-postgres