I am currently working on a Postgresql DB where we implemented a multi-tenant infrastructure.
Here's how it works :
We have several tables (table1, tables2, ...) where we added a tenant column. This column is used to filter rows based on different DB users. We have several users (tenant1, tenant2, ...) and a superuser (no tenant applied to it).
I want to optimize the simple following query :
SELECT id
FROM table
WHERE UPPER("table"."column"::text) = UPPER('blablabla')
Thus, I created a function index :
CREATE INDEX "upper_idx" ON "table" (upper("column") );
If I connect to the DB as superuser and execute my SELECT query, it runs smoothly and fast.
Bitmap Heap Scan on table (cost=71.66..9225.47 rows=2998 width=4)
Recheck Cond: (upper((column)::text) = 'blablabla'::text)
-> Bitmap Index Scan on upper_idx (cost=0.00..70.91 rows=2998 width=0)
Index Cond: (upper((column)::text) = 'blablabla'::text)
However, when I connect as tenant1 the index is not picked up and the db runs a sequential scan instead :
Gather (cost=1000.00..44767.19 rows=15 width=4)
Workers Planned: 2
-> Parallel Seq Scan on table (cost=0.00..43765.69 rows=6 width=4)
Filter: (((tenant)::text = (CURRENT_USER)::text) AND (upper((column)::text) = 'blablabla'::text))
Do you know how to make it work in this case ?
EDIT - added EXPLAIN(ANALYZE, BUFFER)
Gather (cost=1000.00..44767.19 rows=15 width=4) (actual time=502.601..503.466 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=781 read=36160
-> Parallel Seq Scan on table (cost=0.00..43765.69 rows=6 width=4) (actual time=498.978..498.978 rows=0 loops=3)
Filter: (((tenant)::text = (CURRENT_USER)::text) AND (upper((column)::text) = 'blablabla'::text))
Rows Removed by Filter: 199846
Buffers: shared hit=781 read=36160
Planning Time: 1.650 ms
Execution Time: 503.510 ms
EDIT 2 - added the (truncated) CREATE TABLE statement
-- public.table definition
-- Drop table
-- DROP TABLE public.table;
CREATE TABLE public.table (
id serial NOT NULL,
created timestamptz NOT NULL,
modified timestamptz NOT NULL,
...
column varchar(100) NULL,
"tenant" tenant NOT NULL,
...
);
...
CREATE INDEX upper_idx ON public.table USING btree (upper((column)::text), tenant);
CREATE INDEX table_column_91bdd18f ON public.table USING btree (column);
CREATE INDEX table_column_91bdd18f_like ON public.table USING btree (column varchar_pattern_ops);
...
-- Table Triggers
create trigger archive_deleted_rows after
delete
on
public.table for each row execute procedure archive.archive('{id}');
create trigger set_created_modified before
insert
or
update
on
public.table for each row execute procedure set_created_modified();
create trigger set_tenant before
insert
or
update
on
public.table for each row execute procedure set_tenant();
-- public.table foreign keys
...
EDIT 3 - dump of \d table
Table "public.table"
Column | Type | Collation | Nullable | Default
------------------------------------------+--------------------------+-----------+----------+--------------------------------------------
id | integer | | not null | nextval('table_id_seq'::regclass)
........
column | character varying(100) | | |
tenant | tenant | | not null |
........
Indexes:
"table_pkey" PRIMARY KEY, btree (id)
..........
"table__column_upper_idx" btree (upper(column::text), tenant)
"table_column_91bdd18f" btree (column)
"table_column_91bdd18f_like" btree (column varchar_pattern_ops)
.........
Check constraints:
.........
Foreign-key constraints:
.........
Referenced by:
.........
Policies:
POLICY "tenant_policy"
TO tenant1,tenant2
USING (((tenant)::text = (CURRENT_USER)::text))
Triggers:
........
set_tenant BEFORE INSERT OR UPDATE ON table FOR EACH ROW EXECUTE PROCEDURE set_tenant()
EDIT 4 - Added tenant data type
CREATE TYPE tenant AS ENUM (
'tenant1',
'tenant2');
You should add a multi-column index like whis:
CREATE INDEX "upper_column_for_tenant_idx" ON "table" (upper("column") , tenant);
In order to have only one index you should place upper("column") first and then tenant.
PostgreSQL docs state:
A multicolumn B-tree index can be used with query conditions that involve any subset of the index's columns, but the index is most efficient when there are constraints on the leading (leftmost) columns.
I've tried to recreate your setup in db<>fiddle. you can see the plan for query
EXPLAIN ANALYZE
SELECT * FROM public.table WHERE upper(("column")::text) = '50' and tenant='5'
is:
QUERY PLAN
Bitmap Heap Scan on "table" (cost=14.20..2333.45 rows=954 width=24) (actual time=0.093..0.831 rows=1000 loops=1)
Recheck Cond: ((upper(("column")::text) = '50'::text) AND ((tenant)::text = '5'::text))
Heap Blocks: exact=74
-> Bitmap Index Scan on upper_idx (cost=0.00..13.97 rows=954 width=0) (actual time=0.065..0.065 rows=1000 loops=1)
Index Cond: ((upper(("column")::text) = '50'::text) AND ((tenant)::text = '5'::text))
Planning Time: 0.361 ms
Execution Time: 0.997 ms
You should create an index on both columns. Since tenant is an enum data type and you compare it to a function result of type name, both sides are cast to the "common denominator" text. So use this index:
CREATE INDEX ON "table" (upper("column"), tenant::text);
Then calculate new statistics for the table:
ANALYZE "table";
Related
I have a query which is taking 2.5 seconds to run. On checking the query plan, I got to know that postgres is heavily underestimating the number of rows leading to nested loops.
Following is the query
explain analyze
SELECT
reprocessed_videos.video_id AS reprocessed_videos_video_id
FROM
reprocessed_videos
JOIN commit_info ON commit_info.id = reprocessed_videos.commit_id
WHERE
commit_info.tag = 'stop_sign_tbc_inertial_fix'
AND reprocessed_videos.reprocess_type_id = 28
AND reprocessed_videos.classification_crop_type_id = 0
AND reprocessed_videos.reprocess_status = 'success';
Following is the explain analyze output.
Nested Loop (cost=0.84..22941.18 rows=1120 width=4) (actual time=31.169..2650.181 rows=179524 loops=1)
-> Index Scan using commit_info_tag_key on commit_info (cost=0.28..8.29 rows=1 width=4) (actual time=0.395..0.397 rows=1 loops=1)
Index Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
-> Index Scan using ix_reprocessed_videos_commit_id on reprocessed_videos (cost=0.56..22919.99 rows=1289 width=8) (actual time=30.770..2634.546 rows=179524 loops=1)
Index Cond: (commit_id = commit_info.id)
Filter: ((reprocess_type_id = 28) AND (classification_crop_type_id = 0) AND ((reprocess_status)::text = 'success'::text))
Rows Removed by Filter: 1190
Planning Time: 0.326 ms
Execution Time: 2657.724 ms
As we can see index scan using ix_reprocessed_videos_commit_id anticipated 1289 rows, whereas there were 179524 rows. I have trying to find the reason for this but have been unsuccessful in whatever I tried.
Following are the things I tried.
Vacuum and analyzing all the involved tables. (helped a little but not much maybe because the tables were automatically vacuumed and analyzed)
Increasing the statistics count for commit_id column alter table reprocessed_videos alter column commit_id set statistics 1000; (helped a little)
I read about extended statistics, but not sure if they are of any use here.
Following are the number of tuples in each of these tables
kpis=> SELECT relname, reltuples FROM pg_class where relname in ('reprocessed_videos', 'video_catalog', 'commit_info');
relname | reltuples
--------------------+---------------
commit_info | 1439
reprocessed_videos | 3.1563756e+07
Following is some information related to table schemas
Table "public.reprocessed_videos"
Column | Type | Collation | Nullable | Default
-----------------------------+-----------------------------+-----------+----------+------------------------------------------------
id | integer | | not null | nextval('reprocessed_videos_id_seq'::regclass)
video_id | integer | | |
reprocess_status | character varying | | |
commit_id | integer | | |
reprocess_type_id | integer | | |
classification_crop_type_id | integer | | |
Indexes:
"reprocessed_videos_pkey" PRIMARY KEY, btree (id)
"ix_reprocessed_videos_commit_id" btree (commit_id)
"ix_reprocessed_videos_video_id" btree (video_id)
"reprocessed_videos_video_commit_reprocess_crop_key" UNIQUE CONSTRAINT, btree (video_id, commit_id, reprocess_type_id, classification_crop_type_id)
Foreign-key constraints:
"reprocessed_videos_commit_id_fkey" FOREIGN KEY (commit_id) REFERENCES commit_info(id)
Table "public.commit_info"
Column | Type | Collation | Nullable | Default
------------------------+-------------------+-----------+----------+-----------------------------------------
id | integer | | not null | nextval('commit_info_id_seq'::regclass)
tag | character varying | | |
commit | character varying | | |
Indexes:
"commit_info_pkey" PRIMARY KEY, btree (id)
"commit_info_tag_key" UNIQUE CONSTRAINT, btree (tag)
I am sure that postgres should not use nested loops in this case, but is using them because of bad row estimates. Any help is highly appreciated.
Following are the experiments I tried.
Disabling index scan
Nested Loop (cost=734.59..84368.70 rows=1120 width=4) (actual time=274.694..934.965 rows=179524 loops=1)
-> Bitmap Heap Scan on commit_info (cost=4.29..8.30 rows=1 width=4) (actual time=0.441..0.444 rows=1 loops=1)
Recheck Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
Heap Blocks: exact=1
-> Bitmap Index Scan on commit_info_tag_key (cost=0.00..4.29 rows=1 width=0) (actual time=0.437..0.439 rows=1 loops=1)
Index Cond: ((tag)::text = 'stop_sign_tbc_inertial_fix'::text)
-> Bitmap Heap Scan on reprocessed_videos (cost=730.30..84347.51 rows=1289 width=8) (actual time=274.250..920.137 rows=179524 loops=1)
Recheck Cond: (commit_id = commit_info.id)
Filter: ((reprocess_type_id = 28) AND (classification_crop_type_id = 0) AND ((reprocess_status)::text = 'success'::text))
Rows Removed by Filter: 1190
Heap Blocks: exact=5881
-> Bitmap Index Scan on ix_reprocessed_videos_commit_id (cost=0.00..729.98 rows=25256 width=0) (actual time=273.534..273.534 rows=180714 loops=1)
Index Cond: (commit_id = commit_info.id)
Planning Time: 0.413 ms
Execution Time: 941.874 ms
I also set updated the statistics for the commit_id column. I observe a approximately 3x speed increase.
On trying to disable bitmapscan, the query does a sequential scan and takes 19 seconds to run
The nested loop is the perfect join strategy, because there is only one row from commit_info. Any other join strategy would lose.
The question is if the index scan on reprocessed_videos is really too slow. To experiment, try again after SET enable_indexscan = off; to get a bitmap index scan and see if that is better. Then also SET enable_bitmapscan = off; to get a sequential scan. I suspect that your current plan will win, but the bitmap index scan has a good chance.
If the bitmap index scan is better, you should indeed try to improve the estimate:
ALTER TABLE reprocessed_videos ALTER commit_id SET STATISTICS 1000;
ANALYZE reprocessed_videos;
You can try with other values; pick the lowest that gives you a good enough estimate.
Another thing to try are extended statistics:
CREATE STATISTICS corr (dependencies)
ON (reprocess_type_id, classification_crop_type_id, reprocess_status)
FROM reprocessed_videos;
ANALYZE reprocessed_videos;
Perhaps you don't need even all three columns in there; play with it.
If the bitmap index scan does not offer enough benefit, there is one way how you can speed up the current index scan:
CLUSTER reprocessed_videos USING ix_reprocessed_videos_commit_id;
That rewrites the table in index order (and blocks concurrent access while it is running, so be careful!). After that, the index scan could be considerably faster. However, the order is not maintained, so you'll have to repeat the CLUSTER occasionally if enough of the table has been modified.
Create a covering index; one that has all the condition columns (first, in descending order of cardinality) and the value columns (last) needed for you query, which means the index alone can be used - avoiding accessing the table:
create index covering_index on reprocessed_videos(
reprocess_type_id,
classification_crop_type_id,
reprocess_status,
commit_id,
video_id
);
And ensure there's one on commit_info(id) too - indexes are not automatically defined in postgres, even for primary keys:
create index commit_info__id on commit_info(id);
To get more accurate query plans, you can manually set the cardinality of condition columns, for example:
select count(distinct reprocess_type_id) from reprocessed_videos;
Then set that value to the column:
alter table reprocessed_videos alter column reprocess_type_id set (n_distinct = number_from_above_query)
I have a table inside my Postgresql database, called consumer_actions. It contains all the actions done by consumers registered in my app. At the moment, this table has ~ 500 million records. What i'm trying to do is to get the maximum id, based on the system that the action came from.
The definition of the table is:
CREATE TABLE public.consumer_actions (
id int4 NOT NULL,
system_id int4 NOT NULL,
consumer_id int4 NOT NULL,
action_id int4 NOT NULL,
payload_json jsonb NULL,
external_system_date timestamptz NULL,
local_system_date timestamptz NULL,
CONSTRAINT consumer_actions_pkey PRIMARY KEY (id, system_id)
);
CREATE INDEX consumer_actions_ext_date ON public.consumer_actions USING btree (external_system_date);
CREATE INDEX consumer_actions_system_consumer_id ON public.consumer_actions USING btree (system_id, consumer_id);
when i'm trying
select max(id) from consumer_actions where system_id = 1
it takes less than one second, but if i try to use the same index (consumer_actions_system_consumer_id) to get the max(id) by system_id = 2, it takes more than an hour.
select max(id) from consumer_actions where system_id = 2
I have also checked the query planner, is looks similar for both queries; i also rerun vacuum analyze on the table and a reindex. Neither of them helped. Any idea what i can do to improve the second query time?
Here are the query planners for both tables, and the size at the moment of this table:
explain analyze
select max(id) from consumer_actions where system_id = 1;
Result (cost=1.49..1.50 rows=1 width=4) (actual time=0.062..0.063 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..1.49 rows=1 width=4) (actual time=0.057..0.057 rows=1 loops=1)
-> Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca (cost=0.57..524024735.49 rows=572451344 width=4) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: ((id IS NOT NULL) AND (system_id = 1))
Heap Fetches: 1
Planning Time: 0.173 ms
Execution Time: 0.092 ms
explain analyze
select max(id) from consumer_actions where system_id = 2;
Result (cost=6.46..6.47 rows=1 width=4) (actual time=7099484.855..7099484.858 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.57..6.46 rows=1 width=4) (actual time=7099484.839..7099484.841 rows=1 loops=1)
-> Index Only Scan Backward using consumer_actions_pkey on consumer_actions ca (cost=0.57..20205843.58 rows=3436129 width=4) (actual time=7099484.833..7099484.834 rows=1 loops=1)
Index Cond: ((id IS NOT NULL) AND (system_id = 2))
Heap Fetches: 1
Planning Time: 3.078 ms
Execution Time: 7099484.992 ms
(8 rows)
select count(*) from consumer_actions; --result is 577408504
Instead of using an aggregation function like max() that has to potentially scan and aggregate large numbers of rows for a table like yours you could get similar results with a query designed to return the fewest rows possible:
SELECT id FROM consumer_actions WHERE system_id = ? ORDER BY id DESC LIMIT 1;
This should still benefit significantly in performance from the existing indices.
I think that you should create an index like this one
CREATE INDEX consumer_actions_system_system_id_id ON public.consumer_actions USING btree (system_id, id);
I migrated my database from MySQL to PostgreSQL with pgloader, it's globally much more efficient but a query with like condition is more slower on PostgreSQL.
MySQL : ~1ms
PostgreSQL : ~110 ms
Table info:
105 columns
23 indexes
1.6M records
Columns info:
name character varying(30) COLLATE pg_catalog."default" NOT NULL,
ratemax3v3 integer NOT NULL DEFAULT 0,
Query is :
SELECT name, realm, region, class, id
FROM personnages
WHERE blacklisted = 0 AND name LIKE 'Krok%' AND region = 'eu'
ORDER BY ratemax3v3 DESC LIMIT 5;
EXPLAIN ANALYSE (PostgreSQL)
Limit (cost=629.10..629.12 rows=5 width=34) (actual time=111.128..111.130 rows=5 loops=1)
-> Sort (cost=629.10..629.40 rows=117 width=34) (actual time=111.126..111.128 rows=5 loops=1)
Sort Key: ratemax3v3 DESC
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on personnages (cost=9.63..627.16 rows=117 width=34) (actual time=110.619..111.093 rows=75 loops=1)
Recheck Cond: ((name)::text ~~ 'Krok%'::text)
Rows Removed by Index Recheck: 1
Filter: ((blacklisted = 0) AND ((region)::text = 'eu'::text))
Rows Removed by Filter: 13
Heap Blocks: exact=88
-> Bitmap Index Scan on trgm_idx_name (cost=0.00..9.60 rows=158 width=0) (actual time=110.581..110.582 rows=89 loops=1)
Index Cond: ((name)::text ~~ 'Krok%'::text)
Planning Time: 0.268 ms
Execution Time: 111.174 ms
Pgloader have been created indexes on ratemax3v3 and name like:
CREATE INDEX idx_24683_ratemax3v3
ON wow.personnages USING btree
(ratemax3v3 ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX idx_24683_name
ON wow.personnages USING btree
(name COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
I created a new index on name :
CREATE INDEX trgm_idx_name ON wow.personnages USING GIST (name gist_trgm_ops);
I'm quite a beginner with postgresql at the moment.
Do you see anything I could do?
Don't hesitate to ask me if you need more information!
Antoine
To support a LIKE query like that (left anchored) you need to use a special "operator class":
CREATE INDEX ON wow.personnages(name varchar_pattern_ops);
But for your given query, an index on multiple columns would probably be more efficient:
CREATE INDEX ON wow.personnages(region, blacklisted, name varchar_pattern_ops);
Of maybe even a filtered index if e.g. the blacklisted = 0 is a static condition and there are relatively few rows matching that condition.
CREATE INDEX ON wow.personnages(region, name varchar_pattern_ops)
WHERE blacklisted = 0;
If the majority of the rows has blacklisted = 0 that won't really help (and adding the column to the index wouldn't help either). In that case just an index with (region, name varchar_pattern_ops) is probably more efficient.
If your pattern is anchored at the beginning, the following index would perform better:
CREATE INDEX ON personnages (name text_pattern_ops);
Besides, GIN indexes usually perform better than GiST indexes in a case like this. Try with a GIN index.
Finally, it is possible that the trigrams k, kr, kro, rok and ok occur very frequently, which would also make the index perform bad.
Table definition:
CREATE TABLE schema.mylogoperation (
id_mylogoperation serial,
data DATE,
myschema VARCHAR(255),
column_var_2 VARCHAR(255),
user VARCHAR(255),
action TEXT,
column_var_1 TEXT,
log_old VARCHAR,
log_new VARCHAR
constraint pk_mylogoperation primary key (id_mylogoperation)
)
WITH (oids = false);
12 million rows
I tried to explain analyze:
explain analyze
SELECT
column_var_1,
column_var_2
column_var_3,
user,
action,
data,
log_old,
log_new
FROM schema.mylogoperation
WHERE
myschema = 'schema'
AND column_var_2 IN ('mydata1', 'mydata2', 'mydata3')
AND log_old <> log_new
AND column_var_1 LIKE 'mydata%';
indexes ( pk_mylogoperation only)
QUERY PLAN
Seq Scan on myschema (cost=0.00..713948.14rows=660 width=222) (actual time=380.308..4467.364 rows=48 loops=1)
Filter: (((log_old)::text <> (log_new)::text) AND (column_var_1 ~~ 'mydata%'::text) AND ((schema)::text = 'schema'::text) AND ((column_var_2)::text = ANY ('{mydata1,mydata2,mydata3}'::text[])))
Rows Removed by Filter: 12525296
Total runtime: 4467.425 ms
Then I tried to create a some index for better performance:
CREATE INDEX idx_mylogoperation_1 ON schema.mylogoperation (myschema, column_var_2);
reindex table schema.mylogoperation;
analyze schema.mylogoperation;
pk_mylogoperation + idx_mylogoperation_1
QUERY PLAN
Index Scan using idx_mylogoperation_qry1 on mylogoperation (cost=0.56..589836.84 rows=658 width=223) (actual time=331.679..4997.507 rows=48 loops=1)
Index Cond: (((myschema)::text = 'schema'::text) AND ((column_var_2)::text = ANY ('{mydata1,mydata2,mydata3}'::text[])))
Filter: (((log_old)::text <> (log_new)::text) AND (column_var_1 ~~ 'mydata%'::text))
Rows Removed by Filter: 7441986
Total runtime: 4997.580 ms
Then I tried again to create a some index for better performance:
CREATE INDEX idx_mylogoperation_2 ON schema.mylogoperation USING gin (column_var_1 gin_trgm_ops);
reindex table schema.mylogoperation;
analyze schema.mylogoperation;
pk_mylogoperation + idx_mylogoperation_1 + idx_mylogoperation_2
QUERY PLAN
Bitmap Heap Scan on idx_mylogoperation_var_1 (cost=1398.58..2765.08 rows=663 width=222) (actual time=5303.481..5303.906 rows=48 loops=1)
Recheck Cond: (column_var_1 ~~ 'mydata%'::text)
Filter: (((log_old)::text <> (log_new)::text) AND ((myschema)::text = 'schema'::text) AND ((column_var_2)::text = ANY ('{mydata1,mydata2,mydata3}'::text[])))
Rows Removed by Filter: 248
-> Bitmap Index Scan on idx_mylogoperation_var_1 (cost=0.00..1398.41 rows=1215 width=0) (actual time=5303.203..5303.203 rows=296 loops=1)
Index Cond: (column_var_1 ~~ 'mydata%'::text)
Total runtime: 5303.950 ms
The question
the cost decreased but the time was practically the same, why?
Notes:
I do not want to make changes to the select operation, just in the database structure.
This test was performed on a server that is in use. But creating these indices was efficient? Or rather do not use them.
I am using Postgres 9.3.22 on Linux 64-bit Red Hat.
This index:
CREATE INDEX idx_mylogoperation_1 ON schema.mylogoperation (myschema, column_var_2);
didn't help because the relevant portion of your where clause matched ~2/3 of the table. The index didn't narrow down the results very much, but the filter did:
Filter: (((log_old)::text <> (log_new)::text) AND (column_var_1 ~~ 'mydata%'::text))
Rows Removed by Filter: 7441986
I'm not sure which of those two things in the filter removed more, but you could try a partial index like:
CREATE INDEX idx_mylogoperation_1 ON schema.mylogoperation (myschema, column_var_2) WHERE log_old <> log_new;
All uuid columns below use the native Postgres uuid column type.
Have a lookup table where the uuid (uuid type 4 - so as random as can feasibly be) is the primary key. Regularly pull sequence of rows, say 10,000 from this lookup table.
Then, wish to use that set of uuid's retrieved from the lookup table to query other tables, typically two others, using the UUID's just retrieved. The UUID's in the other tables (tables A and B) are not primary keys. UUID columns in other tables A and B have UNIQUE constraints (btree indices).
Currently not doing this merging using a JOIN of any kind, just simple:
Query lookup table, get uuids.
Query table A using uuids from (1)
Query table B using uuids from (1)
The issue is that queries (2) and (3) are surprisingly slow. So for around 4000 rows in tables A and B, particularly table A, around 30-50 seconds typically. Table A has around 60M rows.
Dealing with just table A, when using EXPLAIN ANALYZE, reports as doing an "Index Scan" on the uuid column in column A, with an Index Cond in the EXPLAIN ANALYZE output.
I've experiment with various WHERE clauses:
uuid = ANY ('{
uuid = ANY(VALUES('
uuid ='uuid1' OR uuid='uuid2' etc ....
And experimented with btree (distinct), hash index table A on uuid, btree and hash index.
By far the fastest (which is still relatively slow) is: btree and use of "ANY ('{" in the WHERE clause.
Various opinions I've read:
Actually doing a proper JOIN e.g. LEFT OUTER JOIN across the three tables.
That the use of uuid type 4 is the problem, it being a randomly generated id, as opposed to a sequence based id.
Possibly experimenting with work_mem.
Anyway. Wondered if anyone else had any other suggestions?
Table: "lookup"
uuid: type uuid. not null. plain storage.
datetime_stamp: type bigint. not null. plain storage.
harvest_date_stamp: type bigint. not null. plain storage.
state: type smallint. not null. plain storage.
Indexes:
"lookup_pkey" PRIMARY KEY, btree (uuid)
"lookup_32ff3898" btree (datetime_stamp)
"lookup_6c8369bc" btree (harvest_date_stamp)
"lookup_9ed39e2e" btree (state)
Has OIDs: no
Table: "article_data"`
int: type integer. not null default nextval('article_data_id_seq'::regclass). plain storage.
title: text.
text: text.
insertion_date: date
harvest_date: timestamp with time zone.
uuid: uuid.
Indexes:
"article_data_pkey" PRIMARY KEY, btree (id)
"article_data_uuid_key" UNIQUE CONSTRAINT, btree (uuid)
Has OIDs: no
Both lookup and article_data have around 65m rows. Two queries:
SELECT uuid FROM lookup WHERE state = 200 LIMIT 4000;
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS):
Limit (cost=0.00..4661.02 rows=4000 width=16) (actual time=0.009..1.036 rows=4000 loops=1)
Buffers: shared hit=42
-> Seq Scan on lookup (cost=0.00..1482857.00 rows=1272559 width=16) (actual time=0.008..0.777 rows=4000 loops=1)
Filter: (state = 200)
Rows Removed by Filter: 410
Buffers: shared hit=42
Total runtime: 1.196 ms
(7 rows)
Question: Why does this do a sequence scan and not an index scan when there is a btree on state?
SELECT article_data.id, article_data.uuid, article_data.title, article_data.text
FROM article_data
WHERE uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,..... 3999 more uuid's ....}'::uuid[]);
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS):
Index Scan using article_data_uuid_key on article_data (cost=5.56..34277.00 rows=4000 width=581) (actual time=0.063..66029.031 rows=400
0 loops=1)
Index Cond: (uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,5618754f-544b-4700-9d24-c364fd0ba4e9,958e37e3-6e6e-4b2a-b854-48e88ac1fdb7, ba56b483-59b2-4ae5-ae44-910401f3221b,aa4
aca60-a320-4ed3-b7b4-829e6ca63592,05f1c0b9-1f9b-4e1c-8f41-07545d694e6b,7aa4dee9-be17-49df-b0ca-d6e63b0dc023,e9037826-86c4-4bbc-a9d5-6977ff7458af,db5852bf- a447-4a1d-9673-ead2f7045589
,6704d89 .......}'::uuid[]))
Buffers: shared hit=16060 read=4084 dirtied=292
Total runtime: 66041.443 ms
(4 rows)
Question: Why is this so slow, even though it's reading from disk?
Without seeing your table structure and the output of explain analyze..., I'd expect an inner join on the lookup table to give the best performance. (My table_a has about 10 million rows.)
select *
from table_a
inner join
-- Brain dead way to get about 1000 rows
-- from a renamed scratch table.
(select test_uuid from lookup_table
where test_id < 10000) t
on table_a.test_uuid = t.test_uuid;
"Nested Loop (cost=0.72..8208.85 rows=972 width=36) (actual time=0.041..11.825 rows=999 loops=1)"
" -> Index Scan using uuid_test_2_test_id_key on lookup_table (cost=0.29..39.30 rows=972 width=16) (actual time=0.015..0.414 rows=999 loops=1)"
" Index Cond: (test_id Index Scan using uuid_test_test_uuid_key on table_a (cost=0.43..8.39 rows=1 width=20) (actual time=0.010..0.011 rows=1 loops=999)"
" Index Cond: (test_uuid = lookup_table.test_uuid)"
"Planning time: 0.503 ms"
"Execution time: 11.953 ms"